Duplicate removal

Nick Lamb (njl98r nospam at ecs.soton.ac.uk)
Thu, 2 Dec 1999 06:26:47 +0000

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Nick Lamb: "Re: The competition (CDDB2)"
Previous message: Greg Stein: "Re: Classifying music samples (MP3/WAV/...)"
Next in thread: Julia Klein: "Re: Duplicate removal"
Maybe reply: Julia Klein: "Re: Duplicate removal"

--OROCMA9jn6tkzFBc
Content-Type: multipart/mixed; boundary="PuGuTyElPB9bOcsM"

--PuGuTyElPB9bOcsM
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable

I am not dead!

Hopefully this project isn't dead either, so here is a Perl script which
when run by the administrator of a working CDindex server should tell
them how many duplicates could be merged by a future version of the same
script.

Why doesn't this script just do the merge?

Well, I'm a little bit paranoid, I'm sure CDindex is backed up (right?)
but I have only a very tiny test database for my work on CDindex, and
not much time for testing.=20

So, there are basically four lines missing, without which the worst
that will happen is that it will clog up the DB server while it churns
through all the possible dupes.

------
If someone runs this on a DB with more than a few dozen CDs in it and
gets good results I'd like to hear about it, and hopefully provide
the extra four or so lines to do the merge for real.
------

How does it work?

The script considers each album in turn, and if it finds two albums
with the same title, by the same artist, then it considers that a
possible candidate for merging.

Such candidates are further narrowed down by looking at every track
on the two albums and if the two albums have different track names
or different numbers of tracks then they remain separate, otherwise
they are (potentially, see above) merged.

This way, if you see two albums for your favourite artist which are
definitely duplicates, all you need to do is correct the track
names to match (which should happen in routine typo fixing for
any really identical albums) and my script will do the rest!

If this script works OK in its finished form, I'd like it to be a
regular clean-up process for CDindex, because the alternative is
the same mess we see in CDDB, and nowadays FreeDB too.

Nick.

--PuGuTyElPB9bOcsM
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="duplicates.pl"
Content-Transfer-Encoding: quoted-printable

#!/usr/bin/perl -w
#__________________________________________________________________________=
__
#
# CD Index - The Internet CD Index
#
# Copyright (C) 1999 Nick Lamb
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
#__________________________________________________________________________=
__

use DBI;
use DBDefs;
use strict;

my $num_tracks;
my ($i, $toc);
my ($dbh, $sth, $rv, $sql);=20

my $possible =3D 0;
my $dupes =3D 0;

$dbh =3D DBI->connect(DBDefs->DSN,DBDefs->DB_USER,DBDefs->DB_PASSWD);
if (!$dbh)
{
print "(Error: ".$DBI::errstr.")";
die "Sorry, the database is currently not available.";
}=20
else
{
$sth =3D $dbh->prepare("select Id, Name, Artist from Album order by Nam=
e");
if ($sth->execute())
{
my ( nospam at row, nospam at old);

for( nospam at old =3D $sth->fetchrow_array; nospam at row =3D $sth->fetchrow_array;)
{
if ($old[1] eq $row[1] && $old[2] =3D=3D $row[2]) {
duplicate($old[0], $row[0], $row[1]);
} else {
nospam at old =3D nospam at row;
}
}
$sth->finish;
}
}

if ($dbh)
{
$dbh->disconnect();
}

print $possible." possible duplicates considered...\n";
print $dupes." duplicate albums would have been merged in database...\n";

sub duplicate {
my($oldid, $newid, $title) =3D nospam at _;
my($s1,$s2);
my( nospam at row1, nospam at row2);
my $different =3D 0;

$possible++;

$s1 =3D $dbh->prepare("select Id, Name from Track where Album =3D $oldid =
order by Sequence");
$s2 =3D $dbh->prepare("select Id, Name from Track where Album =3D $newid =
order by Sequence");
=20
if ($s1->execute() && $s2->execute())
{
do
{
nospam at row1 =3D $s1->fetchrow_array;
nospam at row2 =3D $s2->fetchrow_array;
if (( nospam at row1 !=3D nospam at row2) || (( nospam at row1 > 0) && ($row1[1] ne $row2[1])))
{
$different=3D 1;
}
} while (( nospam at row1 > 0) && ( nospam at row2 > 0));

if ($different =3D=3D 0)
{
$dupes++;
print "Would have merged two albums called ".$title."\n";
}

$s1->finish;
$s2->finish;
}
}

--PuGuTyElPB9bOcsM--

--OROCMA9jn6tkzFBc
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.0 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE4RhEnJL0BVnQb59gRAsTtAJ9hUTW/pjA7Ql08UZrrVhDc9QYBkwCeK29F
012/pW00NJdO17q9fmbvmZs=
=Gq90
-----END PGP SIGNATURE-----

--OROCMA9jn6tkzFBc--

Next message: Nick Lamb: "Re: The competition (CDDB2)"
Previous message: Greg Stein: "Re: Classifying music samples (MP3/WAV/...)"
Next in thread: Julia Klein: "Re: Duplicate removal"
Maybe reply: Julia Klein: "Re: Duplicate removal"