Re: Fuzzy matching

Nick Lamb (njl98r nospam at ecs.soton.ac.uk)
Wed, 7 Apr 1999 04:31:03 +0100 (BST)

On Wed, 7 Apr 1999, Marc van Woerkom wrote:

> > Now, I think I'll ramble on about fuzzy matching...
>
> Did you have a look at agrep?

Ah yes, I wasn't thinking of fuzzy matching on text. Though it's an idea
which should probably go on Rob's enormously long TODO list, for the
current Search in CDindex is adequate, but not very helpful either for
people who type "Bjork" (needs an accent) or "Greatfull Dead" (can't
begin to explain what's wrong there :) When someone tries to find
"Colour" or "Color" they'll get pretty confused too.

I was thinking about the fuzzy matching to existing CDindex entries. Its
something we *should* have, but not something which should be designed as
a quick hack, or a kludge. This is where CDDB started down the wrong
path, and I guess (since CDindex doesn't use CDDB DISCIDs) no-one reading
wants to repeat that mistake.

1. What's wrong with DISCID?

The CDDB DISCID is a fuzzy identifier for CDs, very unlike the SHA1 hash
used by CDindex. Any slight variation (a track with four more frames for
example) will result in a very different SHA1 CDindex identifier. The
CDDB DISCID on the other hand, is the same for any CD with the same
number of tracks, same length (in seconds) and the same track lengths,
in seconds.

The problem with the CDDB DISCID is that for shorter discs, or those with
fewer tracks, the DISCID is no longer unique. For my four track single of
"Change" by The Lightning Seeds, CDDB has at least three possible DISCID
matches. In this case CDDB attempts to locate the "closest" match from
the Track frame offsets recorded in the full record.

Despite this, CDDB can't correctly identify this variation of "Change",
and the only way to fix that would be to manually create a new CDDB
record for the CD (with the right track offsets) edit it from the CD player
and then submit it to the database.

I also have a copy of Nirvana's "Unplugged in New York", which is very
slightly different from most other copies. This difference shows up as
a different CDDB DISCID. If I added my copy to CDDB it would merge the
records, so that one record appears under two distinct identifiers. For
a popular album there may be half a dozen different CDDB DISCIDs for
the many slightly different pressings of the album.

But, meanwhile, I haven't yet (and while CDindex lives, I won't ever)
added this pressing of "Unplugged in New York" to CDDB. So when a CDDB
client sees this CD, it can't find an exact match. The CDDB server falls
back to a "fuzzy match". The potential matches must have the same number
of tracks (last two digits) and similar or identical length (middle four
digits) but may have any hash value (first two digits)

The CDDB client is supposed to show the user a list of matches and offer
them a choice. Then it can keep the record for the "correct" album and
and dispose of the others. AFAIK There is no feedback for the CDDB server,
so the useful knowledge (which one was right?) is lost.

The most serious weakness of CDDB shows up on singles with only one track.
The CDDB DISCID for such a single really[1] depends only on the length of
that track in seconds. They don't preserve the precise TOC information,
so most such singles will generate *multiple* exact matches, even though
the TOC would be enough to distinguish them. The CDDB web site lies about
this, claiming it can't be solved, but CDindex solves this problem :)

A related weakness is that the "genre" concept in CDDB is overloaded as
a means to differentiate between two otherwise identical DISCIDs. This
means that, once in a while, Billy Ocean gets to be "New Age" according
to CDDB! They seem to be struggling to fix this in protocol revision 5
but the bad data is already in the database.

Nick.

[1] In theory the CDDB hash should help, but in fact it uses the start
time only for these CDs, and that doesn't vary significantly.