Fuzzy Matching (2)

Nick Lamb (njl98r nospam at ecs.soton.ac.uk)
Wed, 7 Apr 1999 05:27:51 +0100 (BST)

2. A proposed CDindex fuzzy match

The general *idea* of fuzzy matching offered by CDDB is good one, we don't
want users to create a new CDindex entry if there's a perfectly good one
already, under a different ID. However it's obvious that we can't use the
CDDB DISCID mechanism if we want all the benefits of CDindex too.

The important thing recognised in the CDDB fuzzy match is that there are
some relatively constant factors in the TOC for all the pressings of any
given album. The number of tracks will be identical (or we certainly do
have a different CD if not, regardless of whether the musical content is
the same). And the length of each track shouldn't vary much, nor should
the length of the CD as a whole.

So, it seems that the sensible way to go about fuzzy matching, is for
the client to offer the fuzzy match service a list of track offsets,
starting with the lead-out and then going through each track in order,
rather like the submission process. The server can then take all the CDs
which match +/- 100 frames for the lead-out, and find the closest
match using an algorithm like

for (close = 0, track= 1; track <= tracks; ++track) {
if (test_cd_offset[track] > client_cd_offset[track] )
close -= test_cd_offset[track] - client_cd_offset[track];
else
close += test_cd_offset[track] - client_cd_offset[track];
}

This gives a close value which gets more negative as track lengths
diverge from the client's TOC info.

This approach seems very likely to work, but its computationally very
expensive, and probably needs a lot of CPU and Memory in the server side
if we're going to handle 5--10 or even up to 100 simultaneous users of
the server. One possibility would be to do this only for submission, so
we eliminate dupes in the submission process...

----
The closest existing entry in CDindex was
Title: Under The Pink
Artist: Amos, Tori
Track1: blah
(...)

Is the CD you have identical to this one? [Click Here>>]
Otherwise follow the instructions below to submit your CD.

--

At submission time its worth having the fuzzy match, to reduce typos and goofs from the users. Every time a CD is found by the fuzzy match we've saved the user time, and we've improved the quality of CDindex.

The next question is -- how many variants of a popular CD are out there? the CDDB data doesn't differentiate CDs with TOCs which vary only by a few frames, not whole seconds. So we can't judge easily from their records how many aliases CDindex can expect for a given album. My guess is we're looking at perhaps 10x as many aliases as exist in CDDB, due to the increased precision of the CDindex IDs, but it could be as much as 75x as many.

An alternative approach is to calculate a corresponding CDDB DISCID for every CDindex SHA1 hash in the database, this is possible because CDindex preserves the TOC information. This would allow us to try fuzzy matches by searching for the DISCID corresponding to the submission, and clients could fall back to the DISCID match if they were unable to get an exact match using the SHA1 hash. Of course, when using DISCID we sacrifice the extra accuracy of CDindex, and all the disadvantages of CDDB apply.

Nick.