Fuzzy matching

robert nospam at moon.eorbit.net
Mon, 12 Jul 1999 10:52:58 -0700 (PDT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Nick Lamb: "Re: Fuzzy matching"
Previous message: Stephen van Egmond: "Re: Library ideas"
Next in thread: robert nospam at moon.eorbit.net: "Re: Fuzzy matching"
Maybe reply: robert nospam at moon.eorbit.net: "Re: Fuzzy matching"

Ok, the other night I couldn't sleep so I sat down and coded up a fuzzy
matching algorithm for the cdindex.

Nick Lamb wrote:
>starting with the lead-out and then going through each track in order,
>rather like the submission process. The server can then take all the CDs
>which match +/- 100 frames for the lead-out, and find the closest
>match using an algorithm like

>for (close = 0, track= 1; track <= tracks; ++track) {
> if (test_cd_offset[track] > client_cd_offset[track] )
> close -= test_cd_offset[track] - client_cd_offset[track];
> else
> close += test_cd_offset[track] - client_cd_offset[track];
>}
>This gives a close value which gets more negative as track lengths
>diverge from the client's TOC info.
>
>This approach seems very likely to work, but its computationally very
>expensive, and probably needs a lot of CPU and Memory in the server side
>if we're going to handle 5--10 or even up to 100 simultaneous users of
>the server. One possibility would be to do this only for submission, so
>we eliminate dupes in the submission process...

I tried this at first and it found a ton of invalid matches. Seemed
like an intuitive approach to me as well, but alas no.

The actual algorithm that I found to work nicely looks like this:

for(track = 1; track <= tracks; ++track)
{
if (abs(test_cd_offset[track] - client_cd_offset[track]) > 100)
break;
}
if (track == tracks)
{
// this is a match
}

This works ok, and I found some interesting matches that my text
searching wasn't finding. But, as nick points out this approach is
computationally *very* expensive and I haven't really come up with a
way to speed it up.

Subdividing the space on lead-out is not a good idea. Most of the lead
outs that I've seen start at 150. That's no bueno. If I organize the
database such that I can search on the first few tracks, if not all of
them, then I could construct a query that return quite a lot fewer
items.

Net time I can't sleep I'll just implement the enitre fuzzy search in
SQL and we'll see what happens.

--ruaok Freezerburn! All else is only icing. -- Soul Coughing

Robert Kaye -- robert nospam at moon.eorbit.net http://moon.eorbit.net/~robert

Next message: Nick Lamb: "Re: Fuzzy matching"
Previous message: Stephen van Egmond: "Re: Library ideas"
Next in thread: robert nospam at moon.eorbit.net: "Re: Fuzzy matching"
Maybe reply: robert nospam at moon.eorbit.net: "Re: Fuzzy matching"