possible problems with MD5 usage

Jeffrey Ebert (jeffrey nospam at ebertland.net)
Wed, 10 Mar 1999 01:20:23 -0800

Rob,

In implementing the Python version of the cdindex application, I found a
couple of problems with your MD5 usage that could cause differences in
search keys to arise.

MD5Update(&md5, (unsigned char *)pCDInfo, sizeof(CDINDEX_CDINFO));

First, the size of the CDINDEX_CDINFO structure is dependent on the
number of tracks you want to support. You have set it to 1000,
presumably arbitrarily. The MD5 algorithm _is_ sensitive to zeros, so
all the zeros that are in the last 980 or so entries of the array are
changing the MD5 hash. If you change that constant in the future, the
search key will change for the same CD.

Second, this algorithm is sensitive to machine endianness, I think. For
track offset 150, a little-endian machine sees "\226\0\0\0" as a char
string in the MD5 processing, while a big-endian machine sees
"\0\0\0\226". Again, the search key would be different for the same CD.

I suggest that we keep things in the string realm to avoid these
problems.

Something like this:

MD5Init(&md5);

sprintf(s, "%02X", pCDInfo->FirstTrack);
MD5Update(&md5, s, 2);
sprintf(s, "%02X", pCDInfo->LastTrack);
MD5Update(&md5, s, 2);

for (i = 0; i <= LastTrack; i++) {
sprintf(s, "%08X", pCDInfo->FrameOffset[i]);
MD5Update(&md5, s, 8);
}

MD5Final(digest, &md5);

That looks a bit slow, but I don't think it matters for this
application. It's more important that the same CD yields the same search
key, regardless of what machine and software version does the
processing. Also note that the MD5 hash is not a function of the number
entries in the FrameOffset array in this implementation.

Any thoughts?

PS: Perhaps a warning should be placed on the site that submissions to
the database using the current alpha code may need to be re-done in the
future. At least until we're sure that the search key algorithm is
robust.

-- 
Jeff Ebert
jeffrey nospam at ebertland.net