CDDB ID: A Tale of Woe

By Lars Magne Ingebrigtsen

A lifetime ago (in 1993), a guy called Ti Kan wrote a CD player called xmcd. He thought it would be a nice idea to allow people to identify what CDs they were playing, so he invented a database to store CD and track names in. To identify each CD, he created a "CDDB ID", which was supposed to be as unique as possible. This rant is about that CDDB ID, which plagues us to this day.

The people who invented the Red Book CD format didn't see the need to be able to (electronically) identify CDs. There are no unique identifiers on CDs. What we have, really, is only the table of contents (TOC), which lists where each song starts by frame number. (Each second has 75 frames.)

Here's an example:

# Track frame offsets:
#	150
#	15095
#	28530
#	40556
#	60479
#	81952
#	100762
#	112675
#	128656
#	145954
#
# Disc length: 2260

Now, that looks like quite a bit of data to work with. You could either just use all that as the "signature" of the CD (which might be a handful), or you could just use a hash of that.

Ti Kan wanted to use a hash. He could have chosen something like CRC32, which would have given him a 32 bit number, yielding 4 billion unique IDs.

Instead he wrote his own hash.

The CDDB ID of the album up there is "6808d20a". As we can see, it's a 32 bit number. However, it doesn't yield 4 billion unique IDs.

Let's decompose the ID.

The last byte (0a) is the number of tracks on the album. Let's count. Yup. Ten.

The two middle bytes (08d2) is the number of seconds on the album. Get a hex calculator out, and you'll see that that equals 2260.

The first byte is actually a checksum of sorts. It's the sum of all the digits of start of all the tracks (in seconds), modulo 255.

Put these three parts together, and you have the CDDB ID.

The up-shoot of this is that if you have two albums that have ten tracks, and they are of the same length in seconds, you have a (to be charitable) 1-in-255 chance of getting the same ID. That might seem like a pretty good deal, but it's not. The problem is that 90% of pop music albums are in the 40-50 minute range, and have 8-12 tracks. And there are millions of CDs out there.

There's also the fascinating fact that almost all CDs have the first track starting around frame 150-180, which means that the checksum of CDs with a single track ends up being "2". Virtually all single-track CDs that have a length of 500 seconds have the following identifier: "0201f401".

When submitting a CD in the "rock" category to freedb, the likelihood of it being rejected due to a CDDB ID collision is over 50%. There are only 2.5M CDs in that database. With an ideal 32 bit hash you'd have 4 billions possible identifiers, and the likelihood of collisions would be minuscule.

Perhaps it's unfair to rant about this at this late stage, but it seems like nobody else has bothered. And we're still being plagued by this awful design, 15 years after the fact. The number of deployed CD players that use the CDDB ID is huge (thousands of different players used by millions of people). Fixing this problem now basically means abandoning the CDDB ID altogether, which is what services like Gracenote and MusicBrainz is doing.

Ti Kan was made aware (not by me) of this problem back in 1994, and given a script to convert this format into a CRC32-based format, but he rejected it because the deployed base was too big. At that point it was probably in the high dozens.


2007-11-10 16:30:32