The Herd of Kittens - Library Background

UPC codes are not ideal for books. They're too short, are split vendor/non-unique-vendor-code -- for example, Baen Books, "Shards of Honor", Lois McMaster Bujold, has UPC 76714 00599; 76714 is Baen, 00599 is "any Baen paperback that lists for $5.99". There is an extension code, 72087 on this copy, which is a subset of the digits of the ISBN - which provides uniqueness among actual Baen books but not identification. You can build such a database, with some effort; Andrew Plotkin's collection is a start for Science Fiction, but isn't by any means complete, and "official" registration is probably expensive to find (and doesn't necessarily exist: the UPC registry and ISBN registry don't actually have exact common keys, you'd have to figure it out from company names.)

What you really want is the EAN (which is supplanting UPC even in the US in the next 3 years or so.) The EAN is a longer code, which even on mass-market paperbacks you'll find inside the front cover, if the back still has a UPC; over the next couple of years, the EAN will get primary placement on the back, sometimes with a UPC, more often the UPC goes away altogether (look at any XML book for an example, they're all new enough to have EAN only.)

The really cool bit about EAN is the content (the encoding in ink is the same as UPC, just longer): the aforementioned book is 9780671720872 with an extra field of 50599. You see, the first three digits of the EAN are the country code - and 978 is a very special country, called "Bookland" (really, I'm not kidding, web search on it if you don't believe me.) A "Bookland EAN" is followed by an ISBN directly: 978-067172087-2 which is ISBN 067172087-2. The last digit is a checksum - the algorithms differ, and it is a coincidence that they match for this example. Also, all EAN's have an extension field which is a "list price" field... 5-0599 is "US$" 5.99, etc.

Q: Would a database with the codes for books be free/affordable/expensive?

And there's the tricky part. I have some tools I've written which get pointers to other libraries from the Library of Congress Z39.50 index page. I selected 10 random science fiction titles from my collection, and hit 160 libraries - no more than 20 libraries got hits for any of them; I think the most hits from a single library was 6 out of 10. The problem seems to be that libraries only have "cards" for books that they actually own. I'm not sure why the LoC itself isn't better, but last I checked they were still only running the production server during business hours and shutting it down at night.

There are sources for raw MARC card catalog data; apparently the data sets start in the US$20,000 range - so I'm planning to

  1. slurp what individual "cards" I can from the libraries out there
  2. edit and republish them (there is a lot of variation in quality in the cards I have gotten)
  3. write and publish software to manage it all (with a heavy XML and web slant, but a pilot app is also a minimal requirement :-)

"Zeta" is a Z39.50 implementation in pure perl that would simplify a lot of the scripting I did previously, especially when combined with perl DBI for the database.

Barcode scanning itself is the easy part - it seems that the "keyboard wedge" model has basically taken over the industry, any barcode device that is above the "parts" level will either do ASCII-serial or PC Keyboard, and the latter are most common. In the extreme case, Radio Shack has started giving away the ":CueCat" ps2-style keyboard wedge barcode scanner, so you can scan things from the new catalog; there are pointers on slashdot to information on the privacy-detrimental features it has, but also to simple code to "decrypt" the scans into normal UPC and ISBN codes.

I'm actually taking the approach of scanning the EAN and UPC together; since the extended-UPC is still unique, I can use it to find the records I have without opening up the book, when signing out books for friends...

Anyone interested in this is welcome to contact me directly, there isn't that much relevance to Debian until I get an implementation started (though if there's a good XML-interface to any of the free *SQL databases in Debian, that would be good to know - also if there's actually enough pure debian support for java (by which I mean "stuff I can apt-get install, without contrib/non-free") that it becomes a viable development option, I'd like to see that discussed too, debian-java is a better list for it.)