Using CAPTCHA to digitse old books


This morning in the shower I was listening to Digital Planet, one of my favourite podcasts from the BBC. No I’m not sad, I just like to maximise my time:)

They interviewed Luis von Ahn about how Carnegie Mellon University is are using CAPTCHA technology to help digitse very old books that are in the public domain.

What is CAPTCHA? It is an acronym for “Completely Automated Public Turing Test to Tell Computers & Humans Apart” You will have found that many times when you register to use an application on the web, or perhaps when you want to invite someone to be your ‘friend’ on MySpace or leave a message, you will see a small clump of letters and you have to enter what you see into a text box.

The reason for this is because spammers and hackers create bots, that allow them to access information and pretend that they are real people. For example there are people running businesses where they can guarantee you lots of ‘friends’ on MySpace for a fee. Personally I am against this and ultimately it is a waste of time, because just being able to say that you have thousands of friends, doesn’t actually help you in any way.

Just to sidetrack for a moment. I have lots of ‘friends’ on my MySpace page, 3967 at last count. They are people who have requested my ‘friendship’ or vice versa and because of that personal relationship, wherever possible, I have a fan base that I can use when I have a concert or gig that I am performing. I can use this to make contact with them, even by geography, but that is really a topic for my About Songwriting blog.

Anyway, many organisations are trying to digitise as many books as possible to allow them to be read as eBooks. The best known of these is Project Gutenburg, which has already digitsed more than 25,000 books.

The problem with older books, especially those prior to 1900 is that the pages are fading and the fonts are harder to read by OCR (Optical Character Recognition) tools, which themselves are still not 100% reliable. On a tangent, I hate reading books that are not perfect. My eBook Unleashing the Road Warrior was edited 12 times to get it as good as possible and I was dissapointed to find an error on page 309 of Stephen King’s latest book, Dumas Key, but that’s another story:)

So what the Carnegie Mellon people have done is to scan the pages and have created a tool which grabs 2 at a time and feeds them into the CAPTCHA environment. So now when you complete a CAPTCHA that has 2 words instead of random letters, what you are actually doing is not only autheticating that you are indeed a human, you are also helping transcribe these old books and ensuring their texts are protected for future generations to enjoy, is that cool or what?

Personally I find CAPTCHAS a pain in the proverbial, but having learned this, I am feeling a lot better about them.

While this blog is starting to get a good following, I would love to get more readers and encouraging me to keep writing. If you feel that my blog is interesting I would be very grateful if you would vote for me in the category of best blog at the NetGuide Web Awards. Note that the form starts each site with www whereas my blog doesn’t and is of course http://luigicappel.wordpress.com.

Thanks so much for your support:)