New Searchable Scanned Docs and a Common Screw Up
January 5, 2009 by Michael Alexander · Leave a Comment ![]()
![]()
![]()
![]()
![]()
I mention Google so often that many people must think I’m getting paid by the company. Just for the record, they don’t pay me. I just can’t help myself–Google is always doing cool things.
Recently, Google said it is now able to use optical character recognition to index scanned documents stored as Adobe PDFs. Previously, the company rarely scanned docs because it couldn’t be sure of the search results. According to Google’s blog:
While we’ve indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer.
To take a test drive of this new-new thing, click on this search query–Steady Success in a Volatile World–and you’ll see an excerpt of the PDF doc in the search results followed by the View as HTML link.
Convert scanned PDFs to text
Tech site Digital Inspiration has an angle on this new feature worth mentioning:
If you have scanned PDF files on your hard drive but lack OCR software, you can stillĀ convert them into recognizable text, DI says.
Create a folder on your Website (say, your site is abc.com) and upload all your PDFs to the same folder. Then, create a public Web page that links to all the PDFs. Wait for the Google searchbots to spider your stuff. After that’s done, type site:abc.com/pdf filetype:pdf to see your PDFs as HTML.
Lifehacker adds this twist to converting PDFs to HTML:
You can use Google’s Webmaster Tools to reign in what gets scanned and indexed on your site, although you should assume anything you put online can be found by those looking for it.
Can you tell what’s wrong with Lifehacker’s sentence above? Read “21 Words That Sound Alike but Mean Different Things.” What wrong in this picture?






