Wed Oct 3 10:31:10 PDT 2012

Making Scanned PDFs Searchable

I recently posted a scanned version of Bill Shockley's thesis, here. Now scanning a document produces a PDF file which contains simply bit maps of the pages of the document. There is no electronic representation of the words in the document, and so the search function of the PDF viewer does not function. Additionally, web crawlers, or indexing programs for hard drives do not find keywords to index, and so documents are not retrieved when you might otherwise think that they should be in searching operations.

So, I had a look around for methods to correct this situation. I rapidly found pdfocr, which takes apart the PDF file, runs OCR on the images of the file, and reassembles the PDF with searchable text embedded in the file. This sounded good in principle, but there were problems in practice, probably caused by the fact that the packages wihch pdfocr relies on having evolved in the last few years.

However, I found a nice bash script which uses the tesseract OCR package: http://ubuntuforums.org/showthread.php?t=1456756&page=4

The script takes a while to run. But on completion you have a copy of the PDF, of approximately the same size, but with embedded searchable text from an OCR run on each page. This makes your PDF file searchable - with a fair chance of finding important phrases and also indexable by web crawlers and hard disk indexers.


Posted by ZFS | Permanent link