Wed Oct 21 21:36:29 PDT 2015

Searchable PDF Output with Tesseract

Tesseract version 3.03 can output a searchable PDF directly. I gave this a try recently. Here is how I got on.

If you scan a document or a book and you want to be able to search that document, you need to employ an OCR program. The OCR program identifies letters and words, and can provide output that you can use to make a searchable PDF, either directly or via a separate program, such as hocr2pdf.

Here is the simple command line to output a searchable PDF file using Tesseract 3.03.

   tesseract 000010.png 000010 pdf
Highlighting words in a tesseract produced PDF file.
(Highlighting words in a tesseract produced PDF file)
 

Used in this way, tesseract outputs a searchable pdf as 000010.pdf. The image above (click for a larger view) shows the resulting file with some text highlighted.

However, I find that if I output the text using pdf2text, the text from such a file has problems, words are run together.

This is a little frustrating, I can either have searchable PDFs where I can find the page that contains a certain string, that I can easily output large chunks of text to ascii with correct word spacing; or I can have PDFs where I can find individual words but output of large chunks of text is impeded by word spacing problems.

Currently, I am planning to stick to correct word spacing solution, though I would much prefer to be able to achieve both objectives!

Here is where I found the necessary information to get tesseract 3.03 built successfully...Installing Tesseract on a Mac. (Thank you to the author.)


Posted by ZFS | Permanent link | File under: bash