Wed Nov 4 20:32:29 PST 2015

Searchable PDF From a Set of Image Files

In recent days I have experimented further with tesseract version 3.03 and improved the pdf creation script that I use as a result. Here are my thoughts:

Firsly, I decided to make use of the fact that tesseract can now accurately position the majority of words with the underlying image of the word in the scan. This was not the case in the past. The benefit of this improvement is that when you find a word in the text, you can see the part of the image which the tesseract OCR algorithm associated with the word. The downside is that sometimes the spacing between the words is not correct. With a scan to just text, I found many fewer mistakes where words are run together. However, the benefit of being able to see what was hit in the search outweighs the deficit of not being about to cut and paste the scanned text freely, in my view.

Secondly, I decided to drop the resolution of scanned color images. I use Group 4 compression for black and white images, and this leads to a page of around 100k in a final OCRd pdf. This leads to a pdf size of about 20MB for around 200 pages, which is manageable. (100000*200/1000000=20, not accounting for different ideas about the size of a mega, etc.). Such files could be much smaller if one were to use jbig2, but jbig2 isn't completely widely supported yet by pdf viewers, so I decided to hold off on jbig2 for now.

For color images, reducing the resolution does not seem to hurt too much, and the improvement in viewability for color images is dramatic.

Here is the script...it assumes that you have tesseract installed correctly and TESSDATA_PREFIX set up correctly.


#!/bin/sh

MAX=99999
CURRENTDIR=`pwd | sed 's#/home/person/scans/##' | sed 's#/new_method##'`
NPAGES=`ls 0*.tif | wc | awk '{print $1}'`

i=0
for FILE in 0*.tif
do
  BASE=`basename $FILE .tif`
  i=`expr $i + 1`
  d=`echo $i | awk '{printf "%05d",$i}'`
  echo $d " $CURRENTDIR $NPAGES"
  tifftopnm $FILE > tmp.pnm
  TYPE=`pnmfile tmp.pnm | awk '{print $2}'`
  if [ $TYPE = "PPM" ]
  then
    pnmquant 256 tmp.pnm | pnmtotiff -lzw > tmp.tif
    convert tmp.tif -adaptive-resize 25% -density 150 new$d.tif
  else
    pnmtotiff -g4 tmp.pnm > tmp.tif
    convert tmp.tif -density 600 new$d.tif
  fi
  tesseract new$d.tif newpage$d pdf
  rm new$d.tif
  if [ $i -eq $MAX ]
  then
    break
  fi
done

pdftk newpage*.pdf cat output output.pdf

rm newpage0*.pdf
rm tmp.pnm tmp.tif

And a comparison of the new script's results. Here is: Macaulay's Lord Clive (new script) versus the older version: Macaulay's Lord Clive (old script). The new version is 18MB and the old version 23MB - and the new version looks nicer because it has color...!


Posted by ZFS | Permanent link | File under: bash