Mon Apr 20 21:11:19 PDT 2015

Creating Searchable PDFs

Here is a script that I have used to create searchable PDFs on a number of occasions. The input is a set of sequential .tif files. The output is a searchable .pdf file which contains, in addition to the original .tif images, searchable ascii text.

To keep the output .pdf file size down - images (or pages) with many colors are reduced to just 4 shades of grey. It could be 50 shades of grey - and that might help provide a few more hits - but that is left as an exercise to the user.

The input files are assumed to be called 00001.tif (etc) and the output file is called output.pdf. Optical character recognition is carried out using tesseract which seems to do a good job.

Use at your own risk!

#!/bin/sh

MAX=99999
CURRENTDIR=`pwd | sed 's#/home/users/Papers/##'`
NPAGES=`ls 0*.tif | wc | awk '{print $1}'`

i=0
for FILE in 0*.tif
do
  BASE=`basename $FILE .tif`
  i=`expr $i + 1`
  d=`echo $i | awk '{printf "%05d",$i}'`
  echo $d " $CURRENTDIR $NPAGES"

  SIZE=`ls -ltra $FILE | awk '{print $5}'`
  if [ $SIZE -gt 1000000 ]
  then
    echo "FILE $FILE IS LARGE SO MINIMIZING"
    tifftopnm $FILE | ppmtopgm | pnmquant 4 | pnmtotiff -lzw > new.tif
    FILE=new.tif
  fi
  tifftopnm $FILE 2> tifftopnm.err | ppmtopgm | \
    pnmtops -noturn -rle 2> pnmtops.err> tmp.ps
  status=$?
  if [ $status -ne 0 ] 
  then 
    echo "Initial tifftopnm pipeline failed"
    cat tifftopnm.err
    cat pnmtops.err
  fi
  ps2pdf -dEPSCrop tmp.ps
  tesseract $FILE $BASE 2> tesseract.err > tesseract.txt
  status=$?
  if [ $status -ne 0 ] 
  then 
    echo "TESSERACT FAILED"
    cat tesseract.err
    cat tesseract.txt
  fi
  sed 's/</lt/g' $BASE.txt | sed 's/>/gt/g' > tmp.txt
  mv tmp.txt $BASE.txt
  hocr2pdf -n -i $FILE -o tmp2.pdf < $BASE.txt 2> hocr2pdf.err > hocr2pdf.txt
  status=$?
  if [ $status -ne 0 ]
  then
    echo "hocr2pdf failed"
    cat hocr2pdf.err
    cat hocr2pdf.txt
    echo "Retrying using hocr"
    tesseract $FILE $BASE hocr
    hocr2pdf -n -s -i $FILE -o tmp2.pdf < $BASE.html
  fi
  pdftk tmp.pdf background tmp2.pdf output newpage$d.pdf

  if [ $i -eq $MAX ]
  then
    break
  fi
done

pdftk newpage*.pdf cat output output.pdf

rm newpage0*.pdf
rm tmp.pdf tmp2.pdf hocr2pdf.txt tesseract.txt

Posted by ZFS | Permanent link | File under: bash