Here is a script that I have used to create searchable PDFs on a number of occasions. The input is a set of sequential .tif files. The output is a searchable .pdf file which contains, in addition to the original .tif images, searchable ascii text.
To keep the output .pdf file size down - images (or pages) with many colors are reduced to just 4 shades of grey. It could be 50 shades of grey - and that might help provide a few more hits - but that is left as an exercise to the user.
The input files are assumed to be called 00001.tif (etc) and the output file is called output.pdf. Optical character recognition is carried out using tesseract which seems to do a good job.
Use at your own risk!
#!/bin/sh
MAX=99999
CURRENTDIR=`pwd | sed 's#/home/users/Papers/##'`
NPAGES=`ls 0*.tif | wc | awk '{print $1}'`
i=0
for FILE in 0*.tif
do
BASE=`basename $FILE .tif`
i=`expr $i + 1`
d=`echo $i | awk '{printf "%05d",$i}'`
echo $d " $CURRENTDIR $NPAGES"
SIZE=`ls -ltra $FILE | awk '{print $5}'`
if [ $SIZE -gt 1000000 ]
then
echo "FILE $FILE IS LARGE SO MINIMIZING"
tifftopnm $FILE | ppmtopgm | pnmquant 4 | pnmtotiff -lzw > new.tif
FILE=new.tif
fi
tifftopnm $FILE 2> tifftopnm.err | ppmtopgm | \
pnmtops -noturn -rle 2> pnmtops.err> tmp.ps
status=$?
if [ $status -ne 0 ]
then
echo "Initial tifftopnm pipeline failed"
cat tifftopnm.err
cat pnmtops.err
fi
ps2pdf -dEPSCrop tmp.ps
tesseract $FILE $BASE 2> tesseract.err > tesseract.txt
status=$?
if [ $status -ne 0 ]
then
echo "TESSERACT FAILED"
cat tesseract.err
cat tesseract.txt
fi
sed 's/</lt/g' $BASE.txt | sed 's/>/gt/g' > tmp.txt
mv tmp.txt $BASE.txt
hocr2pdf -n -i $FILE -o tmp2.pdf < $BASE.txt 2> hocr2pdf.err > hocr2pdf.txt
status=$?
if [ $status -ne 0 ]
then
echo "hocr2pdf failed"
cat hocr2pdf.err
cat hocr2pdf.txt
echo "Retrying using hocr"
tesseract $FILE $BASE hocr
hocr2pdf -n -s -i $FILE -o tmp2.pdf < $BASE.html
fi
pdftk tmp.pdf background tmp2.pdf output newpage$d.pdf
if [ $i -eq $MAX ]
then
break
fi
done
pdftk newpage*.pdf cat output output.pdf
rm newpage0*.pdf
rm tmp.pdf tmp2.pdf hocr2pdf.txt tesseract.txt