Fri Jul 3 15:26:35 PDT 2015

Optimizing a PDF from a Scanned Paper for Text II

I scanned an old paper recently - and was left with a huge PDF. The PDF was storing multiple images with color, and allowed for too much intensity variation for each pixel - hence its size. Text can be stored more efficiently than that!

So, here is a short script which extracts the pages of a PDF, reduces them to black and white, and reconstructs the PDF. This produced about a factor of ten reduction in size for me, and also improved the legibility as the text is now high contrast black on white.

Yes, the script is crude, and it contains a useless use of cat. Clean up and optimization are left as exercises for anyone interested ...

#!/bin/sh

i=0
while [ $i -lt 27 ]
do
  i=`expr $i + 1`
  echo $i
  d=`echo $i | awk '{printf "%02d",$i}'`
  echo $d
  pdftk A=paper.pdf cat A$i output page$d.pdf
  pdftoppm page$d.pdf -gray eh
  cat eh-000001.pgm | pnmquant 2 | pgmtopgm | \
  pamditherbw -threshold |  pnmtops -nocenter -imagewidth=8.5 > tmp.ps
  ps2pdf -dPDFSETTINGS=/ebook tmp.ps
  mv tmp.pdf newpage$d.pdf
done

pdftk newpage*.pdf cat output newcombined.pdf

Posted by ZFS | Permanent link | File under: bash