Fri Oct 5 19:04:52 PDT 2012

Reducing the Size of Scanned PDFs

I have had occasion to scan a few papers to PDF format recently. Typically this produces large PDF files, because many bytes are used to represent the RGB value of the pixels of the document image. Of course, the original document is generally black and white, and the faithful representation of its coloring is pointless and makes the PDF file larger than it can be.

So I made a little script which takes a PDF file written by the scanning program and 'monochromizes' it.

Here is the script - it uses various image processing commands from the Linux world. It is not 'fancy', just utilitarian - use at your owk risk! The reduction in size of the PDF file can be significant - so if you are struggling with overly large PDFs, this script (or your own modification to it) may be of value.

#!/bin/sh

NPAGES=`pdftk $1 dump_data | grep NumberOfPages | awk '{print $2}'`

OUTPUTFILE=`basename $1 .pdf`.bw.pdf

i=0
while [ $i -lt $NPAGES ]
do
  i=`expr $i + 1`
  echo $i
  d=`echo $i | awk '{printf "%05d",$i}'`
  echo $d
  pdftk A=$1 cat A$i output page$d.pdf
  pdftoppm page$d.pdf -gray tmp
  ppmtopgm tmp-000001.pgm | \
           pamthreshold -simple -threshold=0.85 | \
           pnmtops -imagewidth=8.5 > tmp.ps
  ps2pdf -dPDFSETTINGS=/ebook tmp.ps
  mv tmp.pdf newpage$d.pdf
  rm page$d.pdf
done

pdftk newpage*.pdf cat output $OUTPUTFILE

rm newpage0*.pdf
rm tmp.ps
rm tmp-000001.pgm

Posted by ZFS | Permanent link | File under: bash