November 2015 Archives

Sun Nov 29 21:50:33 PST 2015

Exchanging Path1 for Path2 With an .htaccess File

I found that my site's Google search box (implemented a while back using a form to employ a Google search), returned results that looked promising but actually all pointed to URLs that contained the initial string 'blog' rather than 'data'. Evidently the Google index was elderly...

...I had recently decided to replace the string 'blog' with the string 'data' - (the idea behind this change was go disguise the fact that I use Nano-Blogger to prepare the site's pages!). Anyway, my initial way of handling the switch over from 'blog' to 'data' was to add a line in my .htaccess file to send requests for 'blog' pages to the index for the 'data' section of the web site. I hadn't realized that the line in question was sending everything only to the main index of the site - not very useful.

A little experimentation with .htaccess rules though yields a more correct rule which changes URLs with 'blog/' into URLs which have 'data/' which is what I needed. Here is the necessary line for the .htaccess file:

RewriteRule ^blog/(.*) http://www.themolecularuniverse.com/data/$1 [L]

For the record the previous (failing) .htaccess line was:

RewriteRule ^blog/* http://www.themolecularuniverse.com/data/$1

I am not sure where the incorrect line came from - but I am learning that things in parentheses in .htaccess files are important...


Posted by ZFS | Permanent link | File under: blogging

Mon Nov 9 14:58:35 PST 2015

Media Manipulation...

Anarchists Destroying a Police Car in London
(Anarchists Destroying a Police Car in London)
 

I thought that these two pictures were interesting. The first photograph shows an anarchist attacking a police car on November 5, 2015. The second photograph shows a row of photographers taking a photograph of the same vehicle on the same day. It looks as though the photographers significantly outnumbered the anarchists ... and the value of a photograph outweighs the honour of performing a citizen's arrest...

Photographers and a Police Car in London
(Photographers and a Police Car in London)
 

Posted by ZFS | Permanent link | File under: politics

Wed Nov 4 20:32:29 PST 2015

Searchable PDF From a Set of Image Files

In recent days I have experimented further with tesseract version 3.03 and improved the pdf creation script that I use as a result. Here are my thoughts:

Firsly, I decided to make use of the fact that tesseract can now accurately position the majority of words with the underlying image of the word in the scan. This was not the case in the past. The benefit of this improvement is that when you find a word in the text, you can see the part of the image which the tesseract OCR algorithm associated with the word. The downside is that sometimes the spacing between the words is not correct. With a scan to just text, I found many fewer mistakes where words are run together. However, the benefit of being able to see what was hit in the search outweighs the deficit of not being about to cut and paste the scanned text freely, in my view.

Secondly, I decided to drop the resolution of scanned color images. I use Group 4 compression for black and white images, and this leads to a page of around 100k in a final OCRd pdf. This leads to a pdf size of about 20MB for around 200 pages, which is manageable. (100000*200/1000000=20, not accounting for different ideas about the size of a mega, etc.). Such files could be much smaller if one were to use jbig2, but jbig2 isn't completely widely supported yet by pdf viewers, so I decided to hold off on jbig2 for now.

For color images, reducing the resolution does not seem to hurt too much, and the improvement in viewability for color images is dramatic.

Here is the script...it assumes that you have tesseract installed correctly and TESSDATA_PREFIX set up correctly.


#!/bin/sh

MAX=99999
CURRENTDIR=`pwd | sed 's#/home/person/scans/##' | sed 's#/new_method##'`
NPAGES=`ls 0*.tif | wc | awk '{print $1}'`

i=0
for FILE in 0*.tif
do
  BASE=`basename $FILE .tif`
  i=`expr $i + 1`
  d=`echo $i | awk '{printf "%05d",$i}'`
  echo $d " $CURRENTDIR $NPAGES"
  tifftopnm $FILE > tmp.pnm
  TYPE=`pnmfile tmp.pnm | awk '{print $2}'`
  if [ $TYPE = "PPM" ]
  then
    pnmquant 256 tmp.pnm | pnmtotiff -lzw > tmp.tif
    convert tmp.tif -adaptive-resize 25% -density 150 new$d.tif
  else
    pnmtotiff -g4 tmp.pnm > tmp.tif
    convert tmp.tif -density 600 new$d.tif
  fi
  tesseract new$d.tif newpage$d pdf
  rm new$d.tif
  if [ $i -eq $MAX ]
  then
    break
  fi
done

pdftk newpage*.pdf cat output output.pdf

rm newpage0*.pdf
rm tmp.pnm tmp.tif

And a comparison of the new script's results. Here is: Macaulay's Lord Clive (new script) versus the older version: Macaulay's Lord Clive (old script). The new version is 18MB and the old version 23MB - and the new version looks nicer because it has color...!


Posted by ZFS | Permanent link | File under: bash

Tue Nov 3 17:07:12 PST 2015

Compiling Tesseract 3.03 on Ubuntu 12.04 LTS

I decided to install Tesseract 3.03 on my Ubuntu box recently. (I wanted to have the text layer on my scanned PDFs correctly lined up with the underlying page image - Tesseract 3.03 does this.). So I downloaded the appropriate source and set about building.

I had to build and install leptonica first - I used version 1.72. There after, there was a problem with make in the tesseract 'api' directory. I resolved this by simply executing the required command by hand:

# /bin/bash ../libtool --tag=CXX   --mode=link g++     -o tesseract tesseract-tesseractmain.o libtesseract.la   -lrt  -lpthread /usr/local/lib/liblept.a

This is just the original line emitted by the Makefile with the location of the leptonica library (i.e. /usr/local/lib/liblept.a) corrected.

Thereafter everything was relatively straightfoward. I had to download the English 'trained' data from the appropriate site, and then tesseract was ready to use.


Posted by ZFS | Permanent link | File under: bash