Sunday, June 29, 2008

Open Source OCR on a Mac

Tesseract-OCR is an effective open source project from the mid-1990's. The actual character recognition is terrific, so it's not hard to get good results, and it's quite a time-saver.

But, it's not exactly turn-key. You need to:

1. download the source
2. compile it for a Mac
3. download a language file
4. copy it to the appropriate directory
5. run it on TIFF files that need to be renamed to a .tif extension.

Tesseract won't run unless you copy the language file to /usr/local/share/tessdata. Which is strange, because it uses it very irregularly. Most of the miss-read results are simple English words: you get "iist" instead of "list", "lf" instead of "if". It makes you wonder how exactly it is applying this language file.

If you use a Mac utility like Textedit, or Word, or Open Office, the spell-checker can find and help you fix these in a matter of moments. But, still, it's irritating, when you have a long document. This software needs to be 'productized'.

So, the actual sequence:

1. go here, and download tesseract-2.03.tar.gz.

2. In a Terminal window (Applications->Utilities), find your download directory, cd there, and:

: gunzip  tesseract-2.03.tar.gz
: tar xvf tesseract-2.03.tar


3. cd to the tesseract-2.03 directory, then:

:./configure
:sudo make
:sudo make install


4. Go back here, and download tesseract-2.00.eng.tar.gz, then, find your download directory, and:

: gunzip tesseract-2.00.eng.tar.gz 
: tar xvf tesseract-2.00.eng.tar
: cd tessdata
: sudo bash
: cp * /usr/local/share/tessdata/


Then hit control-d to exit the sudo bash shell.

Make a TIFF file, be sure it has a .tif extension, and then issue a command like this:

tesseract document-image.tif document-results


... and then you'll have text in document-results.txt

Works great. It should come standard with a Mac. With a graphic user interface. And some corrections to the language file use.

Sunday, June 15, 2008

Server heartbeat script

Simple bash script on Gnu/Linux or Unix system, to monitor the uptime of your server, and tell you, call you (well, text your cell phone) when something's wrong. I just call this one heartbeat:


#! /bin/sh -v
while true
do
if wget --no-dns-cache --no-proxy --delete-after google.com
then
if wget --no-dns-cache --no-proxy -t 1 -T 60 --delete-after \
your.domain's.IP.address/beat.php
then
echo "success"
else
echo "Heartbeat stopped" | mail -s "MACHINE DOWN" -c \
your@emailaddress.com \
yourphonenumber@txt.att.net; sleep 20m
fi
fi
sleep 10m
done



We need a server-side script ... the beat.php above:


<?php
header("Cache-Control: no-cache, must-revalidate"); // HTTP/1.1
header("Expires: Mon, 26 Jul 1997 01:00:00 GMT"); // Date in the past
?>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Beat</title>
</head>
<body>

<? print(Date("l F d, Y ... g:i A")); ?>

<br>
</body>
</html>



A few explanations:

* We hit Google first just to make sure the web is working. No sense queueing up an unecessary e-mail to yourself.

* wget's delete-after cleans up the result of the wget, which is built to fetch pages from a URL.

* wget doesn't error when your local resolv cannot find a domain name. This is why we use the IP address. If you need to test to see if your domain's zone name server is running, create a script for that purpose. Don't rely on this one.

* most cell phone have a way to receive e-mail / text messages. Try sending a message from your phone to your e-mail address, and reply to it, to discover, and test, your cell phone's e-mail address.

* Of course, all the "cache expire" stuff, and the fresh-content dynamic server side script, is necessary, otherwise when your server goes down, some helpful cache server along the way will provide you with the beat.php results anyway, from your numerous previous wgets. It's best to test this, despite my precautions ... there are lots of cache mechanisms being inserted on the web these days.