Sunday, June 29, 2008

Open Source OCR on a Mac

Tesseract-OCR is an effective open source project from the mid-1990's. The actual character recognition is terrific, so it's not hard to get good results, and it's quite a time-saver.

But, it's not exactly turn-key. You need to:

1. download the source
2. compile it for a Mac
3. download a language file
4. copy it to the appropriate directory
5. run it on TIFF files that need to be renamed to a .tif extension.

Tesseract won't run unless you copy the language file to /usr/local/share/tessdata. Which is strange, because it uses it very irregularly. Most of the miss-read results are simple English words: you get "iist" instead of "list", "lf" instead of "if". It makes you wonder how exactly it is applying this language file.

If you use a Mac utility like Textedit, or Word, or Open Office, the spell-checker can find and help you fix these in a matter of moments. But, still, it's irritating, when you have a long document. This software needs to be 'productized'.

So, the actual sequence:

1. go here, and download tesseract-2.03.tar.gz.

2. In a Terminal window (Applications->Utilities), find your download directory, cd there, and:

: gunzip  tesseract-2.03.tar.gz
: tar xvf tesseract-2.03.tar

3. cd to the tesseract-2.03 directory, then:

:sudo make
:sudo make install

4. Go back here, and download tesseract-2.00.eng.tar.gz, then, find your download directory, and:

: gunzip tesseract-2.00.eng.tar.gz 
: tar xvf tesseract-2.00.eng.tar
: cd tessdata
: sudo bash
: cp * /usr/local/share/tessdata/

Then hit control-d to exit the sudo bash shell.

Make a TIFF file, be sure it has a .tif extension, and then issue a command like this:

tesseract document-image.tif document-results

... and then you'll have text in document-results.txt

Works great. It should come standard with a Mac. With a graphic user interface. And some corrections to the language file use.


Blogger Unknown said...

Please help me im very eager to learn and trying my best to research.

i get this error:
eoins-macbook:tesseract-2.03 Eoin$ ./configure
checking build system type... i686-apple-darwin9.6.0
checking host system type... i686-apple-darwin9.6.0
checking for cl.exe... no
checking for g++... no
checking for C++ compiler default output file name... configure: error: C++ compiler cannot create executables
See `config.log' for more details.

i cannot continue with the process.

9:56 AM  
Blogger Jakob Fricke said...

i get the same error message ...

10:04 AM  
Blogger Unknown said...

jafri, try to install

Xcode for your mac,

Then update all the mac updates you can get.

Then try the process again.

let me know, it was so good when i got this working.

3:48 AM  

Post a Comment

<< Home