Sunday, June 29, 2008

Open Source OCR on a Mac

Tesseract-OCR is an effective open source project from the mid-1990's. The actual character recognition is terrific, so it's not hard to get good results, and it's quite a time-saver.

But, it's not exactly turn-key. You need to:

1. download the source
2. compile it for a Mac
3. download a language file
4. copy it to the appropriate directory
5. run it on TIFF files that need to be renamed to a .tif extension.

Tesseract won't run unless you copy the language file to /usr/local/share/tessdata. Which is strange, because it uses it very irregularly. Most of the miss-read results are simple English words: you get "iist" instead of "list", "lf" instead of "if". It makes you wonder how exactly it is applying this language file.

If you use a Mac utility like Textedit, or Word, or Open Office, the spell-checker can find and help you fix these in a matter of moments. But, still, it's irritating, when you have a long document. This software needs to be 'productized'.

So, the actual sequence:

1. go here, and download tesseract-2.03.tar.gz.

2. In a Terminal window (Applications->Utilities), find your download directory, cd there, and:

: gunzip  tesseract-2.03.tar.gz
: tar xvf tesseract-2.03.tar


3. cd to the tesseract-2.03 directory, then:

:./configure
:sudo make
:sudo make install


4. Go back here, and download tesseract-2.00.eng.tar.gz, then, find your download directory, and:

: gunzip tesseract-2.00.eng.tar.gz 
: tar xvf tesseract-2.00.eng.tar
: cd tessdata
: sudo bash
: cp * /usr/local/share/tessdata/


Then hit control-d to exit the sudo bash shell.

Make a TIFF file, be sure it has a .tif extension, and then issue a command like this:

tesseract document-image.tif document-results


... and then you'll have text in document-results.txt

Works great. It should come standard with a Mac. With a graphic user interface. And some corrections to the language file use.

3 Comments:

Blogger eoin said...

Please help me im very eager to learn and trying my best to research.

i get this error:
eoins-macbook:tesseract-2.03 Eoin$ ./configure
checking build system type... i686-apple-darwin9.6.0
checking host system type... i686-apple-darwin9.6.0
checking for cl.exe... no
checking for g++... no
checking for C++ compiler default output file name... configure: error: C++ compiler cannot create executables
See `config.log' for more details.

i cannot continue with the process.

9:56 AM  
Blogger jafri said...

i get the same error message ...

10:04 AM  
Blogger eoin said...

jafri, try to install
http://developer.apple.com/TOOLS/Xcode/

Xcode for your mac,

Then update all the mac updates you can get.

Then try the process again.

let me know, it was so good when i got this working.

3:48 AM  

Post a Comment

<< Home