Sunday, June 29, 2008

Open Source OCR on a Mac

Tesseract-OCR is an effective open source project from the mid-1990's. The actual character recognition is terrific, so it's not hard to get good results, and it's quite a time-saver.

But, it's not exactly turn-key. You need to:

1. download the source
2. compile it for a Mac
3. download a language file
4. copy it to the appropriate directory
5. run it on TIFF files that need to be renamed to a .tif extension.

Tesseract won't run unless you copy the language file to /usr/local/share/tessdata. Which is strange, because it uses it very irregularly. Most of the miss-read results are simple English words: you get "iist" instead of "list", "lf" instead of "if". It makes you wonder how exactly it is applying this language file.

If you use a Mac utility like Textedit, or Word, or Open Office, the spell-checker can find and help you fix these in a matter of moments. But, still, it's irritating, when you have a long document. This software needs to be 'productized'.

So, the actual sequence:

1. go here, and download tesseract-2.03.tar.gz.

2. In a Terminal window (Applications->Utilities), find your download directory, cd there, and:

: gunzip  tesseract-2.03.tar.gz
: tar xvf tesseract-2.03.tar


3. cd to the tesseract-2.03 directory, then:

:./configure
:sudo make
:sudo make install


4. Go back here, and download tesseract-2.00.eng.tar.gz, then, find your download directory, and:

: gunzip tesseract-2.00.eng.tar.gz 
: tar xvf tesseract-2.00.eng.tar
: cd tessdata
: sudo bash
: cp * /usr/local/share/tessdata/


Then hit control-d to exit the sudo bash shell.

Make a TIFF file, be sure it has a .tif extension, and then issue a command like this:

tesseract document-image.tif document-results


... and then you'll have text in document-results.txt

Works great. It should come standard with a Mac. With a graphic user interface. And some corrections to the language file use.

Sunday, June 15, 2008

Server heartbeat script

Simple bash script on Gnu/Linux or Unix system, to monitor the uptime of your server, and tell you, call you (well, text your cell phone) when something's wrong. I just call this one heartbeat:


#! /bin/sh -v
while true
do
if wget --no-dns-cache --no-proxy --delete-after google.com
then
if wget --no-dns-cache --no-proxy -t 1 -T 60 --delete-after \
your.domain's.IP.address/beat.php
then
echo "success"
else
echo "Heartbeat stopped" | mail -s "MACHINE DOWN" -c \
your@emailaddress.com \
yourphonenumber@txt.att.net; sleep 20m
fi
fi
sleep 10m
done



We need a server-side script ... the beat.php above:


<?php
header("Cache-Control: no-cache, must-revalidate"); // HTTP/1.1
header("Expires: Mon, 26 Jul 1997 01:00:00 GMT"); // Date in the past
?>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Beat</title>
</head>
<body>

<? print(Date("l F d, Y ... g:i A")); ?>

<br>
</body>
</html>



A few explanations:

* We hit Google first just to make sure the web is working. No sense queueing up an unecessary e-mail to yourself.

* wget's delete-after cleans up the result of the wget, which is built to fetch pages from a URL.

* wget doesn't error when your local resolv cannot find a domain name. This is why we use the IP address. If you need to test to see if your domain's zone name server is running, create a script for that purpose. Don't rely on this one.

* most cell phone have a way to receive e-mail / text messages. Try sending a message from your phone to your e-mail address, and reply to it, to discover, and test, your cell phone's e-mail address.

* Of course, all the "cache expire" stuff, and the fresh-content dynamic server side script, is necessary, otherwise when your server goes down, some helpful cache server along the way will provide you with the beat.php results anyway, from your numerous previous wgets. It's best to test this, despite my precautions ... there are lots of cache mechanisms being inserted on the web these days.

Friday, May 23, 2008

Ubuntu command line mail

Ubuntu doesn't come pre-packaged with command line mail. That's strange, because this is one of the most simple, important commands in Linux / Unix admin network scripting. It lets you interact with the real world.

To install, unfortunately, the older package mailutils is deprecated for Ubuntu, even though it's listed in all Google results for [ "command line mail" ubuntu ].

So, installing good old unix command line mail (e-mail) on Ubuntu, as of late May 2008:

apt-get update
apt-get install postfix
apt-get install mailx

Then use mail freely.

Tuesday, March 18, 2008

Safari form spacing problem

Safari will render a form with extra space at the bottom (typically, right under your submit button), whereas more compliant browsers like Firefox do not. This makes login boxes and other colored form divs and tables render improperly.

The fix is simple ... add this to your opening form tag:

form style="margin: 0px; padding: 0px;"

Thursday, September 27, 2007

Safari crash on mismatched tags

If Apple's mac web browser, Safari, crashes on your web page, look for mismatched beginning and end tags. With a fairly complex page, using lots of CSS, we found that Safari was crashed by a simple mismatch of h1 (opening) and h2 (closing) tags ... note that we first debug-interpolated the error to the CSS file, but there was no error to fix ... the typo was in HTML that used the CSS file ...

(keywords: crash, crashes, crashed, safari browser)

Tuesday, September 25, 2007

Firefox img border problem

Of course, since Firefox is probably the most compliant browser, this is more an artifact than a problem. But I'd like to help people out with this, so I'll keyword it: problem, bug, issue ...

In Firefox an image (img) with (a href) link tags around it, often shows up with a visually unwanted blue border. The fix is simple. Set " border=0 " in the img tag:

< img src="file.jpg" border=0 > 

Friday, August 31, 2007

Good habits: lengthen your history

It's important to know what you've done, as a sysadmin. If you're one of a group of admins, history becomes even more important.

The first thing I do on a legacy Linux system, is edit /etc/profile, and change HISTSIZE to 100000 or more. Note, this does not erase previous history (although etting it to zero will).

Of course, "history" records only one side of the conversation. The responses of the machine are just as important. Unfortunately, there's no straightforward mechanism like history, which would automatically log everything you see. Your terminal client can often do this for you, but it would be nice to do it on the machine itself.

To bug the bash people about this, for their next release, go here.

The "script" command is a temporary fix ... it records your session in a transcript file. You type "exit" or ^d to stop recording. It writes into a file called "typescript", unless you specify something different.

Trying to automate this:

... you cannot just put "script" in your .bashrc file ... it will go into an infinite loop, because "script" itself starts a new shell (when you stop this, amusingly, you'll find yourself at the bottom of a few hundred nested shells ...)

You can't put it in .login ... because script starts a shell. When .login exits, so will the "script" shell. But that's not what you need.

What you need to do, is set an environmental variable in .bash_login, unset it in .basrc, and then start "script" just once.

Monday, August 13, 2007

MSN Messenger very broken on Macs

Microsoft's closed-protocol Windows Live, Microsoft Messenger IM environment will work for a mac under some conditions. If you first authenticate on the machine you first sign in on, you're ok, and the client works passably well (although there are many irritations, like a nearly useless search feature) ... but if you need to take on previously authenticated identities, you're hosed. It simply doesn't work on a mac. The re-authentication e-mail never gets sent. Also, the web ui, "webmessenger.msn.com" is simply broken for any browser that runs on a Mac (Firefox, Safari, and the old IE) ... it loses connections and messages quite easily.

However, MSN Messenger even works well on 8-year-old machines running Windows 2000 Pro ... set one up so you don't need to buy a new machine and give money towards Microsoft's monopoly-driven attempts to create working software.

Monday, March 26, 2007

cp -r hangs. use cp -R.

My primary purpose for this blog, is to make small little nuggets that a search engine will find, when you have a problem, and you haven't slept for three days, and you type in the first thing that comes to you.

I could do this with keywords: cp, cp -r, cp -R, hangs, takes forever, special devices, licq_fifo, linux, unix, large copy, massive copy, etc.

But that's not the way you think when you haven't slept. If you're a human, you ask questions "why is my cp hanging?" or "my copy is hanging, but why?" or just statements "my cp hangs" or "my copy hangs." Very commonly, you think in terms of time-based sequences of statements, as if you were talking to a doctor ... something like "I'm on Linux. I had to copy a lot of data from one disk to another. One disk is remote, using NFS. I used cp -r. It took forever, so I did df, and found that the new disk size wasn't increasing. Then I killed it with control-c. And ran it cp -r -v, to see what was going on. It got hung up on a file licq_fifo. I googled licq_fifo, and found out it was a special device. cp -r doesn't follow symbolic links, why would it not recognize special devices? Oh, wait, I remember this, it's a normal problem. I typed "man cp", and then the more recent "info cp". And, right, cp -R will ignore special device files. I should go to sleep."

This is of course the simplest problem, quite quick to resolve. But look at the volume of problem-solving information in the above paragraph. What can we do to make search engines more like expert systems, so that we can preserve more of the world's problem-solving-sequence information? Not just with IT, but with everything involving such traversals of ordered fuzzy graphs?

Friday, March 23, 2007

Discovering a legacy purpose for LDAP / slapd

If you're on a legacy Linux/Unix system, and you notice slapd running a directory service, you might wonder: what is this directory service for? What is slapd doing? Why is ldap here? Why do I have slpad running? What data is slapd serving? [I'm playing with search engines here -- I typed these questions into google and found nothing quickly.]

A search of the data will answer most of your questions.

ldapsearch

.. is a common LDAP client command on Linux/Unix. It will tell you if no ldap is available ("Can't contact LDAP server").

But, even with a slapd daemon running, you won't find any data by typing this command with no arguments. You need to tell it where to start looking: the base of the ordered data on the local server.

So, find the config file -- usually somewhere like /etc/ldap/slapd.conf. If it isn't there, use "locate slapd" or "locate ldap" to find it.

In the file, you'll see something vaguely like this:

# The base of your directory
suffix "dc=onething,dc=something,dc=else"

Use the string in quotes to search your directory:

ldapsearch -b "dc=onething,dc=something,dc=else"

You'll see the data, and then the purpose of LDAP on your server should be quite clear.

Friday, March 16, 2007

BIOS disappears

I'm putting an EIDE drive in a 7-year-old server, which has SCSI disks from that time -- still working, but they're very small.

There's only one startup setting I need to change:

I hit "delete" to enter the "CMOS Setup Utility".

I go to "BIOS Features Setup".

The setting I need to change is "HDD Sequence SCSI/IDE First".

I need to set it to SCSI, which is still the boot drive. The default is IDE ... the shop that made the server didn't need to change this, because there were no IDE drives ... they should have, but oh well.

But, inexplicably, sometimes, the system doesn't boot up properly. What's up?

Oh. 7-year-old server. The little CMOS backup battery is dead. I've been turning the machine off at the power strip, and it doesn't get the 3 volts it needs to keep my one change to the BIOS ...

Sunday, January 28, 2007

In a deadirectory

After some apt-get update and apt-get install, I get this funny error message when I try to restart apache:

"shell-init: could not get current directory: getcwd: cannot access parent directories: No such file or directory"

But the command runs fine.

Something felt familiar ... oh right. This is a UNIX error message for "you are in a directory that no longer exists".

Monday, January 01, 2007

Mysql broken increment

Changing table names, I changed the name of the primary key as well, and forgot to set all the attributes.

If you don't, things start to break, because the attributes aren't "saved" for you in any way. For example, the auto-increment stops working, the table is no longer functional, and you get this error:

Duplicate entry '0' for key 1

To fix this, add the attributes to the primary key field. This won't hurt the values (at least, it didn't for me):

alter table X change column key_name key_name int(12) unsigned not null auto_increment

Friday, October 27, 2006

Upgrading to PHP 5 -- fix 3

And, of course, there's the famous "backslash" or "magic_quote" problem. The only way to sensibly develop a large PHP app is to turn magic_quote Off in the php.ini file. That way, you can get reliable behavior from addslashes & stripslashes. When you port to a new php implementation, however, you'll find that, by default, magic_quotes is On. Edit php.ini, turn magic quotes Off, issue a /etc/init.d/apache2 restart, and your text inserts & fetches shouls be fine.

Sunday, October 22, 2006

Upgrading to php5 -- fix 2

I get:

"Warning: mysql_connect() [function.mysql-connect]: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' " ....

I notice that my old command line entrance into mysql isn't working, and I get:

"Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' "

Ah ... mysqld isn't running. So:

mysqld_safe &

... which finds the old config files fine. Errors disappear. Everything's up.

Upgrading to php5

After a major Linux distro upgrade

This message (from php):

Fatal error: Call to undefined function: mysql_connect() ...

Mysql wasn't bundled in. So I typed:

apt-get install php5-mysql

... which fixed the problem

Sunday, October 15, 2006

Odd characters in PHP server output

Switched from php4 to php5.

First problem: strange control characters that the browser renders as black diamonds with a question mark inside. Like this: ��

So I saved the html file.

Looked at it with vim -b.

The confused black diamonds were "a0" in hex.

I typed "a0" into Google.

Others had this problem, with perl. It was a matter of content-type character sets (charset): is0-8859-1 vs. UTF-8 ...

Looked at the HTML output of another server I had (with php4) for phpinfo(). Compared side-by-side with the php5 output. Very useful ... a bunch of sections with settings in tables. It was in a section entitled "HTTP Headers Information", subheading "HTTP Response Headers". The good server had the name/value pair:

Content-Type text/html; charset=iso-8859-1

... the bad, problem server had:

Content-Type text/html; charset=UTF-8

So, I looked at the /etc/php5/apache2/php.ini file.

there's a line that read:

;default_charset = "iso-8859-1"

I tried removing the leading semi-colon (uncommenting). Typed:

/etc/init.d/apache2/restart

And that fixed it. The diamonds disappeared.