The poor mans OCR

Optical Character Recognition (OCR) has been around a long time. One of it’s main uses, for those not familiar, is to gather text from images.

Off the shelf products, such as Abby’s FineReader exist, with prices ranging from $150 for a single user copy, to up to $10000 for large enterprise ‘site licenses’. But where’s the fun in buying it!

I learnt recently that Microsoft Office 2007 has built in OCR capabilities which can be accessed from C# via a COM interface. I will explain in this post how to leverage these capabilites.

First off, you need to have MS Office 2007 installed. This is obviously a dependency if you develop an application to use the OCR capabilites in the field – it won’t work without Office installed. Furthermore, the OCR capability doesn’t install by default when you install Office, you need to add a component called ‘Microsoft Office Document Imaging’ (MODI).

For instructions on how to add the required component, look here.

Now that you have MODI installed, you can create an OCR application! Boot up Visual Studio and create a new C# console application.

You’ll first need to add a reference to MODI, so we can use it from your application. From the Visual Studio Solution Explorer window, right-click on the ‘References’ folder. When the dialog box appears, select the ‘COM’ tab. Finally, select the object named ‘Microsoft Office Document Imaging 12.0 Type Library’.

The code below will create a new MODI document, retrieve the text, and ouput it word by word to the console, (you could also output it to a text file or a custom XML file, I leave that as an exercise for the reader). I’ve assumed below the image file you wish to retrieve the text from is located at ‘C:\Images’.

// Grab the text from an image
MODI.Document md = new MODI.Document();

// Retrieve the text gathered from the image
MODI.Image image = (MODI.Image)md.Images[0];
MODI.Layout layout = image.Layout;

// Loop through the list of words
for (int j = 0; j < layout.Words.Count; j++) { MODI.Word word = (MODI.Word)layout.Words[j]; Console.WriteLine(word.Text); } md.Close(false);

Notice the 'MODI.MiLANGUAGES.miLANG_ENGLISH' parameter above, this is set to the language you are dealing with. It looks like 22 languages are supported, including Japanese and the Chinese variants.

When I ran this, (using the home page of my McAfee 'Total Protection' 2010 suite as the guinea pig), the results were surprisingly accurate (for English anyway), with only two recognition errors.

Take a look here.

I wonder if it is also as accurate for double byte languages like Japanese etc. I'd also like to check it against an RTL language like Arabic.

Anyway, it's definetely worth a look if you want to develop a custom OCR application on a shoestring budget.