OC about OCR

Published by rudy Date posted on February 7, 2012

I RECENTLY needed to scan and post a book excerpt for a class I teach but wanted the pages to be available as text, not images. That meant using optical character recognition or OCR software that can translate scanned pages from my book to machine-encoded text.

The advantages of storing the pages as text, not images, are clear. First, text files take up a lot less space, which means you can upload and download them much faster than image files. Second, unlike with images, you can easily search through a text file, and copy and paste selections you need when you’re taking notes or quoting material.

Using OCR software in the earlier days was a hit-or-miss business, with some programs calibrated to work only with certain fonts. This often meant if the document you wanted was set in some unusual font, you were out of luck and would probably have to type it in manually. The early programs weren’t very accurate, either, and required a lot of editing afterward, even when they did recognize the font used.

Fortunately, the technology has improved significantly, and these days, many OCR programs are smart enough to work with a variety of fonts with a high degree of accuracy, usually in the high 90s.

On Windows, there’s a wide selection of paid and free OCR programs. Among commercial Windows packages, ABBY FineReader and OmniPage are among the most popular. Versions of the programs are available on the Mac, too. These packages don’t come cheap, though. Abby FineReader 11 Professional and OmniPage 18 both cost about $150.

On Linux, I use a free and open source program called OCRFeeder, which can scan page images and distinguish between graphics and text, and perform character recognition on the text portions.

On its backend, OCRFeeder uses Tesseract, considered the most accurate free OCR engines available today. First developed by Hewlett-Packard in the mid-1980s and released as open source software in 2005, development of Tesseract has been sponsored by Google since 2006.

OCRFeeder did a decent job of recognizing the pages I scanned using my old Epson Perfection 660 scanner at a resolution of 300 dots per inch. Although I did have to go in and clean up the recognized text, it sure beat typing everything in by hand.

Of course, these days, you don’t even need to install any software to do simple OCR jobs. You can simply upload your images to a Web service that will do the character recognition for you, in many instances, for free. A number of sites also support multiple languages being somewhat obsessive-compulsive, I fed the same document—a scanned image of the first page of Miguel Syjuco’s novel, Ilustrado—through a number of free Web-based services. My goal was simply to determine the accuracy of the character recognition, and I did no tests on their ability to analyze and retain page layouts.

As a control, I also ran the same page through OCRFeeder, which gave me a 98-percent recognition accuracy—missing only four words in the 230-word page.

At first, a number of the Web sites inexplicably dropped words from either side of the margins, rendering the results next to useless. I soon realized, however, that I could fix this by cropping the image closer to the text, with less white space on either side of the document. After this, the online services I tested gave me accuracy ratings of between 97 percent and 100 percent.

Among the non-registration sites, Sciweavers i2OCR (http://www.sciweavers.org/free-online-ocr) gave me the best accuracy at 99 percent, missing only two words in 230. The maximum image size is 10MB.

Free Online OCR (http://free-online-ocr.com), which initially had problems before I cropped the image document, scored a high 99 percent after I narrowed the margins. The site does not specify any limits to image size or the number of pages you can submit.

Free OCR (http://www.free-ocr.com/), with an accuracy score of 98 percent, limits image uploads to 2MB or no wider or higher than 5,000 pixels. There is also a limit of 10 image uploads an hour.

Two other sites, NewOCR (www.newocr.com) and OCRextrACT (http://www.ocr-extract.com/) both yielded 97-percent accuracy on my test document.

I explored two other avenues for OCR. The first was the free online service of Abby FineReader (http://finereader.abbyyonline.com), which requires you to register with a valid e-mail address. The free service limits you to a number of pages a month—you can buy more if you need to—but I couldn’t find any specific page limits on the Web page. Which is unfortunate, because Abby FineReader Online gave me the only 100-percent accuracy rating among all the free services.

Finally, I also tried using the OCR capabilities of Google Docs, the free online productivity suite. To do this, you simply upload an image to Google Docs and check the box that offers to convert text found in the image to Google Docs format. Doing this results in a document with the original image followed by the converted text. As you might expect from a company that has digitized 15 million books, accuracy was pretty high, at 98 percent. Chin Wong, Manila Standard Today

Column archives and blog at: http://www.chinwong.com

Month – Workers’ month

“Hot for workers rights!”

 

Continuing
Solidarity with CTU Myanmar,
trade unions around the world,
for democracy in Myanmar,
with the daily protests of
people in Myanmar against
the military coup and
continuing oppression.

 

Accept National Unity Government
(NUG) of Myanmar.
Reject Military!

#WearMask #WashHands
#Distancing
#TakePicturesVideos

Time to support & empower survivors.
Time to spark a global conversation.
Time for #GenerationEquality to #orangetheworld!
Trade Union Solidarity Campaigns
Get Email from NTUC
Article Categories