bug-ocrad
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-ocrad] Not recognizing obvious text


From: Tony Maro
Subject: Re: [Bug-ocrad] Not recognizing obvious text
Date: Sun, 22 Jan 2006 19:13:54 -0500
User-agent: Mozilla Thunderbird 1.0.7 (Windows/20050923)

Antonio Diaz Diaz wrote:

> Tony Maro wrote:
>
>> I've got a sample for you:
>> In the center of the page in large font is "SEPARATOR PAGE".
>> Not a single character is recognized, however it does try to interpret
>> the barcode above it.
>> I do know that if I crop the image to just the barcode and text, and
>> remove all the whitespace it reads it fine.
>
>
> The problem is a large black block to the right of the page. The image
> goes beyond the sheet of paper on the scanner.
>
> The solution is easy, use the option `-l1' or `-l2' to remove the block.
>
> `ocrad -l1 page.pbm'

Ah, thank you, that explains it.  There's not a way to limit the area of
the page you're doing OCR on is there?  Like a zone ocr?  I'm going for
speed.  What I'm actually doing is trying to detect page rotation by
doing OCR on the page one way, and if the ratio of letters to garbage
isn't high enough I flip the page and OCR again.  I've figured out I can
do this on only around 1/4 of the page and get accurate results, and the
OCR doesn't take as long.

I really only need to OCR either the middle of the page or the top left
quarter of the page.  Unfortunately using ImageMagick is slow for cropping.

*I'm using tiffsplit to split around 500 pages into single pages
*I then call tifftopnm and convert a single page to PBM for processing
*I then use ImagMagick convert to crop the pbm into a temp file
*I run ocrad on it and check the produced text.
*If the ratio of letters to garbage is greater than 1.8, I assume it's
right and go on...
*if not, I rotate the pbm and crop again with mogrify
*run ocrad on the rotated pbm and compare the text again
*If the ratio is better than the first try, I assume the page is upside
down and rotate the original tiff page.
*When done, I reassemble all the tiff pages using tiffcp

I actually rotate the pbm rather than use the rotation in ocrad so I can
grab the opposite corner of the document prior to cropping.  There's
generally more text in the top left corner of the page, and leads to
more accurate results.

For a single document of around 500 pages I've trimmed it down to just
over 7 minutes to do the above checks, correct orientation of any pages
and reassemble the multi-page TIFF.  Accuracy of rotation detection is
around 98% with reasonable quality scans at 200 dpi.  Most pages that
should have been rotated and are not are usually really bad quality to
begin with.  Out of 500 pages only 2 got rotated that should not have
been.  That's a huge boost considering around 40% of the pages are
upside down when I start, and the original documents are in horrible
shape.  You couldn't even consider doing a true OCR and getting readable
results on at least half the pages.  Many are mostly handwritten as
well, which of course doesn't OCR, but at least are on forms that have
some typed text.

Right now about 3 minutes of that 7 minutes is OCR processing, and 2
minutes is cropping, with the rest split amongst splitting, converting
and reassembling.  If I drop the cropping and try to OCR the entire page
it jumps to over 11 minutes.

Bet you guys never thought ocrad would be used for that, eh? ;-)

So, anyone have an idea that might speed up the process?  I'm already
using an AMD 64 3200+ with a 64 bit kernel.  Right now I'm at about 70
pages per minute, but I'd like to get it to 100 pages per minute
processed.  At that speed I still won't keep up with the scanners, but I
should be able to catch back up every night, or at least over the
weekend.  Yes, you read that right.  I'll be processing as much as
150,000 pages per day on one server, and am designing this process so it
could be clustered to handle more.

-Tony




reply via email to

[Prev in Thread] Current Thread [Next in Thread]