Image optimization before OCR

Project:Linux software
Component:Documentation
Category:support request
Priority:normal
Assigned:Unassigned
Status:closed
Related pages:#10: OCR - optical character recognition
Description

#17: Which scanner for Ubuntu?
#21: Most Linux friendly scanner manufacturers?
#13: mass processing TIFF images: GIMP scripts

A friend replicated a test OCR from a test scan I did earlier. The text is in French (with accented letters...)

His result is much better than my own: see attached image (Yellow: his test which is better than my test in green).

He said is augmented the contrast.

It shows that it pays to optimize the image before trying to OCR it. This comes back to this ticket:
#13: mass processing TIFF images: GIMP scripts

AttachmentSize
comparaison-ocr.png104.87 KB

Comments

#1

Same image, but more compact.

AttachmentSize
comparaison_ocr.png 128.59 KB

#2

wiki.

#3

Actually the image in #1 is misleading.
The text is in French and I forgot to add the -l fra attribute.

Attached is a comparison between the OCR of the same page, with in yellow the original scan (without lang fr) and in green, the optimized result with the proper language setting.

AttachmentSize
tesserract_diff_without_and_with_lang_fr.png 111.21 KB

#4

Attached is the actual comparison between the tif file I originally used (yellow), and mose's optimized tif file (green).

AttachmentSize
tesserract_diff_from_mo_au_source.png 65.13 KB

#5

Status:active» fixed

I finished scanning the book. Overall, what made the most difference was : the dpi setting (3 and not more!), and the language setting (don't forget -l fra).

Anyhow, I documented what I could. I finished scanning and OCR'ing what I had.

#6

Status:fixed» closed
Related pages:-10: OCR - optical character recognition

Automatically closed -- issue fixed for 2 weeks with no activity.