OCR with tesseract: garbage output
Jump to:
Project: | Linux software |
Component: | Miscellaneous |
Category: | support request |
Priority: | normal |
Assigned: | Unassigned |
Status: | closed |
Related pages: | #10: OCR - optical character recognition |
Description
I'm trying to scan and OCR an old book. I tried with tesseract:
tesseract Scan10005.TIF output
however, the output is only garbage!
4-¤ cv ·—• g O Eq ."*§•**0""¤»·D. O ’“
¤¤;I¤; 4-» °¤cnE Eb D ¤ <¤5b¤.>.
$-4 ·¤·~¢W • O as
Oo 0 _o. _·O 0 ····-· _¤>*~=··»2°=$§»"’·~¤¤=&~~~¤2
C: :1 on s-4
E;] m..,§">mSE¤*—`§_'§.2u—'§.g§E~SE,-¤§§8 m=`?·*`°g
¤. ¤¤> =¤= own-· ¤.2.2E_¤ ¤¤=·*"=¤»¤>¤~·E¤¤-¤
What does it take for OCR to work?
Comments
#1
en français, pourquoi j'en ai besoin:
http://3enjeux.overshoot.tv/billet/48
#2
Oh! I found out why!
The tif image was scanned the wrong way around. I had to crop it, turn it and flatten it, following the instructions given in the ubuntu wiki:
The process to prepare them with GIMP is very simple:
Go to the Image→Mode menu and make sure the image is in RGB or Grayscale mode.
Select from the menu Tools→Color Tools→Threshold and choose an adequate threshold value.
Select from the menu Image→Mode→Indexed and from the options choose 1-bit and no dithering.
Save the image in TIFF format.
Complete this site's documentation.
#3
#4
Automatically closed -- issue fixed for 2 weeks with no activity.