OCR with tesseract: garbage output

Project:Linux software
Component:Miscellaneous
Category:support request
Priority:normal
Assigned:Unassigned
Status:closed
Related pages:#10: OCR - optical character recognition
Description

I'm trying to scan and OCR an old book. I tried with tesseract:
tesseract Scan10005.TIF output

however, the output is only garbage!

4-¤ cv ·—• g O Eq ."*§•**0""¤»·D. O ’“
¤¤;I¤; 4-» °¤cnE Eb D ¤ <¤5b¤.>.
$-4 ·¤·~¢W • O as
Oo 0 _o. _·O 0 ····-· _¤>*~=··»2°=$§»"’·~¤¤=&~~~¤2
C: :1 on s-4
E;] m..,§">mSE¤*—`§_'§.2u—'§.g§E~SE,-¤§§8 m=`?·*`°g
¤. ¤¤> =¤= own-· ¤.2.2E_¤ ¤¤=·*"=¤»¤>¤~·E¤¤-¤

What does it take for OCR to work?

Comments

#1

en français, pourquoi j'en ai besoin:
http://3enjeux.overshoot.tv/billet/48

#2

Oh! I found out why!

The tif image was scanned the wrong way around. I had to crop it, turn it and flatten it, following the instructions given in the ubuntu wiki:

The process to prepare them with GIMP is very simple:
Go to the Image→Mode menu and make sure the image is in RGB or Grayscale mode.
Select from the menu Tools→Color Tools→Threshold and choose an adequate threshold value.
Select from the menu Image→Mode→Indexed and from the options choose 1-bit and no dithering.
Save the image in TIFF format.

Complete this site's documentation.

#3

Status:active» fixed

#4

Status:fixed» closed
Related pages:-10: OCR - optical character recognition

Automatically closed -- issue fixed for 2 weeks with no activity.