User login

OCR: highest dpi scan the best?

Wed, 06/02/2010 - 08:09 - augustin

#!/bin/sh
PAGES=1 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
i=0
touch $OUTPUT
for i in `seq 1 $PAGES`; do
convert -monochrome -density $RESOLUTION $SOURCE\[$i\] page$i.tif
tesseract page$i.tif page$i
cat $OUTPUT page$i.txt > temp.txt
rm $OUTPUT
rm page$i.tif
rm page$i.txt
mv temp.txt $OUTPUT
done

I haven't tried the -density bit. Maybe it makes a difference....

Does anyone know better?

Login or register to post comments

Comments

#1

augustin - 06/02/2010 - 08:13

Well, I just tried on a 1200dpi scan:

convert -density 1200 greyscale-1200.TIF 1200.tif
tesseract 1200.tif 1200 -l fra

and the result was even worse.

I leave this issue open for other more knowledgeable people to comment, but for me, 300dpi seems good enough.

Login or register to post comments