OCR: highest dpi scan the best?

Project:Linux software
Component:Documentation
Category:support request
Priority:normal
Assigned:Unassigned
Status:active
Related pages:#10: OCR - optical character recognition :-:-: #18: Scanners
Description

I did some tests, scanning the same page at 300dpi, 400dpi, 600dpi and 1200dpi. The OCR results were as follows:

300dpi had a slightly better OCR than the 400 one.
600 and 1200 failed completely: the output was only garbage.
Which is nice: it's quicker to scan at 300dpi than at 1200dpi!!

However, the OCR was not 100% perfect (maybe around 95%, which is good enough).

Still, I read in a script I downloaded from somewhere:

#!/bin/sh
PAGES=1 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
i=0
touch $OUTPUT
for i in `seq 1 $PAGES`; do
convert -monochrome -density $RESOLUTION $SOURCE\[$i\] page$i.tif
tesseract page$i.tif page$i
cat $OUTPUT page$i.txt > temp.txt
rm $OUTPUT
rm page$i.tif
rm page$i.txt
mv temp.txt $OUTPUT
done

I haven't tried the -density bit. Maybe it makes a difference....

Does anyone know better?

Comments

#1

Well, I just tried on a 1200dpi scan:

convert -density 1200 greyscale-1200.TIF 1200.tif
tesseract 1200.tif 1200 -l fra

and the result was even worse.

I leave this issue open for other more knowledgeable people to comment, but for me, 300dpi seems good enough.

#2

Title:OCR: highest dpi the best?» OCR: highest dpi scan the best?