OCR: highest dpi scan the best?
|Related pages:||#10: OCR - optical character recognition :-:-: #18: Scanners|
I did some tests, scanning the same page at 300dpi, 400dpi, 600dpi and 1200dpi. The OCR results were as follows:
300dpi had a slightly better OCR than the 400 one.
600 and 1200 failed completely: the output was only garbage.
Which is nice: it's quicker to scan at 300dpi than at 1200dpi!!
However, the OCR was not 100% perfect (maybe around 95%, which is good enough).
Still, I read in a script I downloaded from somewhere:
PAGES=1 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
for i in `seq 1 $PAGES`; do
convert -monochrome -density $RESOLUTION $SOURCE\[$i\] page$i.tif
tesseract page$i.tif page$i
cat $OUTPUT page$i.txt > temp.txt
mv temp.txt $OUTPUT
I haven't tried the -density bit. Maybe it makes a difference....
Does anyone know better?
Well, I just tried on a 1200dpi scan:
convert -density 1200 greyscale-1200.TIF 1200.tif
tesseract 1200.tif 1200 -l fra
and the result was even worse.
I leave this issue open for other more knowledgeable people to comment, but for me, 300dpi seems good enough.