OCR: highest dpi scan the best?
Jump to:
Project: | Linux software |
Component: | Documentation |
Category: | support request |
Priority: | normal |
Assigned: | Unassigned |
Status: | active |
Related pages: | #10: OCR - optical character recognition :-:-: #18: Scanners |
I did some tests, scanning the same page at 300dpi, 400dpi, 600dpi and 1200dpi. The OCR results were as follows:
300dpi had a slightly better OCR than the 400 one.
600 and 1200 failed completely: the output was only garbage.
Which is nice: it's quicker to scan at 300dpi than at 1200dpi!!
However, the OCR was not 100% perfect (maybe around 95%, which is good enough).
Still, I read in a script I downloaded from somewhere:
#!/bin/sh
PAGES=1 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
i=0
touch $OUTPUT
for i in `seq 1 $PAGES`; do
convert -monochrome -density $RESOLUTION $SOURCE\[$i\] page$i.tif
tesseract page$i.tif page$i
cat $OUTPUT page$i.txt > temp.txt
rm $OUTPUT
rm page$i.tif
rm page$i.txt
mv temp.txt $OUTPUT
done
I haven't tried the -density bit. Maybe it makes a difference....
Does anyone know better?
Comments
#1
Well, I just tried on a 1200dpi scan:
convert -density 1200 greyscale-1200.TIF 1200.tif
tesseract 1200.tif 1200 -l fra
and the result was even worse.
I leave this issue open for other more knowledgeable people to comment, but for me, 300dpi seems good enough.
#2