Skip to Content

OCR: highest dpi scan the best?

Project:Linux software
Component:Documentation
Category:support request
Priority:normal
Assigned:Unassigned
Status:active
Related pages:#10: OCR - optical character recognition :-:-: #18: Scanners
Description

I did some tests, scanning the same page at 300dpi, 400dpi, 600dpi and 1200dpi. The OCR results were as follows:

300dpi had a slightly better OCR than the 400 one.
600 and 1200 failed completely: the output was only garbage.
Which is nice: it's quicker to scan at 300dpi than at 1200dpi!!

However, the OCR was not 100% perfect (maybe around 95%, which is good enough).

Still, I read in a script I downloaded from somewhere:

#!/bin/sh
PAGES=1 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
i=0
touch $OUTPUT
for i in `seq 1 $PAGES`; do
convert -monochrome -density $RESOLUTION $SOURCE\[$i\] page$i.tif
tesseract page$i.tif page$i
cat $OUTPUT page$i.txt > temp.txt
rm $OUTPUT
rm page$i.tif
rm page$i.txt
mv temp.txt $OUTPUT
done

I haven't tried the -density bit. Maybe it makes a difference....

Does anyone know better?

Comments

#1

Well, I just tried on a 1200dpi scan:

convert -density 1200 greyscale-1200.TIF 1200.tif
tesseract 1200.tif 1200 -l fra

and the result was even worse.

I leave this issue open for other more knowledgeable people to comment, but for me, 300dpi seems good enough.

#2

Title:OCR: highest dpi the best?» OCR: highest dpi scan the best?

Post new comment

Edit issue settings
Note: changing any of these items will update the issue's overall values.
active
Use this field if you want to associate this issue to a specific wiki page. Enter the title of the wiki page. You can enter a comma separated list of nodes.
The content of this field is kept private and will not be shown publicly.
  • Use [fn]...[/fn] (or <fn>...</fn>) to insert automatically numbered footnotes.
  • Allowed HTML tags: <a> <blockquote> <cite> <code> <div> <em> <h2> <h3> <h4> <h5> <h6> <img> <li> <ol> <pre> <strong> <ul> <table> <th> <td> <tr> <br>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically. (Better URL filter.)
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
  • Project issue numbers (ex. [#12345]) turn into links automatically.
  • Use [toc list: ol; title: Table of Contents; minlevel: 2; maxlevel: 3; attachments: yes;] to insert a mediawiki style collapsible table of contents. All the arguments are optional.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.