OCR - optical character recognition
This is a wiki page. Be bold and improve it!
If you have any questions about the content on this page, don't hesitate to open a new ticket and we'll do our best to assist you.
Existing documentation
https://help.ubuntu.com/community/OCR
Tests and comparison of several OCR software:
http://www.mscs.dal.ca/~selinger/ocr-test/
Scanned document
To have a good OCR result, the quality of the source (scanned) document is obviously important.
Contrary to what one might think, the higher dpi scan (dot per inches) is not the better. Tests have shown that a scan at 300dpi has a slightly better OCR than a 400dpi scan. 600dpi and 1200dpi scans failed miserably 1.
Also, digital photographs can be used in lieu of scanned images, but it is difficult to control the lighting conditions and the dpi-equivalent for optimum results.
If you want to scan many documents (e.g. a whole book or more), it would be worth while taking the time to make some tests.
History and other software
Tesseract was open sourced by Google in 2006 2 . Since then, it has been the best open source OCR software, so much so that other open source OCR software are no longer maintained. However, tesseract only has a CLI interface.
http://kooka.kde.org/ Kooka is a GUI OCR front end for KDE. Unfortunately, it is no longer maintained. Any OCR library (like tesseract) can be plugged into the Kooka. If kooka had a new maintainer, we'd have a top class OCR coupled with a familiar and easy to use KDE GUI.
Google's Tesseract OCR engine is a quantum leap forward (September 28, 2006):
http://www.linux.com/archive/articles/57222
Tesseract
Tesseract home page:
http://code.google.com/p/tesseract-ocr/
If your page is not in English, make sure to select the right language 3. E.g., for a text in French:
tesseract scan.tif output -l fra
Preparing the image to OCR
When scanning the image, spend some time to familiarize yourself with the settings, to find the most adequate one for OCR. If you have many pages to scan, you will save some time pre-processing the image in GIMP before using the character recognition software.
Prepare the image to scan so that:
1) there is only one column of text.
2) the text is right way up.
The process to prepare scans with GIMP is very simple:
Go to the Image→Mode menu and make sure the image is in RGB or Grayscale mode.
Select from the menu Tools→Color Tools→Threshold and choose an adequate threshold value.
Select from the menu Image→Mode→Indexed and from the options choose 1-bit and no dithering.
Save the image in TIFF format.
Scripts
The best script will be one adapted to your needs. Here are some script snippets that you can adapt:
Call with a .jpg image as argument:
#!/bin/bash
convert $1 -colorspace Gray OCR.tif
If your scan are not up side up, you can rotate them:
#!/bin/bash
convert $1 -colorspace Gray -rotate 90 OCR.tif
To iterate over a large number of images:
#!/bin/sh
PAGES=50 # set to the number of TIF images. The images must be named sequencially.
OUTPUT=book.txt # set to the final output file
i=0
touch $OUTPUT
for i in `seq 1 $PAGES`; do
tesseract Image_$i.TIF page$i -l fra
cat $OUTPUT page$i.txt > temp.txt
rm $OUTPUT
rm page$i.txt
mv temp.txt $OUTPUT
done
Don't forget the language attribute if needed (e.g. -l fra
).
Make the script executable and put in in you ~/bin folder.
See also...
Scanners (hardware):
http://linux.overshoot.tv/wiki/scanners
- 1. See #103: OCR: highest dpi scan the best?.
- 2. Seehttp://googlecode.blogspot.com/2006/08/announcing-tesseract-ocr.html
- 3. If you are interested in seeing the difference the proper language argument makes, check the documents attached to this issue: #41: Image optimization before OCR.
Issues related to this page:
Project | Summary | Status | Priority | Category | Last updated | Assigned to |
---|---|---|---|---|---|---|
Linux software | OCR: highest dpi scan the best? | active | normal | support request | 14 years 25 weeks |