User login

OCR - optical character recognition

This is a wiki page. Be bold and improve it!

If you have any questions about the content on this page, don't hesitate to open a new ticket and we'll do our best to assist you.

Table of Contents

Existing documentation
Scanned document
History and other software
Tesseract
1. Preparing the image to OCR
Scripts
See also...

Existing documentation

https://help.ubuntu.com/community/OCR

Tests and comparison of several OCR software:
http://www.mscs.dal.ca/~selinger/ocr-test/

Scanned document

To have a good OCR result, the quality of the source (scanned) document is obviously important.
Contrary to what one might think, the higher dpi scan (dot per inches) is not the better. Tests have shown that a scan at 300dpi has a slightly better OCR than a 400dpi scan. 600dpi and 1200dpi scans failed miserably 1.

Also, digital photographs can be used in lieu of scanned images, but it is difficult to control the lighting conditions and the dpi-equivalent for optimum results.

If you want to scan many documents (e.g. a whole book or more), it would be worth while taking the time to make some tests.

History and other software

Tesseract was open sourced by Google in 2006 2 . Since then, it has been the best open source OCR software, so much so that other open source OCR software are no longer maintained. However, tesseract only has a CLI interface.

http://kooka.kde.org/ Kooka is a GUI OCR front end for KDE. Unfortunately, it is no longer maintained. Any OCR library (like tesseract) can be plugged into the Kooka. If kooka had a new maintainer, we'd have a top class OCR coupled with a familiar and easy to use KDE GUI.

Google's Tesseract OCR engine is a quantum leap forward (September 28, 2006):
http://www.linux.com/archive/articles/57222

Tesseract

Tesseract home page:
http://code.google.com/p/tesseract-ocr/

If your page is not in English, make sure to select the right language 3. E.g., for a text in French:
tesseract scan.tif output -l fra

Preparing the image to OCR

When scanning the image, spend some time to familiarize yourself with the settings, to find the most adequate one for OCR. If you have many pages to scan, you will save some time pre-processing the image in GIMP before using the character recognition software.

Prepare the image to scan so that:
1) there is only one column of text.
2) the text is right way up.

The process to prepare scans with GIMP is very simple:
Go to the Image→Mode menu and make sure the image is in RGB or Grayscale mode.
Select from the menu Tools→Color Tools→Threshold and choose an adequate threshold value.
Select from the menu Image→Mode→Indexed and from the options choose 1-bit and no dithering.
Save the image in TIFF format.

Scripts

The best script will be one adapted to your needs. Here are some script snippets that you can adapt:

Call with a .jpg image as argument:

#!/bin/bash
convert $1 -colorspace Gray OCR.tif

If your scan are not up side up, you can rotate them:

#!/bin/bash
convert $1  -colorspace Gray -rotate 90 OCR.tif

To iterate over a large number of images:

#!/bin/sh
PAGES=50 # set to the number of TIF images. The images must be named sequencially.
OUTPUT=book.txt # set to the final output file
i=0
touch $OUTPUT

for i in `seq 1 $PAGES`; do
  tesseract Image_$i.TIF page$i -l fra
  cat  $OUTPUT page$i.txt > temp.txt
  rm $OUTPUT
  rm page$i.txt
  mv temp.txt $OUTPUT
done

Don't forget the language attribute if needed (e.g. -l fra).

Make the script executable and put in in you ~/bin folder.

User login

Tickets per project