OCR in Plone using Tesseract OCR
This document will show you how to add the ability to OCR documents in Plone using Tesseract OCR. An "OCR Document" action will be added to appropriate files and when the user chooses this action a text file with the OCR results will be added to the container.
Purpose
Tesseract OCR is an open source OCR (Optical Character Recognition) engine that is currently sponsored by Google. It can take a TIFF image as input, recognize it and output text. Our goal is to add this functionality into Plone.
We will add an action to TIFF files, which will OCR the image and create a text file with the OCR results in the same container. OCRing an image extracts the text so that it can be edited and in Plone it gives the advantage of indexing the extracted text for searches.
First we will create a script that takes an image as its input and returns the results as a string. We will use the Tesseract command line application to accomplish the OCR. Then we will create a script that passes a Plone file to the previous script and creates a text file from the results. Finally we will call the latter script from an action.
I hope to improve this project and add more functionality, such as PDF support. The goals for improvement are listed at the end of the tutorial.
Prerequisites
- Tesseract OCR must be installed for this to work. You can download it here. Read the documentation on how to install it. You can also compile it with libtiff to support compressed TIFF files. Since I did not find anywhere that documented installing Tesseract and libtiff together I have documented my experience here.
- You will need to know how to add scripts to Plone and import external scripts. This is not difficult and the basic steps are outlined in this document.
See It in Action
If you want to see this script in action I have set it up on ABillionBillion.com. Just create an account, upload a TIFF image and click OCR Document.
Step by step
- Tesseract OCR must be installed. Read the prerequisites in the introduction for more information.
- Add the following script to your Plone Extensions folder (Example: /opt/Plone-x.x.x/zinstance/Extensions)
and call it ocrfile.py. The script contains a module that takes image
data, writes it to disk, runs Tesseract from the command line and
returns the outputted text:
def ocrfile(self, f): import urllib import os import sys import tempfile tess = '/usr/local/bin/tesseract' dir1 = tempfile.mkdtemp() txtfilename = dir1 + '/output' imagefilename = dir1 + '/image.tif' file = open(imagefilename, "wb") file.write(f) file.close() os.spawnv(os.P_WAIT, tess, (tess, imagefilename, txtfilename)) file = open(txtfilename + '.txt', "r") s = file.read() file.close() os.remove(txtfilename + '.txt') os.remove(imagefilename) os.rmdir(dir1) return s The variable "tess" contains the path to tesseract (the one above was Tesseract's default location after installation on Linux), change it if it is different on your system.
Before importing this script you can test it on the file system by placing the following script somewhere and calling it ocrfiletest.py. This script takes an image file as a parameter, opens it and sends it to ocrfile.py:
import ocrfile import sys filename = sys.argv[1] f = open(filename) foo = f.read() f.close() s = ocrfile.ocrfile(0, foo) print s
Test the script by running the following command, the recognized text should display in your terminal:
% python ocrfiletest.py /YOURPATHTO/phototest.tif
Import the script:
- Go to the ZMI -> portal skins -> Custom.
- Click Add, select External Method.
- Type 'ocrfile', without the quotes, for Id, Title, Module Name and Function Name.
- Go to the ZMI -> portal skins -> Custom.
- Now we are going to create a local script that will take a
plone file, pass it to the OCR script and then save the results as a
new text file in the parent container:
- Go back to ZMI -> portal skins -> Custom.
- Click Add, select Script (Python).
- Type 'ocr_document' without the quotes for Id and Title.
- Click "Add and Edit".
- Go back to ZMI -> portal skins -> Custom.
- Delete the default code paste the following code instead, then click "Save Changes":
contentObject = context parent = contentObject.aq_inner.aq_parent #Pass file to external module to be OCRed f = contentObject.data s = script.ocrfile(f) ocrresultid = contentObject.id + "_ocr" ocrresulttitle = contentObject.title + " OCR Results" #Delete ocr text file if it exists if ocrresultid in parent.objectIds(): parent.manage_delObjects([ocrresultid]) ocrresultid = parent.invokeFactory("File", id=ocrresultid, title=ocrresulttitle, file=s) #TODO: Change this so that it changes the original file's extension rather then appending on a .txt ocrresultobj = getattr(parent, ocrresultid) ocrresultobj.setFilename(ocrresultid + '.txt') #Forward the user to the newly created text file return context.REQUEST['RESPONSE'].redirect( '%s/%s/view' % (parent.absolute_url(), ocrresultid)) - Now let's create an action that lets us OCR files. The same thing can be done for the image content type:
- Go to the ZMI -> Portal_Types -> File.
- Click the Actions tab.
- Scroll down to the bottom to Add a new action. Add it with the following values, then click "Add":
Title: OCR Document
Id: ocr_document
URL (Expression): string:${object_url}/ocr_document
Condition (Expression): python:object.content_type=='image/tiff'
Permission: Modify Portal Content
Category: object
Visible: Checked
- Go to the ZMI -> Portal_Types -> File.
- You're done! Now you should be able to OCR TIF images by clicking the newly created OCR Document tab above files. Note: We only added the action to the file content type, if you have added TIF images as images (image content type) the action will not show up. You can add the action to images using the instructions in the previous step and selecting the image content type in step A instead of the file content type. You can also test the script on images or any other content type that contains a TIF file by adding "/ocr_document" to the end of the content's URL.
Screen Shots
The action we created will add a tab above files of the TIFF mimetype:

The file I have uploaded is phototest.tif, which comes with Tesseract. Here is what it looks like:

Clicking the OCR Document tab will OCR the document and bring you to the OCR results. Here is a screen shot of the results:

Improvements for the future
The following are improvements I would like to see added to this project. Some of these I will add myself but some are beyond my knowledge. Let me know if you can do any of these:
General
- Make this into a product: From what I understand if this project is a product it can be easily installed, have version numbers and make for easier collaboration. Unfortunately, this is beyond my capabilities, however, maybe someone who has an interest in this will productitize it.
Error Handling
- Add code to the ocr_document script so that the script will not run on files with the wrong extension since Tess will reject them as well.
- Add exception handling to the scripts.
- Add the ability for the scripts to return error text back to the user or logged in an error log when errors occur. Make sure feedback from Tesseract gets logged as well.
Features
- Add functionality to OCR PDF files. It would be great if it could convert PDF files to PDF files with imbedded text rather than uploading a new text file to the directory.
- Deal with multipage files
- Link resulting OCR text with TIFF image.

PDFtoOCR
http://plone.org/products/pdftoocr/