OCR in Plone using Tesseract OCR
This tutorial will show you how to add the ability to OCR documents in Plone using Tesseract OCR. An "OCR Document" action will be added to appropriate files and when the user chooses this action a text file with the OCR results will be added to the container.
Introduction
We will add OCR capabilities into Plone by integrating Tesseract OCR.
Tesseract OCR is an open source OCR (Optical Character Recognition) engine that is currently sponsored by Google. It can take a TIFF image as input, recognize it and output text. Our goal is to add this functionality into Plone.
We will add an action to TIFF files, which will OCR the image and create a text file with the OCR results in the same container. OCRing an image extracts the text so that it can be edited and in Plone it gives the advantage of indexing the extracted text for searches.
First we will create a script that takes an image as its input and returns the results as a string. We will use the Tesseract command line application to accomplish the OCR. Then we will create a script that passes a Plone file to the previous script and creates a text file from the results. Finally we will call the latter script from an action.
I hope to improve this project and add more functionality, such as PDF support. The goals for improvement are listed at the end of the tutorial.
Prerequisites:
- Tesseract OCR must be installed for this to work. You can download it here. Read the documentation on how to install it. You can also compile it with libtiff to support compressed TIFF files. Since I did not find anywhere that documented installing Tesseract and libtiff together I have documented my experience here.
- You will need to know how to add scripts to Plone and import external scripts. This is not difficult and the basic steps are outlined in this document.
See It in Action
If you want to see this script in action I have set it up on ABillionBillion.com. Just create an account, upload a TIFF image and click OCR Document.
Let's get started!
Step by Step Instructions
The following are step by step instructions for adding Tesseract OCR functionality to your site.
- Tesseract OCR must be installed. Read the prerequisites in the introduction for more information.
- Add the following script to your Plone Extensions folder (Example: /opt/Plone-x.x.x/zinstance/Extensions) and call it ocrfile.py. The script contains a module that takes image data, writes it to disk, runs Tesseract from the command line and returns the outputted text:
def ocrfile(self, f): import urllib import os import sys import tempfile tess = '/usr/local/bin/tesseract' dir1 = tempfile.mkdtemp() txtfilename = dir1 + '/output' imagefilename = dir1 + '/image.tif' file = open(imagefilename, "wb") file.write(f) file.close() os.spawnv(os.P_WAIT, tess, (tess, imagefilename, txtfilename)) file = open(txtfilename + '.txt', "r") s = file.read() file.close() os.remove(txtfilename + '.txt') os.remove(imagefilename) os.rmdir(dir1) return s -
The variable "tess" contains the path to tesseract (the one above was Tesseract's default location after installation on Linux), change it if it is different on your system.
-
Before importing this script you can test it on the file system by placing the following script somewhere and calling it ocrfiletest.py. This script takes an image file as a parameter, opens it and sends it to ocrfile.py:
import ocrfile import sys filename = sys.argv[1] f = open(filename) foo = f.read() f.close() s = ocrfile.ocrfile(0, foo) print s
Test the script by running the following command, the recognized text should display in your terminal:
% python ocrfiletest.py /YOURPATHTO/phototest.tif
-
Import the script:
- Go to the ZMI -> portal skins -> Custom.
- Click Add, select External Method.
- Type 'ocrfile', without the quotes, for Id, Title, Module Name and Function Name.
- Go to the ZMI -> portal skins -> Custom.
- Now we are going to create a local script that will take a plone file, pass it to the OCR script and then save the results as a new text file in the parent container:
- Go back to ZMI -> portal skins -> Custom.
- Click Add, select Script (Python).
- Type 'ocr_document' without the quotes for Id and Title.
- Click "Add and Edit".
- Go back to ZMI -> portal skins -> Custom.
- Delete the default code paste the following code instead, then click "Save Changes":
contentObject = context parent = contentObject.aq_inner.aq_parent #Pass file to external module to be OCRed f = contentObject.data s = script.ocrfile(f) ocrresultid = contentObject.id + "_ocr" ocrresulttitle = contentObject.title + " OCR Results" #Delete ocr text file if it exists if ocrresultid in parent.objectIds(): parent.manage_delObjects([ocrresultid]) ocrresultid = parent.invokeFactory("File", id=ocrresultid, title=ocrresulttitle, file=s) #TODO: Change this so that it changes the original file's extension rather then appending on a .txt ocrresultobj = getattr(parent, ocrresultid) ocrresultobj.setFilename(ocrresultid + '.txt') #Forward the user to the newly created text file return context.REQUEST['RESPONSE'].redirect( '%s/%s/view' % (parent.absolute_url(), ocrresultid)) - Now let's create an action that lets us OCR files. The same thing can be done for the image content type:
- Go to the ZMI -> Portal_Types -> File.
- Click the Actions tab.
- Scroll down to the bottom to Add a new action. Add it with the following values, then click "Add":
Title: OCR Document
Id: ocr_document
URL (Expression): string:${object_url}/ocr_document
Condition (Expression): python:object.content_type=='image/tiff'
Permission: Modify Portal Content
Category: object
Visible: Checked
- Go to the ZMI -> Portal_Types -> File.
- You're done! Now you should be able to OCR TIF images by clicking the newly created OCR Document tab above files. Note: We only added the action to the file content type, if you have added TIF images as images (image content type) the action will not show up. You can add the action to images using the instructions in the previous step and selecting the image content type in step A instead of the file content type. You can also test the script on images or any other content type that contains a TIF file by adding "/ocr_document" to the end of the content's URL.
In the next step you will be able to see screen shots of what this is supposed to look like.
Screen Shots
This page displays screen shots of OCRing a document using the OCR Document action.
The action we created will add a tab above files of the TIFF mimetype:

The file I have uploaded is phototest.tif, which comes with Tesseract. Here is what it looks like:
Clicking the OCR Document tab will OCR the document and bring you to the OCR results. Here is a screen shot of the results:


Author: