Warning

This document hasn't been checked for compatibility with current versions of Plone. Use at your own risk.

Step by Step Instructions

by NA last modified Dec 30, 2008 03:06 PM
The following are step by step instructions for adding Tesseract OCR functionality to your site.
  1. Tesseract OCR must be installed. Read the prerequisites in the introduction for more information.
  2. Add the following script to your Plone Extensions folder (Example: /opt/Plone-x.x.x/zinstance/Extensions) and call it ocrfile.py. The script contains a module that takes image data, writes it to disk, runs Tesseract from the command line and returns the outputted text:
    def ocrfile(self, f):
        import urllib
        import os
        import sys
        import tempfile
    
        tess = '/usr/local/bin/tesseract'
    
        dir1 = tempfile.mkdtemp()
        txtfilename = dir1 + '/output'
        imagefilename = dir1 + '/image.tif'
    
        file = open(imagefilename, "wb")
        file.write(f)
        file.close()
    
        os.spawnv(os.P_WAIT, tess, (tess, imagefilename, txtfilename))
    
        file = open(txtfilename + '.txt', "r")
        s = file.read()
        file.close()
    
        os.remove(txtfilename + '.txt')
        os.remove(imagefilename)
        os.rmdir(dir1)
    
        return s
    
  3. The variable "tess" contains the path to tesseract (the one above was Tesseract's default location after installation on Linux), change it if it is different on your system.

  4. Before importing this script you can test it on the file system by placing the following script somewhere and calling it ocrfiletest.py. This script takes an image file as a parameter, opens it and sends it to ocrfile.py:

    import ocrfile
    import sys
    
    filename = sys.argv[1] 
    
    f = open(filename)
    foo = f.read()
    f.close()
    
    s = ocrfile.ocrfile(0, foo)
    
    print s
    

    Test the script by running the following command, the recognized text should display in your terminal:

    % python ocrfiletest.py /YOURPATHTO/phototest.tif
    
  5. Import the script:

    1. Go to the ZMI -> portal skins -> Custom.
    2. Click Add, select External Method.
    3. Type 'ocrfile', without the quotes, for Id, Title, Module Name and Function Name.
  6. Now we are going to create a local script that will take a plone file, pass it to the OCR script and then save the results as a new text file in the parent container:
    1. Go back to ZMI -> portal skins -> Custom.
    2. Click Add, select Script (Python).
    3. Type 'ocr_document' without the quotes for Id and Title.
    4. Click "Add and Edit".
  7. Delete the default code paste the following code instead, then click "Save Changes":
    contentObject = context
    parent = contentObject.aq_inner.aq_parent
    
    #Pass file to external module to be OCRed
    f = contentObject.data
    s = script.ocrfile(f)
    
    ocrresultid = contentObject.id + "_ocr"
    ocrresulttitle = contentObject.title + " OCR Results" 
    
    #Delete ocr text file if it exists
    if ocrresultid in parent.objectIds():
        parent.manage_delObjects([ocrresultid])
    
    ocrresultid = parent.invokeFactory("File", id=ocrresultid, title=ocrresulttitle, file=s)
    
    #TODO: Change this so that it changes the original file's extension rather then appending on a .txt
    ocrresultobj = getattr(parent, ocrresultid)
    
    ocrresultobj.setFilename(ocrresultid + '.txt')
    
    #Forward the user to the newly created text file
    return context.REQUEST['RESPONSE'].redirect(
        '%s/%s/view' % (parent.absolute_url(), ocrresultid))
  8. Now let's create an action that lets us OCR files. The same thing can be done for the image content type:
    1. Go to the ZMI -> Portal_Types -> File.
    2. Click the Actions tab.
    3. Scroll down to the bottom to Add a new action. Add it with the following values, then click "Add":
      Title: OCR Document
      Id: ocr_document
      URL (Expression): string:${object_url}/ocr_document
      Condition (Expression): python:object.content_type=='image/tiff'
      Permission: Modify Portal Content
      Category: object
      Visible: Checked
  9. You're done! Now you should be able to OCR TIF images by clicking the newly created OCR Document tab above files. Note: We only added the action to the file content type, if you have added TIF images as images (image content type) the action will not show up. You can add the action to images using the instructions in the previous step and selecting the image content type in step A instead of the file content type. You can also test the script on images or any other content type that contains a TIF file by adding "/ocr_document" to the end of the content's URL.

 

In the next step you will be able to see screen shots of what this is supposed to look like.


Contribute

Something wrong or out of date? Anybody can edit or create a new article in the knowledge base. Simply create an account on this site, log in, and click the Edit button to contribute.