Enable full-text indexing of Word documents and PDFs in Plone 3.0 (GNU/Linux)

by Kamal Gill last modified Aug 12, 2010 04:29 PM
How to install third-party command-line converters to enable full-text indexing of Word documents and PDFs in Plone 3.0 on GNU/Linux (Ubuntu, Debian, et. al.)

While Plone 3.0 offers native support for full-text indexing of Word documents and PDFs, no longer requiring add-on products such as TextIndexNG, third-party command-line converters are required to complete the setup of the full-text indexing capability. The command-line converter required to index Word documents, wvText, is bundled in a package known as wv (often identified as wvWare), and the converter required to index PDFs, pdftotext, is bundled either in poppler-utils or xpdf (Note: poppler offers command-line utilities for PDF conversion, while xpdf includes a full PDF viewer and requires an X window system).

This document details the steps required to enable full-text indexing of Word documents and PDFs for Plone 3.0 on Debian GNU/Linux, Ubuntu, and other Debian-based systems (Linspire, Freespire, Xandros).  The steps should be similar for RPM-based distributions such as Fedora, CentOS, YDL, Mandriva, and openSUSE.
 

Instructions

 

1) Install wv for full-text indexing of Word documents:

sudo apt-get install wv


2) Install poppler-utils or xpdf for full-text indexing of PDFs

If pdftotext is not installed on your system (i.e. `which pdftotext` returns nothing at the command line), install poppler-utils which bundles pdftotext (PDF to text converter), pdfinfo (PDF document information extractor), pdfimages (PDF image extractor), and pdffonts (PDF font analyzer). Note: xpdf bundles these utilities and more. However, poppler-utils is the preferred option since only the command-line converters are required.

sudo apt-get install poppler-utils


If poppler-utils is unavailable, install xpdf:

 sudo apt-get install xpdf

 

3) Add pdf_to_text module

  • In the Zope Management Interface (ZMI) of your Plone site, click portal_transforms
  • Click on "Add Transform"
  • Enter ID: pdf_to_text
  • Enter module: Products.PortalTransforms.transforms.pdf_to_text
  • Click Submit

 

4) Restart Plone:

sudo /opt/Plone-3.0/zeocluster/bin/restartcluster.sh


5) Update portal_catalog to re-index Word documents and PDFs that were added to the site prior to the availability of the converters

In the Zope Management Interface (ZMI) of your Plone site, click portal_catalog > "Advanced" tab > "Update Catalog" button.  This will re-index all content on the site and will full-text index Word Documents and PDFs on the site.  Note: As the catalog update will consume a significant amount of time and computing resources, perform the catalog update during down time if on a production server.

 

Conclusion

The instructions provided here only enable full-text indexing of Word documents and PDFs.  For full-text indexing of additional types of office documents such as OpenOffice (OpenDocument), PowerPoint, and Excel files, AROfficeTransforms is a recommended add-on product.  AROfficeTransforms is available at http://plone.org/products/arofficetransforms