Attention

This document was written for an old version of Plone, Plone 3, and was last updated 649 days ago.

To learn how to upgrade to the current version of Plone, read the upgrade manual.

Enable full-text indexing of Word documents and PDFs in Plone 3.0 (GNU/Linux)

by Kamal Gill last modified Aug 12, 2010 04:29 PM
How to install third-party command-line converters to enable full-text indexing of Word documents and PDFs in Plone 3.0 on GNU/Linux (Ubuntu, Debian, et. al.)

While Plone 3.0 offers native support for full-text indexing of Word documents and PDFs, no longer requiring add-on products such as TextIndexNG, third-party command-line converters are required to complete the setup of the full-text indexing capability. The command-line converter required to index Word documents, wvText, is bundled in a package known as wv (often identified as wvWare), and the converter required to index PDFs, pdftotext, is bundled either in poppler-utils or xpdf (Note: poppler offers command-line utilities for PDF conversion, while xpdf includes a full PDF viewer and requires an X window system).

This document details the steps required to enable full-text indexing of Word documents and PDFs for Plone 3.0 on Debian GNU/Linux, Ubuntu, and other Debian-based systems (Linspire, Freespire, Xandros).  The steps should be similar for RPM-based distributions such as Fedora, CentOS, YDL, Mandriva, and openSUSE.
 

Instructions

 

1) Install wv for full-text indexing of Word documents:

sudo apt-get install wv


2) Install poppler-utils or xpdf for full-text indexing of PDFs

If pdftotext is not installed on your system (i.e. `which pdftotext` returns nothing at the command line), install poppler-utils which bundles pdftotext (PDF to text converter), pdfinfo (PDF document information extractor), pdfimages (PDF image extractor), and pdffonts (PDF font analyzer). Note: xpdf bundles these utilities and more. However, poppler-utils is the preferred option since only the command-line converters are required.

sudo apt-get install poppler-utils


If poppler-utils is unavailable, install xpdf:

 sudo apt-get install xpdf

 

3) Add pdf_to_text module

  • In the Zope Management Interface (ZMI) of your Plone site, click portal_transforms
  • Click on "Add Transform"
  • Enter ID: pdf_to_text
  • Enter module: Products.PortalTransforms.transforms.pdf_to_text
  • Click Submit

 

4) Restart Plone:

sudo /opt/Plone-3.0/zeocluster/bin/restartcluster.sh


5) Update portal_catalog to re-index Word documents and PDFs that were added to the site prior to the availability of the converters

In the Zope Management Interface (ZMI) of your Plone site, click portal_catalog > "Advanced" tab > "Update Catalog" button.  This will re-index all content on the site and will full-text index Word Documents and PDFs on the site.  Note: As the catalog update will consume a significant amount of time and computing resources, perform the catalog update during down time if on a production server.

 

Conclusion

The instructions provided here only enable full-text indexing of Word documents and PDFs.  For full-text indexing of additional types of office documents such as OpenOffice (OpenDocument), PowerPoint, and Excel files, AROfficeTransforms is a recommended add-on product.  AROfficeTransforms is available at http://plone.org/products/arofficetransforms

 

 


Contribute

Something wrong or out of date? Anybody can edit or create a new article in the knowledge base. Simply create an account on this site, log in, and click the Edit button to contribute.