Enable full-text indexing of Word documents and PDFs in Plone 3.0 (GNU/Linux)
While Plone 3.0 offers native support for full-text indexing of Word documents and PDFs, no longer requiring add-on products such as TextIndexNG, third-party command-line converters are required to complete the setup of the full-text indexing capability. The command-line converter required to index Word documents, wvText, is bundled in a package known as wv (often identified as wvWare), and the converter required to index PDFs, pdftotext, is bundled either in poppler-utils or xpdf (Note: poppler offers command-line utilities for PDF conversion, while xpdf includes a full PDF viewer and requires an X window system).
This document details the steps required to enable full-text indexing of Word documents and PDFs for Plone 3.0 on Debian GNU/Linux, Ubuntu, and other Debian-based systems (Linspire, Freespire, Xandros). The steps should be similar for RPM-based distributions such as Fedora, CentOS, YDL, Mandriva, and openSUSE.
Instructions
1) Install wv for full-text indexing of Word documents:
sudo apt-get install wv
2) Install poppler-utils or xpdf for full-text indexing of PDFs
If pdftotext is not installed on your system (i.e. `which pdftotext` returns nothing at the command line), install poppler-utils which bundles pdftotext (PDF to text converter), pdfinfo (PDF document information extractor), pdfimages (PDF image extractor), and pdffonts (PDF font analyzer). Note: xpdf bundles these utilities and more. However, poppler-utils is the preferred option since only the command-line converters are required.
sudo apt-get install poppler-utils
If poppler-utils is unavailable, install xpdf:
sudo apt-get install xpdf
3) Add pdf_to_text module
- In the Zope Management Interface (ZMI) of your Plone site, click portal_transforms
- Click on "Add Transform"
- Enter ID: pdf_to_text
- Enter module: Products.PortalTransforms.transforms.pdf_to_text
- Click Submit
4) Restart Plone:
sudo /opt/Plone-3.0/zeocluster/bin/restartcluster.sh
5) Update portal_catalog to re-index Word documents and PDFs that were added to the site prior to the availability of the converters
In the Zope Management Interface (ZMI) of your Plone site, click portal_catalog > "Advanced" tab > "Update Catalog" button. This will re-index all content on the site and will full-text index Word Documents and PDFs on the site. Note: As the catalog update will consume a significant amount of time and computing resources, perform the catalog update during down time if on a production server.
Conclusion
The instructions provided here only enable full-text indexing of Word documents and PDFs. For full-text indexing of additional types of office documents such as OpenOffice (OpenDocument), PowerPoint, and Excel files, AROfficeTransforms is a recommended add-on product. AROfficeTransforms is available at http://plone.org/products/arofficetransforms

