Enable full-text indexing of Word documents and PDFs in Plone 3.0 (Windows)
How to install third-party command-line converters to enable full-text indexing of Word documents and PDFs in Plone 3.0 on Windows
In Plone 3.0 no additional add-on products are needed for indexing Word documents and PDFs. However, you must have some tools installed that are able to convert Word documents and PDFs to html or plain text. The tools we are using are wvware for the Word documents and pdftohtml for the PDFs. You can also install MS Word or OpenOffice to convert Word documents, but I think most people would avoid these applications on their servers. For those people, I wrote this how-to.
On the Linux platform the installation is pretty simple (see this how-to for more information thanks to Kamal Gill). On windows, in addition to the installation, you have to make 2 small changes in the python code to enable the Word document indexing. But don't be afraid; they are really small.
Step-by-step
- First you have to download the command-line tools:
- Go to http://gnuwin32.sourceforge.net/packages/wv.htm and download the windows "binaries" and the "dependencies" files for wvware.
- Go to http://sourceforge.net/projects/pdftohtml/ and download the windows binary file for pdftohtml.
- The next step is to install these 2 tools:
- To install wvware, extract the "binaries" zip file into a directory of your choice. Next, extract the "dependencies" zip into a directory of your choice. Then copy the dll files from the bin directory of the “dependencies” directory into the bin directory of the "binaries" directory.
- To install pdftohtml, just extract the zip file into a proper directory of your choice.
- In order to make these command-line tools available, you now have to add the executables to the systems PATH. Add C:\path\to\wvware\bin and C:\path\to\pdftohtml\ to the PATH environment variable.
- Next , we have to patch two files: word_to_html.py and office_wvware.py. You can find them both under C:\path\to\Plone3\Data\Products\PortalTransforms\transforms:
- Let's start with word_to_html.py: From line 13 to 28 you can see that office_wvware is only used on a POSIX platform so we have to change that. So let's just change office_com in line 28 to office_wvware.
- In office_wvware.py we have to do a little bit more. You can see at about line 23 that there is also an if statment that prevents the following command from being executed on a platform other than POSIX. So we need to change and replace the whole block with this windows compatible command:
- Now, restart your Plone instance.
- During the creation of a site, Plone checks to see if a pdftohtml binary is available. If there is no pdftohtml binary, then no pdf_to_html transform will be added to that site. So we have to add it. Go to the portal_transforms tool in your ZMI, select Transform in the drop-down box. Add a new transform with the ID pdf_to_html and the Module Products.PortalTransforms.transforms.pdf_to_html by clicking on the Add button. This step is not necessary for a Plone site that you create after the installation of the pdftohtml binary.
- Now there is only one more step. Go to the portal_catalog tool choose the Advanced tab and click the Update Catalog button.
Save your changes and close word_to_html.py.
os.system('cd %s && wvware.exe --charset=utf-8 %s > %s.html' % (tmpdir,
self.fullname,
self.__name__))
Save your changes and close office_to_wvware.py.
Finally we are done. You can search for text in the PDFs and Word documents of your site now.
Further information
If your are running Plone as a service, you have to make sure that the account you are using to run that service has a temporary folder configured where the command-line tools can save their output.
Related content
- Enable full-text indexing of Word documents and PDFs in Plone 3.0 (GNU/Linux)
- How to install third-party command-line converters to enable full-text indexing of Word documents and PDFs in Plone 3.0 on GNU/Linux (Ubuntu, Debian, et. al.)

Set PATH variable in Windows