Personal tools
You are here: Home Documentation How-tos Enable full-text indexing of Word documents and PDFs in Plone 3.0 (Windows)
Support

Get Help

Join our chat rooms or support forums if you have more specific questions.

Plone Training
Learn how to design, build, and deploy a website in Plone through one of the numerous Plone training sessions around the world.
Find Plone training…
 
Document Actions

Enable full-text indexing of Word documents and PDFs in Plone 3.0 (Windows)

This How-to applies to: Plone 3.0.x
This How-to is intended for: Server Administrators

How to install third-party command-line converters to enable full-text indexing of Word documents and PDFs in Plone 3.0 on Windows

In Plone 3.0 no additional add-on products are needed for indexing Word documents and PDFs. However, you must have some tools installed that are able to convert Word documents and PDFs to html or plain text. The tools we are using are wvware for the Word documents and pdftohtml for the PDFs. You can also install MS Word or OpenOffice to convert Word documents, but I think most people would avoid these applications on their servers. For those people, I wrote this how-to.

On the Linux platform the installation is pretty simple (see this how-to for more information thanks to Kamal Gill). On windows, in addition to the installation, you have to make 2 small changes in the python code to enable the Word document indexing. But don't be afraid; they are really small.

Step-by-step

  1. First you have to download the command-line tools:
    1. Go to http://gnuwin32.sourceforge.net/packages/wv.htm and download the windows "binaries" and the "dependencies" files for wvware.
    2. Go to http://sourceforge.net/projects/pdftohtml/ and download the windows binary file for pdftohtml.
  2. The next step is to install these 2 tools:
    1. To install wvware, extract the "binaries" zip file into a directory of your choice. Next, extract the "dependencies" zip into a directory of your choice. Then copy the dll files from the bin directory of the “dependencies” directory into the bin directory of the "binaries" directory.
    2. To install pdftohtml, just extract the zip file into a proper directory of your choice.
  3. In order to make these command-line tools available, you now have to add the executables to the systems PATH. Add C:\path\to\wvware\bin and C:\path\to\pdftohtml\ to the PATH environment variable.
  4. Next , we have to patch two files:  word_to_html.py and office_wvware.py. You can find them both under C:\path\to\Plone3\Data\Products\PortalTransforms\transforms:
    1. Let's start with word_to_html.py: From line 13 to 28 you can see that office_wvware is only used on a POSIX platform so we have to change that. So let's just change office_com in line 28 to office_wvware.
    2. Save your changes and close word_to_html.py.

    3. In office_wvware.py we have to do a little bit more. You can see at about line 23 that there is also an if statment that prevents the following command from being executed on a platform other than POSIX. So we need to change and replace the whole block with this windows compatible command:
    4.         os.system('cd %s && wvware.exe --charset=utf-8 %s > %s.html' % (tmpdir,
                                                                   self.fullname,
                                                                   self.__name__))
      

      Save your changes and close office_to_wvware.py.

  5. Now, restart your Plone instance.
  6. During the creation of a site, Plone checks to see if a pdftohtml binary is available. If there is no pdftohtml binary, then no pdf_to_html transform will be added to that site. So we have to add it. Go to the portal_transforms tool in your ZMI, select Transform in the drop-down box. Add a new transform with the ID pdf_to_html and the Module Products.PortalTransforms.transforms.pdf_to_html by clicking on the Add button. This step is not necessary for a Plone site that you create after the installation of the pdftohtml binary.
  7. Now there is only one more step. Go to the portal_catalog tool choose the Advanced tab and click the Update Catalog button.

Finally we are done. You can search for text in the PDFs and Word documents of your site now.

 

Further information

If your are running Plone as a service, you have to make sure that the account you are using to run that service has a temporary folder configured where the command-line tools can save their output.

see also:

Enable full-text indexing of Word documents and PDFs in Plone 3.0 (GNU/Linux)
How to install third-party command-line converters to enable full-text indexing of Word documents and PDFs in Plone 3.0 on GNU/Linux (Ubuntu, Debian, et. al.)
by Dominik Ruf last modified October 21, 2007 - 16:21
Contributors: Sheila Maxwell
All content is copyright Plone Foundation and the individual contributors.

Set PATH variable in Windows

Posted by Dan Thomas at October 22, 2007 - 00:08
See this article for a good description of how to change the PATH statement in Windows: http://plone.org/documentation/how-to/using-ploneout-on-windows

Error Getting

Posted by Allyson Roberto Alves Cavalcanti at October 24, 2007 - 18:40
I tried to run this "how to" and i getting this error.

Traceback (innermost last):
Module ZPublisher.Publish, line 119, in publish
Module ZPublisher.mapply, line 88, in mapply
Module ZPublisher.Publish, line 42, in call_object
Module Products.PortalTransforms.TransformEngine, line 389, in manage_addTransform
Module Products.PortalTransforms.TransformEngine, line 263, in _mapTransform
Module Products.MimetypesRegistry.MimeTypesRegistry, line 218, in lookup
- __traceback_info__: ("'BROKEN'", 'BROKEN')
Module Products.MimetypesRegistry.MimeTypesRegistry, line 449, in split
MimeTypeException: Malformed MIME type (BROKEN)

I am doing something wrong ?

Same here

Posted by Yves Moisan at November 14, 2007 - 20:19
I get the same error. Why isn't that all bundled in the installer ?? OOTB pdf indexing isn't quite there yet :-(

Rebooting foes the trick

Posted by Yves Moisan at November 16, 2007 - 16:47
As usual when things fail on windows, a Zope restart is not sufficient : reboot the machine. After that, one can add the new transform without a problem.

One issue I have is that I uploaded a PDF in French (so accented characters). Since I'm on windows, the encoding must be different than UTF-8 that Plone uses so looking for accented words does not bring up the PDF in the search box list. Using the closest letter without an accent seems to work, but that obvioulsy is a hack users shouldn't have to do. Still, it works !

rebooting not helping

Posted by Amandeep Singh Tur at February 13, 2008 - 03:47
no yvesm, even rebooting does not helps and i'm getting same error of broken mime type

.

Posted by Amandeep Singh Tur at February 13, 2008 - 05:15
in addition to broken mime type i'm also getting
ImportError: No module named pdf

Solving Problem on Windows

Posted by Amandeep Singh Tur at February 14, 2008 - 03:51
Not the rebooting machine, not setting pdftohtml and wvware in System path helped. I'd to hack Plone into believing that tools are installed by creating pdftohtml.bat and wvHtml.bat in Windows directory. Also there is no wvHtml.exe in wvware suite. I've used wvware.exe in batch file. The pdftohtml.bat contains this line:
c:\pdftohtml\pdftohtml.exe %*

The wvHtml.exe contains:
c:\wv\bin\wvware.exe %*

Now I'm facing a problem on Windows 2003 box. wvWare says that "I won't mmap that file, using a slower method". If anybody has a solution for this please respond. I'm off to finding one for it.

For any issues with the web site functionality, please file a ticket.

Please consult the policy on plone.org content if you want your content published on this site.

Servers and hosting by