Enable full-text indexing of Word documents and PDFs in Plone 3.0 (Windows)

How to install third-party command-line converters to enable full-text indexing of Word documents and PDFs in Plone 3.0 on Windows

In Plone 3.0 no additional add-on products are needed for indexing Word documents and PDFs. However, you must have some tools installed that are able to convert Word documents and PDFs to html or plain text. The tools we are using are wvware for the Word documents and pdftohtml for the PDFs. You can also install MS Word or OpenOffice to convert Word documents, but I think most people would avoid these applications on their servers. For those people, I wrote this how-to.

On the Linux platform the installation is pretty simple (see this how-to for more information thanks to Kamal Gill). On windows, in addition to the installation, you have to make 2 small changes in the python code to enable the Word document indexing. But don't be afraid; they are really small.

Step-by-step

  1. First you have to download the command-line tools:
    1. Go to http://gnuwin32.sourceforge.net/packages/wv.htm and download the windows "binaries" and the "dependencies" files for wvware.
    2. Go to http://sourceforge.net/projects/pdftohtml/ and download the windows binary file for pdftohtml.
  2. The next step is to install these 2 tools:
    1. To install wvware, extract the "binaries" zip file into a directory of your choice. Next, extract the "dependencies" zip into a directory of your choice. Then copy the dll files from the bin directory of the “dependencies” directory into the bin directory of the "binaries" directory.
    2. To install pdftohtml, just extract the zip file into a proper directory of your choice.
  3. In order to make these command-line tools available, you now have to add the executables to the systems PATH. Add C:\path\to\wvware\bin and C:\path\to\pdftohtml\ to the PATH environment variable.
  4. Next , we have to patch two files:  word_to_html.py and office_wvware.py. You can find them both under C:\path\to\Plone3\Data\Products\PortalTransforms\transforms:
    1. Let's start with word_to_html.py: From line 13 to 28 you can see that office_wvware is only used on a POSIX platform so we have to change that. So let's just change office_com in line 28 to office_wvware.
    2. Save your changes and close word_to_html.py.

    3. In office_wvware.py we have to do a little bit more. You can see at about line 23 that there is also an if statment that prevents the following command from being executed on a platform other than POSIX. So we need to change and replace the whole block with this windows compatible command:
    4.         os.system('cd %s && wvware.exe --charset=utf-8 %s > %s.html' % (tmpdir,
                                                                   self.fullname,
                                                                   self.__name__))
      

      Save your changes and close office_to_wvware.py.

  5. Now, restart your Plone instance.
  6. During the creation of a site, Plone checks to see if a pdftohtml binary is available. If there is no pdftohtml binary, then no pdf_to_html transform will be added to that site. So we have to add it. Go to the portal_transforms tool in your ZMI, select Transform in the drop-down box. Add a new transform with the ID pdf_to_html and the Module Products.PortalTransforms.transforms.pdf_to_html by clicking on the Add button. This step is not necessary for a Plone site that you create after the installation of the pdftohtml binary.
  7. Now there is only one more step. Go to the portal_catalog tool choose the Advanced tab and click the Update Catalog button.

Finally we are done. You can search for text in the PDFs and Word documents of your site now.

 

Further information

If your are running Plone as a service, you have to make sure that the account you are using to run that service has a temporary folder configured where the command-line tools can save their output.

Related content

Enable full-text indexing of Word documents and PDFs in Plone 3.0 (GNU/Linux)
How to install third-party command-line converters to enable full-text indexing of Word documents and PDFs in Plone 3.0 on GNU/Linux (Ubuntu, Debian, et. al.)

Set PATH variable in Windows

Posted by Dan Thomas at Oct 22, 2007 12:08 AM
See this article for a good description of how to change the PATH statement in Windows: http://plone.org/[…]/using-ploneout-on-windows

Error Getting

Posted by Allyson Roberto Alves Cavalcanti at Oct 24, 2007 06:40 PM
I tried to run this "how to" and i getting this error.

Traceback (innermost last):
  Module ZPublisher.Publish, line 119, in publish
  Module ZPublisher.mapply, line 88, in mapply
  Module ZPublisher.Publish, line 42, in call_object
  Module Products.PortalTransforms.TransformEngine, line 389, in manage_addTransform
  Module Products.PortalTransforms.TransformEngine, line 263, in _mapTransform
  Module Products.MimetypesRegistry.MimeTypesRegistry, line 218, in lookup
   - __traceback_info__: ("'BROKEN'", 'BROKEN')
  Module Products.MimetypesRegistry.MimeTypesRegistry, line 449, in split
MimeTypeException: Malformed MIME type (BROKEN)

I am doing something wrong ?

Same here

Posted by Yves Moisan at Nov 14, 2007 08:19 PM
I get the same error. Why isn't that all bundled in the installer ?? OOTB pdf indexing isn't quite there yet :-(

Rebooting foes the trick

Posted by Yves Moisan at Nov 16, 2007 04:47 PM
As usual when things fail on windows, a Zope restart is not sufficient : reboot the machine. After that, one can add the new transform without a problem.

One issue I have is that I uploaded a PDF in French (so accented characters). Since I'm on windows, the encoding must be different than UTF-8 that Plone uses so looking for accented words does not bring up the PDF in the search box list. Using the closest letter without an accent seems to work, but that obvioulsy is a hack users shouldn't have to do. Still, it works !

rebooting not helping

Posted by Amandeep Singh Tur at Feb 13, 2008 03:47 AM
no yvesm, even rebooting does not helps and i'm getting same error of broken mime type

.

Posted by Amandeep Singh Tur at Feb 13, 2008 05:15 AM
in addition to broken mime type i'm also getting
ImportError: No module named pdf

Solving Problem on Windows

Posted by Amandeep Singh Tur at Feb 14, 2008 03:51 AM
Not the rebooting machine, not setting pdftohtml and wvware in System path helped. I'd to hack Plone into believing that tools are installed by creating pdftohtml.bat and wvHtml.bat in Windows directory. Also there is no wvHtml.exe in wvware suite. I've used wvware.exe in batch file. The pdftohtml.bat contains this line:
c:\pdftohtml\pdftohtml.exe %*

The wvHtml.exe contains:
c:\wv\bin\wvware.exe %*

Now I'm facing a problem on Windows 2003 box. wvWare says that "I won't mmap that file, using a slower method". If anybody has a solution for this please respond. I'm off to finding one for it.

I won't mmap that file, using a slower method

Posted by Lukasz Lakomy at Jul 30, 2008 10:39 AM
I have the same problem on Windows 2008 Server. Has anyone found a solution?

I've run wmvare.exe outside Plone and it works. Documents are converted to HTML. But after this process I got a popup with above command. I select option to close the program but the converted HTML is there.

I've found on some forums that this message is harmelss. And it looks like that because I got the result. When I viewed "Problem Details" in that pop up I see that fault is in libwv2.dll.

So maybe this is some kind of 'security feature' in new OSes from Microsoft?

User discrimination

Posted by Veronica Cuello at Nov 21, 2008 04:32 PM
Great Stuff! Thanks!

... now the indexing of files is partially working for me.
Some how files from particular users are not indexed. I thought that maybe was the version of word the users are using (Office) but they are using an old version and OpenOffice should not have any trouble opening those. Any guess?

Thanks guys!

Support for OCR scanned documents

Posted by Sheila Maxwell at Mar 26, 2009 08:06 PM
When we first attempted to set this up, we found scanned documents that were searchable from a PDF reader would not produce results in plone. To fix this, open the pdf_to_text.py and change both instances of binaryArgs to include -i and -hidden.