Personal tools
You are here: Home Products Plone Roadmap #177: Include support for indexing Word, PDF and other common types
Document Actions

#177: Include support for indexing Word, PDF and other common types

Contents
  1. Motivation
  2. Proposal
  3. Implementation
  4. Risks
  5. Participants
by Alexander Limi last modified August 28, 2006 - 06:29
It is increasingly important for Plone to be able to "search inside" of binary file types like Word documents and PDF files. While this is already possible with add-ons like TextIndexNG, it's possible to support the basic functionality for this without including the entire TextIndexNG framework.
Proposed by
Alexander Limi
Seconded by
Kapil Thangavelu
Proposal type
Architecture
State
in-progress

Motivation

The main reason for including this in the core is that no self-respecting CMS can be without Word/PDF indexing these days - especially if used for intranets and extranets. Most of the knowledge in the average company is hidden inside Word documents and PDFs, and one of Plone's goals is to make this information accessible to people.

Also, the fact that 50% (very scientifically measured ;) of the questions on plone-users is from people who can't install TextIndexNG should be a good indicator that it is a functionality a lot of people need.

Proposal

Instead of including the kitchen sink (ie. TextIndexNG plus converters), switching index types, etc etc — there is a more lightweight approach that will solve 95% of the indexing needs for most users.

Shortly summarized, the required changes are:

  1. Make sure the Plone installers ship with the conversion tools required (I believe wvware handles most of them and is the preferred one — correct me if I'm mistaken).
  2. Make a change to Archetypes' BaseObject to index the content of the object's SearchableText using the conversion tools and Archetypes' transform infrastructure.

Implementation

The installers including the binaries should be pretty straightforward, most linux distros already have the wvware packages available, and there is even a Windows installer with all the binaries compiled, ready-to-run.

On the AT side, the following patch will have to be applied. (Kapil is out of town this weekend, so I took the responsibility of writing the PLIP for him, I assume he can make a bundle when he comes back — but as you can see, the changes are minimal):

    Index: BaseObject.py
    ===================================================================
    --- BaseObject.py   (revision 6689)
    +++ BaseObject.py   (working copy)
    @@ -524,6 +524,34 @@
             for field in self.Schema().fields():
                 if not field.searchable:
                     continue
    +
    +            if isinstance( field, FileField) and not isinsance( field, ImageField):
    +                mime_type = field.getContentType( self )
    +                file_name = field.getFilename(self, 0)
    +
    +                if not isinstance( content, str ):
    +                    data = str( content )
    +                else:
    +                    data = content
    +
    +                # XXX need to catch and log errors .. prints to stdout for now
    +                try:
    +                    text_content = transforms.convertTo(
    +                        "text/plain",
    +                        data,
    +                        mimetype = mime_type,
    +                        filename = file_name
    +                    )
    +                except MissingBinary:
    +                    traceback.print_exc()                
    +                    text_content= ""
    +                except IOError:
    +                    traceback.print_exc()
    +                    text_content= ""
    +
    +                data.append( text_content )
    +                continue
    +                
                 method = field.getIndexAccessor(self)
                 try:
                     datum =  method(mimetype="text/plain")

Risks

Saving the object will take slightly longer than it currently does if it's a binary file because of the indexing - but since the BLOB handling is already slow, I don't think it makes much of a difference. Of course, uploading a 600MB word document will tie up the thread for a while - but it will do that anyway, regardless of whether the indexing is enabled or not.

Participants

Kapil Thangavelu

Two clarifications

Posted by Alexander Limi at August 31, 2006 - 20:34
1. The above patch might not even be necessary according to Ben - there might be a bug or something that should be improved in the original code, but it should actually already be doing this.

2. We should probably also include mxTidy in the list of dependencies, as Plone doesn't always produce valid HTML without it installed.

External Binaries

Posted by Maik Roeder at September 4, 2006 - 14:43
As you plan to depend on external binaries. What do you do in case of

1. infinite loops
2. memory leaks
3. garbled return values

Reinventing wheels?!

Posted by Andreas Jung at December 11, 2006 - 14:51
Obviously you guys have nothing better to do than reinventing wheels.

"""
Also, the fact that 50% (very scientifically measured ;) of the questions on plone-users is from people who can't install TextIndexNG should be a good indicator that it is a functionality a lot of people need.
"""

What are the reasons for this?

a) People are stupid or too lazy to read documentation

b) The community refuse to contribute back e.g. by providing help in building the binary version of the extension modules for Windows

AttachmentField already ships with tree tons of external converters. Now you are trying to do the same. Why do you always have to invent something better (in your eyes)? Just for the sake of doing things differently.

Please stop the flaming

Posted by Alexander Limi at December 17, 2006 - 03:19
If you want to discuss the issue with the people that make the decisions, please join the Framework Team list (in Gmane and on lists.plone.org).

Quoting from a mail Kapil sent out there:

"""
TXNG3 is a huge package, it's not just a plugin index — it's basically its own catalog infrastructure, with lots of code, including C extensions, with one maintainer, afaik. 98% of the people (I bet) install it for one reason, namely the focus of this PLIP, indexing common office file types, and all its extra complexity, features, and options ignored.

For this particular purpose, under the hood TXNG3 is utilizing the same machinery, so it's best I think to just give the functionality that most users already want, is already in the codebase, via just exposing the functionality, as opposed to including an entirely new framework that needs to be supported and maintained.
"""

Implemented on Archetypes trunk

Posted by Daniel Nouri at February 21, 2007 - 13:40
In Archetypes trunk, in r7501 [1], I added some bits to the FileField so that it indexes all files that it can convert to text/plain on SearchableText().

However, for Plone 3.0, I plan to make this behaviour the non-default. Because wvware can hang your process forever when it doesn't like the doc file. And portal_transforms is not clever enough to be able to detect this. See r7517 [2].

I plan to include a control panel for 3.0 that lets you enable indexing per portal in a control panel. Also, I'd like to include a maximum size for files to be indexed.

For the future, oooconv:https://infrae.com/svn/buildout/oooconv-dev/trunk/ would definitely be something to check out. DocumentLibrary uses it. I can look into building a bridge for Plone that uses oooconv to do the conversion, if someone decides they want it.


[1] http://dev.plone.org/archetypes/changeset/7501
[2] https://dev.plone.org/archetypes/changeset/7517

wvware painful on OS X, Solaris

Posted by Chris Shenton at April 6, 2007 - 13:53
I've used TNG3 with wmware but when trying to run Plone on OS X or Solaris, I've found vmware a nightmare to build. I've recently found some ancient prebuilt wmare binaries for PPC OS X but that won't help x86 Mac. I've also spent over a week trying to get wmware to build on various Solaris version (sparc, x86) and had no joy. There are a huge amount of recursive dependencies as wmware is exceptionally Linux-centric and most seem irrelevant to the task of text extraction.

It might be easier if we could use vmware2 -- dunno -- but some less dependency-bloated indexer would certainly be a helper. Whether TNG3 or some newfangled indexer is used.

This is hurting one of my .gov clients because their standard UNIX platform is Solaris and some of the departments we support have Xserves. Not Linux or FreeBSD.

Thanks.

wvware

Posted by Chris Shenton at April 6, 2007 - 14:13
oops, I suck. I meant "wvware" in all the above.

For any issues with the web site functionality, please file a ticket.

Please consult the policy on plone.org content if you want your content published on this site.

Servers and hosting by