#177: Include support for indexing Word, PDF and other common types
- Contents
- Proposed by
- Alexander Limi
- Seconded by
- Kapil Thangavelu
- Proposal type
- Architecture
- State
- in-progress
Motivation
The main reason for including this in the core is that no self-respecting CMS can be without Word/PDF indexing these days - especially if used for intranets and extranets. Most of the knowledge in the average company is hidden inside Word documents and PDFs, and one of Plone's goals is to make this information accessible to people.
Also, the fact that 50% (very scientifically measured ;) of the questions on plone-users is from people who can't install TextIndexNG should be a good indicator that it is a functionality a lot of people need.
Proposal
Instead of including the kitchen sink (ie. TextIndexNG plus converters), switching index types, etc etc — there is a more lightweight approach that will solve 95% of the indexing needs for most users.
Shortly summarized, the required changes are:
- Make sure the Plone installers ship with the conversion tools required (I believe wvware handles most of them and is the preferred one — correct me if I'm mistaken).
- Make a change to Archetypes' BaseObject to index the content of the object's SearchableText using the conversion tools and Archetypes' transform infrastructure.
Implementation
The installers including the binaries should be pretty straightforward, most linux distros already have the wvware packages available, and there is even a Windows installer with all the binaries compiled, ready-to-run.
On the AT side, the following patch will have to be applied. (Kapil is out of town this weekend, so I took the responsibility of writing the PLIP for him, I assume he can make a bundle when he comes back — but as you can see, the changes are minimal):
Index: BaseObject.py
===================================================================
--- BaseObject.py (revision 6689)
+++ BaseObject.py (working copy)
@@ -524,6 +524,34 @@
for field in self.Schema().fields():
if not field.searchable:
continue
+
+ if isinstance( field, FileField) and not isinsance( field, ImageField):
+ mime_type = field.getContentType( self )
+ file_name = field.getFilename(self, 0)
+
+ if not isinstance( content, str ):
+ data = str( content )
+ else:
+ data = content
+
+ # XXX need to catch and log errors .. prints to stdout for now
+ try:
+ text_content = transforms.convertTo(
+ "text/plain",
+ data,
+ mimetype = mime_type,
+ filename = file_name
+ )
+ except MissingBinary:
+ traceback.print_exc()
+ text_content= ""
+ except IOError:
+ traceback.print_exc()
+ text_content= ""
+
+ data.append( text_content )
+ continue
+
method = field.getIndexAccessor(self)
try:
datum = method(mimetype="text/plain")
Risks
Saving the object will take slightly longer than it currently does if it's a binary file because of the indexing - but since the BLOB handling is already slow, I don't think it makes much of a difference. Of course, uploading a 600MB word document will tie up the thread for a while - but it will do that anyway, regardless of whether the indexing is enabled or not.
Participants
Kapil Thangavelu
External Binaries
1. infinite loops
2. memory leaks
3. garbled return values
Reinventing wheels?!
"""
Also, the fact that 50% (very scientifically measured ;) of the questions on plone-users is from people who can't install TextIndexNG should be a good indicator that it is a functionality a lot of people need.
"""
What are the reasons for this?
a) People are stupid or too lazy to read documentation
b) The community refuse to contribute back e.g. by providing help in building the binary version of the extension modules for Windows
AttachmentField already ships with tree tons of external converters. Now you are trying to do the same. Why do you always have to invent something better (in your eyes)? Just for the sake of doing things differently.
Please stop the flaming
Quoting from a mail Kapil sent out there:
"""
TXNG3 is a huge package, it's not just a plugin index — it's basically its own catalog infrastructure, with lots of code, including C extensions, with one maintainer, afaik. 98% of the people (I bet) install it for one reason, namely the focus of this PLIP, indexing common office file types, and all its extra complexity, features, and options ignored.
For this particular purpose, under the hood TXNG3 is utilizing the same machinery, so it's best I think to just give the functionality that most users already want, is already in the codebase, via just exposing the functionality, as opposed to including an entirely new framework that needs to be supported and maintained.
"""
Implemented on Archetypes trunk
However, for Plone 3.0, I plan to make this behaviour the non-default. Because wvware can hang your process forever when it doesn't like the doc file. And portal_transforms is not clever enough to be able to detect this. See r7517 [2].
I plan to include a control panel for 3.0 that lets you enable indexing per portal in a control panel. Also, I'd like to include a maximum size for files to be indexed.
For the future, oooconv:https://infrae.com/svn/buildout/oooconv-dev/trunk/ would definitely be something to check out. DocumentLibrary uses it. I can look into building a bridge for Plone that uses oooconv to do the conversion, if someone decides they want it.
[1] http://dev.plone.org/archetypes/changeset/7501
[2] https://dev.plone.org/archetypes/changeset/7517
wvware painful on OS X, Solaris
It might be easier if we could use vmware2 -- dunno -- but some less dependency-bloated indexer would certainly be a helper. Whether TNG3 or some newfangled indexer is used.
This is hurting one of my .gov clients because their standard UNIX platform is Solaris and some of the departments we support have Xserves. Not Linux or FreeBSD.
Thanks.
Two clarifications
2. We should probably also include mxTidy in the list of dependencies, as Plone doesn't always produce valid HTML without it installed.