#177: Include support for indexing Word, PDF and other common types
It is increasingly important for Plone to be able to "search inside" of binary file types like Word documents and PDF files. While this is already possible with add-ons like TextIndexNG, it's possible to support the basic functionality for this without including the entire TextIndexNG framework.
- Proposed by
- Alexander Limi
- Seconded by
- Kapil Thangavelu
- Proposal type
- Architecture
- State
- completed
Motivation
The main reason for including this in the core is that no self-respecting CMS can be without Word/PDF indexing these days - especially if used for intranets and extranets. Most of the knowledge in the average company is hidden inside Word documents and PDFs, and one of Plone's goals is to make this information accessible to people.
Also, the fact that 50% (very scientifically measured ;) of the questions on plone-users is from people who can't install TextIndexNG should be a good indicator that it is a functionality a lot of people need.
Proposal
Instead of including the kitchen sink (ie. TextIndexNG plus converters), switching index types, etc etc — there is a more lightweight approach that will solve 95% of the indexing needs for most users.
Shortly summarized, the required changes are:
- Make sure the Plone installers ship with the conversion tools required (I believe wvware handles most of them and is the preferred one — correct me if I'm mistaken).
- Make a change to Archetypes' BaseObject to index the content of the object's SearchableText using the conversion tools and Archetypes' transform infrastructure.
Implementation
The installers including the binaries should be pretty straightforward, most linux distros already have the wvware packages available, and there is even a Windows installer with all the binaries compiled, ready-to-run.
On the AT side, the following patch will have to be applied. (Kapil is out of town this weekend, so I took the responsibility of writing the PLIP for him, I assume he can make a bundle when he comes back — but as you can see, the changes are minimal):
Index: BaseObject.py
===================================================================
--- BaseObject.py (revision 6689)
+++ BaseObject.py (working copy)
@@ -524,6 +524,34 @@
for field in self.Schema().fields():
if not field.searchable:
continue
+
+ if isinstance( field, FileField) and not isinsance( field, ImageField):
+ mime_type = field.getContentType( self )
+ file_name = field.getFilename(self, 0)
+
+ if not isinstance( content, str ):
+ data = str( content )
+ else:
+ data = content
+
+ # XXX need to catch and log errors .. prints to stdout for now
+ try:
+ text_content = transforms.convertTo(
+ "text/plain",
+ data,
+ mimetype = mime_type,
+ filename = file_name
+ )
+ except MissingBinary:
+ traceback.print_exc()
+ text_content= ""
+ except IOError:
+ traceback.print_exc()
+ text_content= ""
+
+ data.append( text_content )
+ continue
+
method = field.getIndexAccessor(self)
try:
datum = method(mimetype="text/plain")
Risks
Saving the object will take slightly longer than it currently does if it's a binary file because of the indexing - but since the BLOB handling is already slow, I don't think it makes much of a difference. Of course, uploading a 600MB word document will tie up the thread for a while - but it will do that anyway, regardless of whether the indexing is enabled or not.
Participants
Kapil Thangavelu
Two clarifications
2. We should probably also include mxTidy in the list of dependencies, as Plone doesn't always produce valid HTML without it installed.