Indexing and searching
Plone Developer Manual is a comprehensive guide to Plone programming.
1. Introduction to ZCatalogs and the Catalog Tool
A brief introduction to ZCatalogs, the Catalog Tool and what they're used for.
Why ZCatalogs?
Plone is built on the CMF, which uses the ZODB to store content in a very free-form manner with arbitrary hierarchy and a lot of flexibility in general. For some content use cases, however, it is very useful to treat content as more ordered, or tabular. This is where ZCatalog comes in.
Searching, for example, requires being able to query content on structured data such as dates or workflow states. Additionally, query results often need to be sorted based on structured data of some sort. So when it comes to searching it is very valuable to treat our free-form persistent ZODB objects as if they were more tabular. ZCatalog indexes do exactly this.
Since the ZCatalog is in the business of treating content as tabular when it isn't necessarily so, it is very tolerant of any missing data or exceptions when indexing. For example, Plone includes "start" and "end" indexes to support querying events on their start and end dates. When a page is indexed, however, it doesn't have start or end dates. Since the ZCatalog is tolerant, it doesn't raise any exception when indexing the start or end dates on a page. Instead it simply doesn't include pages in those indexes. As such, it is appropriate to use indexes in the catalog to support querying or sorting when not all content provides the data indexed.
This manual is intended to be a brief start guide to ZCatalogs, specially aimed to tasks specific to Plone, and will not treat advanced ZCatalogs concepts in depth. If you want to learn more about ZCatalogs in the context of Zope, please refer to The Zope Book, Searching and Categorizing Content. If you want to perform advanced searches, AdvancedQuery, which is included with Plone since the 3.0 release, is what you're looking for. See Searching with AdvancedQuery for a brief introduction.
Quick start
Every ZCatalog is composed of indexes and metadata. Indexes are fields you can search by, and metadata are copies of the contents of certain fields which can be accessed without waking up the associated content object.
Most indexes are also metadata fields. For example, you can search objects by Title and then display the Title of each object found without fetching them, but note not all indexes need to be part of metadata.
When you search inside the catalog, what you get as a result is a list of elements known as brains. Brains have one attribute for each metadata field defined in the catalog, in addition to some methods to retrieve the underlying object and its location. Metadata values for each brain are saved in the metadata table of the catalog upon the (re)indexing of each object.
Brains are said to be lazy for two reasons; first, because they are only created 'just in time' as your code requests each result, and second, because retrieving a catalog brain doesn't wake up the objects themselves, avoiding a huge perfomance hit.
To see the ZCatalogs in action, fire up your favourite browser and open the ZMI. You'll see an object in the root of your Plone site named portal_catalog. This is the Catalog Tool, a Plone tool (like the Membership Tool or the Quickinstaller Tool) based on ZCatalogs created by default in every Plone site which indexes all the created content.
Open it and click the Catalog tab, at the top of the screen. There you can see the full list of currently indexed objects, filter them by path, and update and remove entries. If you click on any entry, a new tab (or window) will open showing the metadata and index values for the selected indexed object. Note that most fields are "duplicated" in the Index Contents and Metadata Contents tables, but its contents have different formats, because, as it was said earlier, indexes are meant to search by them, and metadata to retrieve certain attributes from the content object without waking it up.
Back to the management view of the Catalog Tool, if you click the Indexes or the Metadata tab you'll see the full list of currently available indexes and metadata fields, respectively, its types and more. There you can also add and remove indexes and metadata fields. If you're working on a test environment, you can use this manager view to play with the catalog, but beware indexes and metadata are usually added through GenericSetup and not using the ZMI.
2. Querying the catalog
How to search and list content by title, description, interface, location, etc.
The Catalog Tool has an easy and clean API to search for content. First of all, you need to acquire it from the current context. Here context can be any object in the site:
from Products.CMFCore.utils import getToolByName catalog = getToolByName(context, 'portal_catalog')
To search for something and get the resulting brains, write:
results = catalog.searchResults(**kwargs)
Where kwargs is a dictionary of index names and their associated query values. Only the indexes that you care about need to be included. This is really useful if you have variable searching criteria, for example, coming from a form where the users can select different fields to search for. For example:
results = catalog.searchResults({'portal_type': 'Event', 'review_state': 'pending'})
It is worth pointing out at this point that the indexes that you include are treated as a logical AND, rather than OR. In other words, the query above will find all the items that are both an Event, AND in the review state of pending.
Additionally, you can call the catalog tool directly, which is equivalent to calling catalog.searchResults():
results = catalog(portal_type='Event')
Available indexes
To see the full list of available indexes in your catalog, open the ZMI (what usually means navigating to http://yoursiteURL/manage) look for the portal_catalog object tool into the root of your Plone site and check the Indexes tab. Note that there are different types of indexes, and each one admits different types of search parameters, and behave differently. For example, FieldIndex and KeywordIndex support sorting, but ZCTextIndex doesn't. To learn more about indexes, see The Zope Book, Searching and Categorizing Content.
Some of the most commonly used ones are:
- Title
- The title of the content object.
- Description
- The description field of the content.
- Subject
- The keywords used to categorize the content. Example:
catalog.searchResults(Subject=('cats', 'dogs')) - portal_type
- As its name suggest, search for content whose portal type is indicated. For example:
catalog.searchResults(portal_type='News Item')
You can also specify several types using a list or tuple format:
catalog.searchResults(portal_type=('News Item', 'Event')) - review_state
- The current workflow review state of the content. For example:
catalog.searchResults(review_state='pending')
- object_provides
- From Plone 3, you can search by the interface provided by the content. Example:
from Products.MyProduct.path.to import IIsCauseForCelebration catalog(object_provides=IIsCauseForCelebration.__identifier__)
Searching for interfaces can have some benefits. Suppose you have several types, for example, event types like Birthday, Wedding and Graduation, in your portal which implement the same interface (for example,IIsCauseForCelebration). Suppose you want to get items of these types from the catalog by their interface. This is more exact than naming the types explicitly (like portal_type=['Birthday','Wedding','Graduation' ]), because you don't really care what the types' names really are: all you really care for is the interface.
This has the additional advantage that if products added or modified later add types which implement the interface, these new types will also show up in your query.
Sorting and limiting the number of results
To sort the results, use the sort_on and sort_order arguments. The sort_on argument accepts any available index, even if you're not searching by it. The sort_order can be either 'ascending' or 'descending', where 'ascending' means from A to Z for a text field. 'reverse' is an alias equivalent to 'descending'. For example:
results = catalog_searchResults(Description='Plone documentation',
sort_on='sortable_title', sort_order='ascending')
The catalog.searchResults() returns a list-like object, so to limit the number of results you can just use Python's slicing. For example, to get only the first 3 items:
results = catalog_searchResults(Description='Plone documentation')[:3]
In addition, ZCatalogs allow a sort_limit argument. The sort_limit is only a hint for the search algorhitms and can potentially return a few more items, so it's preferable to use both sort_limit and slicing simultaneously:
limit = 50
results = catalog_searchResults(Description='Plone documentation',
sort_limit=limit)[:limit]
Searching for content within a folder
Use the 'path' argument to specify the physical path to the folder you want to search into.
By default, this will match objects into the specified folder and all existing sub-folders. To change this behaviour, pass a dictionary with the keys 'query' and 'depth' to the 'path' argument, where
- 'query' is the physical path, and
- 'depth' can be either 0, which will return only the brain for the path queried against, or some number greater, which will query all items down to that depth (eg, 1 means searching just inside the specified folder, or 2, which means searching inside the folder, and inside all child folders, etc).
The most common use case is listing the contents of an existing folder, which we'll assume to be the context object in this example:
folder_path = '/'.join(context.getPhysicalPath())
results = catalog(path={'query': folder_path, 'depth': 1})
Getting the underlying object, its path, and its URL from a brain
As it was said earlier, searching inside the catalog returns catalog brains, not the object themselves. If you want to get the object associated with a brain, do:
brain.getObject()
To get the path of the object without fetching it:
brain.getPath()
which is equivalent to obj.getPhysicalPath().
And finally, to get the URL of the underlying object, usually to provide a link to it:
brain.getURL()
which is equivalent to obj.absolute_url().
3. Configuring Catalogs with GenericSetup
Adding, removing and changing indexes and metadata.
The Catalog Tool can be configured through the ZMI or programatically in Python but current best practice in the CMF world is to use GenericSetup to configure it using the declarative catalog.xml file. The GenericSetup profile for Plone, for example, uses the CMFPlone/profiles/default/catalog.xml XML data file to configure the Catalog Tool when a Plone site is created. It is fairly readable so taking a quick look through it can be very informative.
When using a GenericSetup extension profile to customize the Catalog Tool in your portal, you only need to include XML for the pieces of the catalog you are changing. To add an index for the Archetypes location field, as in the example above, a policy package could include the following profiles/default/catalog.xml:
<?xml version="1.0"?> <object name="portal_catalog" meta_type="Plone Catalog Tool"> <index name="location" meta_type="FieldIndex"> <indexed_attr value="location"/> </index> </object>
The GenericSetup import handler for the Catalog Tool also supports removing indexes from the catalog if present using the "remove" attribute of the <index> element. To remove the "start" and "end" indexes used for events, for example, a policy package could include the following profiles/default/catalog.xml:
<?xml version="1.0"?> <object name="portal_catalog" meta_type="Plone Catalog Tool"> <index name="start" remove="True" /> <index name="end" remove="True" /> </object>
Care must be taken when setting up indexes with GenericSetup - if the import step for a catalog.xml is run a second time (for example when you reinstall the product), the indexes specified will be destroyed, losing all currently indexed entries, and then re-created fresh (and empty!). If you want to workaround this behaviour, you can either update the catalog afterwards or add the indexes yourself in Python code using a custom import handler.
4. Custom indexing strategies
How to add special logic to indexing.
Sometimes you want to index "virtual" attributes of an object computed from existing ones, or just want to customize the way certain attributes are indexed, for example, saving only the 10 first characters of a field instead of its whole content.
To do so in an elegant and flexible way, Plone 3.3 includes a new package, plone.indexer, which provides a series of primitives to delegate indexing operations to adapters.
Let's say you have a content-type providing the interface IMyType. To define an indexer for your type which takes the first 10 characters from the body text, just type (assuming the attribute's name is 'text'):
from plone.indexer.decorator import indexer @indexer(IMyType) def mytype_description(object, **kw): return object.text[:10]
Finally, register this factory function as a named adapter using ZCML. Assuming you've put the code above into a file named indexers.py:
<adapter name="description" factory=".indexers.mytype_description" />
And that's all! Easy, wasn't it?
Note you can omit the for attribute because you passed this to the @indexer decorator, and you can omit the provides attribute because the thing returned by the decorator is actually a class providing the required IIndexer interface.
To learn more about the plone.indexer package, read its doctest.
For more info about how to create content-types, refer to the Archetypes Developer Manual.
Important note: If you want to adapt a out-of-the-box Archetypes content-type like Event or News Item, take into account you will have to feed the indexer decorator with the Zope 3 interfaces defined in Products.ATContentTypes.interface.* files, not with the deprecated Zope 2 ones into the Products.ATContentTypes.interfaces file.
5. Searching with AdvancedQuery
A brief primer on using AdvancedQuery to simplify searches that are otherwise hard with plain ZCatalog
AdvancedQuery is an excellent product that overcomes several of the more cumbersome limitations otherwise present with plain ZCatalog queries. The comprehensive documentation is available here.
If you want to install it, require it in your add-on product's setup.py:
install_requires=[
'setuptools',
'Products.AdvancedQuery',
AdvancedQuery is straightforward to use. In the simplest scenario, it can simply duplicate the action of running a normal ZCatalog query:
from Products.CMFCore.utils import getToolByName
cat = getToolByName(context, 'portal_catalog')
aq = cat.makeAdvancedQuery({'portal_type' : 'Event', 'review_state' : 'pending'})
brains = cat.evalAdvancedQuery(aq)
At this stage, all it looks like is a slightly more complicated way of doing things that you already know how to do. However, AdvancedQuery comes into its own by making possible things that are otherwise very hard to do with plain ZCatalog queries. For example, we want to get all published Documents sorted first by Creator, and sub-sorted by date of publication:
from Products.CMFCore.utils import getToolByName
cat = getToolByName(context, 'portal_catalog')
aq = cat.makeAdvancedQuery({'portal_type' : 'Document', 'review_state' : 'published'})
brains = cat.evalAdvancedQuery(aq, (('Creator', 'asc'), ('effective', 'asc')))
Or how about only those documents the same as above which have had related items noted against them?
from Products.AdvancedQuery import Ge
from Products.CMFCore.utils import getToolByName
cat = getToolByName(context, 'portal_catalog')
aq = cat.makeAdvancedQuery({'portal_type' : 'Document', 'review_state' : 'published'})
aq &= Ge('getRawRelatedItems', None)
brains = cat.evalAdvancedQuery(aq, (('Creator', 'asc'), ('effective', 'asc')))
As you can see, AdvancedQuery makes specifying exactly what you want from the catalog very easy, and the transition to using it is very straightforward, as it already accepts the same sort of query parameters and format that you are already familiar with. When you are ready, you can mix in more advanced criteria without disturbing your existing way of working with the Catalog Tool.
It's strongly recommended to read the AdvancedQuery documentation linked to above and playing with some of the more advanced options it details.
