Personal tools
You are here: Home Products Plone Roadmap #103: Adding RDF to Plone
Document Actions

#103: Adding RDF to Plone

Contents
  1. Definitions
  2. Motivation
  3. Proposal
  4. Implementation
  5. Deliverables
  6. Risks
  7. Progress log
by Michel Pelletier last modified July 17, 2006 - 16:54
To include rdflib-based tool in Plone to bring RDF to Plone content.
Proposed by
Michel Pelletier
Proposal type
Architecture
State
deferred

Definitions

RDF: The Resource Decription Framework http://www.w3.org/RDF/

rdflib: Python rdf library http://rdflib.net/

Zemantic: Zope 3 rdflib integration http://zemantic.org/

Dublin Core: Content Metadata Vocabulary http://dublincore.org/schemas/rdfs/

FOAF: Friend of a Friend Vocabulary http://www.foaf-project.org/

XPackage: Package Description Vocabulary http://www.globalmentor.com/reference/specifications/xpackage/specification/

DOAP: Project Description Vocabulary http://usefulinc.com/doap

Motivation

RDF is a way of encoding relationship statements. Each relationship statement is composed of three parts and called a "triple", where each part is the subject, predicate, and object of that statement. In Plone terms, the subject of a statement could be the URI for a peice of content, the predicate could be URI for a well known XML property like Dublin Core Title (the URI http://purl.org/dc/elements/1.1/title) and the object could be a literal string that contains the title of the document.

Using RDF many kinds of statements can be made about content, or people, or whatever using well-known vocabularies, like Dublin Core to describe content, Friend Of A Friend (FOAF) to describe members and XPackage or DOAPto describe packages and products, or custom vocabularies to describe custom things.

This RDF data is kept in database and encapsulated in an object called a "graph". By indexing triples, statement by statement, an RDF graph can be search quickly for matching statements and patterns of statements. SQL-like query languages like SPARQL can be used on a store to construct complex queries over the data.

The driving use case discussed at EuroPython 05 is that the portal_catalog for Plone contains lots of indexes, some of them site driving functionality and some of them "pure" meta-data. I proposed to Martin and Alexander extracting (most of) the meta-data managed by the catalog out into one or more rdflib Graphs and the suggested this PLIP.

Proposal

This PLIP proposes adding Zemantic/rdflib integration into Plone. This would bring a number of features to Plone:

  • Represent Plone content in RDF
  • Import/Export RDF data through Plone
  • Store RDF data in a variety of backends (ZODB, SQL, Sleepycat)
  • Query RDF data with the SPARQL query language
  • Work with existing AT CatalogMultiplex framework
  • Work with/leverage existing AT references

Explaining RDF is beyond the scope of this PLIP, for more information see the RDF Primer http://www.w3.org/TR/rdf-primer/. I'll try to give a brief introduction for those developers not familiar with it.

Implementation

This proposal integrates rdflib in three ways, RDF Graph content objects, and rdf_tool default graph, and Archetype reference support:

RDF Graphs

A new content type be added to Plone 2.2, called 'RDF Graph', which is a Zope content object wrapper around an rdflib.Graph instance. This instance could have a pluggable backend that allowed the graph to be stored outside of the ZODB, but the ZODB backend will be the default.

RDF data can get into the graph in many ways. It can be loaded from a URL or file, or it can be added/removed as content is added and removed from Plone (using the existing ZCatalog CatalogAwareness "feature").

The most common use case will be that content type programmers will create an page template that renders their content, via adaptation, into RDF/XML. When they register their type with Plone (in Install.py), they also create (if necessary) and register with the archetype_tool whatever RDF Graphs their application needs.

When an instance of that content type is added to Plone, the CatalogMultiplexer will dispatch the object to any registered catalogs or graphs for that type. If a graph is registered for that type, the object is adapted to RDF and then added or removed from the graph, depending on the dispatched event.

RDF Tool

The above use case describes how third-party developers will use rdflib Graphs in Plone, but Plone itself will also need a default graph just like it has a default portal_catalog. I propose a new tool (perhaps 'rdf_tool') that manages the default meta-data for Plone in an rdflib Graph.

Obviously Plone 2.2 can't just start using rdf_tool instead of portal_catalog. This is a big change and the users should be given options, especially for existing deployments. They should have the default option of keeping Plone 2.1's current behavior and not migrating any data to an rdf_tool.

They should have the option to copy their existing catalog to an rdf tool to "try out" their existing data. When stock ATCT types are added, the multiplexer adds the content to both the portal_catalog and rdf_tool. If the new RDF features aren't wanted, the archetype_tool can be easily reconfigured not to dispatch to the rdf_tool to "turn off" the feature and revert exactly to Plone 2.1's behavior.

While duplicating data and consuming more storage, this option retains the existing Plone 2.1 behavior while adding rdf behavior and is backwards compatible with applications that expect the portal_catalog to contain certain indexes.

A final option would be to copy existing portal_catalog data to an rdf_tool and then delete the catalog indexes for that data so that the data is stored solely in the rdf_tool. This option carries with it the most risk and is not backwards compatible but has no duplication of data.

Another possible option would be to invoke magic and simulate the indexes in portal_catalog with data from the rdf_tool, preserving b/w compat while only storing the data in one place.

Archetype References

This PLIP proposes no changes to the use of AT refs or their API. Like above, the user should have options that allow them to use/leverage their existing AT ref code with an RDF tool.

I propose to create an rdflib backend that adapts the reference_catalog object to an rdflib Graph object. This does not use the existing rdflib.backends.ZODB backend but instead uses existing AT refs in the reference_catalog as the actual backing store of the graph.

Thus, existing AT ref based applications can have their references represented in RDF data immediately and take advantage of the SPARQL query language to do advanced queries on their AT ref structures without changing any of the existing AT ref code.

Deliverables

rdflib integration into Plone
RDF Graph content object
rdf_tool default graph tool
AT catalog multiplexing support
AT reference wrapping

Risks

rdflib's ZODB storage could turn out to be grossly more inefficient than ZCatalog indexes, but I doubt it, they are both based on the underlying ZODB BTrees and their structures are of essentially the same complexity. The ability to use other backends to store the data could largely mitigate this for huge graphs.

Progress log

mp - first draft Jul 5, 05

some notes

Posted by Matt Halstead at July 6, 2005 - 01:08

There are some technical hills with RDF persistence for rdflib since it uses new style classes. But that won't be an issue for much longer.

The bigger question I have is that all data for which there is an RDF transform or representation forms part of a bigger overall graph of all data. I presume you are saying the graph is this bigger graph?

In my experience in using RDF in python(and in ZOPE) is that RDF graphs become massive very quickly, and do not work well as single in memory objects or persisted as a flat RDF file in the filesystem. It seems that using the relational database backends for various RDF libraries is more successful.

RDF in Plone is a very good idea. I'd like to more added to the use-cases before an implementation is started.

more on 103 (why doesn't it quote the subject? ;)

Posted by Michel Pelletier at July 7, 2005 - 19:11

> There are some technical hills with RDF persistence for rdflib since it uses new style classes. But that won't be an > issue > for much longer.

Yeah, 2.2 will run on 2.8, which uses new style classes.

> The bigger question I have is that all data for which there is an RDF transform or representation forms part of a bigger > overall graph of all data. I presume you are saying the graph is this bigger graph?

Yes, the rdf_tool is a stored, indexed graph of all the site content's RDF descriptions. Similar to how the ZCatalog holds this data now, it will instead be held in a persistent graph in the rdf_tool.

> In my experience in using RDF in python(and in ZOPE) is that RDF graphs become massive very quickly, and do not work > well as single in memory objects or persisted as a flat RDF file in the filesystem. It seems that using the relational > database backends for various RDF libraries is more successful.

I agree that an in memory or file serialized representation of even a modest Plone site would be un-managable, but rdflib comes with a ZODB backend which stores the RDF using a three-dimensional IOBTree, which manages its memory in smaller chunks and can easily manage for data than can fit in memory.

rdflib also has backends for sleepycat, and sql-lite. Generic RDBMS backends can be written easily or adapted to whatever Zope 3's relational framework is.

> RDF in Plone is a very good idea. I'd like to more added to the use-cases before an implementation is started.

Yeah, I'll work on more use cases, that seems to be the consensus of most comments. Thanks greeman!

Put machinery in CMF first?

Posted by Paul Everitt at July 6, 2005 - 06:01

First, I love the idea (as you know). [wink] However, I wonder if 2.2 is too ambitious a target. This feels like something that should bake a bit before becoming a long-term promise on how it will be used.

IMO, we should first target the CMF and get the portal_tool and the graph machinery as default at that layer. Once there is a CMF shipping that provides the long-term contracts, we do the UI part in Plone.

add to CMF

Posted by Michel Pelletier at July 7, 2005 - 19:23

good point, that's probably a wise decision. What Tres doesn't have is the PLIP style framework for me to propose those changes, but I guess I could just mail him, I'll send him a note and see how he might want to move forward on this.

Define smaller milestones

Posted by Munwar Shariff at July 6, 2005 - 15:45

I agree that the deadline (to put these features in Plone 2.2) is aggressive. But with the help of Plone developers, I think Michel can achieve it. It is better to define the smaller milestones so that people can enjoy the benefits of Zemantic earlier. 1. First implement with Archetypes reference_catalog (make sure the existing AT Ref based applications do not break) 2. Implement in portal_catalog (RDF Tool and Graph) 3. Test for the entire Plone site 4. Move the implementation to Z3/Z3ECM Just curious, has any other CMS system implemented this kind of feature?

Other CMS

Posted by Paul Everitt at July 7, 2005 - 14:04

On Munwar's last point, I don't think the major open source CMS projects have yet pursued the semantic web. At OSCOM 3 (in Harvard), I made a prediction that somebody would create an open source CMS that made the power of the semantic web useful for secretaries. Whoever does that first will crush the rest of us.

I just hope that we're the first. [wink]

An important Plip!

Posted by Nick Bower at September 19, 2006 - 03:33
This is a great Plip. We experimented about 6-9 months ago with multiple RDF back-ends with the goal of back-ending Zope with an RDF engine. Using a central CMF catalog to manage distributed heterogeneous metadata was becoming very limited. Some random thoughts below.

The choice for RDF storage is critical and most options seemed totally immature at the time. Apart from speed/optimisation of the storage (the RDBMS adapter models seemed far faster), the true power of RDF in our case would be realized with support for such features such as inferencing rules and reification. For example, querying large triple collections was disappointingly slow for ZODB and Sleepycat backends to rdflib. Sesame2 lacked required documentation at the time and I recall running into alpha-level problems (although maybe things have changed). Oracle Semantic (10.2), while well documented and highly promising, only had weak inferencing ruleset support and unfortunately had to be eliminated due to the non-updating indexes (ie the db indexes needed to be refreshed periodically).

We eneded up conclucing that all storage backends we tried were not up to the task, and so given that this situation would eventually change, we also pondered that a sound choice may be to use rdflib as a pluggable common interface to an eventual backend selection. This would help mitigate the problem that there was no clear leading project that provided all the features to justify implementing an RDF storage mechanism. Note that we did not try Redland/librdf (not to be confused with rdflib!) and this could be worth looking into.

I'd encourage separation of an rdf_tool from the catalog multiplex method of triggering updates so that this isn't unnecessarily linked to Plone/AT. Catalog multiplexing is fine, but isn't an option (yet) for CMF sites. It's nice to have out of the box functionality for Plone with the option of providing other ways to access the RDF backend from CMF sites.

I'll post more if I think of it. It would be highly valuable here to gain input from people who have deployed RDF backends with realistic triples collections and drive queries using installed inferencing rules.

What nick said

Posted by Tim Hoffman at September 19, 2006 - 04:20
Hi

I work with Nick, and I second everything he said. In addition whatever backends used really need to support context (quads)
otherwise you can't find the owner of any set of triples. In many of our use cases for a triple store we wanted
to annotate preexisting entities with additional relationships owned by other entities, so context was crucial.

For any issues with the web site functionality, please file a ticket.

Please consult the policy on plone.org content if you want your content published on this site.

Servers and hosting by