#103: Adding RDF to Plone
To include rdflib-based tool in Plone to bring RDF to Plone content.
- Proposed by
- Michel Pelletier
- Proposal type
- Architecture
- State
- deferred
Definitions
RDF: The Resource Decription Framework http://www.w3.org/RDF/
rdflib: Python rdf library http://rdflib.net/
Zemantic: Zope 3 rdflib integration http://zemantic.org/
Dublin Core: Content Metadata Vocabulary http://dublincore.org/schemas/rdfs/
FOAF: Friend of a Friend Vocabulary http://www.foaf-project.org/
XPackage: Package Description Vocabulary http://www.globalmentor.com/reference/specifications/xpackage/specification/
DOAP: Project Description Vocabulary http://usefulinc.com/doap
Motivation
RDF is a way of encoding relationship statements. Each relationship statement is composed of three parts and called a "triple", where each part is the subject, predicate, and object of that statement. In Plone terms, the subject of a statement could be the URI for a peice of content, the predicate could be URI for a well known XML property like Dublin Core Title (the URI http://purl.org/dc/elements/1.1/title) and the object could be a literal string that contains the title of the document.
Using RDF many kinds of statements can be made about content, or people, or whatever using well-known vocabularies, like Dublin Core to describe content, Friend Of A Friend (FOAF) to describe members and XPackage or DOAPto describe packages and products, or custom vocabularies to describe custom things.
This RDF data is kept in database and encapsulated in an object called a "graph". By indexing triples, statement by statement, an RDF graph can be search quickly for matching statements and patterns of statements. SQL-like query languages like SPARQL can be used on a store to construct complex queries over the data.
The driving use case discussed at EuroPython 05 is that the portal_catalog for Plone contains lots of indexes, some of them site driving functionality and some of them "pure" meta-data. I proposed to Martin and Alexander extracting (most of) the meta-data managed by the catalog out into one or more rdflib Graphs and the suggested this PLIP.
Proposal
This PLIP proposes adding Zemantic/rdflib integration into Plone. This would bring a number of features to Plone:
- Represent Plone content in RDF
- Import/Export RDF data through Plone
- Store RDF data in a variety of backends (ZODB, SQL, Sleepycat)
- Query RDF data with the SPARQL query language
- Work with existing AT CatalogMultiplex framework
- Work with/leverage existing AT references
Explaining RDF is beyond the scope of this PLIP, for more information see the RDF Primer http://www.w3.org/TR/rdf-primer/. I'll try to give a brief introduction for those developers not familiar with it.
Implementation
This proposal integrates rdflib in three ways, RDF Graph content objects, and rdf_tool default graph, and Archetype reference support:
RDF Graphs
A new content type be added to Plone 2.2, called 'RDF Graph', which is a Zope content object wrapper around an rdflib.Graph instance. This instance could have a pluggable backend that allowed the graph to be stored outside of the ZODB, but the ZODB backend will be the default.
RDF data can get into the graph in many ways. It can be loaded from a URL or file, or it can be added/removed as content is added and removed from Plone (using the existing ZCatalog CatalogAwareness "feature").
The most common use case will be that content type programmers will create an page template that renders their content, via adaptation, into RDF/XML. When they register their type with Plone (in Install.py), they also create (if necessary) and register with the archetype_tool whatever RDF Graphs their application needs.
When an instance of that content type is added to Plone, the CatalogMultiplexer will dispatch the object to any registered catalogs or graphs for that type. If a graph is registered for that type, the object is adapted to RDF and then added or removed from the graph, depending on the dispatched event.
RDF Tool
The above use case describes how third-party developers will use rdflib Graphs in Plone, but Plone itself will also need a default graph just like it has a default portal_catalog. I propose a new tool (perhaps 'rdf_tool') that manages the default meta-data for Plone in an rdflib Graph.
Obviously Plone 2.2 can't just start using rdf_tool instead of portal_catalog. This is a big change and the users should be given options, especially for existing deployments. They should have the default option of keeping Plone 2.1's current behavior and not migrating any data to an rdf_tool.
They should have the option to copy their existing catalog to an rdf tool to "try out" their existing data. When stock ATCT types are added, the multiplexer adds the content to both the portal_catalog and rdf_tool. If the new RDF features aren't wanted, the archetype_tool can be easily reconfigured not to dispatch to the rdf_tool to "turn off" the feature and revert exactly to Plone 2.1's behavior.
While duplicating data and consuming more storage, this option retains the existing Plone 2.1 behavior while adding rdf behavior and is backwards compatible with applications that expect the portal_catalog to contain certain indexes.
A final option would be to copy existing portal_catalog data to an rdf_tool and then delete the catalog indexes for that data so that the data is stored solely in the rdf_tool. This option carries with it the most risk and is not backwards compatible but has no duplication of data.
Another possible option would be to invoke magic and simulate the indexes in portal_catalog with data from the rdf_tool, preserving b/w compat while only storing the data in one place.
Archetype References
This PLIP proposes no changes to the use of AT refs or their API. Like above, the user should have options that allow them to use/leverage their existing AT ref code with an RDF tool.
I propose to create an rdflib backend that adapts the reference_catalog object to an rdflib Graph object. This does not use the existing rdflib.backends.ZODB backend but instead uses existing AT refs in the reference_catalog as the actual backing store of the graph.
Thus, existing AT ref based applications can have their references represented in RDF data immediately and take advantage of the SPARQL query language to do advanced queries on their AT ref structures without changing any of the existing AT ref code.
Deliverables
rdflib integration into Plone
RDF Graph content object
rdf_tool default graph tool
AT catalog multiplexing support
AT reference wrapping
Risks
rdflib's ZODB storage could turn out to be grossly more inefficient than ZCatalog indexes, but I doubt it, they are both based on the underlying ZODB BTrees and their structures are of essentially the same complexity. The ability to use other backends to store the data could largely mitigate this for huge graphs.
Progress log
mp - first draft Jul 5, 05
some notes
The bigger question I have is that all data for which there is an RDF transform or representation forms part of a bigger overall graph of all data. I presume you are saying 'the graph' is this bigger graph?
In my experience in using RDF in python(and in ZOPE) is that RDF graphs become massive very quickly, and do not work well as single in memory objects or persisted as a flat RDF file in the filesystem. It seems that using the relational database backends for various RDF libraries is more successful.
RDF in Plone is a very good idea. I'd like to more added to the use-cases before an implementation is started.