#44: Bulk loading of external content into Archetypes-based objects
- Contents
- Proposed by
- dreamcatcher
- Proposal type
- State
- rejected
Motivation
People coming to Plone from other CMS or from no CMS at all often want to be able to bulk import existing content. There are also cases of sites which produce a high volume of content that needs to be published constantly.
Customers may have existing content that appears on Intranet drives, public websites, and other web applications. Some of this content will be migrated into the Plone CMS. Other content will continue to be managed by the existing application. In both cases, though, the CMS (and sites that subscribe to content in the repository) need some awareness of this content.
For content that is migrated, there needs to be a one-time facility for bulk-loading the content into the CMS. Customers can be responsible for getting the content into a format that can be loaded. The CMS, though, needs to specify this format and support bulk loading using the format.
For content that remains external, the goal is to have a small represenation, like a ghost or a fingerprint, that exists in the CMS. This representation is treated like content: it can have security, versioning, workflow, and be shared between multiple sites. A site manager can have this small representation appear in navigation on sites and even have some parts of the representation get indexed to show up in searches.
However, for non-migrated content, the site visitor will be taken upon hyperlink clicking to the external system. This is essentially the Link content type in Plone.
Thus, any external application can "integrate" with the CMS by maintaining import files that provide a link representation with rich metadata adhering to the project standards.
Proposal
This proposal aims to come up with a standard for bulk-loading content and 'rich metadata' where applicable into a Plone site, using Archetypes-based content types and custom marshallers.
Implementation
Client side
Note: Although we make suggestions about how the client should work, this PLIP doesn't deal with creating the client side of the solution.
Content to be integrated will first be staged into a directory. The content hierarchy will be represented as directories. Each content item will be represented by the content itself plus a companion XML file for metadata. Directories will also have a XML file for their metadata.
A script will be run that will then load this directory hierarchy into a folder in the CMS. Error messages should be displayed as error output by the script.
The client will use a standard 'PUT' command to send the content to the server. This will be done by:
- Sending the metadata file so that the content is created on the server if it doesn't exist yet.
- Sending the actual content on a second 'PUT' request.
The process should follow this order so that the right content type to be created can be controlled by the metadata file. Also, some content types may have required fields that need to be present on creation, and this allows those fields to be specified on the metadata file.
Server side
Content, after import, should behave as if it had been created using the web interface.
Plone, through CMF, provides a tool called 'content_type_registry' which acts as a place to hook policies for deciding which content type should be created, when it doesn't exist yet, by sniffing the body or headers of the incoming 'PUT' request.
Our solution will consist of two parts:
The first part comprises writing a 'content_type_registry' policy that given a request following our spec for metadata will be able to decide the right content type to be created, by choosing amongst the existing content types the one that best fits the profile, or just plain using the content type informed on the metadata file if available.
The second part of the solution consists of a to-be-created marshaller for Archetypes that, given a request, can choose between a set of 'marshalling handlers' which one to use for processing. Those marshalling handlers will be registered with some kind of registry, either global (module-level) or local (site-level tool). As a first step, we will provide only the global-level registry. We will also provide two marshalling handlers. One for our custom metadata spec, that will parse the metadata file and change the content fields accordingly, and another that will be pretty similar to the 'PrimaryFieldMarshaller' currently in Archetypes, that will be the 'fallback' when we receive data that doesn't meets the spec. That is, we will assume that if we are not receiving a file that meets the metadata spec, that this file is the value for the 'body' or 'primary field' of the content at hand.
Considerations about the handling of metadata
Some of the metadata may have special meaning. For example, references between content objects need to be handled correctly, and a desirable goal is to be able to do roundtrip exporting and importing of content from the CMS.
Regarding references, some considerations need to be done:
- There may be references to content that doesn't yet exist. This should be very uncommon, so we won't get into it yet.
- There may be references to content that is about to be created, on the same batch of content being uploaded. In this case, the client is responsible for ordering the dependencies between content being imported so that content that is referenced is imported before the content that references it. There may be problems with circular references, but we don't expect those to be common either.
- References may have metadata too. This should be encoded on the metadata file along with the reference.
- References may be done by either Path or UID.
- We should try as much as possible to keep the same UID, if one exists (as a result of exporting existing content) when importing. There may be UID clashes, but those also should be very uncommon given the nature of how UIDs are created. We should assume that if a UID clashes, it should probably mean that the content already existed, but maybe at a different location. We should try to compare the content being imported with the one already existing and raise an appropriate error message.
Dates also need some special handling:
- It may be desirable to have creation and modification dates match the creation and modification dates of content on the filesystem. The server won't do anything fancy to handle this though. The responsibility for sending the right dates on the metadata file is on the client.
- When exporting data from the server, creation and modification dates will be encoded as metadata. Client must make sure not to override those dates with the content creation/modification dates on the filesystem as they will differ. We are not sure that setting creation/modification dates on filesystem works transparently accross different filesystems.
ftp & mime-types
from 2 websites of Plone 2.1.1, i was trying to use FTP to tranfer pages from one plone site to another, via an intermediate box.
After downloading the files (specifically ATDocuments), I get a file with headers (feild: content), including Content-Type: text/html
After uploading one of these files (again via FTP) to the destination site, these are transformed into DTML documents.
It would be really nice if the ftp download/ftp upload process would keep the same content types.
XMLForest: Generic IMS Content Package Im/Export for all Archetypes based content in Plone
Folks, look at http://plone.org/products/xmlforest
Content Migration
In the case of wanting to bulk import a site that is mostly html pages, wouldn't it be most valuable to focus on parsing out existing metadata where it is available and then creating your objects accordingly? Hopefully they have at least a title tag and possibly a description? If your lucky, they may even have keywords. Tools like htmlTidy could be used to prep the documents so get them ready for parsing.