Converting text with Portal Transforms and the MIME Types Registry

« Return to page index

In this tutorial, you will learn how Plone and Archetypes keeps track of content MIME types, and how PortalTransforms enables you to convert between content of different MIME types. You will learn how to register a new MIME type, and how to create new transforms.

Introduction

Background on MIME types, MimetypesRegistry and PortalTransforms

MIME types are classifications of content, typically used for email attachments and other places where data is interchanged and must be described. For example, the 'text/html' type represents HTML content, whilst the 'text/structured' type represents StructuredText. The 'MimetypesRegistry', bundled with Archetypes, keeps track of the various MIME types available to Plone in the 'mimetypes_registry' tool.

'PortalTransforms', also bundled with Archetypes, provides the 'portal_transforms' tool, which is used to transform data between two MIME types. For example, if you enter StructuredText in a Page, it is transformed to HTML via the 'st' transform when the object is displayed.

In this tutorial, you will learn how to register new MIME types with 'mimetypes_tool', and how to create new transforms with 'PortalTransforms'. The example used throughout the tutorial is the 'intelligenttext' product, "found in the Collective":https://svn.plone.org/svn/collective/intelligenttext/trunk. This provides a new MIME type, 'text/x-web-intelligent', which describes a plain text type that can be transformed to HTML in such a way that paragraph breaks and indentation remains, and web- and email addresses become clickable links. The transform that converts from 'text/x-web-intelligent' to 'text/html' is called 'web_intelligent_plain_text_to_html'.

Additionally, the product provides a 'html_to_web_intelligent_plain_text' transform that can convert HTML to plain text. Note that these transforms are not commutative, so if you convert from 'text/x-web-intelligent' to 'text/html' and then back to 'text/x-web-intelligent', you may not end up with exactly what you started with. This is because HTML is much richer than plain text in formatting, and certain assumptions are made about how to sensibly represent various HTML tags and entities as plain text.

Mimetypes

Working with the MimetypesRegistry's mimetypes_tool to describe a custom MIME type

Registering a new MIME type with the 'mimetypes_tool' is easy. You must write a simple class that represents the type, and call the 'register()' method on the tool to announce its presence. Once registered, the type will become available to 'portal_transforms' and other parts of Archetypes.

The 'text_web_intelligent' class is found in 'intelligenttext/mimetype.py':

from Products.MimetypesRegistry.interfaces import IClassifier
from Products.MimetypesRegistry.MimeTypeItem import MimeTypeItem
from Products.MimetypesRegistry.common import MimeTypeException

from types import InstanceType

class text_web_intelligent(MimeTypeItem):

    __implements__ = MimeTypeItem.__implements__
    __name__   = "Web Intelligent Plain Text"
    mimetypes  = ('text/x-web-intelligent',)
    extensions = ('txt',)
    binary     = 0

This uses the 'MimeTypeItem' base class to provide all the necessary functionality. All you have to do is ensure that the class implements the appropriate interfaces, has a sensible human-readable name (via the '__name__' attribute), and is linked to the appropriate file extension(s) and MIME type(s).

According to RFC-2046, MIME types are described by a string that contains a major and a minor part, separated by a '/'. In this case, the major type is 'text', and the minor type is 'x-web-intelligent'. The 'x-' prefix is conventionally used for types that are "unofficial", such as the 'text/x-web-intelligent' type. For a list of other types registered, look at 'mimetypes_registry' in the ZMI.

It is possible to register more than one MIME type for the same class, for example where two or more types are equivalent. Hence, the 'mimetypes' variable must be a tuple. The 'extensions' tuple describes the associated file extensions. It can be empty if there is no sensible type. The 'binary' variable should be 1 if the MIME type describes binary content, or 0 if it represents textual content.

To register the type with the 'mimetypes_registry' tool, 'Extensions/Install.py' contains the following code:

from Products.CMFCore.utils import getToolByName

from StringIO import StringIO
from types import InstanceType

from Products.intelligenttext.mimetype import text_web_intelligent

def registerMimeType(self, out, mimetype):
    if type(mimetype) != InstanceType:
        mimetype = mimetype()
    mimetypes_registry = getToolByName(self, 'mimetypes_registry')
    mimetypes_registry.register(mimetype)
    print >> out, "Registered mimetype", mimetype

def unregisterMimeType(self, out, mimetype):
    if type(mimetype) != InstanceType:
        mimetype = mimetype()
    mimetypes_registry = getToolByName(self, 'mimetypes_registry')
    mimetypes_registry.unregister(mimetype)
    print >> out, "Unregistered mimetype", mimetype

...


def install(self):

    out = StringIO()

    print >> out, "Installing text/web-intelligent mimetype and transform"

    # Register mimetype
    registerMimeType(self, out, text_web_intelligent)


    ...

    return out.getvalue()

def uninstall(self):

    out = StringIO()

    ...

    # Remove mimetype
    unregisterMimeType(self, out, text_web_intelligent)

    return out.getvalue()

Transforms

Working with PortalTransforms' portal_transforms tool to register new transforms, and writing tests for transforms.

Transforms are registered for one or more input MIME types, and a single output MIME type. Once registered, 'portal_transforms' will be able to use the available transforms to convert between two MIME types.

The 'intelligenttext' transforms are found in 'intelligenttext/transforms'. The structure of this directory should follow the convention that each transform is in its own module (i.e. its own .py file), each of which should contain a class implementing the 'itransform' interface and a 'register()' function that returns a new instance of the transform itself. The '__init__.py' file in the 'transforms' module (directory) should be able to register the available types. As before, we will use 'Extensions/Install.py' to register the types at install time manually.

First of all, '__init__.py' contains the following code:

from Products.PortalTransforms.libtransforms.utils import MissingBinary
modules = [
    'web_intelligent_plain_text_to_html',
    'html_to_web_intelligent_plain_text',
    ]

g = globals()
transforms = []
for m in modules:
    try:
        ns = __import__(m, g, g, None)
        transforms.append(ns.register())
    except ImportError, e:
        print "Problem importing module %s : %s" % (m, e)
    except MissingBinary, e:
        print e
    except:
        import traceback
        traceback.print_exc()

def initialize(engine):
    for transform in transforms:
        engine.registerTransform(transform)

All of this is boilerplate, except for the list of 'modules'. These are the names of the python modules under 'transforms/'.

Each transform module contains a transform class and a 'register()' function. The module 'intelligenttext/web_intelligent_plain_text_to_html.py' contains the following:

from Products.PortalTransforms.interfaces import itransform
from htmlentitydefs import entitydefs
import re

class WebIntelligentPlainTextToHtml:
    """Transform which replaces urls and email into hyperlinks"""

    __implements__ = itransform

    __name__ = "web_intelligent_plain_text_to_html"
    output = "text/html"

    def __init__(self, name=None, inputs=('text/x-web-intelligent',),
                    tab_width = 4):
        self.config = { 'inputs' : inputs, 'tab_width' : 4}
        self.config_metadata = {
            'inputs' : ('list', 'Inputs',
                            'Input(s) MIME type. Change with care.'),
            'tab_width' : ('string', 'Tab width',
                            'Number of spaces for a tab in the input')
            }
        if name:
            self.__name__ = name

        self.urlRegexp = re.compile(r'((?:ftp|https?)://(?:[a-z0-9]' \
        r'(?:[-a-z0-9]*[a-z0-9])?\.)+(?:com|edu|biz|org|gov|int|info' \
        r' |mil|net|name|museum|coop|aero|[a-z][a-z])\b(?:\d+)' \
        r'?(?:\/[^;"\'<>()\[\]{}\s\x7f-\xff]*(?:[.,?]+[^;"\'<>()' \
        r'\[\]{}\s\x7f-\xff]+)*)?)', re.I|re.S)
        self.emailRegexp = re.compile(r'["=]?(\b[A-Z0-9._%-]+@' \
        r'[A-Z0-9._%-]+\.[A-Z]{2,4}\b)', re.I|re.S)
        self.indentRegexp = re.compile(r'^(\s+)', re.M)

    def name(self):
        return self.__name__

    def __getattr__(self, attr):
        if attr in self.config:
            return self.config[attr]
        raise AttributeError(attr)

    def convert(self, orig, data, **kwargs):

        text = orig

        # Do &amp; separately, else, it may replace an already-inserted & from
        # an entity with &amp;, so < becomes &lt; becomes &amp;lt;
        text = text.replace('&', '&amp;')
        # Make funny characters into html entity defs
        for entity, letter in entitydefs.items():
            if entity != 'amp':
                text = text.replace(letter, '&' + entity + ';')

        # Replace hyperlinks with clickable <a> tags
        def replaceURL(match):
            url = match.groups()[0]
            return '<a href="%s">%s</a>' % (url, url)
        text = self.urlRegexp.subn(replaceURL, text)[0]

        # Replace email strings with mailto: links
        def replaceEmail(match):
            url = match.groups()[0]
            return '<a href="mailto:%s">%s</a>' % (url, url)
        text = self.emailRegexp.subn(replaceEmail, text)[0]

        # Make leading whitespace on a line into &nbsp; to preserve indents
        def indentWhitespace(match):
            indent = match.groups()[0]
            indent = indent.replace(' ', '&nbsp;')
            return indent.replace('\t', '&nbsp;' * self.tab_width)
        text = self.indentRegexp.subn(indentWhitespace, text)[0]

        # Finally, make \n's into br's
        text = text.replace('\n', '<br />')

        data.setData(text)
        return data

def register():
    return WebIntelligentPlainTextToHtml()

The class 'WebIntelligentPlainTextToHtml' implements 'itransform'. Notice the '__name__' attribute, which contains the name of the transform as registered with 'portal_transforms', and the 'output' attribute, which specifies the output type of the transform. The '__init__()' method is used to initialise the transform. By providing 'self.config' and 'self.config_metadata', the transform becomes through-the-web configurable. By convention, we allow the list of input MIME types to be configured. We also allow the tab width to be spcified.

All the magic happens in the 'convert()' method. Here, we use the regular expressions compiled in the '__init__()' method (to avoid compiling the same expression more than once) to find and replace URLs and mail addresses with clickable hyperlinks, handling whitespace and converting newlines to '<br />' tags. The method returns a data stream, described in the 'idatastream' interface. In this case, the stream simply contains the replaced text.

The 'html_to_web_intelligent_plain_text' transform is equivalent, but rather longer and more complicated.

To install the transforms, 'Extensions/Install.py' contains:

from Products.CMFCore.utils import getToolByName

from StringIO import StringIO
from types import InstanceType

...

def registerTransform(self, out, name, module):
    transforms = getToolByName(self, 'portal_transforms')
    transforms.manage_addTransform(name, module)
    print >> out, "Registered transform", name

def unregisterTransform(self, out, name):
    transforms = getToolByName(self, 'portal_transforms')
    try:
        transforms.unregisterTransform(name)
        print >> out, "Removed transform", name
    except AttributeError:
        print >> out, "Could not remove transform", name, "(not found)"


def install(self):

    out = StringIO()

    print >> out, "Installing text/web-intelligent mimetype and transform"

    ...

    # Register transforms
    registerTransform(self, out, 'web_intelligent_plain_text_to_html',
'Products.intelligenttext.transforms.web_intelligent_plain_text_to_html')
    registerTransform(self, out, 'html_to_web_intelligent_plain_text',
'Products.intelligenttext.transforms.html_to_web_intelligent_plain_text')

    return out.getvalue()

def uninstall(self):

    out = StringIO()

    # Remove transforms
    unregisterTransform(self, out, 'web_intelligent_plain_text_to_html')
    unregisterTransform(self, out, 'html_to_web_intelligent_plain_text')

    ...

    return out.getvalue()

Finally, we need to test our transforms. The appropriate tests are found in 'intelligenttext/tests/test_transforms.py'. This contains two simple Archetypes test cases that exercise the transforms via various strings. Take a look at this file if you want to understand the transforms in more detail.

Using the transforms

Using the transforms via the PortalTransforms API, and automatically in Archetypes fields.

Now that the transforms are registered, you can use 'portal_transforms' to convert between types. If you took the time to read the test cases in 'tests/test_transforms.py', you will already have seen how this works.

It is possible to invoke transforms by name:

text = 'Make this a link: http://plone.org.'

portal_transforms = getToolByName(self, 'portal_transforms')
data = portal_transforms.convert('web_intelligent_plain_text_to_html', text)
html = data.getData()

However, in most cases, you will be more interested in converting between two MIME types:

text = 'Make this a link: http://plone.org.'

portal_transforms = getToolByName(self, 'portal_transforms')
data = portal_transforms.convertTo('text/html', text,
                                        mimetype='text/-x-web-intelligent')
html = data.getData()

Here, 'text' contains some text, and we tell 'convertTo()' that this should be treated as 'text/x-web-intelligent'. We then ask it to convert this text to 'text/html', and fetch the actual HTML from the 'idatastream' returned via the 'getData()' method.

The nice thing about 'PortalTransforms' is that it completely isolates you from the underlying transform code. For example, you can now create an Archetypes schema containing:

TextField('text',
    widget=TextAreaWidget(
        label="Text",
        description="Enter some text",
    ),
    searchable=True,
    default_content_type="text/x-web-intelligent",
    allowable_content_types=('text/x-web-intelligent',),
    default_output_type="text/html"
),

If edited through the web, Archetypes will ensure that the content is saved with mimetype 'text/x-web-intelligent', and is output as 'text/html'. If you are calling the mutator directly, you need to specify the MIME type manually:

instance.setText('Go to http://plone.org',
                    mimetype='text/x-web-intelligent')

When this is displayed or you call the accessor, 'portal_transforms' will look for a way to convert 'text/x-web-intelligent' to 'text/html', and will invoke the 'web_intelligent_plain_text_to_html' transform as long as 'intelligenttext' is installed in 'portal_quickinstaller':

html = instance.getText()

You can also be explicit about which MIME type you'd like back:

text = instance.getText(mimetype = 'text/x-web-intelligent')

For more details, see 'PortalTransforms/interfaces.py'.