HTML Filtering options
This How-to applies to:
Plone 2.1.x
This How-to is intended for:
Integrators, Customizers
Plone filters HTML in several different places each of which has its own pro's and con's. Unfortunately this can cause confusion as the different filters have different effects.
- Kupu
- Filters out unwanted tags and attributes. Also turns HTML into XHTML. Configurable TTW. Runs client-side only before form is submitted so not secure.
- mxTidy
- Runs the HTML through the HTML tidy program. XML configuration file (not TTW). The default configuration can mess up non-ascii characters and <pre> blocks. Runs on the server when an HTML fields is updated.
- Safe HTML
- Removes dangerous tags and javascript from attributes. No configuration (although it can be disabled TTW). Runs on the server when an HTML field is rendered.
Kupu HTML Filter
Kupu's HTML filter runs on the client browser whenever HTML is saved. The client can bypass this filtering so it cannot be relied on for any form of security.
Kupu lets you blacklist attributes on particular tags, or all occurrences of a specific attribute or a tag.
By default the following tags are removed:
- center
- span
- tt
- big
- small
- u
- s
- strike
- basefont
- font
The following attributes are removed from any tag:
- dir
- lang
- valign
- halign
- border
- frame
- rules
- cellspacing
- cellpadding
- bgcolor
The attributes width, height are removed from the tags table, th, td (largely because IE keeps inserting them when you didn't want them, and pasting from Word always includes inappropriate width and height attributes).
The style attribute is handled specially: most style attribute values are stripped, but text-align, list-style-type, and float are allowed to remain since Kupu can generate them under some circumstances.
There is also a blacklist for the class attribute: by default it is empty but there is a sample kupu configuration which adds in some class names used by Microsoft Word.
All of these are configurable either through the control panel, or by scripting or Python. A sample script is supplied which can be edited fairly easily to change any of Kupu's configurable options.
Kupu also scrubs any event attributes on..., as well as any tags or attributes which the HTML spec doesn't define, or any attributes which aren't permitted on a particular tag. This is not configurable (except by editing the filter code).
Configuring Kupu's filter
The simple way is just to use the control panel to change the options. A better way is to copy Kupu's sample-kupu-customisation-policy.py script from the kupu_plone skin folder, naming the customised script as kupu-customisation-policy.py, and then edit it. This ensures that any customisations you make will be preserved even if you uninstall/reinstall it (e.g. when upgrading).
Disabling Kupu's filter
Short of editing kupu there is no way to entirely disable the content filter.
mxTidy
By default Plone's content types run all HTML through the external HTML Tidy program (if it is available) whenever HTML is saved. HTML Tidy will pretty-print the HTML, but it does not have options to strip specific unwanted tags/attributes so it cannot be used for the same kind of tidying as kupu or the safe html cleanup. Also many combinations of mxTidy options can result in output HTML which does not display the same as the input HTML.
Configuring mxTidy
The default configuration is in the ATContentTypes folder ATContentTypes\etc\atcontenttypes.conf.in. You should not edit this directly: make a copy of the file renamed to atcontenttypes.conf and put it in your Zope instance's etc folder, then edit that copy.
The default options are:
<mxtidy>
enable yes
drop_font_tags yes
drop_empty_paras yes
input_xml no
output_xhtml yes
quiet yes
show_warnings yes
indent_spaces yes
word_2000 yes
wrap 72
tab_size 4
char_encoding raw
</mxtidy>
Problems with this configuration include:
No character encoding has been specified, so accented characters will be corrupted: even numeric character entities are reduced modulo 256. Solution: Set char_encoding to utf8.
HTML Tidy will add a newline following <br> tags inside <pre> sections. This double-spaces the pre section. I'm not sure whether there is any combination of options which avoids this issue.
Kupu does not generate a summary attribute for tables (although it should), so HTML Tidy will generate a warning message every time you save a document containing a table. You can suppress this by setting show_warnings to no.
Disabling mxTidy
One option is simply to ensure that HTML Tidy is not accessible to Plone. If you don't install HTML Tidy Plone will simply not try to use it.
A less drastic solution is to copy the configuration file as described above, and change the enable line to:
enable no
Safe HTML
The safe html transform is applied to the main body of documents when they are rendered by Plone. The intention of this filter is simply to ensure that certain security holes are closed. The most obvious such hole is that if a (non-Manager) user can create a web page containing a script tag and get a site administrator to view that page the script can then perform actions on the site which require administrator access.
There is no configuration for Safe HTML (before version 1.3.9 of PortalTransforms) neither TTW, nor any configuration file (it is of course possible to edit the source, or to inject patches from another product). It is also possible but extremely unadvisable to disable it. If you need to customise the transform then upgrade to version 1.3.9-rc1 or later. This is [or at the time of writing 'will be'] available in Plone 2.1.2, or check it out from Subversion at https://svn.plone.org/svn/archetypes/PortalTransforms/branches/archetypes-1_3-branch
Plone 2.1.2 will include PortalTransforms 1.3.9 which allows customising as follows:
- You can disable the entire transform.
- You can edit the lists of nasty tags, valid tags, and whether or not javascript is to be stripped.
The following tags are permitted by Safe HTML:
a b base big blockquote body br caption cite code dd del div dl dt em h1 h2 h3 h4 h5 h6 head hr html i img ins kbd li meta ol p pre q small span strong sub sup table tbody td th title tr tt u ul
Version 1.3.9 also permits:
area map
These tags and anything they contain are removed:
script object embed applet
Any other tags are removed, although their content will remain. This means that the following tags, although legal HTML will be removed by safe HTML:
abbr acronym address area basefont bdo button center dfn dir fieldset font form iframe input isindex label map menu noframes noscript s samp select strike textarea var
Attributes starting with 'javascript:' are also removed.
Disabling Safe HTML
This section has been removed. Versions prior to 1.3.9 could not reliably delete a transform so the instructions which were given here wouldn't work reliably (if at all). Upgrade and disable it through the configuration screen (.../portal_transforms/safe_html/manage_main) instead.
Do this only if you trust everyone who can create content for your site, as you are leaving it wide open for scripting attacks.
Alternatives to disabling Safe HTML
Another solution is to decide what javascript you want to permit, and attach it to tags based on the html. For example, Plone's applies clickable column headings to any table with the class 'listing' when the page loads. Adding additional behaviour this way allows end users to use javascript in a controlled manner.
Event Attributes
Found this while searching on how to enable event attributes. If you wish to enable event attributes such as "onclick" as stated above, here is how you do it.
From Duncan Booth's original message:
The reason [event attrobutes] are filtered out is simply that allowing end users to set events in the html they create is a big security threat, so in general blocking all events makes sense.
If you are happy to customise your copy of kupucontentfilters.js, then around line 280 you should find the event attributes:
// All event attributes are here but commented out so we don't
// have to remove them later.
this.events = []; //
onclick|ondblclick|onmousedown|onmouseup|onmouseover|onmousemove|onmouseou
t|onkeypress|onkeydown|onkeyup.split(|);
this.focusevents = []; // [onfocus,'onblur']
this.loadevents = []; // [onload, 'onunload']
this.formevents = []; // [onsubmit,'onreset']
this.inputevents = [] ; // [onselect, 'onchange']
For these lines, simply deleting "[] ; // " should be sufficient to add the events back in to the attribute tables. You should then be able to control the events in the same way as other attributes.
reload the transforms
/portal_transforms/manage_reloadAllTransforms
PHP-based Filter
bioinformatics.org/phplabware/internal_utilities/htmLawed
The htmlawed filter is almost perfect for me.
CMF filters tags also
I was asked in comments to add a link to this How-to from my FAQ which explains tag filtering in the CMF:
http://plone.org/documentation/faq/tags-filtered
Maybe a link from this How-to back to my FAQ would complete the loop?