#2: Media Metadata
- Contents
Media carry a lot of metadata, some of which is encoded in the file itself. Products should be able to use this metadata without worrying how to get it out of the file (ogg/mp3/m4a, exif/iptc, ...). Also, the transition between tags in the file itself and application-level metadata should be smooth.
- Proposed by
- Jean Jordaan
- Proposal type
- Architecture
- State
- being-discussed
Motivation
- Many different libraries exist to get metadata out of files. Plone product authors should be able to use this metadata without worrying about the library chosen.
- The metadata within files don't tell the whole story. A specific
application may add more properties that describe a media object.
It should be possible to augment the metadata encoded within the
file with more data.
- Besides the metadata within a file and the metadata added by Plone, much more metadata is available online, via services such as MusicBrainz, Bitzi, FreeDB, Audioscrobbler, and so on. It should be possible to use these data sources to improve data quality and prevent repetitive data entry.
- Media metadata has many consumers, not limited to Plone. Therefore, metadata should be available in formats such as RDF and XSPF for the easy creation of podcasts and playlists.
Proposal
Metadata extraction tool for media files
The Plone Multimedia project is gathering and integrating products to deal with audio and video content.
This PLIP proposes a Metadata tool, intended to provide a unified interface for these products to get metadata from media files.
It allows registration of metadata proxies. Each metadata proxy returns a specific set of metadata (e.g. the ID3 data, or either the EXIF data or the IPTC data in the case of a photo) from a specific file type or types. If the metadata is writable, the proxy can also set metadata on the file. (ID3 proxies can set ID3 tags, OGG proxies set OGG tags, ...)
The metadata tool needs to deal with separate layers.
Raw tags
At the most basic layer, it simply gets the metadata encoded within a file, hiding the details of this from the application programmer. In your code you get the tool, call 'mdtool.getMetadataProxies(file)' (where file is the blob you'd like info on). 'file' might be ogg/mp3/wma/avi/divx/...
A metadata proxy returns a metadata object, which subclasses 'dict'. It also provides other methods to get e.g. the type of the metadata returned.
Parsed tags
Wrappers or proxies at a higher level may augment this metadata object by mapping the literal tags in the media files to a single set of names preferred by PloneMultimedia, e.g. derived from MusicBrainz metadata. (See MusicBrainz Metadata Initiative 2.1, and MusicBrainz Metadata Vocabulary). mmpython has its own mappings of tags to names (see mmpython.mediainfo).
Adding application data
An MP3 file contains metadata about the file in isolation. However, the file might form part of a radio show, in which case additional properties become relevant, e.g. date of broadcast and name of presenter. In this case, the application would register a proxy which returns this data.
Metadata objects
A metadata object subclasses dict. In addition, it knows what type of metadata it represents:
>>> print metadata.getMetadataType()
'id3'
>>> print metadata
{'WXXX': u'(User defined URL link): (): ',
'TCOP': u'(Copyright message): ',
'TOPE': u'(Original artist(s)/performer(s)): ',
...
MetadataTool
The tool provides getMetadataProxies, which takes either a Plone content object or a filename and mime type specification, and looks at the registered proxies, returning all the matching ones. Multiple proxies might match, e.g. id3_proxy and itunes_proxy. It's up to the application to choose among them.
MetadataProxy
A metadata proxy is registered for a mime type. It knows its type, and you can iterate and subscript it like a dictionary. If there is a marshaller registered for the mime type, it provides the default string representation. If multiple marshallers are registered, the application can choose among them.
Marshallers
In addition to extraction adapters, the metadata tool allows registration of marshalling adapters. Marshalling adapters are registered for metadata types. E.g. a MusicBrainzRDFMarshaller might be registered for id3 tags, for ogg tags, or for a higher-level metadata type such as the tags used by mmpython.
Here we are with a proxy which has MusicBrainzRDFMarshaller as primary marshaller:
>>> metadata
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf=" http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc=" http://purl.org/dc/elements/1.1/"
xmlns:mq=" http://musicbrainz.org/mm/mq-1.1#"
xmlns:mm=" http://musicbrainz.org/mm/mm-2.1#"
xmlns:ar=" http://musicbrainz.org/ar/ar-1.0#"
xmlns:az=" http://www.amazon.com/gp/aws/landing.html#">
<mq:Result>
<mq:status>OK</mq:status>
<mm:albumList>
<rdf:Bag>
<rdf:li rdf:resource="http://musicbrainz.org/mm-2.1/album/1073abfc-768e-455b-9937-9b41b923c746"/>
</rdf:Bag>
</mm:albumList>
</mq:Result>
<mm:Album rdf:about="http://musicbrainz.org/mm-2.1/album/1073abfc-768e-455b-9937-9b41b923c746">
<dc:title>Beaucoup Fish</dc:title>
...
<az:Asin>B00000IFTF</az:Asin>
</mm:Album>
</rdf:RDF>
Let's see what else we have:
>>> metadata.getMarshallers() [MusicBrainzRDF, RFC822, BitziRDF] >>> metadata.asRFC822() Title: Girl Creator: Beck AlbumName: Guero Date: 2005 FileName: 03 - Beck - Girl - www.torrentazos.com.mp3 ...
Bundled proxies and marshallers
A set of standard proxies and marshallers can be bundled as a separate product, similar to ATExtensions and MoreFieldsAndWidgets that bundle fields and widgets. Specific products (e.g. PloneRadioShowVapourware) provide their own specialised proxies and marshallers.
Sample session
Here's a sample session at the zopectl debug Python interpreter:
>>> # A Zope object
>>> song = container['somefile.mp3']
>>> song.portal_type
'ATAudio'
>>> photo = container['somephoto.jpg']
>>> # e.g. CMFPhoto or ATImage
>>> # See if the metadata_tool has adapters for us
>>> song_metadata_proxies = metadata_tool.getMetadataProxies(song)
>>> photo_metadata_proxies = metadata_tool.getMetadataProxies(photo)
>>> # E.g. one for EPC, one for IPTC data
>>> len(photo_metadata_proxies)
2
>>> a = photo_metadata_proxies[0]
>>> # Gather metadata from all the metadata_proxies
>>> metadata = [a.getMetadata() for a in photo_metadata_proxies]
>>> metadata[0][u'title']
u'asdfdsaf sadfsadf' # Return unicode
>>> # Does this make sense? Shouldn't we just return the bytes that were
>>> # included in the file?
>>> metadata[0].getMetadatatype()
'exif'
>>> exif_metadata = metadata[0]
>>> metadata[1].getMetadataType()
'iptc'
>>> iptc_metadata = metadata[1]
>>> exif_metadata.keys() # dict interface
[u'Aperture', u'Make', u'Model', ....]
>>> exif_metadata['Aperture']
u'8' # We can't assume the type of the value, so make it string
>>> itunes_song = container['indistinguishablefromaplaintrack.mp3']
>>> song_metadata_proxies = metadata_tool.getMetadataProxies(itunes_song)
>>> metadata_list = [a.getMetadata() for a in song_metadata_proxies]
>>> for metadata in metadata_list:
>>> print '---'
>>> print metadata.getMetadataType()
>>> print metadata
---
'id3'
{'WXXX': u'(): ',
'TCOP': u'',
'TOPE': u'',
'TCOM': u'',
'TRCK': u'4',
'TIT2': u'Hooplas Involving Circus Tricks',
'TENC': u'',
'COMM': u'(iTunNORM)[eng]: 0000102D 00000707 000069F9 00003FAF 000119F0 000119D6 00008A67 000089ED 00011CCB 0000D98D',
'TPE1': u'Say Hi To Your Mom',
'TALB': u'Numbers & Mumbles',
'TYER': u'2004',
'TCON': u'(131)Indie',
# Repeated tag as list
'COMM': [u'()[eng]: http://www.sayhitoyourmom.com',
u'(ID3v1 Comment)[XXX]: http://www.sayhitoyourmom.co',
u'(iTunes_CDDB_1)[eng]: 0200D201+213+1+150',
u'(iTunes_CDDB_TrackNumber)[eng]: 1',
]
}
---
'itunes'
{'WXXX': u'(): ',
'TCOP': u'',
'TOPE': u'',
'TCOM': u'',
'TRCK': u'4',
'TIT2': u'Hooplas Involving Circus Tricks',
'TENC': u'',
# Include original tag
'COMM': u'(iTunNORM)[eng]: 0000102D 00000707 000069F9 00003FAF 000119F0 000119D6 00008A67 000089ED 00011CCB 0000D98D',
# Parse it into an additional tag. Should this be distinguished from the
# id3 tags, or is that something the application should take care of?
'(iTunNORM)[eng]': u'0000102D 00000707 000069F9 00003FAF 000119F0 000119D6 00008A67 000089ED 00011CCB 0000D98D',
'TPE1': u'Say Hi To Your Mom',
'TALB': u'Numbers & Mumbles',
'TYER': u'2004',
'TCON': u'(131)Indie',
'COMM': [u'()[eng]: http://www.sayhitoyourmom.com',
u'(ID3v1 Comment)[XXX]: http://www.sayhitoyourmom.co',
u'(iTunes_CDDB_1)[eng]: 0200D201+213+1+150',
u'(iTunes_CDDB_TrackNumber)[eng]: 1', ],
'(iTunes_CDDB_1)[eng]': u'0200D201+213+1+150',
'(iTunes_CDDB_TrackNumber)[eng]': '1', }
}
References
Implementation
The Metadata implementation should aim to use as much existing architecture as possible. Possible candidates seem to be PortalTransforms and Sidnei's Marshall.
During the sprint, Godefroid remarked that the pattern sketched above should be implemented in terms of adapters and interfaces using Five.
Participants
Jean Jordaan
anyone else who'd like to!
