#37 — UnicodeDecodeError in feed parsing

State Resolved
Version:
Area Functionality
Issue type Bug
Severity Medium
Submitted by Mikko Ohtamaa
Submitted on Jun 20, 2011
Responsible
Target release: 2.0.5
Tested with trunk and latest release.

I think utilities.py improperly mixes UTF-8 and unicode strings, leading to the following exception

Traceback (innermost last):
  Module ZPublisher.Publish, line 126, in publish
  Module ZPublisher.mapply, line 77, in mapply
  Module ZPublisher.Publish, line 46, in call_object
  Module Products.feedfeeder.browser.feed, line 46, in __call__
  Module Products.feedfeeder.browser.feed, line 43, in update
  Module Products.feedfeeder.utilities, line 97, in retrieveFeedItems
  Module Products.feedfeeder.utilities, line 275, in _retrieveSingleFeed
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3: ordinal not in range(128)


                if summary == convert_summary(content['value']):
                    # summary and content is the same so we can cut
                    # the summary. The transform can stumble over
                    # unicode, so we convert to a utf-8 string.
                    summary = summary.encode('utf-8')
                    data = portal_transforms.convert('html_to_text', summary)
                    summary = data.getData()
                    words = summary.split()[:72]
                    summarywords = words[:45]
                    if len(words) > 70:
                        # use the first 50-70 words as a description
                        for word in words[45:]:
                            summarywords.append(word)
                            if word.endswith(u'.'): <---- Here
Steps to reproduce:
Happens with these feeds (not sure which post):

http://www.businessdailyafrica.com/[…]/index.xml
http://www.busiweek.com/[…]/index.php?format=feed&type=rss
Added by Mikko Ohtamaa on Jun 20, 2011 03:06 PM
I think you get around this by changing u"." strings to byte strings "." and thus not triggering automatic unicode decode.
Added by Maurits van Rees on Sep 02, 2011 11:05 PM
Issue state: UnconfirmedResolved
You are right. Fixed in r244105.

I will look at the other pending tickets and will then aim for a release soon.
Added by Maurits van Rees on Dec 27, 2011 10:15 PM
Target release: None2.0.5
This fix is in 2.0.5.

2.0.7 is the most recent release (made today).

No responses can be added.