#37 — UnicodeDecodeError in feed parsing
by
Mikko Ohtamaa
—
last modified
Dec 27, 2011 10:15 PM
| State | Resolved |
|---|---|
| Version: |
—
|
| Area | Functionality |
| Issue type | Bug |
| Severity | Medium |
| Submitted by | Mikko Ohtamaa |
| Submitted on | Jun 20, 2011 |
| Responsible |
—
|
| Target release: | 2.0.5 |
Tested with trunk and latest release.
I think utilities.py improperly mixes UTF-8 and unicode strings, leading to the following exception
Traceback (innermost last):
Module ZPublisher.Publish, line 126, in publish
Module ZPublisher.mapply, line 77, in mapply
Module ZPublisher.Publish, line 46, in call_object
Module Products.feedfeeder.browser.feed, line 46, in __call__
Module Products.feedfeeder.browser.feed, line 43, in update
Module Products.feedfeeder.utilities, line 97, in retrieveFeedItems
Module Products.feedfeeder.utilities, line 275, in _retrieveSingleFeed
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3: ordinal not in range(128)
if summary == convert_summary(content['value']):
# summary and content is the same so we can cut
# the summary. The transform can stumble over
# unicode, so we convert to a utf-8 string.
summary = summary.encode('utf-8')
data = portal_transforms.convert('html_to_text', summary)
summary = data.getData()
words = summary.split()[:72]
summarywords = words[:45]
if len(words) > 70:
# use the first 50-70 words as a description
for word in words[45:]:
summarywords.append(word)
if word.endswith(u'.'): <---- Here
I think utilities.py improperly mixes UTF-8 and unicode strings, leading to the following exception
Traceback (innermost last):
Module ZPublisher.Publish, line 126, in publish
Module ZPublisher.mapply, line 77, in mapply
Module ZPublisher.Publish, line 46, in call_object
Module Products.feedfeeder.browser.feed, line 46, in __call__
Module Products.feedfeeder.browser.feed, line 43, in update
Module Products.feedfeeder.utilities, line 97, in retrieveFeedItems
Module Products.feedfeeder.utilities, line 275, in _retrieveSingleFeed
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3: ordinal not in range(128)
if summary == convert_summary(content['value']):
# summary and content is the same so we can cut
# the summary. The transform can stumble over
# unicode, so we convert to a utf-8 string.
summary = summary.encode('utf-8')
data = portal_transforms.convert('html_to_text', summary)
summary = data.getData()
words = summary.split()[:72]
summarywords = words[:45]
if len(words) > 70:
# use the first 50-70 words as a description
for word in words[45:]:
summarywords.append(word)
if word.endswith(u'.'): <---- Here
- Steps to reproduce:
- Happens with these feeds (not sure which post):
http://www.businessdailyafrica.com/[…]/index.xml
http://www.busiweek.com/[…]/index.php?format=feed&type=rss
Added by
Mikko Ohtamaa
on
Jun 20, 2011 03:06 PM
I think you get around this by changing u"." strings to byte strings "." and thus not triggering automatic unicode decode.
Added by
Maurits van Rees
on
Sep 02, 2011 11:05 PM
You are right. Fixed in r244105.
Issue state:
Unconfirmed
→
Resolved
I will look at the other pending tickets and will then aim for a release soon.
Added by
Maurits van Rees
on
Dec 27, 2011 10:15 PM
This fix is in 2.0.5.
Target release:
None
→
2.0.5
2.0.7 is the most recent release (made today).
No responses can be added.
If you can, please log in before submitting a reaction.
