#146 — Unicode Encode Error in Response

by Stefano Deponti last modified Jan 05, 2009 09:30 AM
State Resolved
Version: 1.1
Area Functionality
Issue type Feature
Severity Important
Submitted by Stefano Deponti
Submitted on Nov 21, 2007
Responsible Maurits van Rees
Target release: 1.1
I get an UnicodeEncodeError when I try to write a response using non basic ASCII characters (like accented vocals: à è ì ò ù).

I’m using Poi 1.1 beta 2 from SVN on Plone 3.02 and Zope 2.10.

Here is the full error message:

Exception Type UnicodeEncodeError
Exception Value 'ascii' codec can't encode characters in position 48-49: ordinal not in range(128)

I've attached the traceback here.

I try to solve the issue changing the line 433 of PoiTracker.py from

   textPart = MIMEText(rstText, 'plain', charset)

to

   textPart = MIMEText(rstText.encode('utf-8','replace'), 'plain', 'utf-8')

Changing this I have no more Unicode errors, but the e-mails have wrong characters (I mean the accented ones).
Steps to reproduce:
You should write a response like this: “Testo con lettere accentate: à è ì ò ù”.
Attached:
IntelligentText icon allegato.txt — IntelligentText, 3 kB (3693 bytes)
Added by Maurits van Rees on Nov 21, 2007 10:20 PM
Issue state: unconfirmedopen
Severity: MediumImportant
Target release: None1.1
Responsible manager: (UNASSIGNED)maurits
Of course this goes fine for me. Sigh. Unicode errors like this are notoriously hard to debug. :-(

The only thing that goes slightly wrong for me is that each accented character in the subject gets shown as XX. A header inserted by a virus checker sheds some light here:

 X-Amavis-Alert: BAD HEADER Non-encoded 8-bit data (char C3 hex): Subject: ...:
        Testo con lettere accentate: \303\240 \303\250 \303\254 \303\262
        \303\271\n

When I paste that subject into a python prompt and print it, those accented characters show up again. So something goes right at least, but we should not be sending out bad headers.

When I try "rstText.encode('utf-8','replace')" like you do above I get an error. Do you mean 'decode' instead of 'encode'? That at least passes.

(Pdb) rstText.encode('utf-8','replace')
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 88: ordinal not in range(128)
(Pdb) rstText.decode('utf-8','replace')
u'A new response has been given to the issue **unicode test: Testo con lettere accentate: \xe0 \xe8 ...'

Hm, but that is unicode, which is not accepted by MIMEText again...


BTW, instead of hardcoding 'utf-8' there it is better to use the already defined charset. That gets the encoding from your site, although I would expect that to be utf-8 too in most cases. Can you go in with a pdb and tell me what value charset has at that point?


There were some commented out tests for email sending in Poi. I uncommented them. And I added two tests for unicode characters (well, utf-8 at least, which is not really the same). They pass for me. Can you run them?
Added by (anonymous) on Nov 22, 2007 04:46 PM
My string is actually:

 textPart = MIMEText(rstText.encode('utf-8','replace'), 'plain', charset)
 
I got inspiration from http://mail.python.org/[…]/353346.html

There is something like this at http://mail.python.org/[…]/244870.html

By the way, I took from there the 'replace' parameter.

As you could see I changed from 'utf-8' hardcoded to charset, just a little after I wrote the issue.

Now here are the painful things... I'm sorry but I cannot understand what you mean with "pdb". I think that it is something like "point of debug" or "plone debug", but could you tell me more about it? On my side I'll google for it, but not this evening, because I have an other job to deal with.

Tomorrow I'll run the other tests you ask to me, too.
Added by Maurits van Rees on Nov 22, 2007 06:56 PM
Issue state: openin-progress
The current Poi code gives you problems. Your code gives me problems. So after reading the links you posted I think we need an extra check:

+ if isinstance(rstText, unicode):
+ rstText = rstText.encode(charset, 'replace')
+

In my case the if-statement is False so for me it keeps going fine. In your case the if-statement is True so your text will be encoded. I hope this works. Committed in r54328. Can you try again?

About the pdb: that is the python debugger. You can add this somewhere in your python code:

  import pdb; pdb.set_trace()

Then you startup Zope in the foreground. Now when the execution of the python code gets to the point where you added this statement, you will be presented with a python prompt. Then you can inspect variables. Just type self or charset or rstText and see what values they have. Press ? for help. Press c to continue.

I am interested in the charset. But actually the easiest way is to look that up in the Zope Management Interface, in portal_properties/site_properties/default_charset. In my case (and also when running the tests) this is 'utf-8'.
Added by (anonymous) on Nov 22, 2007 07:40 PM
Wonderful, I learned a new stuff! Thank you.

I give you a quick response before dealing with my little removal.

Your code works for me. But the generated email message still has wrong characters.

Now, here you are the Python Debugger results:

(Pdb) print self
<PoiTracker at poitracker.2007-08-27.0439070068>

(Pdb) print charset
utf-8

(Pdb) print rstText
A new response has been given to the issue **Poi e codifica Unicode**
in the tracker **Plone 3.0** by **Amministratore**.

Response Information
--------------------

Issue
  Poi e codifica Unicode (http://cms.web.bose/tecnoteca/problemi/poitracker.2007-08-27.0439070068/4)


**Response Details**::

    Ã¢ÂÂTesto con lettere accentate: àÚ ì ò ùâÂÂ

\* This is an automated email, please do not reply - Amministratore Tecnoteca
Added by Stefano Deponti on Nov 22, 2007 07:47 PM
I'm sorry. Only now I have realized I wasn't log in. Of course the anonymous of the last responses was me.
Added by Stefano Deponti on Nov 23, 2007 08:07 PM
I put "import pdb; pdb.set_trace()" in PoiTracker.py at line 434, just before:

         if isinstance(rstText,unicode):

If I type:

(Pdb) rstText = "“Testo con lettere accentate: à è ì ò ù”."

I get a right email message and a right response in Poi.

Should we check PoiResponse.py?
Added by Stefano Deponti on Nov 23, 2007 08:33 PM
I put two checkpoint in PoiResponse.py, one at line 460 and the other at line 471.

Then I looked at responseDetails:

(Pdb) print responseDetails
“Testo con lettere accentate: à è ì ò ù”.

And then I looked at mailText:

(Pdb) print mailText
A new response has been given to the issue **Poi e codifica Unicode**
in the tracker **Plone 3.0** by **Amministratore**.

Response Information
--------------------

Issue
  Poi e codifica Unicode (http://cms.web.bose/tecnoteca/problemi/poitracker.2007-08-27.0439070068/4)


**Response Details**::

    âTesto con lettere accentate: à è ì ò ùâ.

\* This is an automated email, please do not reply - Amministratore Tecnoteca


So there is something wrong in self.poi_email_new_response.

Added by Stefano Deponti on Nov 24, 2007 06:54 AM
Maurits, I ran the test you asked in your first response.

I typed:

    $ bin/zopectl test -s Products.Poi

Is it right?

The output is in the file attached here.
Added by Stefano Deponti on Nov 24, 2007 09:11 AM
I changed PoiResponse.py at line 470:

        if isinstance(mailText,unicode):
             tracker.sendNotificationEmail(addresses, subject, mailText.encode('latin1','replace'))
        else:
             tracker.sendNotificationEmail(addresses, subject, mailText)

and I got right both email and response (right in encoding, I mean).

I know I hardcoded 'latin1' encode, but I don't know how to do otherwise.

I suppose the patch in PoiTracker.py is no more useful, isn't it?


Added by Maurits van Rees on Nov 24, 2007 01:29 PM
Hi Stefano,

Sorry, I have not got time over this weekend to really look at this. But it looks like your fix is good. Well, it should not be latin-1, but I know how to change that. I will see about fixing this next week, hopefully so it works for everyone.

Is sending of new issues going well for you? You can fill in more fields there, so there are more possibilities for things to go wrong with accented characters.

That one test failure you saw is a problem with intelligenttext. I fixed that in both Products.intelligenttext and plone.intelligenttext a week or two ago. So that should fix itself on Plone 3 with a new Plone release.

That the other tests pass, also without your fix, shows that we are not testing those accented characters well enough. I want to try to improve that, but this may be hard.

[Note to self: keep the default charset utf-8 but try adding a response with latin-1 in the tests.]
Added by Stefano Deponti on Nov 26, 2007 03:52 PM
I think you should apply the same patch in content/PoiIssue.py (line 563) and Extensions/poi_issue_workflow_script.py (line 75).

Added by Maurits van Rees on Nov 27, 2007 11:52 PM
Hi Stefano,

I added your changes (slighty adapted) in r54582. For me those changes make no difference at all though. And I have not managed to create any tests that fail with the previous code and pass with the current code. So it is hard to reproduce this completely and hard to show that the bug is gone.

I hope this fixes it for you. Can you try again?

BTW, are you looking at the plain text emails or at the html emails? Is there any difference in how the characters render for you in those two versions?

For me, the only problem I still see is that the subject has illegal characters when the issue title has accented characters. I have no idea how to solve that.
Added by Stefano Deponti on Nov 29, 2007 06:53 PM
Hello Maurits,

I tested a little the new revision.

About the text in mail message, your code don't work for me. I still get bad characters instead of accented vowels and so on. I look to the mail messages by Mozilla Thunderbird, so I can view them both in HTML or in plain text mode: in both views I view bad characters.

I think it's so because in my environment “charset” is “utf-8” and so the line

  mailText = mailText.encode(charset, 'replace')

doesn't encode “mailText” in a character set suitable with “sendNotificationEmail”.

On the contrary, I can see the right characters in the subject of the email (no illegal characters).

I tried to patch your code with my “latin1” hardcoded lines (poi_issue_workflow_scripts.py lines 78.80; PoiIssue.py lines 565.567; PoiResponse.py lines 473.475), and things went well to me. Also the characters in the subject are still right. But I don't know why.

By the way, I confirm that the patch to issue #151 works (it works also for issue #153, of course; I’m sorry for the duplicate, but sometimes the Internet is a jungle).

P.S.: in the meantime I upgraded to Plone 3.0.3.
Added by Maurits van Rees on Nov 29, 2007 10:49 PM
Hello Stefano,

Your charset is utf-8 and you get wrong characters in the email body but good characters in the subject.

My charset is also utf-8 and I get right characters in the email body but wrong characters in the subject.

Crazy...

As you realize, hardcoding latin 1 in Poi is not an option. The only sane default would be utf-8. You could try setting the charset of your Plone Site to latin-1.

But wait, isn't there an email_charset property somewhere? Ah, Plone 3.0 adds that to the property sheet of the portal during migration. You could set that one to latin-1 and keep the default_charset utf-8 if you want. (Users on Plone 2.5 could just add that property to the portal by hand.)

Poi could then try to get this email_charset first, and only use the default_charset if this property is not available. I just added that change in r54705. I only changed PoiTracker.py. Maybe this fix needs to go in some other spots as well that deal with encoding. Actually, the spots I am thinking about (like in PoiResponse.py) are superfluous: they deal with encoding and after that they call tracker.sendNotificationEmail(). But that method already deals with encoding. So those other lines can be removed. Done in r54706.

BTW, if I have utf-8 as charset (default or email) and I add a Response with text like "hälló" then the email body is just fine. When I pick latin-1 as charset, the email body gets weird text like "A~A" instead of an accented character. I highly suspect for you it will be the other way around...
Added by Stefano Deponti on Nov 30, 2007 01:27 PM
Hi Maurits

I tried to set email_charset to “latin-1”. Actually I hadn't this property in the site properties so I added it in http://mysite.example.org/[…]/manage_propertiesForm (perhaps in a fresh plone 3.0 installation this property isn't set?).

I installed the new Poi code, of course. I added a response to an issue: no mail came. Wow. But the response published on Plone was fine.

So I tried with “import pdb; pdb.set_trace()” in PoiTracker.py at line 448.

First of all I saw that I still have email_charset set to “utf-8”, so my change was wrong (isn't the propertie at http://mysite.example.org/[…]/manage_propertiesForm?).

Secondary, it seems that it sends emails only to mailing list. It's a different behavior from last version of Poi. Do you agree with that?

Added by Maurits van Rees on Nov 30, 2007 04:10 PM
The email_charset should be a property of the portal. So do not look in portal_properties, but in the Properties tab of the portal.

There should be no change in to whom the emails are being sent. For possible reasons why emails are not sent, see issue #149.
Added by Stefano Deponti on Nov 30, 2007 05:52 PM
Ok, you're right.

I succeeded in changing email encoding to latin1, but if it is so the email are latin1 encoded (and it's not right).

So I did another test. I took back email_charset to 'utf-8'. Then I added to poi_email_new_response.dtml a line with an accented vowel:

   Questa \xc3\xa8 un\xe2\x80\x99accentata.

Maurits, you are free to not believe me, but it works! I receive emails with the right characters.

Could you say me why?

I attached here my poi_email_new_response.dtml.

Added by Maurits van Rees on Dec 01, 2007 12:44 AM
Hi Stefano,

You wrote: "I succeeded in changing email encoding to latin1, but if it is so the email are latin1 encoded (and it's not right)." But when you changed then encoding to latin-1 in the code itself one or two responses earlier it went fine for you. So that is strange.

On your test with adding accented characters in poi_email_new_response.dtml: the dtml page is probably fine with just ascii I think. When a response with accented characters is filled in, or it is a response to an issue with accented characters (as the title is filled in), the encoding changes. Adding accented characters directly into the dtml page does the same thing. But it kind of looks like there is a difference in the encoding used in those cases.

Here is some testing in a python prompt. I will not claim I understand every line here. :-)

>>> "àccènted title"
'\xc3\xa0cc\xc3\xa8nted title'
>>> u"àccènted title"
u'\xe0cc\xe8nted title'
>>> text = "Title: %s."
>>> text
'Title: %s.'
>>> text % "normal title"
'Title: normal title.'
>>> text % u"normal title"
u'Title: normal title.'
>>> text % "àccènted title"
'Title: \xc3\xa0cc\xc3\xa8nted title.'
>>> text % u"àccènted title"
'Title: u\xc3\xa0cc\xc3\xa8nted title.'
>>> "àccènted title" == u"àccènted title".encode('utf-8')
True
>>> "àccènted title".decode('utf-8') == u"àccènted title"
True
>>> "àccènted title".decode('latin-1') == u"àccènted title"
False
>>> "àccènted title".decode('latin-1')
u'\xc3\xa0cc\xc3\xa8nted title'
>>> u"àccènted title"
u'\xe0cc\xe8nted title'

I wonder if when you try this, an accented string decoded in latin-1 is the same as the unicode version, instead of utf-8 as it is for me.

The thing to note is probably that a unicode object is the same for you and for me; at least it should be if I understand this stuff correctly; I am learning as I go along. :-) But an (accented) string can easily have a different encoding, even when copy-pasted from me to you or the other way around.
Added by Stefano Deponti on Dec 01, 2007 03:07 PM
Hi, Maurits.

First of all I tell the good news. Now both emails and responses show the correct characters. I don't know why yesterday it seemed to me that things went well only if I changed poi_email_new_response.dtml. Perhaps I did a mistake somewhere. Today I reloaded the plain svn code and things went very well. I’m sorry.

About your first question, I can say that when I hardcoded 'latin1' in PoiTracker.py I didn't change line 441:

 textPart = MIMEText(rstText, 'plain', charset)
 
but I only changed line 437:

 rstText = rstText.encode('latin1', 'replace')

So the character set of the mime text was 'utf-8'. If charset gets its value from email_charset, the character set of the mime text will be the one of email_charset. This is the difference.

To fullfill all matters, here are the result of the tests:

>>> "àccènted title"
'\xc3\xa0cc\xc3\xa8nted title'
>>> u"àccènted title"
u'\xe0cc\xe8nted title'
>>> text = "Title: %s."
>>> text
'Title: %s.'
>>> text % "normal title"
'Title: normal title.'
>>> text % u"normal title"
u'Title: normal title.'
>>> text % "àccènted title"
'Title: \xc3\xa0cc\xc3\xa8nted title.'
>>> text % u"àccènted title"
u'Title: \xe0cc\xe8nted title.'
>>> "àccènted title" == u"àccènted title".encode('utf-8')
True
>>> "àccènted title".decode('utf-8') == u"àccènted title"
True
>>> "àccènted title".decode('latin-1') == u"àccènted title"
False
>>> "àccènted title".decode('latin-1')
u'\xc3\xa0cc\xc3\xa8nted title'
>>> u"àccènted title"
u'\xe0cc\xe8nted title'
>>>

It seems like your result.

Maybe the issue is resolved, isn't it?
Added by Maurits van Rees on Dec 01, 2007 11:15 PM
Hello Stefano,

I see one difference between us in those tests. I have:

>>> text % u"àccènted title"
'Title: u\xc3\xa0cc\xc3\xa8nted title.'

where you have as result:
u'Title: \xe0cc\xe8nted title.'

But when I try it now I have the same result as you. Why I had a different result earlier I do not know. I *do* know my earlier result made no sense whatsoever... :-)

Ah well, looks like we can close this issue indeed. Except for one thing: the subject of emails.

You never had problems with accented characters in the subjects of the Poi emails, right? I did have problems with them; somewhere along the line (likely in an smtp server) they were replaced by "XX" as they were not ascii characters. I made some changes to Poi so the subjects are now mime encoded. This was done in r54747. For me this now works fine. Can you check if it still works correctly for you?

Thanks,

Maurits
Added by Stefano Deponti on Dec 03, 2007 08:28 PM
Hi Maurits,

I can confirm you that all is going well, even after last change. Right characters in the response, right characters in the email body and in the email subject, too. Fine!

I could verify with Thunderbird that the subject is mime encoded.

So, thank you, Maurits. I’ll drink a beer in your honour.
Added by Maurits van Rees on Dec 03, 2007 10:12 PM
Issue state: in-progressresolved
That is good to hear, Stefano. I am glad we got it fixed. Thank you for providing so much thorough testing and quick feedback! That really helped in getting this solved.

I will join you with that beer and pick a "Leffe Blond". :-)

No responses can be added.