Roundup Tracker - Issues

Issue 1381559

classification
text/plain nosy attachmt can miss encoding/charset(w kludge)
Type: Severity: normal
Components: Mail interface Versions:
process
Status: closed works for me
:
: richard : a1s, ber, richard
Priority: normal :

Created on 2005-12-15 12:39 by ber, last changed 2006-01-30 16:58 by ber.

Files
File name Uploaded Description Edit Remove
roundup-20051214-fix-text-plain-attachment-encoding-for-nosy-emails.diff ber, 2005-12-15 12:39 Patch with workaround and description
Messages
msg2070 Author: [hidden] (ber) Date: 2005-12-15 12:39
When you have a type="text/plain" file attachment, 
the nosy mailer does not do encodings 
and always uses 7bit. 
This is wrong for texts with umlauts. 
 
My patch has a workaround and outlines a better 
solution. It should apply to CVS 
 from yesterday. 
 
msg2071 Author: [hidden] (a1s) Date: 2005-12-25 16:31
Logged In: YES 
user_id=8719

i am sorry, i do not think that charset guessing as in
attached patch is the right thing.  if charset must be
specified, it should be explicitely set in file mime type
(you can use database detectors to apply site defaults).

commited in the HEAD branch (will appear in 0.9) is a fix
that checks if text/plain attachment can be 7bit-encoded,
and uses quoted printable encoding if it cannot.
msg2072 Author: [hidden] (ber) Date: 2005-12-27 10:49
Logged In: YES 
user_id=113859

Hi Alexander,
thanks for answering and helping to get this bug fixed.

I completely agree with yout that charset guessing is
the wrong solution, but actually it _is_ a solution.
Unless the Charset is saved somewhere, I do not see a better
solution, though.
Guesses about the encoding will not fix the bug that you send
emails out with the wrong charset = (text encoding).
Thus this bug probably should still be open.

If you suggest to save the charset as part of the filetype
string:
This could potentially be a solution, but it might break the
current assumption
what is in this string. The charset usually is not part of
the MIME-Type,
but of the email Content-Type.

This is whay I propose to save the charset in an extra field,
if the filetype is "text/plain". This would be a larger change
to roundup as all input and output channels would need to be
checked
to save the charset or guess it instead.

Also I do not fully understand what you mean by: applying
site defaults. A good site will get utf-8 and latin-1
encoded text-attachments and must record that encoding and 
possibly recode it at a few occasions.

        Bernhard
msg2073 Author: [hidden] (a1s) Date: 2005-12-27 11:15
Logged In: YES 
user_id=8719

i do not understand what do you mean by differentiating
"MIME-Type" and "Content-Type".  i am not aware of any
"MIME-Type" other than content type.

charset is perfectly acceptable parameter of MIME content
type, so the type property of files in classic schema is the
right place to store the name of the character set.

by "site defaults" i mean that latin1 may be mostly correct
guess at your site, but is absolutely incorrect for my site
where text attachments most probably will be encoded with
cp866 or cp1251.

i consider this bug fixed and closed, but won't close it
again if you insist on having it open.
msg2074 Author: [hidden] (ber) Date: 2005-12-30 07:40
Logged In: YES 
user_id=113859

Hi Alexander,

first thanks for not insisting in closing the bug.
Let me try to explain why I believe that the change you 
have described does not solve the question.

Here is a case where things go wrong:
Let us assume your site default is cp1251, so you want
to write in cyrillic.

You get one (a) text/plain attachment which is cyrillic
with charset utf-8 and another one (b) which is cyrillic
with charset cp1251. 

Now they both get send out with nosy, both probably cannot
be "mail-encoded" with 7-bit and with your fix will most
likely get encoded as quoted-printable.
But if you do not save the charset you probably will send
out a) with a charset of cd1251 which will break the text.
This is why I think the bug is not fully solved.
I have tried to change the subject to better reflect this.
What do you think?

So we must save the charset to have a real fix.
My patch was a dirty workaround which fixed the problem
in an expensive way, but at one place.
If you replace "latin-1" with the EMAIL charset default,
it would be an even better kludge.

But of course a better solution should be implemented.
msg2075 Author: [hidden] (a1s) Date: 2005-12-30 07:56
Logged In: YES 
user_id=8719

if charset is not specified in file.type then mail
attachment will not have character set name too.  MIME type
of the mail attachment is exactly what's saved in the type
field.

contemporary email clients should be able to cope with text
attachments without character set designator.

anyway, you can add charset to the type if you want to. 
please see detectors section in "customizing roundup" document.
msg2076 Author: [hidden] (ber) Date: 2005-12-30 08:00
Logged In: YES 
user_id=113859

Hi Alexander,

and now to the question of where to save the charset,
which seems unavoidable to save somewhere. ;-)

Reading RFC2046 the MIME maintype is "text" and
the subtype is "plain". "charset" would be a critical, but
optional parameter. So where do we save this within roundup?
Two ideas:
a) added as string to the filetype
b) creating a new parameter to the file class
   b.1) calling this new parameter "charset"
   b.2) making this new parameter more generic

Idea a) has the potential to break code that relies on the
assumption that all filetypes are of the form MAIN/SUB.
In addition it would always need more parsing to seperate
the main- and subtypes from the parameters, if they are
needed seperately.

Idea b.1) would need a change in the schema and be specific
to "text" maintypes. "text" will be an important case, so it
might be fine.

Idea b.2) would be generic for all parameters that are there
 to come for any attachment type, so it probably should be
implemented ideally similar to a python dictionary.
Implementation and usage in the code would be more
complicated as in b.1.

In principle I do not care which solution is implemented,
as long as one is done, though. I have a tendency for the
b.1 or b.2 solutions, as I cannot judge if a) will break
anything and I do not like parsing the string each time I
want a parameter.

Thanks again for considering this
and I hope you will have a happy rollover!
   Bernhard
msg2077 Author: [hidden] (a1s) Date: 2005-12-30 08:10
Logged In: YES 
user_id=8719

assumption that no filetype has parameters is plain wrong. 
parameters were in content-type since rfc1049 - more than 15
years!
msg2078 Author: [hidden] (ber) Date: 2006-01-02 10:42
Logged In: YES 
user_id=113859

I did not write that "no filetype has parameters" 
and I know how to operate Roundup's detectors. 
Thanks for the lecture. 
 
My first post was about that roundup need to record 
and then set the charset parameter for text/plain file 
attachment on all occasions. This does not seem to be done 
yet. Sending out an 8bit attachment without that parameter 
calls for trouble. I did not look up if latin-1 or utf-8 
should be assumed in this case, but anyway, it is quite 
likely to be wrong.  
 
You have not answered my question wether your patch will 
fix the scenario for the bug that I have described, btw. 
Can I include that you agree that the charset parameter 
should be saved and that this is not the case currently? 
Then we only disagree if this should be done by default 
or not. I say: Yes, it is a serious bug as roundup is 
unusable in environments where e.g. latin-1 and utf-8 based 
texts are used. Umlauts break frequently and users 
rightfully think that this is the software. 
 
My second post was about where to save the parameter. 
Just because in emails this is saved in a body part header 
file as string in a parameter, roundup does not need to do 
this. From your answer I conclude that you like method a) 
best and do not care about the style of code when the  
string is parsed. Also you want places within Roundup (not 
within RFC anything) to break if they have made the 
assumption that "type/subtype" is what they get as string? 
 
 
 
msg2079 Author: [hidden] (a1s) Date: 2006-01-02 10:58
Logged In: YES 
user_id=8719

roundup does not need to record charset on all occasions. 
if you need that, you can do that with database detectors.

sending mail attachments without charset parameter in
content-type is not a bug.  8-bit characters with 7-bit
transfer encoding was a bug.  it is fixed now.

yes, i think that the best place to store charset is mime
type property.  but if you want to store character set name
separately in your tracker, you are free to do that.

there should be no places in roundup breaking if file type
contains parameters.  if there are such places, they must be
fixed.
msg2080 Author: [hidden] (ber) Date: 2006-01-02 11:03
Logged In: YES 
user_id=113859

So you do not think that the behavious that I have outlined 
is a bug? To me it clearly shows that roundup will have to 
record the charset to be able to display those attachments 
in a browser or per email correctly. 
 
It is not a hard bug not having a charset in an email, but 
it leads to a bug for the users, because the umlauts will 
be broken. I have this on a live system. 
 
I also have the problem with web browsers, btw, 
because roundup does not know the charset, it cannot give 
it to the browser, who will then display the texts wrong. 
 
How do you envision to fix the bad behaviour without saving 
the charset? I created a case that will occur often in 
non-us environments and will lead to broken behaviour (for 
the users). 
msg2081 Author: [hidden] (a1s) Date: 2006-01-02 11:23
Logged In: YES 
user_id=8719

if file type contains charset name it will be set in
content-type headings both in emails and in http displays. 
no umlauts will be broken.

if there is no charset name recorded, user agent (email
program or web browser) lets the user to select correct
charset.  no umlauts are broken.

but there is no globally correct way for roundup to guess
the character set if it is not specified explicitely.  (and
with incorrect guesses umlauts will be broken for sure.)  if
you think there is sitewide correct way to do that for your
site, please use database detectors.
msg2082 Author: [hidden] (ber) Date: 2006-01-02 13:33
Logged In: YES 
user_id=113859

If you are saying that the current version does record  
the charset parameters when input comes from email or  
http, then indeed the bug would be fixed to a large extend.  
(My testing was done with a 0.7.x version where  
charset is not recorded.)  
  
In addition I would say that a text/plain attachment  
without charset is incomplete within Roundup as Roundup by   
definition should talk to several systems. So adding a  
guess by default (no matter how it is done technically)  
should be wise. To do the guess best, it would need the 
full information of the input channcel (web, http or 
email). Is this available from the detectors? 
msg2083 Author: [hidden] (ber) Date: 2006-01-30 16:58
Logged In: YES 
user_id=113859

Hi Richard, 
does closing the bug with "works for me" means, 
that you have retested this with roundup 1.0 and you are 
sure that the bug is gone? 
 
I definately saw these problems in an environment 
where people use webbrowsers and email clients with 
different locales (iso-8859-15 and utf8) and do attachments 
with umlauts. 
 
But if you say it is gone with 1.0, this would be cool! 
History
Date User Action Args
2005-12-15 12:39:38bercreate