Roundup Tracker - Issues

Issue 2550799

classification
provide basic support for handling html only emails
Type: rfe Severity: normal
Components: Mail interface Versions: 1.4
process
Status: fixed fixed
:
: rouilj : ber, marlowa, rouilj
Priority: high : Effort-Medium

Created on 2013-03-07 04:39 by rouilj, last changed 2017-10-16 00:15 by rouilj.

Files
File name Uploaded Description Edit Remove
dehtml.py rouilj, 2013-03-07 04:39
unnamed marlowa, 2014-03-20 15:05
unnamed marlowa, 2014-03-20 15:07
Messages
msg4816 Author: [hidden] (rouilj) Date: 2013-03-07 04:39
Currently when html only email is sent roundup rejects the email.

We should make roundup extract the text from the html and post that
(possibly with adding the html as an attachment).

To do this we need to change the mail gateway to find the html
portion of the email and convert to text. There are a few ways
to do the conversion:

  1) use an external program like links -dump
  2) use code like beautiful soup, nltk.clean_html()
  3) use the stupid little class/function attached that you
     can drop in utils as well if you wish.
msg4817 Author: [hidden] (ber) Date: 2013-03-07 09:40
I agree, this was one the wishes mentioned twice in our short user
survey.
msg4818 Author: [hidden] (rouilj) Date: 2013-03-07 14:07
In message <1362649218.01.0.546514816848.issue2550799@psf.upfronthosting.co.za>
 <1362649218.01.0.546514816848.issue2550799@psf.upfronthosting.co.za>,
Bernhard Reiter writes:
>Bernhard Reiter added the comment:
>I agree, this was one the wishes mentioned twice in our short user
>survey.
>
>----------
>keywords:  -Effort-Medium

Do you think it's a high effort task?

I gave it a medium rating because I think it's an isolated task to the
mail gateway. Not trivial as you will have to learn about the mail
gateway's code. However understanding the gateway code to find out
where it exits if there is only an html part will tel you where to
hook the new code into the flow. Then converting the html to text and
creating a text/plain part should be person days of work not person
weeks or longer right?
msg4819 Author: [hidden] (ber) Date: 2013-03-07 15:05
(I removed the keyword by accident.)
msg5041 Author: [hidden] (marlowa) Date: 2014-03-19 09:25
This issue was brought up way back in 2007. See the thread at
http://sourceforge.net/p/roundup/mailman/message/13189177/. The thread
discusses some open source software called the ASCII-nator, which
converts HTML to ASCII for just such a case as this. The discussion sort
of fizzled out but I am pleased to see that other developers still
consider this to be an issue. AFAICS it has a simple solution with the
ASCII-nator.
msg5042 Author: [hidden] (ber) Date: 2014-03-20 10:54
Hi Andre,
as far as I can see, ASCII-nator still has license problems.
As John has pointed out, there are other solutions.
The solution chosen should be reasonable save of course. :)

So now we just need someone to do the work. >;)
Bernhard
msg5043 Author: [hidden] (rouilj) Date: 2014-03-20 13:40
Hi Bernhard:

In message <1395312895.49.0.517509067968.issue2550799@psf.upfronthosting.co.za>
 <1395312895.49.0.517509067968.issue2550799@psf.upfronthosting.co.za>,
Bernhard Reiter writes:
>as far as I can see, ASCII-nator still has license problems.

Is the problem you are referring to GPL V3's more restrictive license
and viral nature. The current roundup/zope page templates is more
permissive and almost BSD like.

Does the GPL V3 kick in on roundup code if we include ASCII-nator
source and call it as an external program, as I suggested with links
-dump or whatever? Having a native python mechanism even if accessed
via fork seems to be better alternative than a totally third party
program that has to be built. (I am thinking of a windows install
here, but I realise that the fork mechanism may not be possible on
windows.)
msg5044 Author: [hidden] (ber) Date: 2014-03-20 14:43
On Thursday 20 March 2014 at 14:40:50, John Rouillard wrote:
> Is the problem you are referring to GPL V3's more restrictive license
> and viral nature. 

It is more like a vaccination effect, if you ask me. :)
Yes, I believe that there may be a problem and roundup or a solution build on 
roundup could be considered a derived work. So when it doubt, we should
consider alternative solutions.

The subprocess would work on windows, but I don't think it is a particular ice 
technical solution. So before I recommend looking at the alternatives first.
msg5045 Author: [hidden] (marlowa) Date: 2014-03-20 15:05
On 20 March 2014 14:43, Bernhard Reiter <issues@roundup-tracker.org> wrote:

>
> Bernhard Reiter added the comment:
>
> On Thursday 20 March 2014 at 14:40:50, John Rouillard wrote:
> > Is the problem you are referring to GPL V3's more restrictive license
> > and viral nature.
>
> It is more like a vaccination effect, if you ask me. :)
>

It has been compared to taking a cutting. One takes a cutting consciensly,
knowing what will happen. Whereas a virus is caught by accident.

> Yes, I believe that there may be a problem and roundup or a solution build
> on
> roundup could be considered a derived work. So when it doubt, we should
> consider alternative solutions.
>

Maye this was the view back in 2007.

>
> The subprocess would work on windows, but I don't think it is a particular
> ice
> technical solution. So before I recommend looking at the alternatives
> first.
>

I think we might be able to go with what was suggested back in 2007, namely
that the code could try to do the import and use it if successful. The
documentation could mention that the ASCIInator will be used if present but
that its absence is not harmful. Thus the ASCIInator could be installed on
the same system as roundup and roundup may use it if present but it doesnt
matter that the two pieces of software have different licences.

>
> ________________________________________________
> Roundup tracker <issues@roundup-tracker.org>
> <http://issues.roundup-tracker.org/issue2550799>
> ________________________________________________
>

-- 
Regards,

Andrew Marlow
http://www.andrewpetermarlow.co.uk
msg5046 Author: [hidden] (marlowa) Date: 2014-03-20 15:07
On 20 March 2014 15:05, Andrew Marlow <issues@roundup-tracker.org> wrote:

>
> Andrew Marlow added the comment:
>
> On 20 March 2014 14:43, Bernhard Reiter <issues@roundup-tracker.org>
> wrote:
>
> >
> > Bernhard Reiter added the comment:
> >
> > On Thursday 20 March 2014 at 14:40:50, John Rouillard wrote:
> > > Is the problem you are referring to GPL V3's more restrictive license
> > > and viral nature.
> >
> > It is more like a vaccination effect, if you ask me. :)
> >
>
> It has been compared to taking a cutting. One takes a cutting consciensly,
> knowing what will happen. Whereas a virus is caught by accident.
>
> > Yes, I believe that there may be a problem and roundup or a solution
> build
> > on
> > roundup could be considered a derived work. So when it doubt, we should
> > consider alternative solutions.
> >
>
> Maye this was the view back in 2007.
>
> >
> > The subprocess would work on windows, but I don't think it is a
> particular
> > ice
> > technical solution. So before I recommend looking at the alternatives
> > first.
> >
>
> I think we might be able to go with what was suggested back in 2007, namely
> that the code could try to do the import and use it if successful. The
> documentation could mention that the ASCIInator will be used if present but
> that its absence is not harmful. Thus the ASCIInator could be installed on
> the same system as roundup and roundup may use it if present but it doesnt
> matter that the two pieces of software have different licences.
>
> >
> > ________________________________________________
> > Roundup tracker <issues@roundup-tracker.org>
> > <http://issues.roundup-tracker.org/issue2550799>
> > ________________________________________________
> >
>
> --
> Regards,
>
> Andrew Marlow
> http://www.andrewpetermarlow.co.uk
>
> ________________________________________________
> Roundup tracker <issues@roundup-tracker.org>
> <http://issues.roundup-tracker.org/issue2550799>
> ________________________________________________
>

-- 
Regards,

Andrew Marlow
http://www.andrewpetermarlow.co.uk
msg5150 Author: [hidden] (rouilj) Date: 2014-10-18 02:13
Another (sadly also GPL V3) choice is:

   https://github.com/aaronsw/html2text

which produces markdown from html (given that markdown is safer
than reStructured text it may be a better choice for the conversion).

Then convert to reStructured text (maybe pandoc
--from=markdown --to=rst --output=message.rst message.md
could work.)

In any case, when saved as a file the mime type could be
text/reStructured text and if the libraries are present,
the message could be converted to html.

If anybody decides to do this, make sure to secure
the conversion according to:

 http://docutils.sourceforge.net/docs/howto/security.html
msg6027 Author: [hidden] (rouilj) Date: 2017-10-10 23:16
I am adapting the patch at: 

https://sourceforge.net/u/iippolitov/roundup/ci/2ee03ad0b0a5edbb8e68763
fbf03a1032cf8a83d/

from Igor Ippolitov which uses beautiful soup 4 to do the html 
processing.

I can't get the debian python-bs4 to work right, so I am merging his 
patch with the html2text code in dehtml.py attached to this issue.

The patch currently attempts to load beautiful soup and if it gets an 
import error will fall back to using dehtml.py.

I am currently working on the test cases and so far all existing test 
now pass. The new test cases I have: email with one text/html part and 
one multipart with text/csv and text/html seem to work for ascii. 
I am having issues with character representations for international 
chars.

Does anybody have some time to test this code and see if it
at least doesn't break anything and make be useful for turning html 
into text.

I still need to add a trivalue config option to select/deselect the 
option:

  beautifulsoup, dehtml, none

before I do the full commit.
msg6033 Author: [hidden] (rouilj) Date: 2017-10-14 14:12
committed first pass at this in rev e20f472fde7d.

Commit hg5306:91354bf0b683 fixed a bug found after looking at code
coverage and testing some missed code paths. Plus hg5307:5b4931cfc182
added test for the entity conversion code path in in the dehtml routine.

Using beautiful soup 4 is enabled but I couldn't develop tests for it,
so mileage may vary.
msg6037 Author: [hidden] (rouilj) Date: 2017-10-16 00:15
This is another solution by twb using lynx with options to set
the encoding. Wanted to record here for future reference
as it uses the lynx -dump option I originally suggested.

 1 def strip_html():
 2     '''Parse global "message_string" variable as a MIME message.
 3        Look for text/html MIME objects,
 4        run lynx -dump on them,
 5        insert them as text/plain MIME objects.
 6        Return the new message (as a string).
 7 
 8        NB: this often results in a multipart/alternative branch with 
TWO text/plain leaves,
 9        but that actually seems to work out pretty well.'''
10 
11     message_object = email.message_from_string(message_string)
12 
13     # NB: this is the earliest point we can extract header fields.
14     syslog.syslog('SUBJECT {}'.format(message_object.get('Subject', 
'No Subject')))
15     syslog.syslog('MESSAGE-ID 
{}'.format(message_object.get('Message-ID', 'No Message-ID')))
16 
17     # NB: this walk traverses ALL nodes in a flattened tree,
18     # so we do not need to manually recurse on branch nodes.
19     # (unlike perl). --twb, Sep 2015
20     for part in message_object.walk():
21         if 'text/html' == part.get_content_type():
22             syslog.syslog('STRIPPED an html part')
23             # Pipe it through lynx to render as plain text.
24             #
25             # NOTE: postfix runs maxwell with C (not C.UTF-8) 
locale!
26             # This breaks non-ASCII for things like open(mode='wt') 
and check_output(universal_newlines).
27             # As a workaround leave everything as b'' bytes and pass 
encoding hints to lynx and set_payload.
28             #
29             # NOTE: It is very unintuitive, but
30             #           get_payload(decode=False) ⇒ u'…'
31             #           get_payload(decode=True)  ⇒ b'…'
32             #       This is because decode=True decodes only C-T-E: 
base64 (or quoted-printable);
33             #       the e.g. ISO-8859-1 to Unicode decoding happens 
later!
34             output = subprocess.check_output(
35                 ['lynx',
36                  '--dump',
37                  '--stdin',
38                  '--assume-charset', 
part.get_content_charset(failobj='UTF-8')],
39                 universal_newlines=False,  # KLUDGE — see above
40                 input=part.get_payload(decode=True),
41                 env={'LC_ALL': 'C.UTF-8'})
42             # Edit the part to be plain text.
43             # FIXME: instead of editing the old MIME object,
44             # *delete* it and create a new text/plain object.
45             # The only part we might want to keep is inline vs. 
attachment (disposition)?
46             # But how do I *do* that during a .walk()? --- it's not 
a foldr!
47             del part['Content-Type']
48             del part['Content-Transfer-Encoding']  # alloc #31444
49             part['Content-Type'] = 'text/plain'
50             part.set_payload(output, 'UTF-8')
51             # FIXME: should append '.txt' to filename=fred.html 
where present.
52             # I can't see how to change it without also overriding 
the disposition,
53             # which is undesirable. --twb, Sep 2015
54 
55     return message_object
History
Date User Action Args
2017-10-16 00:15:58rouiljsetmessages: + msg6037
2017-10-14 21:09:26rouiljsetstatus: open -> fixed
resolution: fixed
2017-10-14 14:12:02rouiljsetmessages: + msg6033
2017-10-10 23:16:26rouiljsetstatus: new -> open
assignee: rouilj
messages: + msg6027
2014-10-18 02:13:45rouiljsetmessages: + msg5150
2014-03-20 15:07:02marlowasetfiles: + unnamed
messages: + msg5046
2014-03-20 15:05:17marlowasetfiles: + unnamed
messages: + msg5045
2014-03-20 14:43:15bersetmessages: + msg5044
2014-03-20 13:40:50rouiljsetmessages: + msg5043
2014-03-20 10:54:55bersetmessages: + msg5042
2014-03-19 09:25:32marlowasetnosy: + marlowa
messages: + msg5041
2013-03-07 15:05:03bersetkeywords: + Effort-Medium
messages: + msg4819
2013-03-07 14:07:11rouiljsetmessages: + msg4818
2013-03-07 09:40:17bersetkeywords: - Effort-Medium
priority: normal -> high
messages: + msg4817
nosy: + ber
2013-03-07 04:39:37rouiljsettitle: rovide basic support for handling html only emails -> provide basic support for handling html only emails
2013-03-07 04:39:26rouiljcreate