Issue 2550799
Created on 2013-03-07 04:39 by rouilj, last changed 2017-10-16 00:15 by rouilj.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | Remove |
dehtml.py | rouilj, 2013-03-07 04:39 | |||
unnamed | marlowa, 2014-03-20 15:05 | |||
unnamed | marlowa, 2014-03-20 15:07 |
Messages | |||
---|---|---|---|
msg4816 | Author: [hidden] (rouilj) | Date: 2013-03-07 04:39 | |
Currently when html only email is sent roundup rejects the email. We should make roundup extract the text from the html and post that (possibly with adding the html as an attachment). To do this we need to change the mail gateway to find the html portion of the email and convert to text. There are a few ways to do the conversion: 1) use an external program like links -dump 2) use code like beautiful soup, nltk.clean_html() 3) use the stupid little class/function attached that you can drop in utils as well if you wish. |
|||
msg4817 | Author: [hidden] (ber) | Date: 2013-03-07 09:40 | |
I agree, this was one the wishes mentioned twice in our short user survey. |
|||
msg4818 | Author: [hidden] (rouilj) | Date: 2013-03-07 14:07 | |
In message <1362649218.01.0.546514816848.issue2550799@psf.upfronthosting.co.za> <1362649218.01.0.546514816848.issue2550799@psf.upfronthosting.co.za>, Bernhard Reiter writes: >Bernhard Reiter added the comment: >I agree, this was one the wishes mentioned twice in our short user >survey. > >---------- >keywords: -Effort-Medium Do you think it's a high effort task? I gave it a medium rating because I think it's an isolated task to the mail gateway. Not trivial as you will have to learn about the mail gateway's code. However understanding the gateway code to find out where it exits if there is only an html part will tel you where to hook the new code into the flow. Then converting the html to text and creating a text/plain part should be person days of work not person weeks or longer right? |
|||
msg4819 | Author: [hidden] (ber) | Date: 2013-03-07 15:05 | |
(I removed the keyword by accident.) |
|||
msg5041 | Author: [hidden] (marlowa) | Date: 2014-03-19 09:25 | |
This issue was brought up way back in 2007. See the thread at http://sourceforge.net/p/roundup/mailman/message/13189177/. The thread discusses some open source software called the ASCII-nator, which converts HTML to ASCII for just such a case as this. The discussion sort of fizzled out but I am pleased to see that other developers still consider this to be an issue. AFAICS it has a simple solution with the ASCII-nator. |
|||
msg5042 | Author: [hidden] (ber) | Date: 2014-03-20 10:54 | |
Hi Andre, as far as I can see, ASCII-nator still has license problems. As John has pointed out, there are other solutions. The solution chosen should be reasonable save of course. :) So now we just need someone to do the work. >;) Bernhard |
|||
msg5043 | Author: [hidden] (rouilj) | Date: 2014-03-20 13:40 | |
Hi Bernhard: In message <1395312895.49.0.517509067968.issue2550799@psf.upfronthosting.co.za> <1395312895.49.0.517509067968.issue2550799@psf.upfronthosting.co.za>, Bernhard Reiter writes: >as far as I can see, ASCII-nator still has license problems. Is the problem you are referring to GPL V3's more restrictive license and viral nature. The current roundup/zope page templates is more permissive and almost BSD like. Does the GPL V3 kick in on roundup code if we include ASCII-nator source and call it as an external program, as I suggested with links -dump or whatever? Having a native python mechanism even if accessed via fork seems to be better alternative than a totally third party program that has to be built. (I am thinking of a windows install here, but I realise that the fork mechanism may not be possible on windows.) |
|||
msg5044 | Author: [hidden] (ber) | Date: 2014-03-20 14:43 | |
On Thursday 20 March 2014 at 14:40:50, John Rouillard wrote: > Is the problem you are referring to GPL V3's more restrictive license > and viral nature. It is more like a vaccination effect, if you ask me. :) Yes, I believe that there may be a problem and roundup or a solution build on roundup could be considered a derived work. So when it doubt, we should consider alternative solutions. The subprocess would work on windows, but I don't think it is a particular ice technical solution. So before I recommend looking at the alternatives first. |
|||
msg5045 | Author: [hidden] (marlowa) | Date: 2014-03-20 15:05 | |
On 20 March 2014 14:43, Bernhard Reiter <issues@roundup-tracker.org> wrote: > > Bernhard Reiter added the comment: > > On Thursday 20 March 2014 at 14:40:50, John Rouillard wrote: > > Is the problem you are referring to GPL V3's more restrictive license > > and viral nature. > > It is more like a vaccination effect, if you ask me. :) > It has been compared to taking a cutting. One takes a cutting consciensly, knowing what will happen. Whereas a virus is caught by accident. > Yes, I believe that there may be a problem and roundup or a solution build > on > roundup could be considered a derived work. So when it doubt, we should > consider alternative solutions. > Maye this was the view back in 2007. > > The subprocess would work on windows, but I don't think it is a particular > ice > technical solution. So before I recommend looking at the alternatives > first. > I think we might be able to go with what was suggested back in 2007, namely that the code could try to do the import and use it if successful. The documentation could mention that the ASCIInator will be used if present but that its absence is not harmful. Thus the ASCIInator could be installed on the same system as roundup and roundup may use it if present but it doesnt matter that the two pieces of software have different licences. > > ________________________________________________ > Roundup tracker <issues@roundup-tracker.org> > <http://issues.roundup-tracker.org/issue2550799> > ________________________________________________ > -- Regards, Andrew Marlow http://www.andrewpetermarlow.co.uk |
|||
msg5046 | Author: [hidden] (marlowa) | Date: 2014-03-20 15:07 | |
On 20 March 2014 15:05, Andrew Marlow <issues@roundup-tracker.org> wrote: > > Andrew Marlow added the comment: > > On 20 March 2014 14:43, Bernhard Reiter <issues@roundup-tracker.org> > wrote: > > > > > Bernhard Reiter added the comment: > > > > On Thursday 20 March 2014 at 14:40:50, John Rouillard wrote: > > > Is the problem you are referring to GPL V3's more restrictive license > > > and viral nature. > > > > It is more like a vaccination effect, if you ask me. :) > > > > It has been compared to taking a cutting. One takes a cutting consciensly, > knowing what will happen. Whereas a virus is caught by accident. > > > Yes, I believe that there may be a problem and roundup or a solution > build > > on > > roundup could be considered a derived work. So when it doubt, we should > > consider alternative solutions. > > > > Maye this was the view back in 2007. > > > > > The subprocess would work on windows, but I don't think it is a > particular > > ice > > technical solution. So before I recommend looking at the alternatives > > first. > > > > I think we might be able to go with what was suggested back in 2007, namely > that the code could try to do the import and use it if successful. The > documentation could mention that the ASCIInator will be used if present but > that its absence is not harmful. Thus the ASCIInator could be installed on > the same system as roundup and roundup may use it if present but it doesnt > matter that the two pieces of software have different licences. > > > > > ________________________________________________ > > Roundup tracker <issues@roundup-tracker.org> > > <http://issues.roundup-tracker.org/issue2550799> > > ________________________________________________ > > > > -- > Regards, > > Andrew Marlow > http://www.andrewpetermarlow.co.uk > > ________________________________________________ > Roundup tracker <issues@roundup-tracker.org> > <http://issues.roundup-tracker.org/issue2550799> > ________________________________________________ > -- Regards, Andrew Marlow http://www.andrewpetermarlow.co.uk |
|||
msg5150 | Author: [hidden] (rouilj) | Date: 2014-10-18 02:13 | |
Another (sadly also GPL V3) choice is: https://github.com/aaronsw/html2text which produces markdown from html (given that markdown is safer than reStructured text it may be a better choice for the conversion). Then convert to reStructured text (maybe pandoc --from=markdown --to=rst --output=message.rst message.md could work.) In any case, when saved as a file the mime type could be text/reStructured text and if the libraries are present, the message could be converted to html. If anybody decides to do this, make sure to secure the conversion according to: http://docutils.sourceforge.net/docs/howto/security.html |
|||
msg6027 | Author: [hidden] (rouilj) | Date: 2017-10-10 23:16 | |
I am adapting the patch at: https://sourceforge.net/u/iippolitov/roundup/ci/2ee03ad0b0a5edbb8e68763 fbf03a1032cf8a83d/ from Igor Ippolitov which uses beautiful soup 4 to do the html processing. I can't get the debian python-bs4 to work right, so I am merging his patch with the html2text code in dehtml.py attached to this issue. The patch currently attempts to load beautiful soup and if it gets an import error will fall back to using dehtml.py. I am currently working on the test cases and so far all existing test now pass. The new test cases I have: email with one text/html part and one multipart with text/csv and text/html seem to work for ascii. I am having issues with character representations for international chars. Does anybody have some time to test this code and see if it at least doesn't break anything and make be useful for turning html into text. I still need to add a trivalue config option to select/deselect the option: beautifulsoup, dehtml, none before I do the full commit. |
|||
msg6033 | Author: [hidden] (rouilj) | Date: 2017-10-14 14:12 | |
committed first pass at this in rev e20f472fde7d. Commit hg5306:91354bf0b683 fixed a bug found after looking at code coverage and testing some missed code paths. Plus hg5307:5b4931cfc182 added test for the entity conversion code path in in the dehtml routine. Using beautiful soup 4 is enabled but I couldn't develop tests for it, so mileage may vary. |
|||
msg6037 | Author: [hidden] (rouilj) | Date: 2017-10-16 00:15 | |
This is another solution by twb using lynx with options to set the encoding. Wanted to record here for future reference as it uses the lynx -dump option I originally suggested. 1 def strip_html(): 2 '''Parse global "message_string" variable as a MIME message. 3 Look for text/html MIME objects, 4 run lynx -dump on them, 5 insert them as text/plain MIME objects. 6 Return the new message (as a string). 7 8 NB: this often results in a multipart/alternative branch with TWO text/plain leaves, 9 but that actually seems to work out pretty well.''' 10 11 message_object = email.message_from_string(message_string) 12 13 # NB: this is the earliest point we can extract header fields. 14 syslog.syslog('SUBJECT {}'.format(message_object.get('Subject', 'No Subject'))) 15 syslog.syslog('MESSAGE-ID {}'.format(message_object.get('Message-ID', 'No Message-ID'))) 16 17 # NB: this walk traverses ALL nodes in a flattened tree, 18 # so we do not need to manually recurse on branch nodes. 19 # (unlike perl). --twb, Sep 2015 20 for part in message_object.walk(): 21 if 'text/html' == part.get_content_type(): 22 syslog.syslog('STRIPPED an html part') 23 # Pipe it through lynx to render as plain text. 24 # 25 # NOTE: postfix runs maxwell with C (not C.UTF-8) locale! 26 # This breaks non-ASCII for things like open(mode='wt') and check_output(universal_newlines). 27 # As a workaround leave everything as b'' bytes and pass encoding hints to lynx and set_payload. 28 # 29 # NOTE: It is very unintuitive, but 30 # get_payload(decode=False) ⇒ u'…' 31 # get_payload(decode=True) ⇒ b'…' 32 # This is because decode=True decodes only C-T-E: base64 (or quoted-printable); 33 # the e.g. ISO-8859-1 to Unicode decoding happens later! 34 output = subprocess.check_output( 35 ['lynx', 36 '--dump', 37 '--stdin', 38 '--assume-charset', part.get_content_charset(failobj='UTF-8')], 39 universal_newlines=False, # KLUDGE — see above 40 input=part.get_payload(decode=True), 41 env={'LC_ALL': 'C.UTF-8'}) 42 # Edit the part to be plain text. 43 # FIXME: instead of editing the old MIME object, 44 # *delete* it and create a new text/plain object. 45 # The only part we might want to keep is inline vs. attachment (disposition)? 46 # But how do I *do* that during a .walk()? --- it's not a foldr! 47 del part['Content-Type'] 48 del part['Content-Transfer-Encoding'] # alloc #31444 49 part['Content-Type'] = 'text/plain' 50 part.set_payload(output, 'UTF-8') 51 # FIXME: should append '.txt' to filename=fred.html where present. 52 # I can't see how to change it without also overriding the disposition, 53 # which is undesirable. --twb, Sep 2015 54 55 return message_object |
History | |||
---|---|---|---|
Date | User | Action | Args |
2017-10-16 00:15:58 | rouilj | set | messages: + msg6037 |
2017-10-14 21:09:26 | rouilj | set | status: open -> fixed resolution: fixed |
2017-10-14 14:12:02 | rouilj | set | messages: + msg6033 |
2017-10-10 23:16:26 | rouilj | set | status: new -> open assignee: rouilj messages: + msg6027 |
2014-10-18 02:13:45 | rouilj | set | messages: + msg5150 |
2014-03-20 15:07:02 | marlowa | set | files:
+ unnamed messages: + msg5046 |
2014-03-20 15:05:17 | marlowa | set | files:
+ unnamed messages: + msg5045 |
2014-03-20 14:43:15 | ber | set | messages: + msg5044 |
2014-03-20 13:40:50 | rouilj | set | messages: + msg5043 |
2014-03-20 10:54:55 | ber | set | messages: + msg5042 |
2014-03-19 09:25:32 | marlowa | set | nosy:
+ marlowa messages: + msg5041 |
2013-03-07 15:05:03 | ber | set | keywords:
+ Effort-Medium messages: + msg4819 |
2013-03-07 14:07:11 | rouilj | set | messages: + msg4818 |
2013-03-07 09:40:17 | ber | set | keywords:
- Effort-Medium priority: normal -> high messages: + msg4817 nosy: + ber |
2013-03-07 04:39:37 | rouilj | set | title: rovide basic support for handling html only emails -> provide basic support for handling html only emails |
2013-03-07 04:39:26 | rouilj | create |