Message 7728 - Roundup tracker

Author	rouilj
Recipients	Heiko, rouilj, schlatterbeck
Date	2023-02-23.18:44:58
Message-id	<20230223184453.0AC896A0010@pe15.cs.umb.edu>
In-reply-to	<20230223095547.p4hypcsuawoqwcl3@runtux.com>
Hi Ralf:

In message <20230223095547.p4hypcsuawoqwcl3@runtux.com>,
Ralf Schlatterbeck writes:
>Ralf Schlatterbeck added the comment:
>
>On Wed, Feb 22, 2023 at 05:20:45PM +0000, John Rouillard wrote:
>> 
>> Roundup has two additional mechanisms to try to assign
>> emails to issues if prefix parsing fails.
>> 
>>  1. It looks for an in-reply-to header in the email and
>>     tries to match that up against an existing message with
>>     the same message id. Then it puts the new message on
>>     issue with the matching message.
>
>Are you sure this mechanism works in all cases? I have lots of cases
>where a direct reply does not work, I've never considered that matching
>happens by in-reply-to header.

Well I don't know about all cases. But the test

  testReplytoMultiMatch

in test_mailgw.py works. And the code in Codecov shows that the
branches are being executed.

Here is a funny one though, there is a test testIssueidLast which does
find the [issue1] at the end of the subject line.

(note, if I modify that test to remove the "[issue1]" at the end it does
properly match the new message by message-id/in-reply-to to issue
1.)

However if I keep the "prefix" at the end, it gets matched by
extracting it via this (unanchored) regexp:

  \\[(?P<classname>(file|issue|keyword|msg|priority|query|status|user))(?P<nodeid>\\d+)?\\]'

even though it correctly records a missing prefix. The comment with
this is that mailing list software could add an identifier as
well. This would mess up prefix detection, but designator detection would
still happen.

So it would appear that '[classNN]' is matched anywhere in the subject
line case insensitively via:

  m = re.search(class_re, tmpsubject, re.IGNORECASE)

8-/. Only the first one found is used. This is with strict parsing
too. In this case, there is a side effect of erasing the new subject
as it moves past, so the issue preserves the original title.

>>  2. If message id matching fails, it tries to match on the subject.
>>     If the new message's subject is changed sufficiently from the
>>     title of an existing issue, this will fail as well.
>
>Yes, the particular use-case Heiko is referring to is something like
>
>[issue 4711]
>i.e. an space after issue is the most common thing people get wrong.
>Now if someone is editing the subject by hand, chances are that the
>whole mail is not composed as an answer to an existing incoming mail...

True.

>So the idea would be to make matching more tolerant like in a message
>body. This allows different case and/or a space between class name and
>node-id.
>So if we change this it probably should be configurable.

Agreed. Should it be added to the definition of loose? Should there be
a separate setting (or settings in an already long file)? Should the
values of:

   subject_prefix_parsing

be extended to <required> [optional]:

   <strict|loose|none> [internal_whitespace] [delim_whitespace]
     [prefix_only]

(Better naming is needed.)

  * internal_whitespace is "[issue 24]"
  * delim_whitespace (if we want to support it) is "[ issue24 ]".
  * using both allows: "[ issue 24 ]".

Implementing prefix_only might be tricky as noted above with junk
before it.

>And when at it we might want to make the message-body matching
>configurable, too for the use-case above.

This is for a different ticket. BTW it's your fault for not naming the
class QSOcontacts rather than qso 8-). 73!

>> Ralf, let's assume Heiko's experiment indicates that
>> inreplyto matching should work. [...]
>
>I don't think it matches anything, see use-case above.

Agreed. The space between the class and id would prevent any match.

>But it may well be that when it started matching (message begins with
>'[' after stripping reply/fwd prefixes) that no other alternatives are
>tried. See above for my experience on subject matching.

The in-reply-to should be applied, but as you noted there may not be
an in-reply-to on the inbound email. Also if Roundup doesn't receive a
copy of every email in an external discussion, it will break the
chain.

Hence my asking Heiko to fire up roundup-admin for a look.

>> Parsing the subject line is tricky. We don't want false
>> positives where a match happens, but shouldn't be. To guard
>> against this we use:
>> 
>>   1. known location for the prefix
>Yes, I would only consider a *prefix*, i.e. directly after a 'Re:' or
>similar prefix in the subject (this is already a configurable regex).

I wonder if we should anchor the identifier prefix and handle any
other mailing list junk as part of the Re: removal stuff.

>>   2. delimiters around the prefix
>Yes I would not depart from []. We *do* have the occasional [issue4711}
>or so but I would not try to parse these.

ok.

>>   3. strict format inside the delimiters
>See above, for a first change I'd only allow a space between classname
>and node-id. This is the most common case as far as my observations go.

with some format variations (spacing) allowed by configuration.

>> It is more complex that that, you can have:
>> 
>>   Subject: [user3] [realname=Fred]
>
>Yes, but the second thing must currently be last in the subject.

It would appear not. Also if you do:

   Subject: [realname=Fred] [user3]

I think the suffix parsing will not happen. When suffix parsing
happens, the subject line consists what is after '[user3]'.

>> What is the setting for 'subject_prefix_parsing' in your
>> tracker's config.ini?
>> [...]
>> If you aren't using strict, it might help.
>
>Yes, maybe that would indeed solve that particular issue.

From the code it looks like it should if the designator is found as a
prefix.

>> >My request is to make the mail gateway based issue matching
>> >more fault tolerant. It should not matter where in the
>> >subject line the issue designator is placed.
>
>I'm very reluctant to allow the designator anywhere in the message. This
>simply could create too much ambiguity, especially together with the
>second bracketed expression where you can set properties.

Well we have that already so... Heiko consider this implemented 8-).

>> >Blank spaces inside the brackets
>> [...]
>> or do they split the designator like:
>> 
>>   [  issue 24 ]
>
>I think this is a good use-case and most of the cases I've seen match
>that pattern.

Ok.

>>  [ iss ue 24 ]
>
>I wouldn't allow these.

Ok.

>> >It should also be case insensitive.
>
>I also would not allow case-insensitive matches.

Apparently it is already. Although nothing I found in the code
indicates that classes are supposed to be case sensitive.

>But note that the parsing inside the message body seems to allow
>capitalization. So I'm seeing something like Issue 23 highlighted as a
>link to issue23. An all-uppercase ISSUE 23 is *not* linked, though.

I wonder if a class named QSO would be matched by qso 73?
History
Date	User	Action	Args
2023-02-23 18:44:59	rouilj	set	recipients: + rouilj, schlatterbeck, Heiko
2023-02-23 18:44:59	rouilj	link	issue2551262 messages
2023-02-23 18:44:58	rouilj	create
Roundup Tracker - Issues

Message7728