Roundup Tracker - Issues

Issue 2551250

classification
Weird error with PrioList of detectors
Type: crash Severity: urgent
Components: Interface Versions: devel
process
Status: new
:
: schlatterbeck : rouilj, schlatterbeck
Priority: : python3

Created on 2022-12-21 18:36 by schlatterbeck, last changed 2022-12-27 15:33 by schlatterbeck.

Messages
msg7693 Author: [hidden] (schlatterbeck) Date: 2022-12-21 18:36
I've recently moved a production tracker from python2 to python3.
Now I'm getting *occasionally* a Traceback

TypeError: '<' not supported between instances of 'function' and 'function'

when sorting detectors (auditors and reactors).
These are using a PrioList object which sorts on demand before iterating over auditors and/or reactors.
The idea there is that detectors can have a priority.
The objects in the list are 3-tuples with prio, name, function.
This should never try to look at the function unless we have two entries with the same prio (the default is 100 and this is seldomly set explicitly so they may all be the same) and name.

Since this is quite urgent for me I'll try to fix this by allowing a key function to be passed to the PrioList. This would then be something like "lambda x: x[:2]" so that we never compare function objects. This is probably the right thing to do.

What I find very confusing is that this should happen everytime when sorting the detectors. It doesn't.
msg7694 Author: [hidden] (schlatterbeck) Date: 2022-12-21 19:17
I've pushed a fix that fixes the sorting issue should there ever be two detectors for the same event with the same name and prio.

But: In my tracker I do not have duplicate detector names.
In fact I'm currently unable to reproduce the problem.
Is it possible that under certain circumstances we add detectors twice?

Note that this occurred also in the mailgw interface, maybe when processing more than one mail this can occur?
msg7695 Author: [hidden] (rouilj) Date: 2022-12-21 20:01
Hi Ralf:

In message <1671647765.44.0.499935488614.issue2551250@roundup.psfhosted.org>,
Ralf Schlatterbeck writes:
>I've recently moved a production tracker from python2 to python3.
>Now I'm getting *occasionally* a Traceback
>
>TypeError: '<' not supported between instances of 'function' and 'function'
>
>when sorting detectors (auditors and reactors).

>These are using a PrioList object which sorts on demand before
>iterating over auditors and/or reactors.
[...]
>The objects in the list are 3-tuples with prio, name, function.

>This should never try to look at the function unless we have two
>entries with the same prio (the default is 100 and this is seldomly
>set explicitly so they may all be the same) and name.

I would expect the prio to be the same but not the name. Could this be
a string vs bytestring issue on the name? Are any of the names in
German?

>Since this is quite urgent for me I'll try to fix this by allowing a
>key function to be passed to the PrioList. This would then be
>something like "lambda x: x[:2]" so that we never compare function
>objects. This is probably the right thing to do.

Agreed. I wonder if this may be an issue in other places we use
Priolist. Can you dump the list it is sorting as well so we have a
clue where the first two elements of the tuple are identical.

This data dump would also serve as a test case for PrioList.

>What I find very confusing is that this should happen everytime when
>sorting the detectors. It doesn't.

To make this happen the name also has to compare the same right? So I
would not expect it to happen. Any idea how the name is ending up the
same? Maybe something is being doubly added under certain
circumstances?

I seem to recall some changes to python internals regarding iteration
orders for dicts, but it shouldn't affect tuples (i.e. the function
gets sorted before the prio/name).
msg7696 Author: [hidden] (schlatterbeck) Date: 2022-12-22 07:43
On Wed, Dec 21, 2022 at 08:01:46PM +0000, John Rouillard wrote:
> I would expect the prio to be the same but not the name. Could this be
> a string vs bytestring issue on the name? Are any of the names in
> German?

No, the names are function names in python, they're plain ascii.
And I've checked, I don't have any duplicate names.

> Agreed. I wonder if this may be an issue in other places we use
> Priolist. Can you dump the list it is sorting as well so we have a
> clue where the first two elements of the tuple are identical.

I think the PrioList is currently only used for auditors/reactors.
I think I implemented this a decade or two ago :-)
And, yes, I've dumped the list here and found no duplicates. But then I
also could not reproduce the error so far.

> >What I find very confusing is that this should happen everytime when
> >sorting the detectors. It doesn't.
> 
> To make this happen the name also has to compare the same right? So I
> would not expect it to happen. Any idea how the name is ending up the
> same? Maybe something is being doubly added under certain
> circumstances?

Yes, the name would have to be the same, too. And I've checked, I do not
have any reactors/auditors with duplicate names. So my current
hypothesis is, too, that we get some duplication in some circumstances.

> I seem to recall some changes to python internals regarding iteration
> orders for dicts, but it shouldn't affect tuples (i.e. the function
> gets sorted before the prio/name).

Yes, and in our case we explicitly sort the list. So iteration order
should be the same in every invocation.

I've rolled out my change yesterday evening so I think I will not see
the error anymore. But I'll try to find out if we get duplication
somewhere.

Ralf
-- 
Dr. Ralf Schlatterbeck                  Tel:   +43/2243/26465-16
Open Source Consulting                  www:   www.runtux.com
Reichergasse 131, A-3411 Weidling       email: office@runtux.com
msg7703 Author: [hidden] (rouilj) Date: 2022-12-25 04:51
Hi Ralf:

In message <1671650263.38.0.386812066679.issue2551250@roundup.psfhosted.org>,
Ralf Schlatterbeck writes:
>I've pushed a fix that fixes the sorting issue should there ever
>be two detectors for the same event with the same name and prio.

Should two entries with the same name/prio even be allowed?

Since list.sort() is stable, identical entries will preserve the
original relative order of identical keys, but that's not documented
anywhere that I can see. In fact we don't document that identically
prioritied and named items can be added. Given that it used to cause a
crash, I assume it's not being used.

>But: In my tracker I do not have duplicate detector names.
>In fact I'm currently unable to reproduce the problem.

I assume you're trying to reproduce without the patch.  Do you have
any debugging statements in place that may perturb things?  Also what
version of Python are you running?

I wonder if we should raise an exception when adding an element with
the same name and priority as an existing element. Detecting this at
the source should identify where the duplicates are coming from. This
might expose another bug. I am concerned about the performance issue
but we could add the key (priority, name) to a dict and just compare
the key to determine membership.

Ideally we would detect the issue at time of creation, but wrapping:

  self.list.sort()

in try/except to capture the exception and dump self.list might
provide some clues. I have been looking for changes between python 2
and python 3 that could trigger this but I got nothing.

>Is it possible that under certain circumstances we add detectors twice?

I suggested that as a possibility in my response to your initial
report.  I have no idea how that would happen (intermittently) though.

>Note that this occurred also in the mailgw interface, maybe when
>processing more than one mail this can occur?

Maybe. I assume you are fetching email via imap from cron?

Might:

   https://github.com/bamthomas/aioimaplib/blob/master/aioimaplib/tests/imapserver.py

or

  https://github.com/antespi/docker-imap-devel

or some other imap mock server be useful for testing?
msg7704 Author: [hidden] (schlatterbeck) Date: 2022-12-27 14:19
On Sun, Dec 25, 2022 at 04:51:14AM +0000, John Rouillard wrote:
> 
> In message <1671650263.38.0.386812066679.issue2551250@roundup.psfhosted.org>,
> Ralf Schlatterbeck writes:
> >I've pushed a fix that fixes the sorting issue should there ever
> >be two detectors for the same event with the same name and prio.
> 
> Should two entries with the same name/prio even be allowed?

I don't see any good reason to specifically check for this.

> Since list.sort() is stable, identical entries will preserve the
> original relative order of identical keys, but that's not documented
> anywhere that I can see. In fact we don't document that identically
> prioritied and named items can be added. Given that it used to cause a
> crash, I assume it's not being used.

Yes I think this is a minor issue. In most use-cases you don't even assign
an explicit prio to detectors.

> >But: In my tracker I do not have duplicate detector names.
> >In fact I'm currently unable to reproduce the problem.
> 
> I assume you're trying to reproduce without the patch.  Do you have
> any debugging statements in place that may perturb things?  Also what
> version of Python are you running?

This is 3.8 on the production machine. Note that this will not occur on
python2 because there comparison of functions is allowed. I currently do
not yet have debug code in place (I just wanted to get out a fix before
christmas :-)

Also note: I'm *newly* getting the same error message on a tracker that
is running on python3 for quite some time (also 3.8 afair). So is it
possible that recently something has been introduce that duplicates
detectors? On the other hand this tracker is quite low-volume... so it
may well be that we never hat multiple incoming emails in a single
mailgw run yet.

> I wonder if we should raise an exception when adding an element with
> the same name and priority as an existing element. Detecting this at
> the source should identify where the duplicates are coming from. This
> might expose another bug. I am concerned about the performance issue
> but we could add the key (priority, name) to a dict and just compare
> the key to determine membership.

or a set()

> 
> Ideally we would detect the issue at time of creation, but wrapping:
> 
>   self.list.sort()
> 
> in try/except to capture the exception and dump self.list might
> provide some clues. I have been looking for changes between python 2
> and python 3 that could trigger this but I got nothing.

I don't think there were changes when introducing py3, the comparison
(for sorting) of functions works fine in python2 but not in python3:

: ralf@bee:3; python3
>>> def f1 ():
...  pass
...
>>> def f2 ():
...  pass
...
>>> f1 < f2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'function' and 'function'

: ralf@bee:4; python2
...
>>> f1 < f2
True

> >Is it possible that under certain circumstances we add detectors twice?
> 
> I suggested that as a possibility in my response to your initial
> report.  I have no idea how that would happen (intermittently) though.

> >Note that this occurred also in the mailgw interface, maybe when
> >processing more than one mail this can occur?
> 
> Maybe. I assume you are fetching email via imap from cron?

It might occur when processing more than one email in the mailgw if
there is a bug that repeatedly adds detectors for each mail that is
processed. That's my current hypothesis...

And, yes, I'm using imap.

> Might:
> 
>    https://github.com/bamthomas/aioimaplib/blob/master/aioimaplib/tests/imapserver.py
> 
> or
> 
>   https://github.com/antespi/docker-imap-devel
> 
> or some other imap mock server be useful for testing?

Yes, unless we can reproduce it with some other mailgw method, it might
not be related to imap.

If we don't write each other, I wish you 'einen guten Rutsch ins Neue
Jahr' (roughly translates as 'a good slide to the new year' :-)

Ralf
-- 
Dr. Ralf Schlatterbeck                  Tel:   +43/2243/26465-16
Open Source Consulting                  www:   www.runtux.com
Reichergasse 131, A-3411 Weidling       email: office@runtux.com
msg7705 Author: [hidden] (schlatterbeck) Date: 2022-12-27 15:33
On Tue, Dec 27, 2022 at 02:19:33PM +0000, Ralf Schlatterbeck wrote:
> Also note: I'm *newly* getting the same error message on a tracker that
> is running on python3 for quite some time (also 3.8 afair).

Turns out that was a different error (comparison between NoneType and
float in one of my detectors).

Ralf
-- 
Dr. Ralf Schlatterbeck                  Tel:   +43/2243/26465-16
Open Source Consulting                  www:   www.runtux.com
Reichergasse 131, A-3411 Weidling       email: office@runtux.com
History
Date User Action Args
2022-12-27 15:33:18schlatterbecksetmessages: + msg7705
2022-12-27 14:19:33schlatterbecksetmessages: + msg7704
2022-12-25 04:51:14rouiljsetmessages: + msg7703
2022-12-22 07:43:38schlatterbecksetmessages: + msg7696
2022-12-21 20:01:46rouiljsetmessages: + msg7695
2022-12-21 19:17:43schlatterbecksetmessages: + msg7694
2022-12-21 18:36:05schlatterbeckcreate