Roundup Tracker - Issues

Issue 2550514

classification
optimizing menu() et al.
Type: resource usage Severity: major
Components: Database Versions: devel
process
Status: open
:
: stefan : rawler, richard, rouilj, schlatterbeck, stefan
Priority: : patch

Created on 2009-02-21 02:00 by stefan, last changed 2021-03-24 08:58 by schlatterbeck.

Files
File name Uploaded Description Edit Remove
roundup-dbperf-improve.diff rawler, 2009-02-21 15:06
getnodes.patch stefan, 2009-03-20 15:40
Messages
msg3565 Author: [hidden] (stefan) Date: 2009-02-21 01:59
The LinkHTMLProperty.menu() function does conceptually this:

---

options = [o for o in filter(spec) if permitted('View', o)]
for o in options:
  add_menu_item(o)

---

This seemingly benign chunk of code may involve a lot of SQL calls:

* at least one SQL call inside the filter() call, then possible N more
  in the N calls to permitted() (where N is the number of objects 
  returned by filter()
* N more SQL calls as add_menu_item() needs to fetch the label property
  for each item.

So, 2N + 1 distinct SQL calls to fill a single menu with N options.

I believe at least some of this can be optimized:

1) With an enhanced version of filter(), not only matching nodeids would
  be returned, but requested properties, too. This would get rid of the
  SQL calls inside add_menu_item().

2) The permission checks could possibly be fused, too. However, as in 
  general permissions are handled outside SQL, this is much harder.

Therefor, I'd suggest to enhance filter() so instead of simply returning
matching node ids it returns node ids in addition to requested properties:

filter(search_matches, filterspec, sort=[], group=[], properties=[])

would return a list of items, where an item is either a nodeid, if
'properties' is empty, or a tuple containing nodeid, as well as all
requested properties of that node, otherwise.

Comments ?
msg3566 Author: [hidden] (richard) Date: 2009-02-21 02:12
Sounds good to me. Was something I mulled over ages ago.
msg3567 Author: [hidden] (rawler) Date: 2009-02-21 15:06
I had some slightly different solutions. Although "getnodes" I've called
"fetch()". It fetches nodes based on a collection of ids and using the
'using' SQL-keyword. It also features defaults for other backends.

There are also a few smaller optimisations, including caching
pre:fetch() for batch()es. Also, in a few places identified by
profiling, I've replaced multiple self.db-calls, with a local
db-variable. (Tip from
http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Avoidingdots...)

In general, though, I think that filter() returning complete nodes is
the way to go. The get()-API will probably only ever perform well on
DBM-style direct-file databases. Other databases, in particular *SQL,
should attempt to fetch as many rows in one go as possible, and the API
should encourage that.
msg3664 Author: [hidden] (stefan) Date: 2009-03-20 15:40
Here is a patch that does two things (beside some cleanup):

* it adds 'getnodes()' methods that allow for more compact SQL queries.
* it adds a 'properties' argument to the 'filter()' methods, to allow to
fuse the query of properties with the original filter (again to reduce
the number of required SQL queries).

My plan is to adjust various functions (notably in cgi.templating) to
use these enhancements to reduce the number of queries made on the
database, and thus to speed up page rendering.

How does this look ?
msg3665 Author: [hidden] (stefan) Date: 2009-03-20 16:10
I just realize the patch contains some experimental bits not meant to be
included here. Sorry for the noise.
msg3816 Author: [hidden] (stefan) Date: 2009-07-19 23:12
Richard, any opinion on this ?
msg3818 Author: [hidden] (richard) Date: 2009-07-20 06:14
Stefan, your plan seems sound, go for it.
msg3947 Author: [hidden] (schlatterbeck) Date: 2010-01-03 12:07
When iterating over *many* results from a database query, we should have
an iterator interface instead of fetching (possibly thousands of) all
whole nodes at once. I've experimented with queries that return
thousands of nodes and simply enlarged roundups node cache. The result
is only marginally faster due to two things
- lots of sql queries
- memory allocation for the cache

only the first will be addressed by the proposed patch...
Maybe we return an iterator from getnodes?
msg3948 Author: [hidden] (stefan) Date: 2010-01-03 14:25
On 01/03/2010 07:07 AM, Ralf Schlatterbeck wrote:
>
> Ralf Schlatterbeck<rsc@runtux.com>  added the comment:
>
> When iterating over *many* results from a database query, we should have
> an iterator interface instead of fetching (possibly thousands of) all
> whole nodes at once. I've experimented with queries that return
> thousands of nodes and simply enlarged roundups node cache. The result
> is only marginally faster due to two things
> - lots of sql queries
> - memory allocation for the cache
>
> only the first will be addressed by the proposed patch...
> Maybe we return an iterator from getnodes?

I agree that an iterator is a good idea. Though the bottleneck we 
observed (and which let me to propose a change in the API) is due to IPC 
overhead caused by many individual queries, instead of few fused ones.
msg3949 Author: [hidden] (schlatterbeck) Date: 2010-01-03 15:13
On Sun, Jan 03, 2010 at 02:25:06PM +0000, Stefan Seefeld wrote:
> > - lots of sql queries
> > - memory allocation for the cache
> >
> > only the first will be addressed by the proposed patch...
> > Maybe we return an iterator from getnodes?
> 
> I agree that an iterator is a good idea. Though the bottleneck we 
> observed (and which let me to propose a change in the API) is due to IPC 
> overhead caused by many individual queries, instead of few fused ones.

I've also done some measurements. Roundup uses an internal node cache of
100 items by default (this is a config option). I've done some
experiments with a large query (which returns about 140000 items):

  Cachesize   Time (minutes)
        2     1:54-2:36
       10     1:29-1:35 (only 2 Measurements)
      100     1:14-1:37
   100000     1:23-1:35 (usually at the lower end)

so we see diminishing returns on larger cache sizes inside roundup.
The number of queries when going from cache-size 100 to 100000 goes down
from 240000 to 140000 (!) due to no re-querying the same nodes (I'm
using some of the nodes more than once in the computation). The time
requirement is memory-bound.

In all cases the initial SQL query that returns all the IDs takes less
than 1 second real-time.

So I guess my example can really profit from using an iterator --
especially since the whole database is in RAM because I configured the
PostgreSQL cache large enough. In these cases the database thread and
the roundup thread consuming the query data can run in parallel.

Ralf
-- 
Dr. Ralf Schlatterbeck                  Tel:   +43/2243/26465-16
Open Source Consulting                  Fax:   +43/2243/26465-23
Reichergasse 131                        www:   http://www.runtux.com
A-3411 Weidling                         email: office@runtux.com
osAlliance member                       email: rsc@osalliance.com
msg3950 Author: [hidden] (stefan) Date: 2010-01-03 17:10
On 01/03/2010 10:13 AM, Ralf Schlatterbeck wrote:

> So I guess my example can really profit from using an iterator --
> especially since the whole database is in RAM because I configured the
> PostgreSQL cache large enough. In these cases the database thread and
> the roundup thread consuming the query data can run in parallel.

I'd suggest to create a new issue about the iterator, so we can keep 
this one for the fused SQL queries idea I submitted initially.

Thanks,
		Stefan
msg3956 Author: [hidden] (schlatterbeck) Date: 2010-01-04 10:33
On Sun, Jan 03, 2010 at 05:10:58PM +0000, Stefan Seefeld wrote:
> 
> On 01/03/2010 10:13 AM, Ralf Schlatterbeck wrote:
> 
> > So I guess my example can really profit from using an iterator --
> > especially since the whole database is in RAM because I configured the
> > PostgreSQL cache large enough. In these cases the database thread and
> > the roundup thread consuming the query data can run in parallel.
> 
> I'd suggest to create a new issue about the iterator, so we can keep 
> this one for the fused SQL queries idea I submitted initially.

Why not make the initial implementation of `getnodes` an iterator? Your
initial use-case in msg3565 is nicely implemented using an iterator. And
if we really have a case where we want the nodes to be materialized we
can always write
nodes = [db.someclass.getnodes (...)]

or am I missing something?

Ralf
-- 
Dr. Ralf Schlatterbeck                  Tel:   +43/2243/26465-16
Open Source Consulting                  Fax:   +43/2243/26465-23
Reichergasse 131                        www:   http://www.runtux.com
A-3411 Weidling                         email: office@runtux.com
osAlliance member                       email: rsc@osalliance.com
msg3957 Author: [hidden] (stefan) Date: 2010-01-04 14:05
On 01/04/2010 05:33 AM, Ralf Schlatterbeck wrote:
>
> Ralf Schlatterbeck<rsc@runtux.com>  added the comment:
>
> On Sun, Jan 03, 2010 at 05:10:58PM +0000, Stefan Seefeld wrote:
>>
>> On 01/03/2010 10:13 AM, Ralf Schlatterbeck wrote:
>>
>>> So I guess my example can really profit from using an iterator --
>>> especially since the whole database is in RAM because I configured the
>>> PostgreSQL cache large enough. In these cases the database thread and
>>> the roundup thread consuming the query data can run in parallel.
>>
>> I'd suggest to create a new issue about the iterator, so we can keep
>> this one for the fused SQL queries idea I submitted initially.
>
> Why not make the initial implementation of `getnodes` an iterator? Your
> initial use-case in msg3565 is nicely implemented using an iterator. And
> if we really have a case where we want the nodes to be materialized we
> can always write
> nodes = [db.someclass.getnodes (...)]
>
> or am I missing something?

No, that sounds fine. All I'm saying is that this issue is about 
reducing the number of SQL queries. Your suggestion is certainly good, 
but tangential to the problems I'm describing here.

Thanks,
		Stefan
msg5622 Author: [hidden] (rouilj) Date: 2016-06-25 20:52
Open questions:

was this implemented using an iterator for getnodes as agreed in the
final message?

does the code as implemented in the patch work for anydbm and all rdbm
back ends?
msg7132 Author: [hidden] (rouilj) Date: 2021-03-22 20:37
Ping was anything done with these patches? If not is anybody able to 
update them to the current 2.0 code base and python3ize them with
tests?
msg7134 Author: [hidden] (schlatterbeck) Date: 2021-03-23 10:34
On Mon, Mar 22, 2021 at 08:37:17PM +0000, John Rouillard wrote:
> 
> Ping was anything done with these patches? If not is anybody able to 
> update them to the current 2.0 code base and python3ize them with
> tests?

This is implemented with filter_iter a long time ago.

But it isn't used in the html framework. The idea of filter_iter is to
do the same as filter but as an iterator. So in each iteration it will
return a single ID (instead a full list of IDs). But: In addition it
internally fetches all attributes of a node, not just the ID and
populates the node-cache. In this way we will have a single query that
does the same as the proposed getnodes() in the issue: Fetch the whole
row. But we're doing it in an iterator which can conserve a lot of
memory when the query is large (especially since we're not doing a
fetchall of the resulting query data).

So instead of

for id in db.someclass.filter(...):
    n = db.someclass.getnode(id)
    # do domething with node n

(which will perform an additional sql query for each getnode)

using 
for id in db.someclass.filter_iter(...):
    n = db.someclass.getnode(id)
    # do domething with node n

will perform only a single query at the start. The getnode call will hit
the cache. In addition it will not slurp in the whole query into the
memory like the proposed getnodes would (I had proposed to make getnodes
an iterator in msg3956 in that issue).

Note that the original analysis that this will perform 2*n+1 sql queries
is wrong, because the first access to a node will put the node in the
cache (which still leaves n + 1 sql queries :-)

So the only thing that remains to be done is to implement this in
LinkHTMLProperty.menu and probably in more methods of the HTML wrapper
classes. Note that most use-cases of filter fit a pattern where filter
can be replaced with filter_iter.

The only exception is sorting by Multilinks: This cannot be done in SQL
alone and is therefore not implemented in filter_iter. This is something
that should be deprecated anyway.

And: If the code in the loop uses Multilinks, these *are* fetched in
additional SQL queries (Multilinks are fetched lazily).

Ralf
-- 
Dr. Ralf Schlatterbeck                  Tel:   +43/2243/26465-16
Open Source Consulting                  www:   www.runtux.com
Reichergasse 131, A-3411 Weidling       email: office@runtux.com
msg7135 Author: [hidden] (schlatterbeck) Date: 2021-03-23 10:54
On Tue, Mar 23, 2021 at 10:34:14AM +0000, Ralf Schlatterbeck wrote:
> And: If the code in the loop uses Multilinks, these *are* fetched in
> additional SQL queries (Multilinks are fetched lazily).

This should be clarified: If the code in the loop accesses Multilinks
properties of the node, those would be fetched separately. Which
certainly does *not* happen in the original use-case of optimizing
LinkHTMLProperty.menu and will not happen in many other use cases (like,
e.g., fetching rows for display in index templates in the web interface
in most cases I've seen).

Ralf
-- 
Dr. Ralf Schlatterbeck                  Tel:   +43/2243/26465-16
Open Source Consulting                  www:   www.runtux.com
Reichergasse 131, A-3411 Weidling       email: office@runtux.com
msg7136 Author: [hidden] (rouilj) Date: 2021-03-23 21:25
Hi Ralf:

Responding to two updates here.

In message <20210323103411.z4l6lop3tqihfcp3@runtux.com>,
Ralf Schlatterbeck writes:
>On Mon, Mar 22, 2021 at 08:37:17PM +0000, John Rouillard wrote:
>> Ping was anything done with these patches? If not is anybody able to
>> update them to the current 2.0 code base and python3ize them with
>> tests?
>
>This is implemented with filter_iter a long time ago.
>
>But it isn't used in the html framework. 

What has to be done to get this implemented? I see that tests in
test/db_test_base.py do include calls to filter_iter.

Do we need a test for the templating system
(e.g. LinkHTMLProperty::menu())? The data in the database and schema
probably need to be set up to test the conditions:

   user is not allowed to see one of the items in the class
   one of the items in the class is retired

to make sure we hit the 3 main paths through the code. Final test is
to compare the html.  Then we can rework menu() to use filter_iter.

Question should the html be tested as a string or by comparing the two
parse trees. I think comparing by tree is more robust, but I don't
know of any tools in python to do that. Simple html parser, difflib etc.
don't quite fill the niche.

While I was looking at the code, we support xhtml and html4. At this
point I think xhtml is kind of dead, but we probably should maintain
it.  However should we add html5 to the output format? In most cases
html4 and html5 would be the same but .... If we do this should we
change the templates to html5 format (doctype declaration, lang=
attribute etc.).

Also I noticed that there is a checklist() spec commented out at the
end of the menu() definition. I see a reference to checklist in the
customizing.doc and in doc/html_extra/spec.html. Do you know if there
was ever an implementation for this (series of checkboxes to select
options (e.g. for search) or radiobuttons (with a none/unset value)
for setting the value for a Link)? Would have been nice to have as I
had to implement these at the templating level.

In message <20210323105412.nu5ohfxsxk35awp5@runtux.com>,
>On Tue, Mar 23, 2021 at 10:34:14AM +0000, Ralf Schlatterbeck wrote:
>> And: If the code in the loop uses Multilinks, these *are* fetched in
>> additional SQL queries (Multilinks are fetched lazily).
>
>This should be clarified: If the code in the loop accesses Multilinks
>properties of the node, those would be fetched separately. Which
>certainly does *not* happen in the original use-case of optimizing
>LinkHTMLProperty.menu and will not happen in many other use cases (like,
>e.g., fetching rows for display in index templates in the web interface
>in most cases I've seen).

To make sure I understand. If issue.index.html included the superseder
property, that would trigger this addition set of lookups (one per id
in the superseder list) correct?

Also if we use filter_iter we can't sort by multilink in the sql. Can
we detect if the user is requesting a multilink sort and fall back to
filter()? I think sorting by multilink might be a requirement for
index views when grouping by a multilink. (I agree with you that
sorting by multilink is confusing.)

Do you know if the multilink 1,9,10 has the same value as the
multilink 9,10,1? I think it must for sorting at the db level to make
any sense but ....

Have a great week.
msg7138 Author: [hidden] (schlatterbeck) Date: 2021-03-24 08:58
On Tue, Mar 23, 2021 at 09:25:28PM +0000, John Rouillard wrote:
> >This is implemented with filter_iter a long time ago.
> >
> >But it isn't used in the html framework. 
> 
> What has to be done to get this implemented? I see that tests in
> test/db_test_base.py do include calls to filter_iter.
> 
> Do we need a test for the templating system
> (e.g. LinkHTMLProperty::menu())? The data in the database and schema
> probably need to be set up to test the conditions:

Yes, that would be cool, I think the HTML layer is currently largely
untested.

> Question should the html be tested as a string or by comparing the two
> parse trees. I think comparing by tree is more robust, but I don't
> know of any tools in python to do that. Simple html parser, difflib etc.
> don't quite fill the niche.

Beautiful soup (bs4) is a nice tool for parsing html and coding
regression tests. But this would be an external dependency. Note that we
already optionally use it for parsing incoming HTML emails.
On the other hand the created string is not changing too much, so a
string compare would probably be enough.

> While I was looking at the code, we support xhtml and html4. At this
> point I think xhtml is kind of dead, but we probably should maintain
> it.  However should we add html5 to the output format? In most cases
> html4 and html5 would be the same but .... If we do this should we
> change the templates to html5 format (doctype declaration, lang=
> attribute etc.).

Yes that would be nice. I think the generated html type can currently be
configured. I'd go for html5 for the standard templates of the classic
tracker.

> Also I noticed that there is a checklist() spec commented out at the
> end of the menu() definition. I see a reference to checklist in the
> customizing.doc and in doc/html_extra/spec.html. Do you know if there
> was ever an implementation for this (series of checkboxes to select
> options (e.g. for search) or radiobuttons (with a none/unset value)
> for setting the value for a Link)? Would have been nice to have as I
> had to implement these at the templating level.

No I haven't used this ever.
I'm mainly using classhelp, this pops up a separate window where you can
select things that end up in the input field when pressing a button.

> >This should be clarified: If the code in the loop accesses Multilinks
> >properties of the node, those would be fetched separately. Which
> >certainly does *not* happen in the original use-case of optimizing
> >LinkHTMLProperty.menu and will not happen in many other use cases (like,
> >e.g., fetching rows for display in index templates in the web interface
> >in most cases I've seen).
> 
> To make sure I understand. If issue.index.html included the superseder
> property, that would trigger this addition set of lookups (one per id
> in the superseder list) correct?

Code:
for id in db.someclass.filter_iter(...):
    node = db.someclass.getnode(id)
    x = node.some_multilink <- This triggers the SQL lookup of the multilink

And, yes, concerning your question: As soon as something accesses a
multilink inside the loop it would be materialized with an additional
query. I don't find this too worrying:
- If a class has several multilinks only the used ones are materialized,
  e.g., 'files' and 'messages' would not be queried for in your example
- The multilink is a separate table in SQL anyway. Better to do this
  lazily than materialize the data up-front with a large join.

> Also if we use filter_iter we can't sort by multilink in the sql. Can
> we detect if the user is requesting a multilink sort and fall back to
> filter()? I think sorting by multilink might be a requirement for
> index views when grouping by a multilink. (I agree with you that
> sorting by multilink is confusing.)

The internal data structure that represents the tree for the generated
sql is aware of the fact that there is sorting by multilinks. So in
theory this is possible to detect. On the other hand: I do not think
that a list sorted by multilinks is comprehensible to humans. So I guess
we should drop this feature and just ignore it during searches. Maybe
issue a warning. In addition it is probably not easy to understand why
some queries take a lot more time than others.

> Do you know if the multilink 1,9,10 has the same value as the
> multilink 9,10,1? I think it must for sorting at the db level to make
> any sense but ....

Yes, they have the same value. First the individual values are sorted
(by the order property of the multilink) then the dataset is sorted by
the resulting lists (again by order prop of course).

> Have a great week.
Same to you!

Ralf
-- 
Dr. Ralf Schlatterbeck                  Tel:   +43/2243/26465-16
Open Source Consulting                  www:   www.runtux.com
Reichergasse 131, A-3411 Weidling       email: office@runtux.com
History
Date User Action Args
2021-03-24 08:58:34schlatterbecksetmessages: + msg7138
2021-03-23 21:25:28rouiljsetmessages: + msg7136
2021-03-23 10:54:14schlatterbecksetmessages: + msg7135
2021-03-23 10:34:14schlatterbecksetmessages: + msg7134
2021-03-22 20:37:17rouiljsetmessages: + msg7132
2016-06-25 20:52:46rouiljsetnosy: + rouilj
messages: + msg5622
2010-01-04 14:05:47stefansetmessages: + msg3957
2010-01-04 10:33:00schlatterbecksetmessages: + msg3956
2010-01-03 17:10:58stefansetmessages: + msg3950
2010-01-03 15:13:33schlatterbecksetmessages: + msg3949
2010-01-03 14:25:06stefansetmessages: + msg3948
2010-01-03 12:07:09schlatterbecksetnosy: + schlatterbeck
messages: + msg3947
2009-07-20 06:14:25richardsetmessages: + msg3818
2009-07-19 23:12:51stefansetmessages: + msg3816
2009-03-20 16:10:33stefansetmessages: + msg3665
2009-03-20 15:40:47stefansetstatus: new -> open
files: + getnodes.patch
messages: + msg3664
2009-02-21 15:06:06rawlersetfiles: + roundup-dbperf-improve.diff
nosy: + rawler
messages: + msg3567
keywords: + patch
2009-02-21 02:12:55richardsetmessages: + msg3566
2009-02-21 02:00:00stefancreate