Issue 2550514
Created on 2009-02-21 02:00 by stefan, last changed 2021-03-24 08:58 by schlatterbeck.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | Remove |
roundup-dbperf-improve.diff | rawler, 2009-02-21 15:06 | |||
getnodes.patch | stefan, 2009-03-20 15:40 |
Messages | |||
---|---|---|---|
msg3565 | Author: [hidden] (stefan) | Date: 2009-02-21 01:59 | |
The LinkHTMLProperty.menu() function does conceptually this: --- options = [o for o in filter(spec) if permitted('View', o)] for o in options: add_menu_item(o) --- This seemingly benign chunk of code may involve a lot of SQL calls: * at least one SQL call inside the filter() call, then possible N more in the N calls to permitted() (where N is the number of objects returned by filter() * N more SQL calls as add_menu_item() needs to fetch the label property for each item. So, 2N + 1 distinct SQL calls to fill a single menu with N options. I believe at least some of this can be optimized: 1) With an enhanced version of filter(), not only matching nodeids would be returned, but requested properties, too. This would get rid of the SQL calls inside add_menu_item(). 2) The permission checks could possibly be fused, too. However, as in general permissions are handled outside SQL, this is much harder. Therefor, I'd suggest to enhance filter() so instead of simply returning matching node ids it returns node ids in addition to requested properties: filter(search_matches, filterspec, sort=[], group=[], properties=[]) would return a list of items, where an item is either a nodeid, if 'properties' is empty, or a tuple containing nodeid, as well as all requested properties of that node, otherwise. Comments ? |
|||
msg3566 | Author: [hidden] (richard) | Date: 2009-02-21 02:12 | |
Sounds good to me. Was something I mulled over ages ago. |
|||
msg3567 | Author: [hidden] (rawler) | Date: 2009-02-21 15:06 | |
I had some slightly different solutions. Although "getnodes" I've called "fetch()". It fetches nodes based on a collection of ids and using the 'using' SQL-keyword. It also features defaults for other backends. There are also a few smaller optimisations, including caching pre:fetch() for batch()es. Also, in a few places identified by profiling, I've replaced multiple self.db-calls, with a local db-variable. (Tip from http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Avoidingdots...) In general, though, I think that filter() returning complete nodes is the way to go. The get()-API will probably only ever perform well on DBM-style direct-file databases. Other databases, in particular *SQL, should attempt to fetch as many rows in one go as possible, and the API should encourage that. |
|||
msg3664 | Author: [hidden] (stefan) | Date: 2009-03-20 15:40 | |
Here is a patch that does two things (beside some cleanup): * it adds 'getnodes()' methods that allow for more compact SQL queries. * it adds a 'properties' argument to the 'filter()' methods, to allow to fuse the query of properties with the original filter (again to reduce the number of required SQL queries). My plan is to adjust various functions (notably in cgi.templating) to use these enhancements to reduce the number of queries made on the database, and thus to speed up page rendering. How does this look ? |
|||
msg3665 | Author: [hidden] (stefan) | Date: 2009-03-20 16:10 | |
I just realize the patch contains some experimental bits not meant to be included here. Sorry for the noise. |
|||
msg3816 | Author: [hidden] (stefan) | Date: 2009-07-19 23:12 | |
Richard, any opinion on this ? |
|||
msg3818 | Author: [hidden] (richard) | Date: 2009-07-20 06:14 | |
Stefan, your plan seems sound, go for it. |
|||
msg3947 | Author: [hidden] (schlatterbeck) | Date: 2010-01-03 12:07 | |
When iterating over *many* results from a database query, we should have an iterator interface instead of fetching (possibly thousands of) all whole nodes at once. I've experimented with queries that return thousands of nodes and simply enlarged roundups node cache. The result is only marginally faster due to two things - lots of sql queries - memory allocation for the cache only the first will be addressed by the proposed patch... Maybe we return an iterator from getnodes? |
|||
msg3948 | Author: [hidden] (stefan) | Date: 2010-01-03 14:25 | |
On 01/03/2010 07:07 AM, Ralf Schlatterbeck wrote: > > Ralf Schlatterbeck<rsc@runtux.com> added the comment: > > When iterating over *many* results from a database query, we should have > an iterator interface instead of fetching (possibly thousands of) all > whole nodes at once. I've experimented with queries that return > thousands of nodes and simply enlarged roundups node cache. The result > is only marginally faster due to two things > - lots of sql queries > - memory allocation for the cache > > only the first will be addressed by the proposed patch... > Maybe we return an iterator from getnodes? I agree that an iterator is a good idea. Though the bottleneck we observed (and which let me to propose a change in the API) is due to IPC overhead caused by many individual queries, instead of few fused ones. |
|||
msg3949 | Author: [hidden] (schlatterbeck) | Date: 2010-01-03 15:13 | |
On Sun, Jan 03, 2010 at 02:25:06PM +0000, Stefan Seefeld wrote: > > - lots of sql queries > > - memory allocation for the cache > > > > only the first will be addressed by the proposed patch... > > Maybe we return an iterator from getnodes? > > I agree that an iterator is a good idea. Though the bottleneck we > observed (and which let me to propose a change in the API) is due to IPC > overhead caused by many individual queries, instead of few fused ones. I've also done some measurements. Roundup uses an internal node cache of 100 items by default (this is a config option). I've done some experiments with a large query (which returns about 140000 items): Cachesize Time (minutes) 2 1:54-2:36 10 1:29-1:35 (only 2 Measurements) 100 1:14-1:37 100000 1:23-1:35 (usually at the lower end) so we see diminishing returns on larger cache sizes inside roundup. The number of queries when going from cache-size 100 to 100000 goes down from 240000 to 140000 (!) due to no re-querying the same nodes (I'm using some of the nodes more than once in the computation). The time requirement is memory-bound. In all cases the initial SQL query that returns all the IDs takes less than 1 second real-time. So I guess my example can really profit from using an iterator -- especially since the whole database is in RAM because I configured the PostgreSQL cache large enough. In these cases the database thread and the roundup thread consuming the query data can run in parallel. Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting Fax: +43/2243/26465-23 Reichergasse 131 www: http://www.runtux.com A-3411 Weidling email: office@runtux.com osAlliance member email: rsc@osalliance.com |
|||
msg3950 | Author: [hidden] (stefan) | Date: 2010-01-03 17:10 | |
On 01/03/2010 10:13 AM, Ralf Schlatterbeck wrote: > So I guess my example can really profit from using an iterator -- > especially since the whole database is in RAM because I configured the > PostgreSQL cache large enough. In these cases the database thread and > the roundup thread consuming the query data can run in parallel. I'd suggest to create a new issue about the iterator, so we can keep this one for the fused SQL queries idea I submitted initially. Thanks, Stefan |
|||
msg3956 | Author: [hidden] (schlatterbeck) | Date: 2010-01-04 10:33 | |
On Sun, Jan 03, 2010 at 05:10:58PM +0000, Stefan Seefeld wrote: > > On 01/03/2010 10:13 AM, Ralf Schlatterbeck wrote: > > > So I guess my example can really profit from using an iterator -- > > especially since the whole database is in RAM because I configured the > > PostgreSQL cache large enough. In these cases the database thread and > > the roundup thread consuming the query data can run in parallel. > > I'd suggest to create a new issue about the iterator, so we can keep > this one for the fused SQL queries idea I submitted initially. Why not make the initial implementation of `getnodes` an iterator? Your initial use-case in msg3565 is nicely implemented using an iterator. And if we really have a case where we want the nodes to be materialized we can always write nodes = [db.someclass.getnodes (...)] or am I missing something? Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting Fax: +43/2243/26465-23 Reichergasse 131 www: http://www.runtux.com A-3411 Weidling email: office@runtux.com osAlliance member email: rsc@osalliance.com |
|||
msg3957 | Author: [hidden] (stefan) | Date: 2010-01-04 14:05 | |
On 01/04/2010 05:33 AM, Ralf Schlatterbeck wrote: > > Ralf Schlatterbeck<rsc@runtux.com> added the comment: > > On Sun, Jan 03, 2010 at 05:10:58PM +0000, Stefan Seefeld wrote: >> >> On 01/03/2010 10:13 AM, Ralf Schlatterbeck wrote: >> >>> So I guess my example can really profit from using an iterator -- >>> especially since the whole database is in RAM because I configured the >>> PostgreSQL cache large enough. In these cases the database thread and >>> the roundup thread consuming the query data can run in parallel. >> >> I'd suggest to create a new issue about the iterator, so we can keep >> this one for the fused SQL queries idea I submitted initially. > > Why not make the initial implementation of `getnodes` an iterator? Your > initial use-case in msg3565 is nicely implemented using an iterator. And > if we really have a case where we want the nodes to be materialized we > can always write > nodes = [db.someclass.getnodes (...)] > > or am I missing something? No, that sounds fine. All I'm saying is that this issue is about reducing the number of SQL queries. Your suggestion is certainly good, but tangential to the problems I'm describing here. Thanks, Stefan |
|||
msg5622 | Author: [hidden] (rouilj) | Date: 2016-06-25 20:52 | |
Open questions: was this implemented using an iterator for getnodes as agreed in the final message? does the code as implemented in the patch work for anydbm and all rdbm back ends? |
|||
msg7132 | Author: [hidden] (rouilj) | Date: 2021-03-22 20:37 | |
Ping was anything done with these patches? If not is anybody able to update them to the current 2.0 code base and python3ize them with tests? |
|||
msg7134 | Author: [hidden] (schlatterbeck) | Date: 2021-03-23 10:34 | |
On Mon, Mar 22, 2021 at 08:37:17PM +0000, John Rouillard wrote: > > Ping was anything done with these patches? If not is anybody able to > update them to the current 2.0 code base and python3ize them with > tests? This is implemented with filter_iter a long time ago. But it isn't used in the html framework. The idea of filter_iter is to do the same as filter but as an iterator. So in each iteration it will return a single ID (instead a full list of IDs). But: In addition it internally fetches all attributes of a node, not just the ID and populates the node-cache. In this way we will have a single query that does the same as the proposed getnodes() in the issue: Fetch the whole row. But we're doing it in an iterator which can conserve a lot of memory when the query is large (especially since we're not doing a fetchall of the resulting query data). So instead of for id in db.someclass.filter(...): n = db.someclass.getnode(id) # do domething with node n (which will perform an additional sql query for each getnode) using for id in db.someclass.filter_iter(...): n = db.someclass.getnode(id) # do domething with node n will perform only a single query at the start. The getnode call will hit the cache. In addition it will not slurp in the whole query into the memory like the proposed getnodes would (I had proposed to make getnodes an iterator in msg3956 in that issue). Note that the original analysis that this will perform 2*n+1 sql queries is wrong, because the first access to a node will put the node in the cache (which still leaves n + 1 sql queries :-) So the only thing that remains to be done is to implement this in LinkHTMLProperty.menu and probably in more methods of the HTML wrapper classes. Note that most use-cases of filter fit a pattern where filter can be replaced with filter_iter. The only exception is sorting by Multilinks: This cannot be done in SQL alone and is therefore not implemented in filter_iter. This is something that should be deprecated anyway. And: If the code in the loop uses Multilinks, these *are* fetched in additional SQL queries (Multilinks are fetched lazily). Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting www: www.runtux.com Reichergasse 131, A-3411 Weidling email: office@runtux.com |
|||
msg7135 | Author: [hidden] (schlatterbeck) | Date: 2021-03-23 10:54 | |
On Tue, Mar 23, 2021 at 10:34:14AM +0000, Ralf Schlatterbeck wrote: > And: If the code in the loop uses Multilinks, these *are* fetched in > additional SQL queries (Multilinks are fetched lazily). This should be clarified: If the code in the loop accesses Multilinks properties of the node, those would be fetched separately. Which certainly does *not* happen in the original use-case of optimizing LinkHTMLProperty.menu and will not happen in many other use cases (like, e.g., fetching rows for display in index templates in the web interface in most cases I've seen). Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting www: www.runtux.com Reichergasse 131, A-3411 Weidling email: office@runtux.com |
|||
msg7136 | Author: [hidden] (rouilj) | Date: 2021-03-23 21:25 | |
Hi Ralf: Responding to two updates here. In message <20210323103411.z4l6lop3tqihfcp3@runtux.com>, Ralf Schlatterbeck writes: >On Mon, Mar 22, 2021 at 08:37:17PM +0000, John Rouillard wrote: >> Ping was anything done with these patches? If not is anybody able to >> update them to the current 2.0 code base and python3ize them with >> tests? > >This is implemented with filter_iter a long time ago. > >But it isn't used in the html framework. What has to be done to get this implemented? I see that tests in test/db_test_base.py do include calls to filter_iter. Do we need a test for the templating system (e.g. LinkHTMLProperty::menu())? The data in the database and schema probably need to be set up to test the conditions: user is not allowed to see one of the items in the class one of the items in the class is retired to make sure we hit the 3 main paths through the code. Final test is to compare the html. Then we can rework menu() to use filter_iter. Question should the html be tested as a string or by comparing the two parse trees. I think comparing by tree is more robust, but I don't know of any tools in python to do that. Simple html parser, difflib etc. don't quite fill the niche. While I was looking at the code, we support xhtml and html4. At this point I think xhtml is kind of dead, but we probably should maintain it. However should we add html5 to the output format? In most cases html4 and html5 would be the same but .... If we do this should we change the templates to html5 format (doctype declaration, lang= attribute etc.). Also I noticed that there is a checklist() spec commented out at the end of the menu() definition. I see a reference to checklist in the customizing.doc and in doc/html_extra/spec.html. Do you know if there was ever an implementation for this (series of checkboxes to select options (e.g. for search) or radiobuttons (with a none/unset value) for setting the value for a Link)? Would have been nice to have as I had to implement these at the templating level. In message <20210323105412.nu5ohfxsxk35awp5@runtux.com>, >On Tue, Mar 23, 2021 at 10:34:14AM +0000, Ralf Schlatterbeck wrote: >> And: If the code in the loop uses Multilinks, these *are* fetched in >> additional SQL queries (Multilinks are fetched lazily). > >This should be clarified: If the code in the loop accesses Multilinks >properties of the node, those would be fetched separately. Which >certainly does *not* happen in the original use-case of optimizing >LinkHTMLProperty.menu and will not happen in many other use cases (like, >e.g., fetching rows for display in index templates in the web interface >in most cases I've seen). To make sure I understand. If issue.index.html included the superseder property, that would trigger this addition set of lookups (one per id in the superseder list) correct? Also if we use filter_iter we can't sort by multilink in the sql. Can we detect if the user is requesting a multilink sort and fall back to filter()? I think sorting by multilink might be a requirement for index views when grouping by a multilink. (I agree with you that sorting by multilink is confusing.) Do you know if the multilink 1,9,10 has the same value as the multilink 9,10,1? I think it must for sorting at the db level to make any sense but .... Have a great week. |
|||
msg7138 | Author: [hidden] (schlatterbeck) | Date: 2021-03-24 08:58 | |
On Tue, Mar 23, 2021 at 09:25:28PM +0000, John Rouillard wrote: > >This is implemented with filter_iter a long time ago. > > > >But it isn't used in the html framework. > > What has to be done to get this implemented? I see that tests in > test/db_test_base.py do include calls to filter_iter. > > Do we need a test for the templating system > (e.g. LinkHTMLProperty::menu())? The data in the database and schema > probably need to be set up to test the conditions: Yes, that would be cool, I think the HTML layer is currently largely untested. > Question should the html be tested as a string or by comparing the two > parse trees. I think comparing by tree is more robust, but I don't > know of any tools in python to do that. Simple html parser, difflib etc. > don't quite fill the niche. Beautiful soup (bs4) is a nice tool for parsing html and coding regression tests. But this would be an external dependency. Note that we already optionally use it for parsing incoming HTML emails. On the other hand the created string is not changing too much, so a string compare would probably be enough. > While I was looking at the code, we support xhtml and html4. At this > point I think xhtml is kind of dead, but we probably should maintain > it. However should we add html5 to the output format? In most cases > html4 and html5 would be the same but .... If we do this should we > change the templates to html5 format (doctype declaration, lang= > attribute etc.). Yes that would be nice. I think the generated html type can currently be configured. I'd go for html5 for the standard templates of the classic tracker. > Also I noticed that there is a checklist() spec commented out at the > end of the menu() definition. I see a reference to checklist in the > customizing.doc and in doc/html_extra/spec.html. Do you know if there > was ever an implementation for this (series of checkboxes to select > options (e.g. for search) or radiobuttons (with a none/unset value) > for setting the value for a Link)? Would have been nice to have as I > had to implement these at the templating level. No I haven't used this ever. I'm mainly using classhelp, this pops up a separate window where you can select things that end up in the input field when pressing a button. > >This should be clarified: If the code in the loop accesses Multilinks > >properties of the node, those would be fetched separately. Which > >certainly does *not* happen in the original use-case of optimizing > >LinkHTMLProperty.menu and will not happen in many other use cases (like, > >e.g., fetching rows for display in index templates in the web interface > >in most cases I've seen). > > To make sure I understand. If issue.index.html included the superseder > property, that would trigger this addition set of lookups (one per id > in the superseder list) correct? Code: for id in db.someclass.filter_iter(...): node = db.someclass.getnode(id) x = node.some_multilink <- This triggers the SQL lookup of the multilink And, yes, concerning your question: As soon as something accesses a multilink inside the loop it would be materialized with an additional query. I don't find this too worrying: - If a class has several multilinks only the used ones are materialized, e.g., 'files' and 'messages' would not be queried for in your example - The multilink is a separate table in SQL anyway. Better to do this lazily than materialize the data up-front with a large join. > Also if we use filter_iter we can't sort by multilink in the sql. Can > we detect if the user is requesting a multilink sort and fall back to > filter()? I think sorting by multilink might be a requirement for > index views when grouping by a multilink. (I agree with you that > sorting by multilink is confusing.) The internal data structure that represents the tree for the generated sql is aware of the fact that there is sorting by multilinks. So in theory this is possible to detect. On the other hand: I do not think that a list sorted by multilinks is comprehensible to humans. So I guess we should drop this feature and just ignore it during searches. Maybe issue a warning. In addition it is probably not easy to understand why some queries take a lot more time than others. > Do you know if the multilink 1,9,10 has the same value as the > multilink 9,10,1? I think it must for sorting at the db level to make > any sense but .... Yes, they have the same value. First the individual values are sorted (by the order property of the multilink) then the dataset is sorted by the resulting lists (again by order prop of course). > Have a great week. Same to you! Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting www: www.runtux.com Reichergasse 131, A-3411 Weidling email: office@runtux.com |
History | |||
---|---|---|---|
Date | User | Action | Args |
2021-03-24 08:58:34 | schlatterbeck | set | messages: + msg7138 |
2021-03-23 21:25:28 | rouilj | set | messages: + msg7136 |
2021-03-23 10:54:14 | schlatterbeck | set | messages: + msg7135 |
2021-03-23 10:34:14 | schlatterbeck | set | messages: + msg7134 |
2021-03-22 20:37:17 | rouilj | set | messages: + msg7132 |
2016-06-25 20:52:46 | rouilj | set | nosy:
+ rouilj messages: + msg5622 |
2010-01-04 14:05:47 | stefan | set | messages: + msg3957 |
2010-01-04 10:33:00 | schlatterbeck | set | messages: + msg3956 |
2010-01-03 17:10:58 | stefan | set | messages: + msg3950 |
2010-01-03 15:13:33 | schlatterbeck | set | messages: + msg3949 |
2010-01-03 14:25:06 | stefan | set | messages: + msg3948 |
2010-01-03 12:07:09 | schlatterbeck | set | nosy:
+ schlatterbeck messages: + msg3947 |
2009-07-20 06:14:25 | richard | set | messages: + msg3818 |
2009-07-19 23:12:51 | stefan | set | messages: + msg3816 |
2009-03-20 16:10:33 | stefan | set | messages: + msg3665 |
2009-03-20 15:40:47 | stefan | set | status: new -> open files: + getnodes.patch messages: + msg3664 |
2009-02-21 15:06:06 | rawler | set | files:
+ roundup-dbperf-improve.diff nosy: + rawler messages: + msg3567 keywords: + patch |
2009-02-21 02:12:55 | richard | set | messages: + msg3566 |
2009-02-21 02:00:00 | stefan | create |