Issue 2550514
Created on 2009-02-21 02:00 by stefan, last changed 2016-06-25 20:52 by rouilj.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | Remove |
roundup-dbperf-improve.diff | rawler, 2009-02-21 15:06 | |||
getnodes.patch | stefan, 2009-03-20 15:40 |
Messages | |||
---|---|---|---|
msg3565 | Author: [hidden] (stefan) | Date: 2009-02-21 01:59 | |
The LinkHTMLProperty.menu() function does conceptually this: --- options = [o for o in filter(spec) if permitted('View', o)] for o in options: add_menu_item(o) --- This seemingly benign chunk of code may involve a lot of SQL calls: * at least one SQL call inside the filter() call, then possible N more in the N calls to permitted() (where N is the number of objects returned by filter() * N more SQL calls as add_menu_item() needs to fetch the label property for each item. So, 2N + 1 distinct SQL calls to fill a single menu with N options. I believe at least some of this can be optimized: 1) With an enhanced version of filter(), not only matching nodeids would be returned, but requested properties, too. This would get rid of the SQL calls inside add_menu_item(). 2) The permission checks could possibly be fused, too. However, as in general permissions are handled outside SQL, this is much harder. Therefor, I'd suggest to enhance filter() so instead of simply returning matching node ids it returns node ids in addition to requested properties: filter(search_matches, filterspec, sort=[], group=[], properties=[]) would return a list of items, where an item is either a nodeid, if 'properties' is empty, or a tuple containing nodeid, as well as all requested properties of that node, otherwise. Comments ? |
|||
msg3566 | Author: [hidden] (richard) | Date: 2009-02-21 02:12 | |
Sounds good to me. Was something I mulled over ages ago. |
|||
msg3567 | Author: [hidden] (rawler) | Date: 2009-02-21 15:06 | |
I had some slightly different solutions. Although "getnodes" I've called "fetch()". It fetches nodes based on a collection of ids and using the 'using' SQL-keyword. It also features defaults for other backends. There are also a few smaller optimisations, including caching pre:fetch() for batch()es. Also, in a few places identified by profiling, I've replaced multiple self.db-calls, with a local db-variable. (Tip from http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Avoidingdots...) In general, though, I think that filter() returning complete nodes is the way to go. The get()-API will probably only ever perform well on DBM-style direct-file databases. Other databases, in particular *SQL, should attempt to fetch as many rows in one go as possible, and the API should encourage that. |
|||
msg3664 | Author: [hidden] (stefan) | Date: 2009-03-20 15:40 | |
Here is a patch that does two things (beside some cleanup): * it adds 'getnodes()' methods that allow for more compact SQL queries. * it adds a 'properties' argument to the 'filter()' methods, to allow to fuse the query of properties with the original filter (again to reduce the number of required SQL queries). My plan is to adjust various functions (notably in cgi.templating) to use these enhancements to reduce the number of queries made on the database, and thus to speed up page rendering. How does this look ? |
|||
msg3665 | Author: [hidden] (stefan) | Date: 2009-03-20 16:10 | |
I just realize the patch contains some experimental bits not meant to be included here. Sorry for the noise. |
|||
msg3816 | Author: [hidden] (stefan) | Date: 2009-07-19 23:12 | |
Richard, any opinion on this ? |
|||
msg3818 | Author: [hidden] (richard) | Date: 2009-07-20 06:14 | |
Stefan, your plan seems sound, go for it. |
|||
msg3947 | Author: [hidden] (schlatterbeck) | Date: 2010-01-03 12:07 | |
When iterating over *many* results from a database query, we should have an iterator interface instead of fetching (possibly thousands of) all whole nodes at once. I've experimented with queries that return thousands of nodes and simply enlarged roundups node cache. The result is only marginally faster due to two things - lots of sql queries - memory allocation for the cache only the first will be addressed by the proposed patch... Maybe we return an iterator from getnodes? |
|||
msg3948 | Author: [hidden] (stefan) | Date: 2010-01-03 14:25 | |
On 01/03/2010 07:07 AM, Ralf Schlatterbeck wrote: > > Ralf Schlatterbeck<rsc@runtux.com> added the comment: > > When iterating over *many* results from a database query, we should have > an iterator interface instead of fetching (possibly thousands of) all > whole nodes at once. I've experimented with queries that return > thousands of nodes and simply enlarged roundups node cache. The result > is only marginally faster due to two things > - lots of sql queries > - memory allocation for the cache > > only the first will be addressed by the proposed patch... > Maybe we return an iterator from getnodes? I agree that an iterator is a good idea. Though the bottleneck we observed (and which let me to propose a change in the API) is due to IPC overhead caused by many individual queries, instead of few fused ones. |
|||
msg3949 | Author: [hidden] (schlatterbeck) | Date: 2010-01-03 15:13 | |
On Sun, Jan 03, 2010 at 02:25:06PM +0000, Stefan Seefeld wrote: > > - lots of sql queries > > - memory allocation for the cache > > > > only the first will be addressed by the proposed patch... > > Maybe we return an iterator from getnodes? > > I agree that an iterator is a good idea. Though the bottleneck we > observed (and which let me to propose a change in the API) is due to IPC > overhead caused by many individual queries, instead of few fused ones. I've also done some measurements. Roundup uses an internal node cache of 100 items by default (this is a config option). I've done some experiments with a large query (which returns about 140000 items): Cachesize Time (minutes) 2 1:54-2:36 10 1:29-1:35 (only 2 Measurements) 100 1:14-1:37 100000 1:23-1:35 (usually at the lower end) so we see diminishing returns on larger cache sizes inside roundup. The number of queries when going from cache-size 100 to 100000 goes down from 240000 to 140000 (!) due to no re-querying the same nodes (I'm using some of the nodes more than once in the computation). The time requirement is memory-bound. In all cases the initial SQL query that returns all the IDs takes less than 1 second real-time. So I guess my example can really profit from using an iterator -- especially since the whole database is in RAM because I configured the PostgreSQL cache large enough. In these cases the database thread and the roundup thread consuming the query data can run in parallel. Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting Fax: +43/2243/26465-23 Reichergasse 131 www: http://www.runtux.com A-3411 Weidling email: office@runtux.com osAlliance member email: rsc@osalliance.com |
|||
msg3950 | Author: [hidden] (stefan) | Date: 2010-01-03 17:10 | |
On 01/03/2010 10:13 AM, Ralf Schlatterbeck wrote: > So I guess my example can really profit from using an iterator -- > especially since the whole database is in RAM because I configured the > PostgreSQL cache large enough. In these cases the database thread and > the roundup thread consuming the query data can run in parallel. I'd suggest to create a new issue about the iterator, so we can keep this one for the fused SQL queries idea I submitted initially. Thanks, Stefan |
|||
msg3956 | Author: [hidden] (schlatterbeck) | Date: 2010-01-04 10:33 | |
On Sun, Jan 03, 2010 at 05:10:58PM +0000, Stefan Seefeld wrote: > > On 01/03/2010 10:13 AM, Ralf Schlatterbeck wrote: > > > So I guess my example can really profit from using an iterator -- > > especially since the whole database is in RAM because I configured the > > PostgreSQL cache large enough. In these cases the database thread and > > the roundup thread consuming the query data can run in parallel. > > I'd suggest to create a new issue about the iterator, so we can keep > this one for the fused SQL queries idea I submitted initially. Why not make the initial implementation of `getnodes` an iterator? Your initial use-case in msg3565 is nicely implemented using an iterator. And if we really have a case where we want the nodes to be materialized we can always write nodes = [db.someclass.getnodes (...)] or am I missing something? Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting Fax: +43/2243/26465-23 Reichergasse 131 www: http://www.runtux.com A-3411 Weidling email: office@runtux.com osAlliance member email: rsc@osalliance.com |
|||
msg3957 | Author: [hidden] (stefan) | Date: 2010-01-04 14:05 | |
On 01/04/2010 05:33 AM, Ralf Schlatterbeck wrote: > > Ralf Schlatterbeck<rsc@runtux.com> added the comment: > > On Sun, Jan 03, 2010 at 05:10:58PM +0000, Stefan Seefeld wrote: >> >> On 01/03/2010 10:13 AM, Ralf Schlatterbeck wrote: >> >>> So I guess my example can really profit from using an iterator -- >>> especially since the whole database is in RAM because I configured the >>> PostgreSQL cache large enough. In these cases the database thread and >>> the roundup thread consuming the query data can run in parallel. >> >> I'd suggest to create a new issue about the iterator, so we can keep >> this one for the fused SQL queries idea I submitted initially. > > Why not make the initial implementation of `getnodes` an iterator? Your > initial use-case in msg3565 is nicely implemented using an iterator. And > if we really have a case where we want the nodes to be materialized we > can always write > nodes = [db.someclass.getnodes (...)] > > or am I missing something? No, that sounds fine. All I'm saying is that this issue is about reducing the number of SQL queries. Your suggestion is certainly good, but tangential to the problems I'm describing here. Thanks, Stefan |
|||
msg5622 | Author: [hidden] (rouilj) | Date: 2016-06-25 20:52 | |
Open questions: was this implemented using an iterator for getnodes as agreed in the final message? does the code as implemented in the patch work for anydbm and all rdbm back ends? |
History | |||
---|---|---|---|
Date | User | Action | Args |
2016-06-25 20:52:46 | rouilj | set | nosy:
+ rouilj messages: + msg5622 |
2010-01-04 14:05:47 | stefan | set | messages: + msg3957 |
2010-01-04 10:33:00 | schlatterbeck | set | messages: + msg3956 |
2010-01-03 17:10:58 | stefan | set | messages: + msg3950 |
2010-01-03 15:13:33 | schlatterbeck | set | messages: + msg3949 |
2010-01-03 14:25:06 | stefan | set | messages: + msg3948 |
2010-01-03 12:07:09 | schlatterbeck | set | nosy:
+ schlatterbeck messages: + msg3947 |
2009-07-20 06:14:25 | richard | set | messages: + msg3818 |
2009-07-19 23:12:51 | stefan | set | messages: + msg3816 |
2009-03-20 16:10:33 | stefan | set | messages: + msg3665 |
2009-03-20 15:40:47 | stefan | set | status: new -> open files: + getnodes.patch messages: + msg3664 |
2009-02-21 15:06:06 | rawler | set | files:
+ roundup-dbperf-improve.diff nosy: + rawler messages: + msg3567 keywords: + patch |
2009-02-21 02:12:55 | richard | set | messages: + msg3566 |
2009-02-21 02:00:00 | stefan | create |