Issue 2550727: db.newid is broken with sqlite - Roundup tracker

classification

Title:	db.newid is broken with sqlite
Type:	crash	Severity:	major
Components:	Database	Versions:	1.4

process

Status:	fixed	Resolution:	fixed
Dependencies		Superseder:
Assigned To:	rouilj	Nosy List:	ber, rouilj
Priority:		Keywords:	patch

Created on 2011-09-15 17:05 by mpm, last changed 2016-07-14 02:31 by rouilj.

Files
File name	Uploaded	Description	Edit	Remove
newid_with_transactions.diff	rouilj, 2016-07-10 17:06

Messages
msg4423	Author: [hidden] (mpm)	Date: 2011-09-15 17:05
Calls to db.newid in separate processes can race, resulting in duplicate ids and failed database insertion. Traceback (most recent call last): File "/usr/lib/pymodules/python2.6/roundup/mailgw.py", line 1388, in handle_Message return self.handle_message(message) File "/usr/lib/pymodules/python2.6/roundup/mailgw.py", line 1444, in handle_message return self._handle_message(message) File "/usr/lib/pymodules/python2.6/roundup/mailgw.py", line 1522, in _handle_message parsed_message.create_msg() File "/usr/lib/pymodules/python2.6/roundup/mailgw.py", line 1098, in create_msg messageid=messageid, inreplyto=inreplyto, msg_props) File "/usr/lib/pymodules/python2.6/roundup/backends/rdbms_common.py", line 2945, in create newid = self.create_inner(propvalues) File "/usr/lib/pymodules/python2.6/roundup/backends/rdbms_common.py", line 1630, in create_inner self.db.addnode(self.classname, newid, propvalues) File "/usr/lib/pymodules/python2.6/roundup/backends/rdbms_common.py", line 948, in addnode self.sql(sql, vals) File "/usr/lib/pymodules/python2.6/roundup/backends/rdbms_common.py", line 219, in sql cursor.execute(sql, args) IntegrityError: PRIMARY KEY must be unique Recommend atomically incrementing the id first, and then reading it.
msg4424	Author: [hidden] (ber)	Date: 2011-09-16 07:30
Hi Matt, thanks for this report and for using or trying roundup! Can you say more how to you encountered this issue? This might make it easier to reproduce for me. (Somehow I was always under the impression that write access to the database would have been locked and then serialised by the lock, but Richard will know better and I never checked.) Thanks, Bernhard
msg4425	Author: [hidden] (mpm)	Date: 2011-09-16 16:21
We get this on the Mercurial project when we close bugs in commit messages. We have a robot that scans each commit for the close signature and fires off an email to our bts mail gateway. Each message to the gateway starts a separate roundup process. We hit this race quite consistently. I think this is helped by Postfix, which scans its input queues on a timer, and thus consistently fires off messages simultaneously on different cores. As I understand sqlite3/pysqlite's locking model, pysqlite opens the database in shared mode and automatically opens and closes transactions for certain kinds of SQL statements. So we end up with multiple processes looking up the msgid simultaneously, adding one, then doing write transactions (that succeed). Then we next go on to try to insert two messages with the same message id and fail on the second one. There's nothing anywhere in the sqlite backend that looks like locking.
msg4429	Author: [hidden] (ber)	Date: 2011-09-19 07:15
Matt, thanks for the additional output. Do you have a workaround available? I would expect postfix to have an option for serializing external application calls.
msg4432	Author: [hidden] (mpm)	Date: 2011-09-19 16:24
We stuck a time.sleep(10) in our robot.
msg5790	Author: [hidden] (rouilj)	Date: 2016-07-09 03:01
I wonder if using "begin exclusive" in db.newid would solve this. Doing this should start a write transaction on the first select. It should block if some other process has an exclusive lock. This will block any read or write attempt until the commit() call after the update. It should prevent the same value from being read twice. def newid(self, classname): """ Generate a new id for the given class """ + print "\|%r, %r, %r\|"%(self.conn, self.conn.isolation_level, self.cursor.connection.isolation_level) + + # prevent other processes from reading while we increment + # otherwise multiple processes can end up with the same + # new id and hilarity results. + self.conn.isolation_level = 'EXCLUSIVE'; + + # force the creation of a new cursor and close out + # any existing transaction + self.sql_commit(); + # get the next ID sql = 'select num from ids where name=%s'%self.arg self.sql(sql, (classname, )) newid = int(self.cursor.fetchone()[0]) # update the counter sql = 'update ids set num=%s where name=%s'%(self.arg, self.arg) vals = (int(newid)+1, classname) self.sql(sql, vals) + self.conn.isolation_level = ''; # go to standard begin transaction + # sql_commit creates a new cursor. + self.sql_commit(); + # return as string return str(newid) The first self.sql_commit() may not be needed. The documentation doesn't state that a change of isolation_level takes effect on the next sql execute call. So by committing I am deleting the old cursor and starting a new cursor that has the requested isolation_level. One test in the python code seems to indicate that changing the isolation_level should be able to be done with existing cursors, so the first self.sql_commit() should not be needed, but I can't be sure. Matt are you able to try testing this patch? It sounds like you have the environment to produce the error. Also if you can test without the first self.sql_commit() and see if it works. This does slow down some tests, e.g. test/test_sqlite::testIDGeneration goes from 16 seconds to 22 seconds on my laptop. So it's doing something 8-). -- rouilj
msg5811	Author: [hidden] (rouilj)	Date: 2016-07-10 17:06
I am attaching a patch that should work round pysqlite's handling of transactions and properly create an exclusive lock on the ids table while getting a new id. Note the inline patch I submitted before didn't do the trick in a small test case. I think it's because the exclusive lock only fires when I try to use DML (data modification language) and not between the query and the update. This should stop the duplicate id's returned when different processes are accessing the table. This patch passes the full sqlite test suite. I tried to make test/test_sqlite.py::sqliteConcurrencyTest::testConcurrency fail using the original so I could see if it passed with this patch. But I couldn't reproduce the fail using the test suite. I also tried it using scripts/add_issue after 500 issues added I didn't get any exceptions raised. So I guess I don't have a powerful enough laptop to generate the collisions.
msg5849	Author: [hidden] (rouilj)	Date: 2016-07-14 02:31
Figured out how to test this. My original script using add-issue must have been broken. I used the following: #! /bin/sh #set -xv for i in `seq 1 5`; do for j in `seq 1 5`; do PYTHONPATH=$PWD python scripts/add-issue demo urgent "title $i.$j" & done wait done to get the following exception: Traceback (most recent call last): File "scripts/add-issue", line 53, in <module> messages=messages) File "/home/rouilj/develop/roundup.dev/roundup.hg/roundup/backends/rdbms_common.py", line 1524, in create newid = self.create_inner(**propvalues) File "/home/rouilj/develop/roundup.dev/roundup.hg/roundup/backends/rdbms_common.py", line 1668, in create_inner self.db.addnode(self.classname, newid, propvalues) File "/home/rouilj/develop/roundup.dev/roundup.hg/roundup/backends/rdbms_common.py", line 966, in addnode self.sql(sql, vals) File "/home/rouilj/develop/roundup.dev/roundup.hg/roundup/backends/rdbms_common.py", line 221, in sql cursor.execute(sql, args) sqlite3.IntegrityError: UNIQUE constraint failed: _issue.id which I think is the 2016 version of the original 2011 error. I got that exception 8 times running the script above (25 processes total). After my patch, no failures using the loop above. Run the script after increasing the number of processes and loops to 10. The printing the last 20 issues sorted by creation date produces: title 9.8 title 9.1 title 9.5 title 9.9 title 9.3 title 9.7 title 9.4 title 9.2 title 9.10 title 9.6 title 10.4 title 10.10 title 10.1 title 10.2 title 10.5 title 10.9 title 10.8 title 10.3 title 10.6 title 10.7 so all entries are present and accounted for, the start order is not preserved, but I am ok with that. The default lock timeout set in config.ini is 30 seconds, but I could set it to 10 seconds and get each batch of 10 to complete. setting it to 9 produces at least 1 sqlite3.OperationalError: database is locked exception. If the default timeout of 30 seconds needs to be increased (set in config.RDBMS_SQLITE_TIMEOUT) because of timeout exceptions the user probably needs to move to a better database with better concurrent access (postgresql, mysql). I was able to use a BEGIN IMMEDIATE transaction (rather than BEGIN EXCLUSIVE in the attached patch). The immediate transaction allows reads from the database but prevents another BEGIN IMMEDIATE or write from occurring. Since all reads for the newid occur inside an IMMEDIATE transaction, other reads of the newid table can occur (if there are any) that are outside of reserving/generating a newid. Using immediate transactions seems to have reduced the impact by a small amount over exclusive transactions. But the full test_sqlite test suite takes 40m50s now compared to 31m5s without the transaction code. Probably because of having to recreate the database for every test that hits the newid code heavily. Note that it appears the slowdown is due to i/o wait. A better disk subsystem should help. But at least now it's correct. Committed this patch as: 9cebf686c552

History
Date	User	Action	Args
2016-07-14 02:31:04	rouilj	set	status: open -> fixed assignee: rouilj resolution: fixed messages: + msg5849
2016-07-11 12:04:43	mpm	set	nosy: - mpm
2016-07-10 17:07:03	rouilj	set	messages: - msg5810
2016-07-10 17:06:38	rouilj	set	files: + newid_with_transactions.diff keywords: + patch messages: + msg5811
2016-07-10 17:05:12	rouilj	set	status: new -> open messages: + msg5810
2016-07-09 03:01:03	rouilj	set	nosy: + rouilj messages: + msg5790
2011-09-19 16:24:01	mpm	set	messages: + msg4432
2011-09-19 07:15:31	ber	set	messages: + msg4429
2011-09-16 16:21:13	mpm	set	type: crash messages: + msg4425
2011-09-16 07:30:29	ber	set	nosy: + ber messages: + msg4424
2011-09-15 17:05:41	mpm	create