Roundup Tracker - Issues

Issue 2550727

classification
db.newid is broken with sqlite
Type: crash Severity: major
Components: Database Versions: 1.4
process
Status: fixed fixed
:
: rouilj : ber, rouilj
Priority: : patch

Created on 2011-09-15 17:05 by mpm, last changed 2016-07-14 02:31 by rouilj.

Files
File name Uploaded Description Edit Remove
newid_with_transactions.diff rouilj, 2016-07-10 17:06
Messages
msg4423 Author: [hidden] (mpm) Date: 2011-09-15 17:05
Calls to db.newid in separate processes can race, resulting in duplicate
ids and failed database insertion.

Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/roundup/mailgw.py", line 1388, in
handle_Message
    return self.handle_message(message)
  File "/usr/lib/pymodules/python2.6/roundup/mailgw.py", line 1444, in
handle_message
    return self._handle_message(message)
  File "/usr/lib/pymodules/python2.6/roundup/mailgw.py", line 1522, in
_handle_message
    parsed_message.create_msg()
  File "/usr/lib/pymodules/python2.6/roundup/mailgw.py", line 1098, in
create_msg
    messageid=messageid, inreplyto=inreplyto, **msg_props)
  File "/usr/lib/pymodules/python2.6/roundup/backends/rdbms_common.py",
line 2945, in create
    newid = self.create_inner(**propvalues)
  File "/usr/lib/pymodules/python2.6/roundup/backends/rdbms_common.py",
line 1630, in create_inner
    self.db.addnode(self.classname, newid, propvalues)
  File "/usr/lib/pymodules/python2.6/roundup/backends/rdbms_common.py",
line 948, in addnode
    self.sql(sql, vals)
  File "/usr/lib/pymodules/python2.6/roundup/backends/rdbms_common.py",
line 219, in sql
    cursor.execute(sql, args)
IntegrityError: PRIMARY KEY must be unique

Recommend atomically incrementing the id first, and then reading it.
msg4424 Author: [hidden] (ber) Date: 2011-09-16 07:30
Hi Matt,
thanks for this report and for using or trying roundup!

Can you say more how to you encountered this issue?
This might make it easier to reproduce for me.
(Somehow I was always under the impression that write access to the
database would have been locked and then serialised by the lock,
but Richard will know better and I never checked.)

Thanks,
Bernhard
msg4425 Author: [hidden] (mpm) Date: 2011-09-16 16:21
We get this on the Mercurial project when we close bugs in commit
messages. We have a robot that scans each commit for the close signature
and fires off an email to our bts mail gateway. Each message to the
gateway starts a separate roundup process.

We hit this race quite consistently. I think this is helped by Postfix,
which scans its input queues on a timer, and thus consistently fires off
messages simultaneously on different cores.

As I understand sqlite3/pysqlite's locking model, pysqlite opens the
database in shared mode and automatically opens and closes transactions
for certain kinds of SQL statements. So we end up with multiple
processes looking up the msgid simultaneously, adding one, then doing
write transactions (that succeed). Then we next go on to try to insert
two messages with the same message id and fail on the second one.

There's nothing anywhere in the sqlite backend that looks like locking.
msg4429 Author: [hidden] (ber) Date: 2011-09-19 07:15
Matt, thanks for the additional output.
Do you have a workaround available?
I would expect postfix to have an option for serializing external
application calls.
msg4432 Author: [hidden] (mpm) Date: 2011-09-19 16:24
We stuck a time.sleep(10) in our robot.
msg5790 Author: [hidden] (rouilj) Date: 2016-07-09 03:01
I wonder if using "begin exclusive" in db.newid would solve this.

Doing this should start a write transaction on the first select. It
should block if some other process has an exclusive lock. This will
block any read or write attempt until the commit() call after the
update. It should prevent the same value from being read twice.

     def newid(self, classname):
         """ Generate a new id for the given class
         """
+        print "|%r, %r, %r|"%(self.conn, self.conn.isolation_level,
self.cursor.connection.isolation_level)
+
+        # prevent other processes from reading while we increment
+        # otherwise multiple processes can end up with the same
+        # new id and hilarity results.
+        self.conn.isolation_level = 'EXCLUSIVE';
+
+        # force the creation of a new cursor and close out
+        # any existing transaction
+        self.sql_commit();
+
         # get the next ID
         sql = 'select num from ids where name=%s'%self.arg
         self.sql(sql, (classname, ))
         newid = int(self.cursor.fetchone()[0])

         # update the counter
         sql = 'update ids set num=%s where name=%s'%(self.arg, self.arg)
         vals = (int(newid)+1, classname)
         self.sql(sql, vals)
 
+        self.conn.isolation_level = ''; # go to standard begin transaction
+        # sql_commit creates a new cursor.
+        self.sql_commit();
+
         # return as string
         return str(newid)

The first self.sql_commit() may not be needed. The documentation
doesn't state that a change of isolation_level takes effect on the
next sql execute call.  So by committing I am deleting the old cursor
and starting a new cursor that has the requested isolation_level.

One test in the python code seems to indicate that changing the
isolation_level should be able to be done with existing cursors, so
the first self.sql_commit() should not be needed, but I can't be sure.

Matt are you able to try testing this patch? It sounds like you have
the environment to produce the error. Also if you can test without the
first self.sql_commit() and see if it works.

This does slow down some tests,
e.g. test/test_sqlite::testIDGeneration goes from 16 seconds to 22
seconds on my laptop. So it's doing something 8-).

-- rouilj
msg5811 Author: [hidden] (rouilj) Date: 2016-07-10 17:06
I am attaching a patch that should work round pysqlite's
handling of transactions and properly create an exclusive
lock on the ids table while getting a new id.

Note the inline patch I submitted before didn't do the trick
in a small test case. I think it's because the exclusive lock
only fires when I try to use DML (data modification language)
and not between the query and the update.

This should stop the duplicate id's returned when different
processes are accessing the table.

This patch passes the full sqlite test suite.  I tried to make 

  test/test_sqlite.py::sqliteConcurrencyTest::testConcurrency

fail using the original so I could see if it passed with this patch.
But I couldn't reproduce the fail using the test suite. I also tried it
using scripts/add_issue after 500 issues added I didn't get any
exceptions raised. So I guess I don't have a powerful enough
laptop to generate the collisions.
msg5849 Author: [hidden] (rouilj) Date: 2016-07-14 02:31
Figured out how to test this. My original script using add-issue must
have been broken. I used the following:

#! /bin/sh
#set -xv 
for i in `seq 1 5`; do
    for j in `seq 1 5`; do
        PYTHONPATH=$PWD python scripts/add-issue demo urgent "title $i.$j" &
    done
    wait
done

to get the following exception:

Traceback (most recent call last):
  File "scripts/add-issue", line 53, in <module>
    messages=messages)
  File
"/home/rouilj/develop/roundup.dev/roundup.hg/roundup/backends/rdbms_common.py",
line 1524, in create
    newid = self.create_inner(**propvalues)
  File
"/home/rouilj/develop/roundup.dev/roundup.hg/roundup/backends/rdbms_common.py",
line 1668, in create_inner
    self.db.addnode(self.classname, newid, propvalues)
  File
"/home/rouilj/develop/roundup.dev/roundup.hg/roundup/backends/rdbms_common.py",
line 966, in addnode
    self.sql(sql, vals)
  File
"/home/rouilj/develop/roundup.dev/roundup.hg/roundup/backends/rdbms_common.py",
line 221, in sql
    cursor.execute(sql, args)
sqlite3.IntegrityError: UNIQUE constraint failed: _issue.id

which I think is the 2016 version of the original 2011 error.  I got
that exception 8 times running the script above (25 processes total).

After my patch, no failures using the loop above.

Run the script after increasing the number of processes and loops to
10.  The printing the last 20 issues sorted by creation date produces:

   title 9.8
   title 9.1
   title 9.5
   title 9.9
   title 9.3
   title 9.7
   title 9.4
   title 9.2
   title 9.10
   title 9.6
   title 10.4
   title 10.10
   title 10.1
   title 10.2
   title 10.5
   title 10.9
   title 10.8
   title 10.3
   title 10.6
   title 10.7

so all entries are present and accounted for, the start order is not
preserved, but I am ok with that. The default lock timeout set in
config.ini is 30 seconds, but I could set it to 10 seconds and get
each batch of 10 to complete. setting it to 9 produces at least 1

  sqlite3.OperationalError: database is locked

exception.

If the default timeout of 30 seconds needs to be increased (set in
config.RDBMS_SQLITE_TIMEOUT) because of timeout exceptions the user
probably needs to move to a better database with better concurrent
access (postgresql, mysql).

I was able to use a BEGIN IMMEDIATE transaction (rather than BEGIN
EXCLUSIVE in the attached patch). The immediate transaction allows
reads from the database but prevents another BEGIN IMMEDIATE or write
from occurring.  Since all reads for the newid occur inside an
IMMEDIATE transaction, other reads of the newid table can occur (if
there are any) that are outside of reserving/generating a newid.

Using immediate transactions seems to have reduced the impact by a
small amount over exclusive transactions. But the full test_sqlite
test suite takes 40m50s now compared to 31m5s without the transaction
code. Probably because of having to recreate the database for every
test that hits the newid code heavily. Note that it appears the
slowdown is due to i/o wait. A better disk subsystem should help. But
at least now it's correct.

Committed this patch as: 9cebf686c552
History
Date User Action Args
2016-07-14 02:31:04rouiljsetstatus: open -> fixed
assignee: rouilj
resolution: fixed
messages: + msg5849
2016-07-11 12:04:43mpmsetnosy: - mpm
2016-07-10 17:07:03rouiljsetmessages: - msg5810
2016-07-10 17:06:38rouiljsetfiles: + newid_with_transactions.diff
keywords: + patch
messages: + msg5811
2016-07-10 17:05:12rouiljsetstatus: new -> open
messages: + msg5810
2016-07-09 03:01:03rouiljsetnosy: + rouilj
messages: + msg5790
2011-09-19 16:24:01mpmsetmessages: + msg4432
2011-09-19 07:15:31bersetmessages: + msg4429
2011-09-16 16:21:13mpmsettype: crash
messages: + msg4425
2011-09-16 07:30:29bersetnosy: + ber
messages: + msg4424
2011-09-15 17:05:41mpmcreate