Yesterday afternoon across three development machines — all different platforms — we started to get failures logging in to the site we were working on. This of course brought all work to a halt whilst we tried to work out what was going on.
By early evening we'd spent hours going through the code and we did now a few things.
The sequence of events that you see during log in is something like this:
What the user sees is that they log in only to be presented with a log in page again. Very annoying.
By early yesterday evening we were spelunking through the session saving code (we'd already spent some hours going through the authentication code making sure that was working). Django pickles the session data and is able to save it to one of several places depending on the session back end in use. There is a plug-in architecture for the session back ends that I don't really like — it complicates configuration for no real gain as they could have just had a separate middleware for each of the back ends instead.
Anyway, a few hours later and another load of print statements we could see that it was sending the correct session data to the database, but the wrong session data was being read from the database. This was going in:
gAJ9cQEoVRJfYXV0aF91c2VyX2JhY2tlbmRxAlUpZGphbmdvLmNvbnRyaWIuYXV0aC5iYWNrZW5k cy5Nb2RlbEJhY2tlbmRxA1UNX2F1dGhfdXNlcl9pZHEESwF1LjVjMDI4MWRjOGM0NmVhZDhjY2Qw ODFkMmI0ZjUxMmIx
But this was being read:
The session data itself is pickled and that pickled data is salted (with the random string from the settings) hashed (MD5). This hash is checked when the data is loaded back in so we knew that it was legitimate and as we had print statements all over the session saving code we knew that the session code was doing the right thing in the right order.
We'd also discovered that when the new session is created it is saved whilst still empty. This worked as we saw the data in the database, but the second save after it had been populated didn't appear in the database even though we weren't seeing any error.
We're using the psycopg2 back end to talk to Postgres. We run slightly different versions of Postgres on our Linux and Windows boxes and where seeing the same behaviour on both platforms. We were also seeing the same behaviour across different versions of psycopg2 and databases build in different ways. On Linux using runserver the problem was always there, on Windows using PyISAPIe it was sometimes there.
As we knew that the session code was correct the next place to look was somewhere in this tangle of systems where Django meets the database and there are strong influences from the environment. This is not a great place to be trying to debug things as there are too many moving parts interact in subtle ways.
Thankfully Django makes it fairly easy to see the SQL that it creates. It didn't take us long to find the INSERT¹ [1There is a really nasty hack down here in the O/RM that determines whether to do an INSERT or an UPDATE, but that's a story for anther day.] statement that created the empty session:
INSERT INTO "django_session" ("session_key", "session_data", "expire_date") VALUES (E'e1dd12acd758224c7ee1391471ae7470', E'gAJ9cQEuN2M1N2I4YzA2NDVmMDFjZDZhNmEwYTUwNmRlNzBhODY= ', E'2008-11-27 16:39:58.165905')
Then a bit later on the UPDATE statement that saved (or rather didn't save) the session data with the authentication in it:
UPDATE "django_session" SET "session_data" = E'gAJ9cQEoVRJfYXV0aF91c2VyX2JhY2tlbmRxAlUpZGphbmdvLmNvbnRyaWIuYXV0aC5iYWNrZW5k cy5Nb2RlbEJhY2tlbmRxA1UNX2F1dGhfdXNlcl9pZHEESwF1LjVjMDI4MWRjOGM0NmVhZDhjY2Qw ODFkMmI0ZjUxMmIx ', "expire_date" = E'2008-11-27 16:39:58.177304' WHERE "django_session"."session_key" = E'e1dd12acd758224c7ee1391471ae7470'
Running this update at an SQL prompt into the database would log me in so I knew that the SQL was good, but why wasn't it updating the database when Django was running it?
A few more print statements in Django's model code that generates the SQL produced this debug trace:
[ (<django.db.models.fields.TextField object at 0x2c836d0>, None, 'gAJ9cQEoVRJfYXV0aF91c2VyX2JhY2tlbmRxAlUpZGphbmdvLmNvbnRyaWIuYXV0aC5iYWNrZW5k\n cy5Nb2RlbEJhY2tlbmRxA1UNX2F1dGhfdXNlcl9pZHEESwF1LjVjMDI4MWRjOGM0NmVhZDhjY2Qw\n ODFkMmI0ZjUxMmIx\n'), (<django.db.models.fields.DateTimeField object at 0x2c83790>, None, u'2008-11-27 16:39:58.177304') ] 1
You can see the two fields that are being updated by the SQL and the number one at the end. That number is the row count returned by the database layer — the number of rows that the database layer says has been updated within the database. Django updates the database, the database says a row has been updated, but no data is changed anywhere. Grrr
I already knew that there were some circumstances under which Postgres would ignore SQL commands during a transaction if an error had occured. I would expect to see some notification of the error, but maybe there was something related going on so I put some debug traces on the transaction middleware to check what was going on there. That lead to this interesting trace:
admin True Commits and leaves transaction management. setting session
The first line is the authentication trace for the admin account and shows that the authentication information is inside the session data. The next line shows the database commit and the last shows the start of the session data save — the session data was being saved after the transaction had been committed.
A little more poking showed me that the session data is saved not when the user gets authenticated, but as the request completes within the middleware. By making sure that the transaction middleware was earlier in the stack (so it would be executed later when the request processing was complete) then that ought to bring the session save inside the transaction and that ought to fix the problem.
A few seconds later I had log in working. A commit and some checks and I could see it working on reliably on all the environments that I had available to me late at night and I could go to bed happy that we wouldn't all be wasting more time the next day trying to sort this out.
If there is one, the moral here is that transactions are hard to get right. Given the slightest opportunity developers will get them wrong. This is why within FOST.3™ I designed it such that the O/RM layer will throw an exception if you try to save changes to the database without first putting a transaction in place. The new design for Fost 4 will take this one step further requiring a transaction even to read anything from the database.
Like concurrency controls for multi-threaded programs there is not a lot that library code can do to make transactions work invisibly. Transactions need to be started, stopped and committed at the application layer because only that high up do you have a proper view of what the correct granularity is. The best the library code can do is to try to make sure that they make the process as foolproof as possible.
I can't really see any change in Django that enforces proper transaction handling to make it into trunk as it will break a lot of systems — arguably these systems are broken anyway, but for people using non-transactional database backends I can well see that they wouldn't want to pay the price of all of the extra checks. Maybe it's something that could be brought into the database backends, but their design is still quite unclear to me — I never found the Django code that wraps the submission of an SQL statement to the database. There are quite a few design differences in how the databases are talked to that make this checking much easier in Fost than in Django.
The one thing that we can't work out at all is why this suddenly started to affect us yesterday. I tracked down the revision that added the transaction middleware — it's from March 13th and the order between the session and transaction middleware hasn't been changed until late last night.