Wednesday, April 23, 2014

UPSERTisms in Postgres, part 2: congestion

In my previous post I benchmarked a number of different UPSERTy implementations.  However, a number of people (rightfully) asked me how the results would change under concurrency.  So I had to come up with a way of testing that.

I wanted reproducible numbers, so I figured that just throwing a lot of UPSERTs at a database was not acceptable, not to mention the problems with keeping that up while maintaining that a certain number of rows pre-exist.  Instead, I changed the implementation a bit: with a small change I persuaded the function to always take the code path which would be taken if a concurrent transaction was UPSERTing on the exact same value at the same time.  I only looked at INSERT or UPDATE operations, and only methods one and two, since they looked the most promising in my previous tests.

I wrote a Go program which tested the various methods, and got the following results:

InsertOrUpdate1                10000 130578 ns/op
InsertOrUpdate1RowExists       10000 111433 ns/op
InsertOrUpdate1UnderCongestion 10000 176165 ns/op
InsertOrUpdate2                10000 108636 ns/op
InsertOrUpdate2RowExists       10000 152281 ns/op
(There isn't a version of the second method for the "congestion" code path, as that's the exact same code that is executed when the row already exists.)

So now I knew the penalty the transaction forced onto the slow code path was taking.  To get a meaningful graph out of it, I used the following observations:

  • If the row already exists, the number of concurrent transactions doesn't matter.
  • If c transactions try to concurrently UPSERT a row that does not exist, the one that "wins the race" is as fast as the operation normally would have been, disregarding the concurrent transactions.  The rest, c - 1 transactions, are as slow as InsertOrUpdate1UnderCongestion or InsertOrUpdate2RowExists for methods one and two, respectively.

The result was the following graph:



So even under concurrency the first method appears to be faster in applications where more than 30-40% of the time the target row already exists.  This matches the results of the previous benchmark.

Update:  Fixed some calculations for the second method; I was not charging it any penalty for congestion, which was wrong.

4 comments:

  1. What do you think is the best way for a table used for rate-limiting api_requests ?

    ReplyDelete
    Replies
    1. I'd say not to use a table for that at all; something like Redis would seem to be a much better fit, and keeps all the load off your DB (presumably a goal, if not the primary goal, of the rate-limiting).

      Delete
  2. Hi, thx for great posts. Did you test how different transaction Isolation levels affect on upsert operations? What level was used in benchmark?

    ReplyDelete
    Replies
    1. Everything was tested in READ COMMITTED. I don't think any of the approaches work correctly under REPEATABLE READ or SERIALIZABLE. They probably all have a corner case where they just loop forever.

      Delete