Wednesday, April 23, 2014

UPSERTisms in Postgres, part 2: congestion

In my previous post I benchmarked a number of different UPSERTy implementations.  However, a number of people (rightfully) asked me how the results would change under concurrency.  So I had to come up with a way of testing that.

I wanted reproducible numbers, so I figured that just throwing a lot of UPSERTs at a database was not acceptable, not to mention the problems with keeping that up while maintaining that a certain number of rows pre-exist.  Instead, I changed the implementation a bit: with a small change I persuaded the function to always take the code path which would be taken if a concurrent transaction was UPSERTing on the exact same value at the same time.  I only looked at INSERT or UPDATE operations, and only methods one and two, since they looked the most promising in my previous tests.

I wrote a Go program which tested the various methods, and got the following results:

InsertOrUpdate1                10000 130578 ns/op
InsertOrUpdate1RowExists       10000 111433 ns/op
InsertOrUpdate1UnderCongestion 10000 176165 ns/op
InsertOrUpdate2                10000 108636 ns/op
InsertOrUpdate2RowExists       10000 152281 ns/op
(There isn't a version of the second method for the "congestion" code path, as that's the exact same code that is executed when the row already exists.)

So now I knew the penalty the transaction forced onto the slow code path was taking.  To get a meaningful graph out of it, I used the following observations:

  • If the row already exists, the number of concurrent transactions doesn't matter.
  • If c transactions try to concurrently UPSERT a row that does not exist, the one that "wins the race" is as fast as the operation normally would have been, disregarding the concurrent transactions.  The rest, c - 1 transactions, are as slow as InsertOrUpdate1UnderCongestion or InsertOrUpdate2RowExists for methods one and two, respectively.

The result was the following graph:

So even under concurrency the first method appears to be faster in applications where more than 30-40% of the time the target row already exists.  This matches the results of the previous benchmark.

Update:  Fixed some calculations for the second method; I was not charging it any penalty for congestion, which was wrong.


  1. What do you think is the best way for a table used for rate-limiting api_requests ?

    1. I'd say not to use a table for that at all; something like Redis would seem to be a much better fit, and keeps all the load off your DB (presumably a goal, if not the primary goal, of the rate-limiting).

  2. Hi, thx for great posts. Did you test how different transaction Isolation levels affect on upsert operations? What level was used in benchmark?

    1. Everything was tested in READ COMMITTED. I don't think any of the approaches work correctly under REPEATABLE READ or SERIALIZABLE. They probably all have a corner case where they just loop forever.