SELECT 'foo' IN ('foo' 'bar'); ?column? ---------- f (1 row)
Be careful out there.
In my previous post I benchmarked a number of different UPSERTy implementations. However, a number of people (rightfully) asked me how the results would change under concurrency. So I had to come up with a way of testing that.
I wanted reproducible numbers, so I figured that just throwing a lot of UPSERTs at a database was not acceptable, not to mention the problems with keeping that up while maintaining that a certain number of rows pre-exist. Instead, I changed the implementation a bit: with a small change I persuaded the function to always take the code path which would be taken if a concurrent transaction was UPSERTing on the exact same value at the same time. I only looked at INSERT or UPDATE operations, and only methods one and two, since they looked the most promising in my previous tests.
I wrote a
which tested the various methods, and got the following results:
InsertOrUpdate1 10000 130578 ns/op InsertOrUpdate1RowExists 10000 111433 ns/op InsertOrUpdate1UnderCongestion 10000 176165 ns/op InsertOrUpdate2 10000 108636 ns/op InsertOrUpdate2RowExists 10000 152281 ns/op(There isn't a version of the second method for the "congestion" code path, as that's the exact same code that is executed when the row already exists.)
So now I knew the penalty the transaction forced onto the slow code path was taking.
To get a meaningful graph out of it, I used the following observations:
The result was the following graph:
So even under concurrency the first method appears to be faster in applications where more than 30-40% of the time the target row already exists. This matches the results of the previous benchmark.
Update: Fixed some calculations for the second method; I was not charging it any penalty for congestion, which was wrong.
As I'm sure a lot of people know already, PostgreSQL 9.4 will eventually be released without "native" support for UPSERT. That said, it's still one of the most common questions on the excellent #postgresql IRC channel, and people are sometimes tempted to try and solve it via TRIGGERs (hell, even RULEs), unsuccessfully.
Even though people are often correctly told to "use a function" and are pointed to the example in the documentation, I often find myself needing some more information about the specific case to give a satisfactory answer. The tricky part is that due to a race condition, a UNIQUE constraint is necessary to correctly implement UPSERTesque operations in Postgres, but unique violations abort the entire transaction unless a subtransaction is created. Unfortunately, entering a subtransaction is often said to be expensive, so the trick is to try and identify cases where you can still implement the operation correctly without entering a subtransaction unless absolutely necessary.
So I got curious about the cases where you can get substantial
performance benefits from not entering one, and benchmarked
different concurrency-safe solutions to three of the common
operations I've identified. The three operations are:
For testing, I pre-populated a table with N % of rows in a range, and then called the UPSERTy function once for every value in that range. The idea was to get a rough approximation of the performance in "the row exists in N % of operations" for different values of N. You can find all the code I used here. Have eye bleach (or a fork) handy if you plan on looking at the bash scripts, though.
For INSERT or IGNORE, I came up with three different solutions. The first variant first sees if the row exists, and if it doesn't it tries the INSERT in a subtransaction. This approach avoids having to enter a subtransaction if the value clearly already exists. The hypothesis was that this approach would be fastest if the value exists at least on every second attempt.
The second approach simply always enters a subtransaction and attempts to INSERT the record, ignoring the unique_violation exception if the row already exists. This is probably the fastest approach if most of the time the row does not already exist.
Finally, the last variant does an INSERT .. WHERE NOT EXISTS in a subtransaction. I expected this to be slightly faster than the previous approach in some cases, but slower in almost all cases.
Here are the results:
As you can see, at ~40% of rows existing avoiding the subtransaction is starting to become a performance gain. Perhaps slightly surprisingly, INSERT .. WHERE NOT EXISTS in a subtransaction doesn't seem to ever make sense.
To implement INSERT or SELECT, I came up with a total of four solutions. The first three are the equivalents of INSERT or IGNORE, as described above, and the last one uses a LOOP instead of copying the lookup directly into the exception handler block. I had no hypothesis on which one would be faster between the first one and the last one; I was just curious to see if there was any difference at all. The fourth version also supports DELETEs, unlike any of the other versions.
The ideas here look very similar to the INSERT or IGNORE case, though skipping the subtransaction makes sense even earlier, at around 30% of rows pre-existing in the target table. No clear difference between methods one and four, not surprisingly.
Very similar results here as well, though methods one and four don't get that big of a benefit anymore from the row existing compared to how fast they are against an empty table. Subtransactions are still expensive, though.
According to these results, you might want to consider avoiding the subtransaction if you expect the target row to exist roughly 30% of the time. Definitely consider that approach if you expect the row to exist in anywhere above 40% of cases.
Please remember that these results can only provide a general guideline. In particular, the behaviour was not tested under any actual concurrency (other than manual tests to verify correctness). If performance is very important for you, it's always up to you to measure different solutions and choose the one that works best for your application!
Update: Joel Jacobson suggested a slightly different variant in the comments, and I've added it to the graphs in the git repository. The results further highlight the fact that it's not the act of entering a subtransaction that's expensive, it's failing one and recovering from it. That also means that benchmarking these variants under concurrent workloads is likely going to give slightly different results.