Monday, April 21, 2014

UPSERTisms in Postgres

As I'm sure a lot of people know already, PostgreSQL 9.4 will eventually be released without "native" support for UPSERT.  That said, it's still one of the most common questions on the excellent #postgresql IRC channel, and people are sometimes tempted to try and solve it via TRIGGERs (hell, even RULEs), unsuccessfully.

Even though people are often correctly told to "use a function" and are pointed to the example in the documentation, I often find myself needing some more information about the specific case to give a satisfactory answer.  The tricky part is that due to a race condition, a UNIQUE constraint is necessary to correctly implement UPSERTesque operations in Postgres, but unique violations abort the entire transaction unless a subtransaction is created.  Unfortunately, entering a subtransaction is often said to be expensive, so the trick is to try and identify cases where you can still implement the operation correctly without entering a subtransaction unless absolutely necessary.

So I got curious about the cases where you can get substantial performance benefits from not entering one, and benchmarked different concurrency-safe solutions to three of the common operations I've identified.  The three operations are:

  • INSERT or IGNORE:  INSERT, but if the row already exists, don't do anything.  No result required other than "success".
  • INSERT or SELECT:  INSERT .. RETURNING, but if the row already exists, return data from the existing row as the result instead.
  • INSERT or UPDATE:  INSERT .. RETURNING, but if the row already exists, UPDATE .. RETURNING instead.
For the purposes of this post, I'm going to be assuming that no DELETEs or UPDATEs of the PRIMARY KEY are ever executed on the target table.  Some of the examples would work correctly in these cases at well, but most of them would need small adjustments.

For testing, I pre-populated a table with N % of rows in a range, and then called the UPSERTy function once for every value in that range.  The idea was to get a rough approximation of the performance in "the row exists in N % of operations" for different values of N.  You can find all the code I used here.  Have eye bleach (or a fork) handy if you plan on looking at the bash scripts, though.

INSERT or IGNORE

For INSERT or IGNORE, I came up with three different solutions.  The first variant first sees if the row exists, and if it doesn't it tries the INSERT in a subtransaction.  This approach avoids having to enter a subtransaction if the value clearly already exists.  The hypothesis was that this approach would be fastest if the value exists at least on every second attempt.

The second approach simply always enters a subtransaction and attempts to INSERT the record, ignoring the unique_violation exception if the row already exists.  This is probably the fastest approach if most of the time the row does not already exist.

Finally, the last variant does an INSERT .. WHERE NOT EXISTS in a subtransaction.  I expected this to be slightly faster than the previous approach in some cases, but slower in almost all cases.

Here are the results:


As you can see, at ~40% of rows existing avoiding the subtransaction is starting to become a performance gain.  Perhaps slightly surprisingly, INSERT .. WHERE NOT EXISTS in a subtransaction doesn't seem to ever make sense.

INSERT or SELECT

To implement INSERT or SELECT, I came up with a total of four solutions.  The first three are the equivalents of INSERT or IGNORE, as described above, and the last one uses a LOOP instead of copying the lookup directly into the exception handler block.  I had no hypothesis on which one would be faster between the first one and the last one; I was just curious to see if there was any difference at all.  The fourth version also supports DELETEs, unlike any of the other versions.

Results:


The ideas here look very similar to the INSERT or IGNORE case, though skipping the subtransaction makes sense even earlier, at around 30% of rows pre-existing in the target table.  No clear difference between methods one and four, not surprisingly.

INSERT or UPDATE

For the "traditional" UPSERT I just implemented the same approaches as for INSERT or SELECT.  Again, the fourth approach supports DELETEs.

Results:


Very similar results here as well, though methods one and four don't get that big of a benefit anymore from the row existing compared to how fast they are against an empty table.  Subtransactions are still expensive, though.

Conclusion

According to these results, you might want to consider avoiding the subtransaction if you expect the target row to exist roughly 30% of the time.  Definitely consider that approach if you expect the row to exist in anywhere above 40% of cases.

Please remember that these results can only provide a general guideline.  In particular, the behaviour was not tested under any actual concurrency (other than manual tests to verify correctness).  If performance is very important for you, it's always up to you to measure different solutions and choose the one that works best for your application!

Update:  Joel Jacobson suggested a slightly different variant in the comments, and I've added it to the graphs in the git repository.  The results further highlight the fact that it's not the act of entering a subtransaction that's expensive, it's failing one and recovering from it.  That also means that benchmarking these variants under concurrent workloads is likely going to give slightly different results.

9 comments:

  1. Why do you need to do a second BEGIN in "INSERT or SELECT" method 1?

    Example doing the same thing but with only one BEGIN...END block:

    CREATE FUNCTION Get_Flag_Value(_FlagName text, _FlagValue text) RETURNS integer
    LANGUAGE plpgsql
    AS $$
    DECLARE
    _FlagID integer;
    _FlagValueID integer;
    BEGIN
    SELECT FlagID INTO STRICT _FlagID FROM Flags WHERE Name = _FlagName;
    SELECT FlagValueID INTO _FlagValueID FROM FlagValues WHERE FlagID = _FlagID AND Value = _FlagValue;
    IF NOT FOUND THEN
    INSERT INTO FlagValues (FlagID,Value) VALUES (_FlagID,_FlagValue) RETURNING FlagValueID INTO STRICT _FlagValueID;
    END IF;
    RETURN _FlagValueID;
    EXCEPTION
    WHEN unique_violation THEN
    SELECT FlagValueID INTO _FlagValueID FROM FlagValues WHERE FlagID = _FlagID AND Value = _FlagValue;
    RETURN _FlagValueID;
    END;
    $$;

    ReplyDelete
    Replies
    1. Entering a BEGIN block with an EXCEPTION handler in it creates a subtransaction, and the idea was to avoid that overhead when possible.

      That said, looking at the graphs, it appears that the real cost isn't actually *entering* a subtransaction, it's recovering from an aborted subtransaction. So it would be interesting to measure this case against the other ones to see how big the cost of merely entering a subtransaction is. I'll have to see about doing that a bit later today.

      Delete
  2. > Unfortunately, entering a subtransaction is often said to be expensive

    In large databases you incur latent costs as well. Each operation consumes one transaction ID; after 200 million (sub)transactions, autovacuum will start re-writing all your tables to prevent transaction ID wraparound.

    ReplyDelete
  3. Do you have methods that work with more than one row?

    I've got a naive bayes classifier written in MySQL that routinely upserts 2000 rows at a time.

    ReplyDelete
    Replies
    1. Nothing that would be fast but still work correctly under concurrency out-of-the-box, no.

      Delete
  4. Fantastic analysis. The missing piece though is performance in the face of contention. And under contention all methods should give correct results.

    Two methods I'd consider are cte upsert (you should know!) And lock table/select/upsert.

    ReplyDelete
    Replies
    1. I'm planning on doing another benchmark under concurrency a bit later. I just need to get the details locked down so that the numbers are reproducible.

      Delete
  5. hi Marko,

    I'm the maintainer of a Ruby library that creates Postgres upsert functions on the fly:

    https://github.com/seamusabshere/upsert/blob/master/lib/upsert/merge_function/postgresql.rb#L101

    What implications might your benchmarks have for this implementation? If you would give me some pointers, I'll work through it!

    Thanks,
    Seamus

    ReplyDelete
    Replies
    1. This and the followup benchmark I posted today agree that if the row exists at least 30% of the time, the method you've implemented there should be as fast if not faster than all the alternatives I tested. For cases where the row exists less than 30% of the time, method two as covered here will likely be faster.

      As an aside, I think you should re-raise the exception if instead of silently returning if first_try is not 1 in your implementation. I'm guessing that you put that guard there in cases where the unique_violation comes from a constraint other than the one you're upserting on, and in that case silently ignoring the operation is a bad idea.

      Delete