Thursday, November 6, 2014

PostgreSQL gotcha of the week, week 45

Something happened earlier this week (or maybe it was last week, I forget), and I was quite perplexed for a good 30 seconds, so I thought I'd try and make this problem a bit more well-known.  Consider the following schema:

CREATE TABLE blacklist (
  person text
);

INSERT INTO blacklist VALUES ('badguy');

CREATE TABLE orders (
  orderid serial PRIMARY KEY,
  person text,
  amount numeric
);

CREATE FUNCTION is_blacklisted(_person text)
  RETURNS boolean
  STABLE
AS $$
BEGIN
  RETURN EXISTS (SELECT 1 FROM blacklist
                 WHERE person = _person);
END
$$ LANGUAGE plpgsql;

CREATE FUNCTION new_order(_person text, _amount numeric)
  RETURNS bigint
AS $$
DECLARE
_orderid bigint;
BEGIN
  IF is_blacklisted(_person) THEN
    RAISE EXCEPTION 'person has been blacklisted';
  END IF;

  INSERT INTO orders (person, amount)
    VALUES (_person, _amount)
  RETURNING orderid INTO _orderid;
  RETURN _orderid;
END
$$ LANGUAGE plpgsql;

Now everything is running smoothly:

=# select new_order('goodguy', 100.0);
 new_order 
-----------
         1
(1 row)

=# select new_order('badguy', 100.0);
ERROR:  person has been blacklisted

But something's wrong with the blacklist table: it doesn't tell you who blacklisted the person or why.  You also happen to know that you added all of the existing people on the blacklist, so you want to have them reflect that.  You run the following ALTER TABLEs:

ALTER TABLE blacklist
  ADD COLUMN reason text,
  -- use DEFAULT to assign values to existing rows
  ADD COLUMN blacklisted_by text
    NOT NULL DEFAULT 'you';

-- and then drop it
ALTER TABLE blacklist
  ALTER COLUMN blacklisted_by
    DROP DEFAULT;

And everything seems to be working nicely.  However, a few days later you realize that "badguy" stole all your money by getting through a couple of fake orders.  How did that happen?

The answer is a bit complicated, but the gist of it is that if a STABLE function accesses the table for the first time in a transaction while a table-rewriting operation (such as the ALTER TABLE we used) is going on concurrently, it will see an empty table, instead of either the old or the new version of the data.  is_blacklisted() returned FALSE because of this, and "badguy" happened to be able to create an order.  This caveat applies to ALL SERIALIZATION MODES; not even READ COMMITTED users are safe.  However, SQL language functions don't seem to have this problem in more recent versions of postgres.

This really questions the safety of table-rewriting operations in an application heavy on PL/PgSQL (or some other procedural languages).  I know we're banning them internally.

Thursday, May 8, 2014

PostgreSQL gotcha of the week, week 19

local:marko=# create table foo();
CREATE TABLE
local:marko=#* insert into foo default values;
INSERT 0 1
local:marko=#* create function getdata(foo)
returns table (a int, b int)
as $$
begin

if random() < 0.5 then
    a := 1; b := 1;
    return next;
else
    a := 2; b := 2;
    return next;
end if ;
end
$$ language plpgsql
;
CREATE FUNCTION
local:marko=#* select (getdata(foo)).* from foo;
 a | b 
---+---
 2 | 1
(1 row)

If you didn't figure out how that happened: first, (getdata(foo)).*  gets expanded to  (getdata(foo)).a, (getdata(foo)).b.  Then both expressions are evaluated separately, where they happen to enter different arms of the IF.

To fix, on versions earlier than 9.3:

select (getdata).*
from
(
    select getdata(foo)
    from foo
    offset 0
) ss;
The OFFSET 0 prevents the subquery from being flattened, which also ensures that the function is only called once for every row in "foo".

On 9.3 and later:

select getdata.* from foo, getdata(foo);

Thursday, May 1, 2014

Why I consider USING harmful

Often on the topic of the different JOIN syntaxes in SQL I mention that I consider ON the only reasonable way to specify JOIN clauses in production code, and I get asked for details.  It always takes me a while to recall the specific problem and to come up with a plausible scenario, so I figured it would be easier to document this once and for all.

Suppose you have a database containing information about cars and the parts that go into building said cars.  Your schema could look something like this:

create table cars (
carid serial primary key,
model text not null
);

create table manufacturers (
manufacturerid serial primary key,
name text not null
);

create table parts (
partid serial primary key,
manufacturerid int references manufacturers,
model text not null
);

create table car_parts (
carid int references cars,
partid int references parts,
primary key (carid, partid)
);

And then a query to get a specific car and the parts that go into building it might look like this:

select
  cars.model,
  car_parts.partid,
  manufacturers.name as manufacturer
from cars
join car_parts using (carid)
join parts using (partid)
join manufacturers using (manufacturerid)
where cars.carid = $1
;

Everything is working great, until some day someone thinks that you should also track the manufacturer for each car.  So you run the following DDL:

alter table cars
  add column manufacturerid integer
  references manufacturers;

.. and boom.  The query shown above that previously worked correctly won't work anymore.  Even worse, if it's in a view (as opposed to a function or SQL in the application), the view will continue to work, and you might not even know that it's broken until you try to restore a dump of the database.

This is a huge caveat, and also the reason I think USING does not belong to production code.  ON is far more reasonable, though it requires a bit more typing.  SQL is not for the lazy.

Saturday, April 26, 2014

SQL gotcha of the week

SELECT 'foo' IN ('foo'
                 'bar');
 ?column? 
----------
 f
(1 row)

Be careful out there.

Wednesday, April 23, 2014

UPSERTisms in Postgres, part 2: congestion

In my previous post I benchmarked a number of different UPSERTy implementations.  However, a number of people (rightfully) asked me how the results would change under concurrency.  So I had to come up with a way of testing that.

I wanted reproducible numbers, so I figured that just throwing a lot of UPSERTs at a database was not acceptable, not to mention the problems with keeping that up while maintaining that a certain number of rows pre-exist.  Instead, I changed the implementation a bit: with a small change I persuaded the function to always take the code path which would be taken if a concurrent transaction was UPSERTing on the exact same value at the same time.  I only looked at INSERT or UPDATE operations, and only methods one and two, since they looked the most promising in my previous tests.

I wrote a Go program which tested the various methods, and got the following results:

InsertOrUpdate1                10000 130578 ns/op
InsertOrUpdate1RowExists       10000 111433 ns/op
InsertOrUpdate1UnderCongestion 10000 176165 ns/op
InsertOrUpdate2                10000 108636 ns/op
InsertOrUpdate2RowExists       10000 152281 ns/op
(There isn't a version of the second method for the "congestion" code path, as that's the exact same code that is executed when the row already exists.)

So now I knew the penalty the transaction forced onto the slow code path was taking.  To get a meaningful graph out of it, I used the following observations:

  • If the row already exists, the number of concurrent transactions doesn't matter.
  • If c transactions try to concurrently UPSERT a row that does not exist, the one that "wins the race" is as fast as the operation normally would have been, disregarding the concurrent transactions.  The rest, c - 1 transactions, are as slow as InsertOrUpdate1UnderCongestion or InsertOrUpdate2RowExists for methods one and two, respectively.

The result was the following graph:



So even under concurrency the first method appears to be faster in applications where more than 30-40% of the time the target row already exists.  This matches the results of the previous benchmark.

Update:  Fixed some calculations for the second method; I was not charging it any penalty for congestion, which was wrong.

Monday, April 21, 2014

UPSERTisms in Postgres

As I'm sure a lot of people know already, PostgreSQL 9.4 will eventually be released without "native" support for UPSERT.  That said, it's still one of the most common questions on the excellent #postgresql IRC channel, and people are sometimes tempted to try and solve it via TRIGGERs (hell, even RULEs), unsuccessfully.

Even though people are often correctly told to "use a function" and are pointed to the example in the documentation, I often find myself needing some more information about the specific case to give a satisfactory answer.  The tricky part is that due to a race condition, a UNIQUE constraint is necessary to correctly implement UPSERTesque operations in Postgres, but unique violations abort the entire transaction unless a subtransaction is created.  Unfortunately, entering a subtransaction is often said to be expensive, so the trick is to try and identify cases where you can still implement the operation correctly without entering a subtransaction unless absolutely necessary.

So I got curious about the cases where you can get substantial performance benefits from not entering one, and benchmarked different concurrency-safe solutions to three of the common operations I've identified.  The three operations are:

  • INSERT or IGNORE:  INSERT, but if the row already exists, don't do anything.  No result required other than "success".
  • INSERT or SELECT:  INSERT .. RETURNING, but if the row already exists, return data from the existing row as the result instead.
  • INSERT or UPDATE:  INSERT .. RETURNING, but if the row already exists, UPDATE .. RETURNING instead.
For the purposes of this post, I'm going to be assuming that no DELETEs or UPDATEs of the PRIMARY KEY are ever executed on the target table.  Some of the examples would work correctly in these cases at well, but most of them would need small adjustments.

For testing, I pre-populated a table with N % of rows in a range, and then called the UPSERTy function once for every value in that range.  The idea was to get a rough approximation of the performance in "the row exists in N % of operations" for different values of N.  You can find all the code I used here.  Have eye bleach (or a fork) handy if you plan on looking at the bash scripts, though.

INSERT or IGNORE

For INSERT or IGNORE, I came up with three different solutions.  The first variant first sees if the row exists, and if it doesn't it tries the INSERT in a subtransaction.  This approach avoids having to enter a subtransaction if the value clearly already exists.  The hypothesis was that this approach would be fastest if the value exists at least on every second attempt.

The second approach simply always enters a subtransaction and attempts to INSERT the record, ignoring the unique_violation exception if the row already exists.  This is probably the fastest approach if most of the time the row does not already exist.

Finally, the last variant does an INSERT .. WHERE NOT EXISTS in a subtransaction.  I expected this to be slightly faster than the previous approach in some cases, but slower in almost all cases.

Here are the results:


As you can see, at ~40% of rows existing avoiding the subtransaction is starting to become a performance gain.  Perhaps slightly surprisingly, INSERT .. WHERE NOT EXISTS in a subtransaction doesn't seem to ever make sense.

INSERT or SELECT

To implement INSERT or SELECT, I came up with a total of four solutions.  The first three are the equivalents of INSERT or IGNORE, as described above, and the last one uses a LOOP instead of copying the lookup directly into the exception handler block.  I had no hypothesis on which one would be faster between the first one and the last one; I was just curious to see if there was any difference at all.  The fourth version also supports DELETEs, unlike any of the other versions.

Results:


The ideas here look very similar to the INSERT or IGNORE case, though skipping the subtransaction makes sense even earlier, at around 30% of rows pre-existing in the target table.  No clear difference between methods one and four, not surprisingly.

INSERT or UPDATE

For the "traditional" UPSERT I just implemented the same approaches as for INSERT or SELECT.  Again, the fourth approach supports DELETEs.

Results:


Very similar results here as well, though methods one and four don't get that big of a benefit anymore from the row existing compared to how fast they are against an empty table.  Subtransactions are still expensive, though.

Conclusion

According to these results, you might want to consider avoiding the subtransaction if you expect the target row to exist roughly 30% of the time.  Definitely consider that approach if you expect the row to exist in anywhere above 40% of cases.

Please remember that these results can only provide a general guideline.  In particular, the behaviour was not tested under any actual concurrency (other than manual tests to verify correctness).  If performance is very important for you, it's always up to you to measure different solutions and choose the one that works best for your application!

Update:  Joel Jacobson suggested a slightly different variant in the comments, and I've added it to the graphs in the git repository.  The results further highlight the fact that it's not the act of entering a subtransaction that's expensive, it's failing one and recovering from it.  That also means that benchmarking these variants under concurrent workloads is likely going to give slightly different results.