Thursday, May 8, 2014

PostgreSQL gotcha of the week, week 19

local:marko=# create table foo();
CREATE TABLE
local:marko=#* insert into foo default values;
INSERT 0 1
local:marko=#* create function getdata(foo)
returns table (a int, b int)
as $$
begin

if random() < 0.5 then
    a := 1; b := 1;
    return next;
else
    a := 2; b := 2;
    return next;
end if ;
end
$$ language plpgsql
;
CREATE FUNCTION
local:marko=#* select (getdata(foo)).* from foo;
 a | b 
---+---
 2 | 1
(1 row)

If you didn't figure out how that happened: first, (getdata(foo)).*  gets expanded to  (getdata(foo)).a, (getdata(foo)).b.  Then both expressions are evaluated separately, where they happen to enter different arms of the IF.

To fix, on versions earlier than 9.3:

select (getdata).*
from
(
    select getdata(foo)
    from foo
    offset 0
) ss;
The OFFSET 0 prevents the subquery from being flattened, which also ensures that the function is only called once for every row in "foo".

On 9.3 and later:

select getdata.* from foo, getdata(foo);

Thursday, May 1, 2014

Why I consider USING harmful

Often on the topic of the different JOIN syntaxes in SQL I mention that I consider ON the only reasonable way to specify JOIN clauses in production code, and I get asked for details.  It always takes me a while to recall the specific problem and to come up with a plausible scenario, so I figured it would be easier to document this once and for all.

Suppose you have a database containing information about cars and the parts that go into building said cars.  Your schema could look something like this:

create table cars (
carid serial primary key,
model text not null
);

create table manufacturers (
manufacturerid serial primary key,
name text not null
);

create table parts (
partid serial primary key,
manufacturerid int references manufacturers,
model text not null
);

create table car_parts (
carid int references cars,
partid int references parts,
primary key (carid, partid)
);

And then a query to get a specific car and the parts that go into building it might look like this:

select
  cars.model,
  car_parts.partid,
  manufacturers.name as manufacturer
from cars
join car_parts using (carid)
join parts using (partid)
join manufacturers using (manufacturerid)
where cars.carid = $1
;

Everything is working great, until some day someone thinks that you should also track the manufacturer for each car.  So you run the following DDL:

alter table cars
  add column manufacturerid integer
  references manufacturers;

.. and boom.  The query shown above that previously worked correctly won't work anymore.  Even worse, if it's in a view (as opposed to a function or SQL in the application), the view will continue to work, and you might not even know that it's broken until you try to restore a dump of the database.

This is a huge caveat, and also the reason I think USING does not belong to production code.  ON is far more reasonable, though it requires a bit more typing.  SQL is not for the lazy.