johto's lair: 2010

Sunday, December 19, 2010

Queues in SQL

This is actually a post I should've written a long time ago, but never got to it. The problem this post is discussing is implementing a queue in SQL. There are a lot of problems and some of them aren't immediately obvious, so it's not surprising that this is a frequently asked question on the IRC channel. Let's use the following schema:

CREATE TABLE items
(
    id serial PRIMARY KEY,
    available boolean DEFAULT TRUE NOT NULL,

    -- application-specific data here
);

-- fast lookup of available items
CREATE INDEX items_lookup_index ON
    items (id) WHERE available;

And you have one or more "consumer" processes grabbing items from the queue, processing them and then deleting them. When a consumer starts processing an item, it marks it "not available". After the consumer is done, it removes the row from the table. You also have zero or more "producer" processes inserting items into the queue.

Inserting items is quite straightforward, but how would you pop an item off the queue? There is no UPDATE .. LIMIT in PostgreSQL (yet), so that won't work. Instead, you might be tempted to do something like this:

UPDATE
    items
SET available = FALSE
WHERE id =
    (
        SELECT
            min(id)
        FROM items
        WHERE available
    )
RETURNING *
;

But there is a small problem with correctness: two concurrently running consumers might both get the same row. The reason for this is that the subquery is evaluated using a snapshot taken when the query started. If the "availability" of the row changes after that (i.e. another concurrently running process grabs the item), we still UPDATE it and both clients think they got the item. Fortunately, there's a way around this; add the check to the WHERE clause of the UPDATE:

UPDATE
    items
SET available = FALSE
WHERE id =
    (
        SELECT
            min(id)
        FROM items
        WHERE available
    )

-- important
AND available

RETURNING *
;

This works, because the WHERE clause in an UPDATE is always evaluated against the latest version of the row (assuming we're dealing with a table and not a view). However, the subquery always returns exactly one row, and if that row does not match our criteria when we get to the UPDATE, we might end up updating no rows! So either the application needs to be prepared to redo the UPDATE or (what I think is a lot better) this code has to be in a server-side function which retries it automatically.

Now that correctness is not an issue and you have deployed the code to your production server, you might notice that your clients spend a lot of time just popping items off the queue. Because all consumers try to lock the same row, they end up waiting for the "fastest" transaction to finish just to notice that the row was already processed. So effectively, only one consumer transaction can be running at a time, and the remaining need to do a lot of unnecessary work. In the absolute worst case scenario your consumers process the items very quickly and almost immediately come back to the loop with other processes. However, if you only have a small number of consumers and/or processing an item takes a significant amount of time, you might not experience any problems with this approach. But if you do, there is a way to make the queue faster: advisory locks.

What we need is a way to say "if this item is not immediately available, try the next one". There is no way to do that using only SQL, but fortunately we can mix function calls into our SQL. Viz:

UPDATE
    items
SET available = FALSE
WHERE id =
    (
        SELECT
            min(id)
        FROM items
        WHERE
            available AND
            pg_try_advisory_lock(id)
    )
RETURNING *
;

pg_try_advisory_lock() tries to lock an application-defined resource identified by a bigint and returns TRUE if the object was successfully locked, or FALSE if it could not be locked because another session is already holding a lock on that resource. Postgres does not use advisory locks internally, and it is up to the application to define the meaning of the bigint parameter. In this case, an id works well. By doing this, we tell PG to skip the row if it is not available immediately, which is exactly what we wanted. Great! Well, not exactly. There are still three problems with this:

The snapshot problem mentioned earlier is still there.
The planner might prefer a sequential scan over an index scan for the subquery.
Advisory locks are not released automatically (not even when the transaction ends), so we need to manually release them.

The first one is easy to fix: again, add the check to the WHERE clause of the UPDATE. To fix the second one, we need to do a bit more work. The usual solution looks as follows (just the subquery part):

SELECT
    id
FROM
(
    SELECT
        id
    FROM items
    WHERE available
    ORDER BY id
    LIMIT $n
) ss
WHERE
    pg_try_advisory_lock(id)
ORDER BY id
LIMIT 1
;

where $n is the maximum number of consumer processes you expect to be running at any time. What happens here is the subquery tells the planner exactly how many rows we end up processing in the worst case, and then on the outer level we only take the rows we need (and can get). This way the planner doesn't have to try to guess for which rows pg_try_advisory_lock() returns TRUE and it can usually choose the best plan. Here's our final query:

UPDATE
    items
SET available = FALSE
WHERE id =
    (
        SELECT
            id
        FROM
        (
            -- try to convince the planner to choose an
            -- index scan over a seq scan
            SELECT
                id
            FROM items
            WHERE available
            ORDER BY id
            LIMIT $n
        ) ss
        WHERE
            pg_try_advisory_lock(id)
        ORDER BY id
        LIMIT 1
    )

-- important
AND available

RETURNING *
;

Unfortunately, even if we use advisory locks, it is possible for this query to return 0 rows. The later we release the advisory lock, the smaller the probability of that happening becomes. In practice, if you release the advisory lock after COMMIT, the chance of that happening is very close to 0, but releasing advisory locks after COMMIT puts the burden to the client because you can't force a COMMIT in a server-side function (anyone feel like implementing stored procedures? :-) ). Even if you release the lock before committing, you should be fine as long as you have the retry logic in place. Test it out and see how it works!

Keep in mind that this is not a complete solution. What if a consumer dies while processing an item? What if the server crashes? The best way to deal with this is very application specific, so I decided not to cover it, at least not in this post.

You might also have multiple queues in a single database, so using the bigint version of advisory locks might result in problems. In that case, you can sometimes use the integer version of the function like this: pg_try_advisory_lock('tablename'::regclass::int, id). For information about regclass, see the docs.

Friday, November 5, 2010

PGDay

I will be talking at PGDay.eu in December about concurrency-related problems in PostgreSQL (and hopefully a few solutions too!). If your database has more than one user, this is the talk for you!

I'll start from the basics, but you should be familiar with the basics of postgres (or some other SQL database).

Saturday, July 17, 2010

Smallest available ID

Today an interesting problem came up on the IRC channel: Given a table of integer IDs, find the first non-existing ID starting from zero. A straightforward solution using generate_series() and a left join is in the comments, but we're trying to do better than that.

For this solution, there are four cases we need to worry about (and I only thought of one of these on my own, I'll blame the lack of caffeine):

There are no rows in the table.
There are rows in the table, but their IDs are all greater than zero.
There are rows between zero and N, but there are gaps.
There are rows between zero and N and no gaps.

The first two can be covered by returning 0 when it's available. This will either create a gap starting from 0, make a gap that already existed in the table the gap that has the smallest ID, or make the table gapless. Finding the gap with the smallest ID is relatively trivial with window functions (8.4 or later only, sorry) and in a gapless table, we can use max + 1. Let's take a look at the actual query:

-- Use zero if available
(SELECT
    0 AS id
 WHERE
    NOT EXISTS
        (SELECT 1 FROM foo WHERE id = 0) )

    UNION ALL

-- Find the smallest available ID inside a gap
(SELECT
    id + 1
 FROM
 (
    SELECT
        id, lead(id) OVER (ORDER BY id)
    FROM
        foo
 ) ss
 WHERE
    lead - id > 1
 ORDER BY
    id
 LIMIT
    1)

    UNION ALL

-- Or if there were no gaps and zero wasn't
-- available, use max + 1.  But don't return
-- a row if the table is empty; the first
-- query took care of that.
(SELECT
    max(id) + 1
 FROM
    foo
 HAVING
     max(id) IS NOT NULL)

ORDER BY
    id
LIMIT
    1
;

The last ORDER BY id LIMIT 1 affects the result of the whole UNION so we always get the smallest available ID if more than one of these queries return a row. For that same reason we also need to use parentheses around the second query; we only want to return one ID if there are several gaps or the only gap is wide. I used parentheses for all three queries because I thought having them only around one query in a UNION looked funny.

For the record, finding the gap on a version older than 8.4 is still possible, but I'm not going to cover it here. Consider upgrading ;-)

Edit: It occurred to me that don't need to separately query for the max value; lead(id) will be NULL for the last row. This gives us a bit nicer query:

-- Use zero if available
(SELECT
    0 AS id
 WHERE
    NOT EXISTS
        (SELECT 1 FROM foo WHERE id = 0) )

    UNION ALL

-- Find the smallest available ID inside a gap, or max + 1
-- if there are no gaps.
(SELECT
    id + 1
 FROM
 (
    SELECT
        id, lead(id) OVER (ORDER BY id)
    FROM
        foo
 ) ss
 WHERE
    lead - id > 1 OR
    lead IS NULL
 ORDER BY
    id
 LIMIT
    1)

ORDER BY
    id
LIMIT
    1
;

Unfortunately, you still need the first query of the UNION, but there's a way around that, too: insert a dummy row with id = -1 to the table. That gives us an even nicer query:

SELECT
    id + 1
FROM
(
    SELECT
        id, lead(id) OVER (ORDER BY id)
    FROM
        foo
) ss
WHERE
    lead - id > 1 OR
    lead IS NULL
ORDER BY
    id
LIMIT
    1
;

You could also add the dummy row in a subquery instead of selecting straight from the table, but then the query can't use an index at all.

Tuesday, June 29, 2010

Faster count(*)

One of the most common questions on IRC is "how to make SELECT count(*) FROM tbl; faster?" While it's not clear why an application would need to run this query repeatedly, it is possible to make it a bit faster. There are two common approaches:

The easiest way is to use a trigger to update a static row count in a separate table:

CREATE TABLE counttbl
    (rowcount int NOT NULL,
     CHECK (rowcount >= 0));

CREATE UNIQUE INDEX "counttbl_single_row"
    ON counttbl((1)); -- only allow one row in counttbl

CREATE FUNCTION update_counttbl() RETURNS trigger AS
$$
BEGIN
    IF TG_OP = 'INSERT' THEN
        UPDATE counttbl SET rowcount = rowcount + 1;
    ELSIF TG_OP = 'DELETE' THEN
        UPDATE counttbl SET rowcount = rowcount - 1;
    END IF;
    RETURN NULL; -- ignored in an AFTER trigger
END;
$$ language plpgsql;

CREATE TRIGGER update_rowcount AFTER INSERT OR DELETE ON realtbl
    FOR EACH ROW EXECUTE PROCEDURE update_counttbl();

INSERT INTO counttbl SELECT count(*)
    FROM realtbl; -- populate the count table

This approach is fairly straightforward, but suffers from a huge problem: only one transaction can INSERT INTO or DELETE FROM the table at a time because UPDATE on counttbl locks the row exclusively (the lock is necessary though). With some additional trickery, we can avoid this and achieve better concurrency:

CREATE TABLE counttbl
    (rowcount int NOT NULL,
     CHECK (rowcount >= -1));

CREATE UNIQUE INDEX "counttbl_single_row" ON counttbl((1))
    WHERE rowcount > 1; -- only allow one "sum" row

CREATE FUNCTION update_counttbl() RETURNS trigger AS
$$
BEGIN
    IF TG_OP = 'INSERT' THEN
        INSERT INTO counttbl VALUES(1);
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO counttbl VALUES(-1);
    END IF;
    RETURN NULL; -- ignored in an AFTER trigger
END;
$$ language plpgsql;

CREATE TRIGGER update_rowcount AFTER INSERT OR DELETE ON realtbl
    FOR EACH ROW EXECUTE PROCEDURE update_counttbl();

INSERT INTO counttbl SELECT count(*)
    FROM realtbl; -- populate the count table

Now we can run concurrent INSERTs and DELETEs on realtbl because we don't need to UPDATE any rows. To actually find out the count you'd run SELECT sum(rowcount) FROM counttbl; This approach has two problems though: it's a bit slower to find out the count than in the first solution, and it requires you to periodically update the "sum" row.

As always, neither of these solutions work for every possible case. It's up to you to choose the one that works for your application.

Writeable CTEs

For the past year or so, I've worked on a feature known as "Writeable CTEs". While some people know what this feature is about and would want to use it right now (I'm talking about you, Merlin), there's still a lot of people who don't know what they offer. So far, I've identified two major use cases, but I'm quite sure people will find more as they wrap their heads around the feature. ;-) Let's take a look at what I've got so far:

1. Moving rows from one table to another

On any released PG version, you would usually do this:

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
INSERT INTO target SELECT * FROM source;
DELETE FROM source;
COMMIT;

There are some drawbacks to this approach, though:

If you have multiple processes running this same transaction at the same time, you get a lot of serialization errors.
If the source table gets a lot of UPDATEs and DELETEs, you get a lot of serialization errors.
You need to do two accesses to the heap.
You need to do it in SERIALIZABLE isolation mode.

While you can avoid #1 quite easily by using a lock for these processes, effectively allowing only one of them to do this at a time, #2 is a lot harder to avoid. #3 shouldn't be significant in practice but can be in some scenarios. #4 can actually be problematic.

With the new feature, you can avoid all four drawbacks. The syntax is* also a lot more intuitive:

WITH t AS
    (DELETE FROM source RETURNING *)
INSERT INTO target SELECT * FROM t;

This will do exactly what it looks like: first delete the rows from "source" and then insert them into "target". While the first version of this feature (which I suspect we'll see in 9.1) will need to materialize the complete DELETE result set first, I'm hopeful that we can remove that need in the future.

2. INSERTing into multiple tables

Imagine you're writing a web application which collects information about people and their pets. When the user has typed in his real name and the names of his pets, you want to add him to your (normalized) database. Like any other web app, you're using surrogate keys. Normally, you would first INSERT the user and get the userid with RETURNING or currval() and then INSERT the pets. While this doesn't seem too bad, with a bigger application you might end up doing tens of round-trips to the server. With writeable CTEs, you can do this easily:

WITH person AS
    (INSERT INTO persons VALUES ('Marko Tiikkaja') RETURNING userid)
INSERT INTO pets
SELECT
    person.userid, pet
FROM
    person
CROSS JOIN
    unnest(ARRAY['dog1', 'dog2', 'cat1']) pet

First, my name is added to "persons" table and the "person" CTE holds a single row with my userid. Now we want to add one row with this userid for each pet into the table "pets". This can be done easily with a CROSS JOIN. I could've also used the syntax FROM person, unnest(..) but I wanted to make clear that a cross join was desirable. We could also easily add different persons with different pets by also putting the "INSERT INTO pets .." statements into their own CTEs (you can have about as many CTEs as you need).

* It is not yet clear that the syntax will be exactly this, but I'm going to try to get there. :-)

Edit: s/REPEATABLE READ/SERIALIZABLE/ to avoid confusion.

Sunday, June 20, 2010

Introduction

I have been thinking about this for quite some time now, and I finally decided to start a blog. Since most of you probably don't know me, I guess a small introduction is necessary.

My name is Marko Tiikkaja. I've used PostgreSQL for the past 7 years or so and loved it more and more every year. For the past two years I've been active on freenode's #postgresql channel under the alias "johto". About a year ago I started hacking on the backend code of PG when David Fetter came up with the idea of allowing DML statements inside CTEs. This project is known as "Writeable CTEs" and I have a blog post about it in the queue.

That's about all I wanted to say in the first post. Stay tuned for actually PG-related content. ;-)