Saturday, July 17, 2010

Smallest available ID

Today an interesting problem came up on the IRC channel: Given a table of integer IDs, find the first non-existing ID starting from zero. A straightforward solution using generate_series() and a left join is in the comments, but we're trying to do better than that.

For this solution, there are four cases we need to worry about (and I only thought of one of these on my own, I'll blame the lack of caffeine):

  • There are no rows in the table.
  • There are rows in the table, but their IDs are all greater than zero.
  • There are rows between zero and N, but there are gaps.
  • There are rows between zero and N and no gaps.
The first two can be covered by returning 0 when it's available. This will either create a gap starting from 0, make a gap that already existed in the table the gap that has the smallest ID, or make the table gapless. Finding the gap with the smallest ID is relatively trivial with window functions (8.4 or later only, sorry) and in a gapless table, we can use max + 1. Let's take a look at the actual query:

-- Use zero if available
(SELECT
    0 AS id
 WHERE
    NOT EXISTS
        (SELECT 1 FROM foo WHERE id = 0) )

    UNION ALL

-- Find the smallest available ID inside a gap
(SELECT
    id + 1
 FROM
 (
    SELECT
        id, lead(id) OVER (ORDER BY id)
    FROM
        foo
 ) ss
 WHERE
    lead - id > 1
 ORDER BY
    id
 LIMIT
    1)

    UNION ALL

-- Or if there were no gaps and zero wasn't
-- available, use max + 1.  But don't return
-- a row if the table is empty; the first
-- query took care of that.
(SELECT
    max(id) + 1
 FROM
    foo
 HAVING
     max(id) IS NOT NULL)

ORDER BY
    id
LIMIT
    1
;

The last ORDER BY id LIMIT 1 affects the result of the whole UNION so we always get the smallest available ID if more than one of these queries return a row. For that same reason we also need to use parentheses around the second query; we only want to return one ID if there are several gaps or the only gap is wide. I used parentheses for all three queries because I thought having them only around one query in a UNION looked funny.

For the record, finding the gap on a version older than 8.4 is still possible, but I'm not going to cover it here. Consider upgrading ;-)



Edit: It occurred to me that don't need to separately query for the max value; lead(id) will be NULL for the last row. This gives us a bit nicer query:

-- Use zero if available
(SELECT
    0 AS id
 WHERE
    NOT EXISTS
        (SELECT 1 FROM foo WHERE id = 0) )

    UNION ALL

-- Find the smallest available ID inside a gap, or max + 1
-- if there are no gaps.
(SELECT
    id + 1
 FROM
 (
    SELECT
        id, lead(id) OVER (ORDER BY id)
    FROM
        foo
 ) ss
 WHERE
    lead - id > 1 OR
    lead IS NULL
 ORDER BY
    id
 LIMIT
    1)

ORDER BY
    id
LIMIT
    1
;

Unfortunately, you still need the first query of the UNION, but there's a way around that, too: insert a dummy row with id = -1 to the table. That gives us an even nicer query:

SELECT
    id + 1
FROM
(
    SELECT
        id, lead(id) OVER (ORDER BY id)
    FROM
        foo
) ss
WHERE
    lead - id > 1 OR
    lead IS NULL
ORDER BY
    id
LIMIT
    1
;

You could also add the dummy row in a subquery instead of selecting straight from the table, but then the query can't use an index at all.

11 comments:

  1. Typo in last select. No comma after "max(id) + 1". But nice post :)

    ReplyDelete
  2. Thanks for writing this code Marko -- I'm the one who brought this up in IRC.

    Your SQL code is robust, and working very well in my project.

    ReplyDelete
  3. @Randolf:

    Thanks for putting up with my slowness and testing my failed attempts. ;-)

    ReplyDelete
  4. Why not this?

    select s.id from generate_series(1,(select max(id) + 1 from <table>)) as s(id)
    where s.id not in (select id from <table>)
    order by s.id limit 1

    This appears to be pretty quick with an index on id. You can specify the lower bound of the id range in the generate_series.

    Thoughts? Comments?

    ReplyDelete
  5. Putting up with? Are you kidding?? You put that together so quickly -- I was truly impressed. Paid support often isn't this fast nor good!

    I've written a few stored procedures over the years that took me weeks to write (mainly due to learning the plpgsql language at the time; I'm still learning), and you left me with an impression that this is a very strong area of expertise for you.

    Because you immediately saw a use for the lead() function (which also yielded better performance), it's also clear that your knowledge on PostgreSQL is current.

    Folks in irc.freenode.net#postgresql are always very helpful (and I'm grateful for this), and often times the discussion is insightful (there's always something new to learn). I find that PostgreSQL is very well supported in IRC by knowledgeable and helpful people, like yourself, who also seem very willing to spend some time to understand the challenges -- it seems as if you folks enjoy difficult challenges just to find out how quickly you can come up with working solutions.

    You've opened my eyes to how much more powerful an SQL statement can be in PostgreSQL. Thanks again.

    ReplyDelete
  6. @Anonymous:

    That would be better written as:

    select s.id
    from generate_series(0, (select coalesce(max(id) + 1, 0) from foo)) as s(id)
    left join foo t on (t.id = s.id)
    where t.id is null
    order by s.id
    limit 1;

    NOT IN is very unpredictable performance-wise so its use should be avoided. You also didn't account for the empty table case.

    When I was working on this problem in IRC, I also suggested the generate_series() solution. In my tests it took about 4 times as much time to execute as the solution covered in the blog post, and Randolf decided to go with the more complicated solution. But as always, everyone should take a look at the available solutions and test and see which works best for their use case.

    Thanks for pointing this out, I really should have covered that solution in the blog post too.

    ReplyDelete
  7. @Randolf:

    Thank you for your kind words.

    When I was still learning all the basic stuff about postgres, I, like you, found the people on #postgresql very helpful and am truly grateful that we have such an awesome community around this great product.

    I can only speak for myself, but like you guessed, I enjoy challenges. And I think helping people out is the least I can do. :-)

    ReplyDelete
  8. This stored procedure is working wonderfully well for me. I'm just starting to use it on PostgreSQL 9, and so far it seems to be working just fine (not surprising).

    ReplyDelete
  9. Thanks for writing ! I agree with Randolf. Indeed, Stored procedure is working great. I just tested it and it is working phenomenal. Man thank to you :)

    ReplyDelete
  10. It is my great pleasure to visit your website and to enjoy your great post here. I like it very much.

    ReplyDelete