Tuesday, June 29, 2010

Faster count(*)

One of the most common questions on IRC is "how to make SELECT count(*) FROM tbl; faster?" While it's not clear why an application would need to run this query repeatedly, it is possible to make it a bit faster. There are two common approaches:

The easiest way is to use a trigger to update a static row count in a separate table:

CREATE TABLE counttbl
    (rowcount int NOT NULL,
     CHECK (rowcount >= 0));

CREATE UNIQUE INDEX "counttbl_single_row"
    ON counttbl((1)); -- only allow one row in counttbl

CREATE FUNCTION update_counttbl() RETURNS trigger AS
$$
BEGIN
    IF TG_OP = 'INSERT' THEN
        UPDATE counttbl SET rowcount = rowcount + 1;
    ELSIF TG_OP = 'DELETE' THEN
        UPDATE counttbl SET rowcount = rowcount - 1;
    END IF;
    RETURN NULL; -- ignored in an AFTER trigger
END;
$$ language plpgsql;

CREATE TRIGGER update_rowcount AFTER INSERT OR DELETE ON realtbl
    FOR EACH ROW EXECUTE PROCEDURE update_counttbl();

INSERT INTO counttbl SELECT count(*)
    FROM realtbl; -- populate the count table

This approach is fairly straightforward, but suffers from a huge problem: only one transaction can INSERT INTO or DELETE FROM the table at a time because UPDATE on counttbl locks the row exclusively (the lock is necessary though). With some additional trickery, we can avoid this and achieve better concurrency:

CREATE TABLE counttbl
    (rowcount int NOT NULL,
     CHECK (rowcount >= -1));

CREATE UNIQUE INDEX "counttbl_single_row" ON counttbl((1))
    WHERE rowcount > 1; -- only allow one "sum" row

CREATE FUNCTION update_counttbl() RETURNS trigger AS
$$
BEGIN
    IF TG_OP = 'INSERT' THEN
        INSERT INTO counttbl VALUES(1);
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO counttbl VALUES(-1);
    END IF;
    RETURN NULL; -- ignored in an AFTER trigger
END;
$$ language plpgsql;

CREATE TRIGGER update_rowcount AFTER INSERT OR DELETE ON realtbl
    FOR EACH ROW EXECUTE PROCEDURE update_counttbl();

INSERT INTO counttbl SELECT count(*)
    FROM realtbl; -- populate the count table

Now we can run concurrent INSERTs and DELETEs on realtbl because we don't need to UPDATE any rows. To actually find out the count you'd run SELECT sum(rowcount) FROM counttbl; This approach has two problems though: it's a bit slower to find out the count than in the first solution, and it requires you to periodically update the "sum" row.

As always, neither of these solutions work for every possible case. It's up to you to choose the one that works for your application.

15 comments:

  1. The advice is sound and I've implemented a similar trigger solution to this.

    The reason an application needs the number of rows on a table is for pagination. How can we display how many pages to show on a pager if we don't know how many records there are?

    ReplyDelete
  2. @h:

    Hmm. I see. I originally thought that you'd (almost) always have a WHERE clause, but I can see how this could actually be useful. Thanks.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. @Alexander:

    *shrug* That seems more complicated in my opinion. It also requires additional space for the updated_at column and the index on (updated_at). Sometimes you already have the former or both of these so there would not be any additional cost.

    But what's even worse in my opinion is that you still need to periodically update the counttbl so you don't even gain anything.

    ReplyDelete
  6. And the pager still doesn't work with arbitrary where clauses...

    ReplyDelete
  7. For a pager, just LIMIT 11 if you display 10 rows per page. Then it's easy to know if you have to display a "next page" link.

    ReplyDelete
  8. @dim, often it is useful, or preferred, in a pager to present a list of page numbers so you can see how many pages there are and jump to particular ones, and not simply prev/next links; hence you need at least an estimate of the total record count.

    ReplyDelete
  9. Getting an estimate of the number of rows is possibly by using explain, but it's a little cumbersome to use...

    ReplyDelete
  10. @Anonymous:

    You can get it a lot more easily by querying PG's statistics tables. It's only as good as your most recent ANALYZE though, and this post is about getting the exact answer.

    ReplyDelete
  11. @Marko Tikkaja:

    I'm having some trouble getting the estimated row-count for a specific query by using the statistics tables - is there a technique here I'm missing?

    ReplyDelete
  12. @Anonymous:

    Oh, I see EXPLAIN estimates the row count differently. It effectively does this:

    SELECT (reltuples / relpages::float8) * (pg_relation_size('tblname'::regclass) / 8192) FROM pg_class WHERE oid='tblname'::regclass; where 8192 is the block size of your PG cluster.

    ReplyDelete
  13. @Marko Tiikkaja:

    ..but that would only work for a trivial query?

    I know this is quite tangential to the original content (faster count(*)), but at times it's interesting to see an estimate of how many rows are expected for a query with some joins and constraints. Replicating all the logic from the planner by querying the statistics tables seems more or less hopeless.

    ReplyDelete
  14. @Anonymous:

    Right. That would only work for SELECT count(*) FROM tblname; Replicating the logic from the planner is fairly pointless for any other query. I think the only reasonable option is parsing the EXPLAIN output, preferably with 9.0's new EXPLAIN formats.

    ReplyDelete
  15. I found this is an informative and interesting post so i think so it is very useful and knowledgeable.

    ReplyDelete