Wednesday, November 18, 2015

LISTENing connections aren't cheap

Recently I used perf to look into what could be the cause for our increased CPU usage on the PostgreSQL server (encouraged by Andres' great talk at  I was somewhat surprised to find that thirty percent of the CPU time used by postgres was spent spinning on spinlocks, i.e. doing no actual useful work.  Digging into the profile a bit more, most of these were coming from a function called asyncQueueReadAllNotifications.  To understand why this function might have such a large impact on the system, we need to understand the basics of how LISTEN / NOTIFY works in postgres.  The specifics can be found in the comment near the top of async.c, but the high level view is as follows.

When a session starts LISTENing on a channel the first time, it advertises its interest in notifications in a special shared memory segment.  The key is, however, that this is a black-and-white flag: if you're interested in any notification channel, shared memory will say you're interested in all of them.  This is also the reason you can't query which channels other backends are listening on; the data simply isn't available anywhere except in the local memory of each backend.  The only data that could be exposed on SQL level is the list of backends which are interested in notifications.  That is, however, at least currently not exposed, though if you're brave enough you can use GDB to extract that list.

For the NOTIFY part, things are a bit more complex.  While the transaction is going on, any issued notifications are queued into a local buffer.  If the transaction rolls back, this list is simply discarded.  But if the transaction commits, the notifications in the list are copied to a special queue in shared memory.  Once the queue has been populated, the backend goes through the list of all other backends listening on any channel and wakes them up.  And herein lies the problem: every single backend is being woken up, even if they're not interested in any of the notifications sent by this particular transaction.  Once you have a lot of workers listening on different channels and lots of notifications being delivered per time unit of your choice, the wakeups, transactions created for accessing the shared memory segment and locks around the data structures start eating up a lot of CPU.

So, what can we do?

Reduce the number of connections using LISTEN.  Really, it can save your day. This is what happened when we reduced the number of backends listening by around 50%:

We've since went the extra mile and migrated all listening connections to allas, which brought the spinlocks in our CPU profiles down from 30% to 0.16%.  asyncQueueReadAllNotifications doesn't even appear in the profiles, which makes me happy.

Of course, if all backends are actually interested in all (or most) notifications, perhaps this is not that big of an issue.  But if you have a lot of backends and most of them are only interested in a subset of notifications delivered through the system, you might want to think about ways to reduce the number of backends listening for connections directly in postgres.  I've presented one way of achieving that here, but obviously it's not the only way.

Wednesday, February 4, 2015

allas: connection pooling for LISTEN / NOTIFY

Lately I've been working on a connection pooler which only supports LISTEN / NOTIFY.  The idea is to be able to keep the number of Postgres connections down without having to give up (or come up with a workaround for) notifications.  With allas you can e.g. use pgbouncer or your environment's native connection pool and open a separate connection for notifications only.  allas internally uses only a single connection to PostgreSQL, so you can keep your max_connections setting down at a reasonable level and get more out of your database server.

If this sounds interesting, check it out!

Thursday, January 15, 2015

Queues, queues, they all fall down

Roughly five years ago I presented a way of pulling items off a queue table using advisory locks.  Now that the upcoming 9.5 release is going to have a similar method built in, I wanted to explore the performance characteristics of the different approaches available starting from PostgreSQL 9.5.  To recap, the problem is: given a table of items (or "jobs"), distribute the items to a set of workers pulling items from the table concurrently.  Each item must returned exactly once.

I've identified four general approaches and written what I thought were realistic, correct and good implementations of them.  Additionally, I wrote two versions of each algorithm:

  • one where an item is marked "unavailable" in one transaction, and then marked "processed" in another once the worker has finished working on the item
  • one where an item is marked "processed" right away, but the transaction is kept open until the worker has finished working on the item ("single-transaction version")

Some variations with different details are possible, but they should relatively accurately map to one of these two approaches in terms of performance.

Everything in this article assumes READ COMMITTED mode, and all approaches will likely break horribly under different isolation levels.

Naive Update

This approach is probably the most straightforward one you could think of: grab the first row in itemid order with a FOR UPDATE, and then UPDATE it to make sure others know you're working on it.  As of PostgreSQL 9.0 LIMIT and FOR UPDATE work together nicely, guaranteeing that you'll get a row if one is available.  The downside is that only one worker can be working on the queue at any given moment.  This is painfully obvious in the single-transaction version.

Advisory Locks

As some might recall, I've written about this approach before.  The idea is to allow workers to be concurrently working on the queue by using advisory locks to "skip" over locked rows.  This is especially nice with the single-transaction mode since you only have to UPDATE the row once, but other workers can still continue operating on the queue without disturbance.

The downside is that this approach is relatively complex: any correct and efficient solution has to deal with non-obvious race conditions and potential problems with planner interaction.

Random Offset

This is an approach I was only half serious about when writing the code.  The idea is to do exactly what the Naive Update algorithm does, but instead of always waiting for the first row, we skip a random number of rows with OFFSET.  Care must be taken when approaching the end of the queue, but other than that the approach is fairly straightforward, and looks decent on paper.


While working on this article, this was my absolute favourite approach.  It's very simple, and potentially extremely fast.  It relies on the SKIP LOCKED option for FOR UPDATE introduced in PostgreSQL 9.5.

Downsides?  It's hard to find legitimate ones.  Ideally, I think, we would want to attach SKIP LOCKED to an UPDATE with an ORDER BY and a LIMIT, but none of those three have been implemented in Postgres yet.


I wrote a simple Go program to test the different approaches.  The source code is available, but running it requires some tweaking; that's explained on the github page.  I also used a Go library called "plotinum" to draw the graphs directly from Go.

All methods were tested with 1, 2, 4, 6, 8, 12 and 16 workers running in parallel, pulling from a queue of 75,000 items on an 8-core x86_64 server.  To ensure correctness, each worker maintains a list of items it has seen, and at the end of the run these lists are merged, making sure that each item was seen exactly once.  All times are the fastest of three runs.

The PostgreSQL was built from commit fd3d894e4ea0021efa2628e4dfc5fe0ed3071859, which was the latest commit to the to-be-9.5 branch at the time.  autovacuum and fsync were off, and the usual tuning knobs were adjusted.

Advisory lock release timing

Before we take look at the results, I'd like to discuss the challenges of advisory lock based solutions a bit further.  As I mentioned, there's a race condition which isn't quite obvious if you haven't thought about the problem before.  Consider the following sequence of events:

  1. Transaction A marks itemid=1 in the queue as "not available", but does not yet commit.
  2. Transaction B takes an MVCC snapshot.  This snapshot will see itemid=1 as "available", since transaction A has not yet committed.
  3. Transaction A commits, and releases the advisory lock on itemid=1.
  4. Transaction B find itemid=1 thinking it's still available (remember the time at which the snapshot was taken), acquires the advisory lock (since it was just released), and thinks it's OK for it to start working on itemid=1.

There are a couple of ways the race condition in step #4 can be avoided, but the bottom line is to force the database to either look at the latest version of the row, or to force a new snapshot to be taken while holding the advisory lock.  If that version of the row still looks "available", then you're free to work on the item.  If on the other hand it's not "available" anymore, you have to not only start the process all over again, but also run a different query to see whether the queue is empty or you just hit the race condition.  As a result, if processing an item takes only a short period of time and there is a large number of concurrent workers, the slowdown might be noticeable.  Therefore the time at which the advisory lock is released can have significant effects on the performance of the queue.

As I was running my first round of tests, it become obvious that a number of workers were spending a significant portion of their time recovering from the race condition.  So I tried a few different methods of reducing the chance of hitting the race condition.  The ones I came up with were:

  1. Hold the advisory lock until the very last query in the item processing flow.  I wanted to avoid the extra round-trip of an additional query, so I added a  RETURNING pg_advisory_unlock()  to the last UPDATE query.
  2. Use transaction-level advisory locks, but try and avoid races with a random offset.  This is a lot like the "random offset" approach, but if you happen to hit the same offset with someone else, instead of locking you just risk hitting the normal advisory lock race condition.
  3. Release advisory locks in "batches" using pg_advisory_unlock_all().  I chose a batch size of 16 locks.  This method hold the locks for longer than the previous method, and it adds an extra round trip to the database every $batch_size attempts at picking up an item.
  4. Maintain a fixed-length queue of advisory locks being held by the current session, and release from the beginning one at a time in the RETURNING clause, similarly to approach #1.  This avoids the extra roundtrip and avoids the potential for "surges" of races with the call to pg_advisory_unlock_all().

#1 and #2 are nicer from a modularity standpoint, since they don't hold the advisory locks when they're not working on an item.  They also don't require any state to be kept for the database session.  #3 and #4 are both inferior in both ways.

So obviously I put on my science hat and benchmarked the different race condition mitigation strategies against each other.  The graph spat out looks like the following:

(Note: advisoryLocksSingleTxn is essentially "advisoryLocksBatchRelease", but written using the single-transaction approach outlined above.  I was too lazy to implement single-transaction versions of all lock release strategies, but I wanted to include one in this benchmark as well.)

I also made the Go tool print out the number of times the race condition was hit.  The effect is probably most obvious at 8 workers:

  METHOD: advisoryLocksNaive (100304 races)
  METHOD: advisoryLocksHold (27406 races)
  METHOD: advisoryLocksRandomOffset (11016 races)
  METHOD: advisoryLocksBatchRelease (19186 races)
  METHOD: advisoryLocksFixedLengthQueue (15 races)
  METHOD: advisoryLocksSingleTxn (5822 races)

A few interesting facts can be seen from these results:

  1. Transaction-level advisory locks are awful for this problem.
  2. #1 (advisoryLocksHold) is quite effective despite being simple to implement.  The number of races is almost down to a fourth of the naive approach, and runtime roughly halved.
  3. Adding a random offset appears to be very effective; bringing the number of races down to about 11% of those of the naive approach.
  4. The fixed-length queue approach can almost completely eliminate the race condition..
  5. .. but the single-transaction approach is still faster, despite suffering from a significantly larger a number of races.

For the final shootdown I chose to eliminate advisoryLocksNaive, advisoryLocksRandomOffset and advisoryLocksBatchRelease.  While the random offset approach was cute, it was relatively close to the more stable advisoryLocksHold approach in terms of runtime.

Final Results

With some solid approaches laid down and the technical details out of the way, this is the graph my final run produced:

Interestingly, the advisory lock approaches are actually slightly faster than SKIP LOCKED; around 16% with 8 workers pulling items in parallel with the two-transaction approach.  The difference is less noticeable with just a single transaction.  It would be interesting to try and figure out where the difference comes from, but I've done enough science for this week.

Wrapping up

So there we have it.  One person's benchmark attempting to give an idea on how the different approaches perform.  The benchmark itself slightly convoluted (since it's spending no time actually processing the items), so you probably should not take it too seriously.  On the other hand it does suggest that using SKIP LOCKED is a really good idea, since it's easy to use and very efficient in practice.

Another observation I consider noteworthy is that the naive approach is not so bad.  Even when pulling items from the queue as fast as possible, it's within 30% of the fastest approach.  Chances are that that's not even noticeable in your real application, so you don't have to start preparing for an upgrade to 9.5 quite yet.