Monitoring PostgreSQL Streaming Replication

Posted by dansketcher on January 27, 2013
hosting, monitoring, postgresql, replication / No Comments

Having migrated to PostgreSQL 9.2 and set up Streaming Replication successfully, the next obvious thing you want to know is, “Is it still working?” Fortunately, PostgreSQL comes out-of-the-box with a few tools that can help


pg_stat_replication is really only useful on a master database server. It shows the slaves connected to the master, and some other opaque information about ‘where it’s up to’, for example

template1=# select * from pg_stat_replication;
  pid  | usesysid |  usename   | application_name | client_addr | client_hostname | client_port |         backend_start        |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
 30213 |   xxxxxx | replicator | walreceiver      |  |                 |       52889 | 2013-01-24 17:06:06.11849+00 | streaming | 9/42008AD0    | 9/42008AD0     | 9/42008AD0     | 9/42007FB0      |             0 | async
(1 row)

To a noob like me the _location fields are a bit opaque (apart from being hex and probably related to WAL segments), but Alexander Fortin’s post to the postgresql mailing list helps make this a bit clearer.

I’ve actually converted that statement to not require his custom hex_to_int function, like so:

    sent_offset - (
        replay_offset - (sent_xlog - replay_xlog) * 255 * 16 ^ 6 ) AS byte_lag
        ('x' || lpad(split_part(sent_location,   '/', 1), 8, '0'))::bit(32)::bigint AS sent_xlog,
        ('x' || lpad(split_part(replay_location, '/', 1), 8, '0'))::bit(32)::bigint AS replay_xlog,
        ('x' || lpad(split_part(sent_location,   '/', 2), 8, '0'))::bit(32)::bigint AS sent_offset,
        ('x' || lpad(split_part(replay_location, '/', 2), 8, '0'))::bit(32)::bigint AS replay_offset
    FROM pg_stat_replication
) AS s;

which gives an output like this:

 client_addr | byte_lag 
-------------+----------  |        0

meaning that slave is 0 bytes behind the master. Most of the time in my experience it stays pretty low, but how big is too big? As long as it stays below the size of 1-2 WAL segment (use “SHOW wal_segment_size” to work out what your config is) then things are fine. If it drops back further, you might want to find out what’s going on.

SELECT pg_last_xact_replay_timestamp()

There’s a few functions that only work on the slave too, most notably pg_last_xact_replay_timestamp(). It shows the timestamp of the last transaction applied to the slave. With a reasonably busy server, using something like:

SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INTEGER AS lag_seconds;

is good enough to see how far behind the slave is. However, if there’s not enough activity, although the server is completely caught up, this number will continue to increase as now() gets further and further away from the last committed transaction timestamp. Instead, per this post of the mailing list, compare the location information too:

CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() 
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;

Finally, for anyone using ServerDensity, I updated reinbach’s plugin to include a simple count check of connected slaves on a master, and the slave lag function above.
See it here: