2 PG Days in 2 days apart !!

Want to learn more about PostgreSQL? Are you at driving distance from Washington DC and/or Newyork area? If any of anwers is yes then you shound’t miss this opportunity to attend 2 PostgreSQL Days within 2 days apart in your area with plenty of knowledge sharing talks and networking events!br

1. Friday,March 30th,2012 PG Day in Washington DC

2. Monday,April 2nd,2012  PG day in Newyork

I will be attending both of the these days. If you are attending, will love to meet you in-person. br /br /See you soon!!

monitor bucardo replication lag using circonus

         I have been using circonus for monitoring, trending and alerting for any database metrics for quite a long time now. The circonus interface makes the monitoring, trending and alerting setup painless and you can see graph flowing in minutes. Another good thing about Circonus is that you can monitor anything that you can query from database ! This week, the task at my hand was to find a way to monitor bucardo replication lag.  `bucardo_ctl status sync_name` provides very important information that you can rely for trending and alerting purposes.

$ bucardo_ctl status my_slave
Sync name: my_sync
Current state: WAIT:22s (PID = 19500)
Type: pushdelta
Source herd/database: slave_herd / master_herd
Target database: my_slave
Tables in sync: 318
Last good: 23s (time to run: 1m 21s)
Last good time: Feb 29, 2012 15:27:14 Target: my_slave
Ins/Upd/Del: 142 / 0 / 0
Last bad: 1h 45m 9s (time to run: 19m 57s)
Last bad time: Feb 29, 2012 13:42:29 Target: my_slave
Latest bad reason: MCP removing stale q entry
PID file: /var/run/bucardo/bucardo.ctl.sync.my_sync.pid
PID file created: Wed Feb 29 13:42:33 2012
Status: active
Limitdbs: 0
Priority: 0
Checktime: none
Overdue time: 00:00:00
Expired time: 00:00:00
Stayalive: yes Kidsalive: yes
Rebuild index: 0 Do_listen: no
Ping: yes Makedelta: no
Onetimecopy: 0

All the information provided by `bucardo_ctl status` command is important but most interesting thing to monitor is “Last good:”. Last good shows bucardo replication lag on slave server.

Trending in circonus:

Now, I have metric identified for the monitoring. Next step is to find the best way to put the metric into the monitoring tool. After some poking around the output and ways to monitor stuff, I decided to grab the SQL query from bucardo_ctl perl script and stick it into circonus monitoring. Most of the time spent for setting up this monitor was to grab right query from big perl script (bucardo_ctl) and mapping the metric required from the query. After that, here is the query that I plugged  into Circonus in no-time.

‘bucardo_last_good’ , round(extract(epoch FROM now()-ended))
(SELECT * FROM bucardo.q WHERE sync = ‘my_sync’ AND cdate >= now() – interval ‘3 days’
SELECT * FROM freezer.master_q
WHERE sync = ‘my_sync’ AND cdate >= now() – interval ‘3 days’) AS foo
WHERE ended is NOT NULL AND aborted IS NULL

Alerting in circonus:

bucardo_ctl status shows Last good status to “unknown” if replication is broken.

Name Type State PID Last_good Time I/U/D Last_bad Time
my_sync| P |WAIT:35s|7620| unknown | | |36s |1m58s

       In circonus, you could setup rules and relevant severity levels. The most important part is that if the query doesn’t return any row it should page(“unknown” condition) . Circonus provides rule for alerts in case of a metric is absent. Now, I am all set with alerts as well.

Yay! bucardo replication is under monitoring and trending without any hassle! Hopefully, this post will help you next time when you try to put bucardo replication lag under monitoring.