Externally Monitoring Services with Librato and Ansible

14 April 2015, Rhodri Pugh

We use the excellent Metrics service from Librato for our monitoring and visualisation for our infrastructure. Recently though I saw a presentation from Dataloop and loved their dynamic check script feature. I instantly saw a use in our environment for ensuring services were responding correctly.

Inside Out

Currently all our applications emit metrics to StatsD, which then find their way to Librato. These metrics though are obviously internal to the application, they report information about what it’s done, rather than providing information about how it’s actually behaving. Which I think is a subtle but important difference.

What we’d like to be able to do (as well as applications emitting their own metrics) is visualise this outside view of our services.

Outside In

This is exactly what the classic check script solution would do. Something like Nagios would run these scripts which would probe your infrastructure and report back their findings. But this isn’t a feature provided out-of-the-box with Librato, it sits outside of the area of what they cover. I jumped onto their live chat and started asking for any recommendations on how to implement this kind of thing.

What was suggested was using a script on a cron to send a metric back to Librato, and to then use their alerting system to notify our team when bad things happen.

Our Solution

We use Ansible to provision our servers, so it was pretty easy to add a task to various service roles to add a monitoring script.

---

- name: Add Monitor
  template: src=monitor.sh.j2 dest=/usr/local/bin/foo-monitor
            owner=root group=root mode=755

- name: Configure Cron
  copy: src=crontab dest=/etc/cron.d/foo

The crontab is pretty obvious…

1
* * * * * root /usr/local/bin/foo-monitor

Then the monitoring script itself…

1
2
3
4
5
6
7
8
#!/bin/sh
# vim: filetype=sh

STATUS=`curl -I http://127.0.0.1:{{ foo_port }}/ 2>/dev/null | head -n 1 | cut -d$" " -f2`

if [[ "200" = $STATUS ]]; then
    echo "foo.ping:1|g" | nc -u -w1 127.0.0.1 8125
fi

Breaking it down, line 4 uses curl to send a request to the service being monitored, then chop up the response to get the status code being returned. Line 6 then checks to see if it’s a success, and if it is line 7 emits a call to the local StatsD server with a gauge indicating the service seems to be responding ok.

Alerting

Finally, we created an alert for this in Librato that informs us if the metric foo.ping stops responding (ie. success is not being reported). We could have gone the other way on this and only sent error messages to Librato, reporting when these are received. But I prefer to use the everything is ok alarm as I feel it’s safer to try to prove something is working, rather than detect it has failed.

Summary

Got a better way of doing this? Using an awesome tool that we’re missing out on? Please let us know!