I've written before about the importance of being the first to know when you're having issues. Our setup here at Sorry includes various monitoring service like Pingdom and NewRelic, as well as alerting tools like PagerDuty to ensure that when things kickoff we get notified.
Since we launched our new monitoring automations feature we also have our status page hooked up, so we can get the word out to you, our customers.
So how do we ensure that this process works properly on those occasions when things go down? Well, we run fire-drills.
The test button isn't enough
Most monitoring providers have a "Test" button in their UI, however I've found that when you have multiple alerting and status page services connected this usually isn't enough to thoroughly test things.
A better approach is to find a way that simulates downtime, where the monitoring service itself doesn't know the difference, and behaves in the same manor as if it were real.
Assuming you're not mad enough to take your service offline (I'm looking at you Netflix), we need to find a simple way to play pretend.
S3 Websites to the Rescue
We have an AWS S3 website, with various directories and pages, each of which get's connected to our different monitoring tools.
By renaming or deleting these pages we're able to trigger anything from a single alert
on a single provider, through to a cross-platform storm.
From this point onwards the monitoring does it's thing, passing alerts to PagerDuty so everyone get's notified, and our Pingdom integration posts something to a duplicate status page which we've setup for drills (so we're not SPAMing our customers each time.)
You too can create additional status pages in your own account.
How often do we drill?
Probably not as often as we should, as once setup and tested things don't change a great deal, however, we always retest things when making changes.
Where to next?
The big gap in this process right now, is that it's only effective for simple uptime checks, anything more fine-grained such as browser performance, or memory consumption on background processes can't be simulated.
Quite how we go about this I'm not sure, but I have a few ideas bubbling away...
How do you drill your setup?
I'd love to share ideas and experience, come find me @SirRawlins on Twitter, or get in touch via the website chat in the bottom corner.