Monday, March 21, 2022

Quis Custodiet Ipsos Custodes?


Here is an interesting take on detecting issues in the monitoring service and alerting the on-call person. Miedwar explained that their original method was simple and good enough: An AWS Lambda function periodically triggers and sends an HTTP health request to Grafana via proxy. When the health check fails, it triggers an incident in PagerDuty.  Elegant, independent, simple, pretty good. Why change?  It cannot see past failures between polling intervals. Their proxy is a single point of failure (SPOF).  Their new "trigger unless the system claims it is healthy" design resolves both problems and is just as simple.

For the Latin impaired, the title means "Who watches the watchmen?"

No comments: