As some teams invest significant effort in making their solution transparent, easy to troubleshoot and stable by instrumenting their product and technology which is then connected to their monitoring system the only question remains: Who monitors the monitoring? If the monitoring is down you won’t get any alarms about the malfunction of the product and you are at the begging. So you add monitoring of a monitoring system. Availability tough game starts. If both components have 95% availability you effectively achieve 90%. I believe that you got the rules of the game we play.
How it is monitoring of monitoring setup in practice? Well, the implementation differs. But the core idea is the same and works well even in the army! It is called Dead man switch. It is based on the idea that if we are supposed to receive a signal for triggering an alarm in an unknown moment we need to guarantee that the signal can trigger an alarm anytime. Reversing the logic for triggering the alarm will give us that guarantee so having a signal which we receive constantly and the alarm is triggered when we don’t receive a signal. So simple! This principle (heartbeat) is used in multiple places e.g. clustering. In the army, they use it as well.
Some monitoring tools they have this capability built-in but Prometheus doesn’t. So how to achieve this in order to sleep well that the watcher is watching. We need to set up a rule that is constantly firing in Prometheus. There can be a rule like this:
- name: monitoring-dead-man rules: - alert: "Monitoring_dead_man" expr: vector(1) labels: service: deadman annotations: summary: "Monitoring dead man switch should always fire alert" description: "Monitoring dead man switch for probing alert path"
Now we need to create a heart beating. The rule on its own would fire once then it would be propagated to Prometheus Alert Manager (component responsible for managing alerts) and all would be over. We need a regular interval for our check-ins. Interval is given by availability you want to achieve as that all adds to reaction time you need. You can achieve this behaviour by special route for your alert in an Alert Manager:
- receiver: 'DEAD-MAN-SNITCH' match: service: deadman repeat_interval: 5m
Now we need to achieve an alarm trigger reverse logic. In our particular case, we use Dead Man Snitch which is great for monitoring batch jobs e.g. data import to your database. It works in a way that if you do not check in in the specified interval it triggers with the lead time given by interval. You can specify the rules when to trigger but those are the details of the service you use. All you need to add to Prometheus is a receiver definition checking in particular snitch as follows:
- name: 'DEAD-MAN-SNITCH' webhook_configs: - url: 'https://nosnch.in/your_snitch_id'
The last thing you need to do is integrate the trigger with the system you use for on-call rotta management e.g. PagerDuty . For this example integration between dead man snitch and pagerduty
As I am not an infra or DevOps guy by hart it took me some time to figure it out and connect the bits. I hope you found it useful. If you have other experiences or the different way how to achieve this behaviour let me know in the comment section below.