This is part of a series on monitoring that explores cultural obstacles to a productive monitoring culture.
Your monitoring system is a tireless, unwavering friend. It always has your back. All it asks in return is a little bit of respect. Keep the status clean and alarms quiet. That’s all you have to do. It’s a simple thing, but this is where many organizations start to unravel at a technical level.
When we fail to respect our monitoring system, alarm fatigue sets in and we take the alerts less seriously. Email filters get added to mask the problem. We all know how this worked out for The Boy Who Cried Wolf. Not good.
We’re all guilty of this to some degree, but that doesn’t make it right.
Alarms cannot be normal
Above all else, we must completely reject the idea that alarms are normal. Nothing will improve until we’re over this hump. Often, there is a lack of shared expectations between teams at the root of this. Different teams have differing ideas about service levels and just ignore the alerts they don’t agree with. Before you know it, the unresolved alarms have multiplied and everyone is ignoring everything. Ironically, when alerts become routine, they are usually first filtered out by the very people who should be resolving them.
Unresolved alarms tend to multiply quickly. A mature monitoring system will have a variety of alarms, ranging from the obvious “everything is down” to leading indicators that are not always immediately urgent. When preventative alarms designed for prevention go unresolved, downstream alerts will start to multiply simply because root causes aren’t being addressed.
As alerts multiply, the aggressiveness of filtering goes up even faster than the quantity of “normal” alerts. Eventually, everything of substance is filtered out and nobody even notices when critical failures happen!
Something is on Fire!
When the monitoring system is throwing alerts, that has to be priority number one. No meetings, no lunch, do not pass go, do not collect $200. Something is on fire!
Sadly, that’s usually not the case. We’re all guilty of this to some degree, but that doesn’t make it right. It’s a cycle that must be broken. Not being able to treat monitoring alerts like something is on fire means we’re turning our backs to one of the most powerful friends we have in the office. Who reading this doesn’t need all the help we can get?
If something isn’t really on fire, it shouldn’t alert. The obvious answer is to adjust the alert threshold, but it’s really not a great idea to remove alerts for leading indicators of trouble. Instead, look for ways to push upstream to the true root cause and prevent the alert. Create and automate maintenance process to deal with resource issues proactively. Empower your monitoring system to attempt to restart stuck processes before it starts alerting.
We control what the monitoring checks and alerts on. We have the power to fix it when we get it wrong. But just ignoring it? That’s not just disrespecting the monitoring system, it’s disrespecting ourselves. We need to do better than that.