Shared Expectations for Monitoring
This is part of a series on monitoring that explores cultural obstacles to a productive monitoring culture.
They say the first step is admitting you have a problem, but when it comes to monitoring the first step is agreeing that you have a problem. When you get started, this is easy to define — is the system up? Over time, the world becomes less clear. Things get complicated quickly as responsibilities get spread around. The symptom and the disease often belong to different people. Or too often, one person owns a symptom and the other won’t even agree that they’re sick.
The road to hell is paved with business needs. Backups cause I/O performance problems. ETL jobs to tie up the database, and before you know it your application tier is starting to crawl. Time passes and the overnight jobs take on a life of their own. The sluggishness gives way to total failure.
“That only happens outside business hours, it’s fine” — good idea, but somebody triggered those errors. “Just raise the limit so it doesn’t timeout” — never mind that things are fine during the day. These are the “negotiations” that go on when people aren’t working to the same standards. Management asks everyone to work it out and compromise, but it doesn’t happen.
Compromise Doesn’t Work
Forget what you learned at your last team building session. Compromise doesn’t work when it comes to monitoring and alerting, and that’s a beautiful thing. Things are either working properly or they aren’t, and the monitoring systems is just the referee to keep everyone honest.
The missing piece is actually a common definition of “working properly”. If you’re lucky, you have an SLA to point to that already sets these expectations. If not, you’ll need to run things up the food chain to get that settled.
Don’t escalate the issues in terms of the current technical problem — raise the issue in terms of the impact to your customers. This makes the decision less about picking sides, and more about managing the customer experience. It also sets a generic precedent that can apply to similar situations in the future, rather than just your immediate problem.
If things are really bad, go all in and suggest that you just be honest and put up a maintenance page overnight instead of leading the user on that things might work. That’s guaranteed to start a serious discussion!
Run these questions far enough up the food chain that the decider can back their decisions. That’s not to say expectations can’t change over time, but you’ve got to have somebody engaged who can enforce whatever rules are in effect. If you can’t identify that person, you’ve got bigger problems than your monitoring system.
An SLA With Yourself
Compile these expectations and publish them. In time, you’ll end up with an SLA of sorts for your internal expectations. Hopefully, with higher expectations than you promise your customers. Cover the high-level expectations just like you’d cover them in a customer-facing SLA. You should also cover deeper technical issues, such as infrastructure responsiveness and how to be a good steward of shared resources. For example, by recognizing that there are limits on how long one process can hold an exclusive lock before it starts to cause problems elsewhere.
Once these standards are set, there’s no longer a need to compromise. You’re either meeting the standards or you aren’t. The key, of course, is having management that’ll support the standards they set when problems do arise. But that’s a bigger problem than I can solve here.