Monitoring isn't a Project, it's a Culture
Monitoring seems to be an impossible problem for a lot of IT shops. It shouldn’t be. Monitoring is something we all do in the real world - even your clueless coworkers and bosses. We monitor that there’s food in the fridge, gas in the tank, and money in the bank. Your credit card alerts you of suspicious charges or overdue payments.
That’s not how it plays out at work. You get too many alerts for things that go unresolved. You look like an idiot when customers call before Nagios tells you there’s a problem. On good days you get alerted after things go down, but no warning before the wheels fall off. Even the best shops get caught up in some of these problems from time to time, it’s the nature of the beast. But this is no way to live for the long term.
Unless you’re living under a rock, monitoring and alerting are not foreign concepts. Why is it so damn hard when we go to work? Monitoring isn’t a project you do and move on, it’s never done. But too many shops treat it like that. Monitoring and alerting needs change over time as applications, infrastructure, and expectations change. Monitoring needs to be an ongoing concern; it has to be part of the culture.
Software development teams with a good testing culture move faster and get more done. The test suite is the wall at their back. A culture of quality monitoring and alerting provides that same backstop for operations. It removes the anxiety around day to issues and intentional changes.
It helps you make informed decisions when you have to fight fires. As you level up, you spend less time fighting fires and more time preventing fires. It provides insight to stay far enough ahead of problems to focus on actual improvements. You become confident that a fire won’t start if you take a vacation. Your family quits expecting you to come home angry. Life gets better.
What habits do we need to make (or break) to make this happen?
-
Shared Expectations. Achieving full coverage requires group involvement. You’ll forever be fighting fires if everyone isn’t working to provide the same level of service.
-
Respect it. Your monitoring system can only protect you if you respect it. Unresolved alerts cannot be normal.
-
Full coverage. It’s not good enough to monitor the front end. Your full infrastructure needs to be in bounds to find and resolve root causes.
-
Centralize it. Your monitoring tools all need to report back to a central point, and the alerts need to all come from there.
-
Set appropriate thresholds. Alerts that (usually) resolve themselves after a few minutes are a great way to form dangerous habits. Thresholds must be set so we only alert when intervention is required.
-
Be proactive. It’s not good enough to get an alert when the system goes down. You need to find leading indicators and watch those in order to prevent downtime.
-
Continuous improvement. If your monitoring does let you down, adjust so it does better next time. Add alerts to push closer to root causes - iterate enough times and you can eliminate them.
-
Use the scientific method. Make a hypothesis and monitor for it. Forge ahead with facts rather than conjecture.
None of these are hard to adopt. Over time, they pay huge dividends of saved time and reduced stress. Figuring these practices out made a huge impact in my life and in my stress levels, and they can do the same for you. In future articles, I’ll be expanding on what these practices mean in the real world
Stop thinking of monitoring as a project with a deadline and start making it part of your culture.