Write Access to Production Data is a Liability - Lost Apathy, Overcoming Indifference

We all hate hero culture, but want to be heroes anyway. When there’s a problem in production that can be resolved with a “simple” tweak of the data, you get to be a hero. What’s not to like? Quite a bit.

We’ve all seen it, and let’s be honest, we’ve all done it. There’s a simple problem with a simple solution, and there’s just no point in running something so simple through the full normal process.

Allowing manual write access to production data is dangerous, and is terribly costly even when things don’t end in an obvious disaster.

Just like application code, anything that modifies data needs to be subject to proper version control, testing, and release processes.

Human Error Should Happen

There’s the obvious risk of human error. People make mistakes, especially under pressure. And let’s be realistic, there’s always pressure when you’re editing data hot in production. Quite frankly, it’s not fair to put people in high risk situations, whether they realize it or not. Shit happens, as they say, so why go looking for trouble? The sad reality is that most people in these situations see themselves as the savior rather than the victim. It’s no wonder this is such common practice.

It’s human nature to make mistakes. Don’t put people in positions where they can’t afford make human errors.

At the same time, innovation requires experimentation. Experimentation is at its best when it’s OK to make mistakes along the way. It’s hard to shift gears between being a risk-taking innovator and being an infallible hero. People cannot be excepted to do both well, and become risk averse even when they need not be. Nobody operates well at both extremes, and it’s innovation that suffers.

What Really Happened?

A few days later, there’s a new question about the data. Things don’t quite seem right, and the prior heroics are now in question. Did we fix the problem correctly, or did the problem return? What did we do last time this came up?

That ad hoc update wasn’t properly logged and neither were the results. Sometimes you’re lucky and the query got saved, but how sure are you that what got saved is exactly what was run? Malice aside, there’s a lot of ways to mix that up - saving the penultimate version of a script, not noticing the error that caused incomplete execution, or running a different script than you thought you did. When you can’t trust your artifacts, any further investigation is dead in it’s tracks and the mystery goes unsolved.

If you follow your release process and let your automation run the updates, you know exactly what operations were performed on your data, what the results were. Most importantly, you can trust the artifacts. Jenkins makes records you can trust.

Getting Wasted at Work

Time and money, that is, when you don’t follow proper release processes for data changes. Anyone who does this will, of course, argue the opposite. They don’t have time to follow release protocols and neither does the customer. Rationalization like this neglects the fact that most support work doesn’t need to happen, and shouldn’t be happening. Support problems are often quality issues by another name.

There are very few truly one-off support problems. Rather, there are a lot of repeating problems that keep showing up and require repeated resolution. As luck would have it, computers are good at doing repetitive work for us! We just have to tell them how. When support issues go straight from the help desk to an ad hoc data edit, you don’t create process artifacts that allow improvements to happen. You take away the opportunity to look for root causes and never collect the information needed to improve the product. You deny your customers the opportunity to not need to contact the help desk at all!

Resolving issues quickly is good customer service, but preventing the issue entirely is better. Having the hero fix things without a root cause analysis ensure the issue will recur. Which means repeating the cycle of risk and wasted time. The combination of wasted time and a broken feedback cycle depress product quality. Letting this cycle accumulate leads to a state of constant crisis, a stale product, and a slow death.

Now What?

Automate your build, test, and deployment processes. Utilize a CI/CD server to further constrain and automate these processes.

Perhaps ironically, the problem processes that stand in the way of automation are themselves best kept in check by automation. Automation often forces the adoption of better practices. It leaves logs behind to allow post-mortem analysis so you can improve.

Add procedures and controls to discourage doing things the wrong way. Make circumventing the release process more painful than following it.

There may still be the occasional call for heroics, but with proper discipline it will decrease. When that happens, ensure all the details get logged before the heroics start, so that corrective action can be made. Quality will go up. Stress goes down. Customers take the help desk out of their speed dial. Life gets better.