I've been thrown into an infrastructure role recently, and this of course has forced me to try to codify some of what I've been doing in a manner that I can pass on to others. The fact that I've been getting about 500 emails about “warnings”, “errors” and “business as usual” per day hasn't improved my mood about taking this new role on — although I expect it'll get to become quite enjoyable once the team and I start to get a proper handle on things.
There are so many ways of trying to predict what might happen with systems, but in the end there's only one thing that we can really deal with: a real server failure for a reason that we hope to be able to identify. So in order to handle this we put alerts on all sorts of things: memory usage, CPU usage, network usage, and any other kind of usage we can hook an alerting system up to.
Our intuition that a memory warning saying we're using more than 90% of RAM means something, actually, isn't worth anything unless we can correlate that with at least a concrete service failure (what you classify as a service failure, well, that's between you and your SLAs).
What we miss here is that, sure high RAM usage may well be a factor in a failure, but it's not predictive unless every high memory usage (to let's say 90% of cases) means we're actually going to suffer a service outage. If I get an email every day because a certain process running at off peak time uses 90% of RAM and everything remains OK then that means I'm getting spammed by an email every day. It doesn't mean that the warning is useful if there is never any problems associated with RAM consumption. It also doesn't mean that a warning of more than 80% usage during peak usage might not presage a complete outage. Our warnings need to be contextual, but above all they need to be predictive.
That is, they need to be predictive in the scientific Popperistic sense of meaning “this alert has a very high correlation with a problem I care enough about that I'm going to have to fix it now”. In that case the alert in itself isn't actionable, but at least it primes you to be ready for the outage that will surely follow — a “heads up” if you will. If that correlation can't be at least 90% accurate (a number I pulled from thin air, but it needs to be very high, I think 90% is the lowest number that might be useful) then people will learn to ignore the warnings. And once warnings can be ignored then you might as well not bother issuing them.
The basic premise here is that every alert must be actionable in some way. If they're not actionable then they're purely diagnostic — and there's nothing wrong with that. We need a huge amount of data before and after failures in order to try to work out what the factors that caused it were. Hopefully once we know that we can take steps to codify this into a predictive warning that tells us what we need to do (or even better an automated warning that tells us what was reconfigured to ensure the problem doesn't arise). Of course, if the correlation is high enough, and the downside low enough then we should just automate the mitigating factors in such a way that the solving of the problem becomes part of the normal system operation.
Diagnostic alerting, which isn't actionable, however is just a distraction. We end up ignoring them all. What I want to see is that post mortems of real world failures show a good enough correlation that we can issue a warning when certain things happen and to use that information to try to pre-empt a failure. If we can always pre-empt the failure then it should become part of our normal operating procedures (and be fully automated) and no longer require a warning.
Now, of course, none of this should detract from “error” reporting — that being reporting that something is broken. The big problem with the technologies that we use today (especially at scale) is that “errors” aren't hard and fast. Transient problems come and go all the time, and we have to deal with that as a reality.
With a simple system it's possible to say either it works or it doesn't, but once you have more than a couple of services working together to serve resultsm and you have a bit of fault tolerance thrown in, you can't talk about a system being up or down, only parts of it being up or down. What to do about this is something I think we still need to work out on a case by case basis.
I hate taxonimising, but still… The essential tl;dr is here:
An error needs to be reserved for an actual system outage. There is something concrete wrong that needs immediate action by somebody smart enough to look at the full context and decide what is the acceptable response to take — and who has authority to take it (this last bit seems to be missing far too often in case studies you see on the internet).
Warnings I'm still less sure about, but some ideas are:
The idea here is that even warnings are actionable. Non-actionable messages hitting my email, pager, phone or anything else is essentially spam and I really don't want to be setting our systems up to be spamming myself or anybody else.
I'm hoping that after working this through for a month or so on all of our systems I can reduce my emails from more than 500 per day saying something might be worth looking into to about 3 per week that are something that certainly needs doing.
If we can be clever enough in how we deal with all of this my plan is to bring it down to around 3 per year. I'll live in hope :)