Lessons from DynDNS's approach to problems and systems failures

Created 8th December, 2005 02:44 (UTC), last edited 22nd October, 2006 07:19 (UTC)

I've always felt that the real mark of intelligence is not just learning from your own mistakes, but more importantly learning from those of others. In order to learn from those mistakes though we have to know what they are.

I got this email from DynDNS on December 7th 2005:

Due to a configuration error during some normal maintenance activities, all e-mail into our ticketing systems, including abuse@dyndns.xxx, billing@dyndns.xxx, sales@dyndns.xxx, and support@dyndns.xxx, was lost, beginning at approximately 17:05 EST (GMT-0500), and ending at 19:55 EST today. If you sent a new e-mail to any of our queues during that time, and did not recieve [sic] an autoresponse or actual reply, you will need to re-send your e-mail. Also, if you sent a follow-up to an existing ticket during that time, you will need to re-send your message. At this time, all messages are again being received and processed correctly. We sincerely apologize for this issue, and are working to ensure that this type of misconfiguration cannot happen again, or that, if it does, it will be more quickly discovered and corrected.

The lesson that we should take from this isn't that we need to be careful about reconfiguring servers in case we make a mistake* [*Every systems administrator is already careful (and if they're not then fire them), but mistakes invariably happen. It just isn't possible to test every configuration change and some that work in the test lab don't work in a live environment.]. The lesson that we should take from this is that when we make a mistake the first people we should tell are those that it effects, in many cases (and DynDNS do this) those people will be our customers.

Within any company that makes its living through technology there is a tension between the Engineers and the Marketeers (for these purposes the company's lawyers fall into the same group). The Engineers know that everything fails sooner or later. A system with a 100% up-time record is just one that hasn't failed yet. Marketeers on the other hand want to present a shiny and unblemished view of their company to the world, and this normally means never admitting to failure.

Engineers have always learnt the most valuable lessons from studying system failures. This is partly why I started the bugs category. Failures need to be published so that there is at least a chance that others can learn from them. But I don't think it's enough to be going out and finding other people's failures so I'm also going to spend more time documenting the errors that I make in my software. I'm not going to spend much time discussing every little glitch or typo, but I do want to spend some time discussing why major changes are needed and when a mistake gets locked in to the system design.

See also