Avoiding Catastrophic Failure

You may have already heard the news about Delta Airlines catastrophic failure. Ars Technica reports the true cause of the failure – routine maintenance of the power generators. While it may be a little presumptuous or high on the bragging scale to have only one datacenter to house your entire infrastructure – this is not the best method. The blame is often placed on the IT personnel when computer systems go down, but in this case the error is shared. There was a maintenance individual who did not spot the potential of a fire, there is the building planning committee that placed the power sources too close together, there is the IT budgeting team that did not have an off-site solution, and there is the CTO misinformed on the infrastructure needs of a worldwide company. A catastrophic failure is anything that damages a company’s reputation.

I can understand the single point of failure – it is often found in SMB/non-profit environments. The single point of failure happening is marginal at best. This causes it to be overlooked many times over as some will hope it never comes to encountering that scenario. Budgetary constraints are often the first road block, the second being time to implement, the third being the internal security practices of customer data, and the fourth being the time to restore after a catastrophic failure is less than 24 hours – these also minimizes the single point of failure in our minds. We so often minimize the single point of failure to where it loses its place as #1 concern to #100 on “do someday task.”

We live in the best computer age right now. Catastrophic failures can be avoided. Here are a few ways to prevent catastrophic failures.

  1. Keep up-to-date on current software trends/best practices.
  2. Don’t keep all of your eggs in one basket. Keep your services separated.
  3. Research and obtain certifications relevant to IT, but not necessarily directly related to your field of work. For instance, a network engineer achieving a Linux certification or a SysAdmin getting a CCNA.
  4. Get IT audited. If the audit returns no suggestions, get audited by another company within 3 months.
  5. Simulate all the disaster recovery scenarios twice every 5 years. Have different managers perform the restore, meet with their findings (how long it takes, what is needed, budget), and record it in documentation.
  6. Ease the restore time by having everything in a configuration management engine (Like Puppet, Chef, Anisble, Salt, and others)
  7. Reward IT employees if there have been no service disruptions in the past month. Yes, do it regularly. It can be as simple as a thank-you card, gift card, or the ability to leave early one day in the next week.

I do not know Delta Airlines’ software/hardware stack, but I would estimate that they have at least a database and a front end that communicates with the database. With that estimation, it would be easy to set up an AWS environment where an auto scaling group contained a replica of the datacenter database. Even if it required a manual switch from on-premise to the cloud database, the total downtime would have been less than 20 minutes (10-15 minutes for the IT manager to find the wiki entry and delegate the task to the right guy, 5 minutes for DNS propagation). Maintaining something like this would be around $500/month for the AWS VPN connection and database servers. If they go this route or not time will only tell. One thing is certain, change is going to be made to their infrastructure.