One of my favorite things to do is take a case study or real world situation and apply it to our industry or my job.  The first time I did this in earnest, I wrote Data Flows Made Easy. I was inspired by an article published in the Harvard Business Review that described the disconnect between different groups of designers and engineers ((Sosa, Manual E., Steven D. Eppinger, and Craig M. Rowles. “Are Your Engineers Talking to One Another When They Should?” Harvard Business Review, Volume 85, Number 11 (November 2007): 133-142.)).  I was somewhere on a plane (SURPRISED!?!?) and as I read through the article, it struck me that this method could be directly applied to data security and the challenges that my clients lived through.

Oil lamp, by ralphunden

When there is a major event not directly related to information security, I like to think about what types of things I can learn from it.  In case you have been living under a rock, a major accident happened at an oil rig in the Gulf of Mexico four weeks ago. Stopping the leak has been challenging, and some fear that we may be looking at major ecological impact  for years to come.

The problem began with a blowout on the Deepwater Horizon rig off the coast of Louisiana resulting in an explosion that left eleven workers missing. The explosion fueled a fire that consumed the rig over a period of 24 hours, causing it to sink in 5,000 feet of water.  Now oil is leaking into the Gulf of Mexico at an alarming rate—in a spot where it shouldn’t.

Deepwater wells like this one are outfitted with a blowout preventer (BOP). For many reasons this BOP has not engaged, and oil is leaking from the well through some of the drill pipe “north” of the BOP that was bent and severely damaged during the sinking of Deepwater Horizon.  If the BOP was functioning correctly, the leak would be significantly reduced if not completely sealed.

This crisis for Transocean (operator of the rig) and BP is an extreme example of something going wrong.  Is it a Black Swan?  I think time will tell, but as is with most crisis situations, it wasn’t one single failure that caused the issue.  Several failures happening in a relatively specific order caused this massive tragedy.

While information security professionals often prevent massive breaches from occurring, sometimes a specific series of events happens that discloses data.  It’s up to us to play our role in stopping the breach and containing the loss when that happens. But the last thing we want to do is cause problems for “future Branden” by short cutting security today ((You know, like when Future Marshall and Future Ted have to figure out who gets the apartment when Lily and Marshall get married?)).

According to reports, the BOP failed in several ways—one of which being modifications that were undocumented or unexpected ((For the record, I am by no means an expert on deep water drilling, or oil rigs in general. I’m just collecting from public information and inferring relevance to information security.)).  This is a perfect example of how the employee of today can ruin the day for the employee of tomorrow.

While blowouts are not rare, massive failures like this one are.  What can information security professionals do to prevent this type of catastrophic failure in their own business?  Here are a few specific things we can learn from this tragedy and apply to our daily routine:

  • Innovative solutions (especially in break-fix scenarios) can save the day, but you can’t stop there. You need to 1) go back and DOCUMENT all aspects of the solution, and 2) venture to upgrade, replace, or actively support extreme customizations ((That’s for you Springfield.  Go UTILIZE that hammer to give people headaches.)) such that Future Employee can more quickly diagnose a problem and come to a workable solution.
  • Avoid stretching the technical limitations of hardware, software, and security solutions for any extended period of time.  Sure, everyone has pushed some system past its limits a time or two, but you can’t do that forever.  Limits imposed by engineers exist for a reason. Don’t rely on Lady Luck to get you through the day.
  • Regularly test your detection systems to ensure they are operational. If you are counting on your logging subsystem to tell you when certain conditions exist that could lead to a failure, you probably want to make sure that it works!  This means testing to see if logs appear when the actual conditions exist as well as making sure you have ample space on your logging device and that the information is fed into a system that can get the information to the right people in time to handle it.
  • Try not to shoot yourself in the foot.  This is a hard one because market conditions, corporate culture, and budgetary constraints can dictate this.  Ideally, you don’t want to do something today that will make a crisis an order of magnitude worse.  Example? Neglecting a data file on a FAT file system that grows beyond four gigabytes.  Seriously, when it hit three, you should have either converted the file system or found a way to break the file into smaller chunks.

While this is one of those excellent examples of how risk management may have failed us (or maybe was working perfectly based on the assumptions put into the equation), following the steps above will reduce the probability of a catastrophic failure occurring in a way that makes recovery expensive and challenging.

This post originally appeared on BrandenWilliams.com.

Possibly Related Posts: