Failures in the new Digital Infrastructure

System failure in the wild at Westfield Shopping Centre, acknowledgement to Martin Clinton Just a quick post to get me back in the groove, has it really been 6 months? As we increase the size and complexity of these digital infrastructures to deliver the services upon which we depend, outages like that at O2 will have bigger and bigger effects in society. Simply handling the complaints well does not do the business. Witness the challenges at RBS recently,  from this report it looks like poor control in systems development and implementation; Amazon Web Services have occasional outages owing to usual factors, but with disproportionate effects, see Netflix’ rather honest appraisal of failings of their own system design can be found via this report which reveals challenges in designing distributed systems to cope with all failure modes.

The point is that we need to

  • be able to put such failures in perspective. We will always suffer power outages, lightning strikes and flooding, as well as the effects of human error and equipment failure. When you compare the number of failures suffered by Amazon in a year with the aggregated number of systems that they are operating, the failure rate is probably much less than the best data centres in operation.
  • help service providers understand the proven strategies to minimise the risk and impact of such failures – they cannot be completely avoided, only mitigated. For example, Amazon have a very sophisticated set of services to allow their customers to manage failures and deploy their systems across different service centres, but are these services understood and used properly by their customers? While Netflix is pretty slick at deploying its services, including its novel Chaos Monkey service aimed at disrupting live services to demonstrate resilience,  it still has stuff to learn about failure modes and their effects, hence their need for more technical staff!

There are other dimensions to this challenge, of course, the Large Scale Complex IT Systems research project in the UK is looking at the various factors involved in our increasingly layered, distributed systems design. But there are no easy answers, just better understanding of the challenge. We still need those talented engineers that Netflix and the other Cloud leaders are looking for.