Major IT outages today make big news and unwelcome news stories for those involved. These seem a relatively new phenomenon but one that looks unlikely to abate. Many large enterprises are built on a foundation of aging IT systems, held together by bespoke software designed many years ago – the proverbial sticking plaster. Their IT estates have been built up over many years, with legacy systems often out-living the IT managers and teams that developed and installed them.
Boards driven by intense M&A activity during the 1990s and 2000s have left a legacy of further IT sprawl. In these cases it is often only a matter of time before the sticking plaster comes loose and a major IT outage causes tides and floods through the business affecting customers and counter parties, and worse still, reputation and stock price.
In tough economic times many boards see little need, or desire to invest to integrate and rationalise these sprawling IT systems, even seeking comfort in the fact that siloed systems will hopefully mean ring-fenced repercussions if they do fail. However, as recent news headlines have highlighted, out-of-date estates and broader networks can fail, and have serious implications when doing so…
High profile outages
American Airlines recently suffered a ‘mammoth technology meltdown’ that grounded its entire fleet and left the airline unable to book flights, track bags or even load and fuel planes.
"As you’d imagine, we do have redundancies in our system," Tom Horton, chief executive of parent company AMR Corp, said in an apology posted on YouTube. "But unfortunately in this case, we had a software issue that impacted both our primary and backup systems."
Flight cancellations and delays continued even after the IT issues were rectified, but the aftermath of such a failure will continue to resonate for many months. Complexity of systems and vast geographic scope cannot act as excuses for poorly constructed or aging IT systems, especially when so much is at stake. American Airlines’ intended merger with US Airlines brings into question whether its technology makes it ready to become the world’s largest airline.
The banking industry is also well known for having large and aging IT estates which are beginning to show signs of distress causing some major and highly publicised IT failures. This has impacted customers and reputations in an industry that needs strong organisational competence to restore its battered reputation. In our age of social media and "bash the bankers" journalism, these back office issues are becoming front page news too frequently for comfort.
In response to recent unforeseen failings in the banking industry, the banking regulator, the FCA (previously the FSA) reached out to all UK retail banks in September 2012 requiring that each bank submit a written account of what it had done to ensure the overall resilience of critical infrastructure and banking processes and what contingency plans were in place to restore service within an acceptable timeframe if a failure did occur. This has been followed by a specific investigation by the FCA commencing 12 April 2013 into RBS’ IT problems during June and July 2012.
The consequences of an IT outage
A catastrophic IT failure does not just affect systems and employees but customers and counter parties too, turning a back office issue into public damage to customer satisfaction and loyalty.
The consequences of a large scale IT failure also have the potential to resonate long after the initial incident. From a business perspective, the initial hit to brand reputation, stock price and productivity can be almost instantaneous, but it is how the business responds to the outage that will impact how far the repercussions spread.
Outages that impact the end-user, if not managed carefully, can lead to anger directed towards the brand and subsequent loss of custom. In an age where customer loyalty is wearing thin, how many brands can leave their valued customers without a service for any meaningful period and expect its reputation and trading to remain immune?
Businesses need to realise that there is now a tight correlation between technology and brand reputation. In today’s ‘always on’ culture, being forced offline not only leads to a very public back-lash but also the loss of invaluable customer and market confidence.
Operational and infrastructure stability is the order of the day for CIOs, so their credibility can take a major hit when disaster strikes. Not only can they find themselves hauled in front of the Board to deliver a comprehensive explanation of the situation, but they will be judged by the board’s confidence in their recovery plan. A major IT failure does not look good on the CV: a poor recovery is worse.
The loss and damage arising from the outage can be considerable but quantification can be difficult. Some losses are relatively easy to ascertain, where for example third party contractors are required and direct costs incurred, but others, such as reputational damage and loss of business can be more difficult to capture in monetary terms.
Where the blame lies
The perfect storm for an outage often involves a combination of failed or failing processes, technology and human error, sometimes exacerbated by the hour of the day or day of the week. Contrast a failure taking place in a batch run on a Sunday evening which rolls into Monday morning, with a failure mid-week during the online day.
An understanding of where lines of responsibility sit is crucial from the moment an outage is identified. With many IT services being outsourced to a range of suppliers, risk and responsibility needs to be accurately identified. This is more easily said, than done. When systems seriously underperform or fail it is often very difficult to identify where the problem resides. Enterprise IT systems and platforms are interconnected in complex networks both within and often outside the entity, for example with a cloud service. This does not make diagnosis quick or straight forward and remedy can often involve a series of trial and error fixes with rounds of remedial testing.
In the heat of the moment, things are said and emails written which can help or hinder both the recovery and remedial process but they also might compromise or prejudice subsequent claims and cross claims.
Where responsibility appears to lie elsewhere and the scale of the loss and damage is substantial, it is inevitable that parties will look to their contracts and their contractors for recompense. Most enterprises no longer hold longstanding contracts with an exclusive supplier. The need for agility, flexibility, scalability and the need for cost reduction have driven multi-service and multi-vendor relationships, which present a complex legal picture when added to the uncertainties of cause and effect.
IT contracts often have reasonably standard provisions concerning limitation of liability for losses flowing from a breach and exclusion of liability clauses. It would be extremely rare to find a contract where the supplier indemnifies an organisation – be it bank, telecom provider, or airline – for their whole business going down. Given recent failures however, it is perhaps not surprising that major customers with strong negotiating positions are seeking to hedge their positions by seeking to extend their supplier’s remedial obligations and their potential exposure to customer claims for losses arising from a major breakdown.
Insurers also have a part to play, where business interruption, errors and omissions and cyber liability policies potentially cover an event. This can lead to additional complexity due to the terms of the insurance cover and the insurer’s rights of subrogation. An insurer’s interests are not always aligned to that of a business suffering the fall-out of an IT outage. An organisation will generally want to wrap up the situation, seek recompense quickly, and move on. However, an insurance company will want to minimise money paid out under the policy which may require insured to play a longer game.
Preparing for the inevitable?
The way a business responds to an outage is critical to system recovery, the financial outlay and the rebuilding of its reputation. The right strategy, plan, processes and people can mitigate the impact and determine whether there is a ripple, tidal or flood impact.
Capturing evidence, identifying the chain of causation and understanding the losses and costs involved are crucial. Crisis management protocols and a positive PR campaign will always take priority in an outage situation. However, care needs to be taken during the recovery period to ensure reasonable positions are maintained which align with contractual obligations and responsibilities.
Sticking plasters will always fall off over time, and as such organisations need to start making concrete plans to prevent and mitigate the effects of IT failure. Getting the right technology in place is only half the story. Ensuring process clarity and a distinct chain of decision-making is equally as crucial, as is expert advice and support.
With these factors in place, a business can create a strong foundation for contingency planning and a strong recovery without substantial loss, thus managing the ripple effect if an outage does occur and avoiding a very damaging flood.