The computer failure that brought US flights to a stop

Hello everyone and welcome once again to the Cognixia podcast. We have been around for close to a year now and we are floored by the love and support we get from our listeners. At Cognixia podcast, every week we bring you a bite-size dosage of the latest happenings, discussions, guides, and a lot more from the world of emerging digital technologies. It has been a great journey so far, and we look forward to many, many more amazing episodes that would inspire you to take the next big leap in your career by taking the learning and upskilling route.

This week we talk about something major that happened earlier in January 2023. A computer failure at the Federal Aviation Administration brought flights across the United States to a complete standstill, with hundreds of delays quickly cascading through the systems at airports worldwide. The US government confirmed that there was no evidence of a cyberattack. In this episode, we will delve deeper into what happened that day.

The Federal Aviation Administration or the FAA has a system that sends vital communication to the pilots. This system experienced a massive outage causing thousands of flights to be grounded. This system is called the NOTAM or Notice to Air Missions. Before a flight takes off, the pilots and airline dispatchers in the US are required to review the notices in the NOTAM. This would include important information about the weather runway closures, ongoing construction work, etc. which would be required for the flight’s smooth journey. Earlier, NOTAM used to be telephone-based, however, over time, NOTAM was moved online.

The NOTAM system broke down earlier in January 2023 and it was a few hours before the system came back up, but not before 1200 flight cancelations and more than 7800 delays on the East coast itself. Some of the busiest airports in the world – Chicago, New York, Los Angeles Atlanta, etc. are in the United States and they saw between 30 to 40% of flights being delayed due to this outage.

Over the years, experts have often opined that a large chunk of FAA’s systems relies on old mainframe systems which are generally reliable but still out-of-date, requiring modernization and updating. Some pieces of the NOTAM system are about 30 years old, which in the technology space is very, very outdated.

A few days after the incident, it was revealed that unspecified personnel were responsible for corrupting a file in the system, which caused the outage of the FAA NOTAM’s computer system. According to a Fortune News Report, FAA does have stringent procedures in place to ensure that data doesn’t get damaged by technicians when they are working on the systems. However, the files got altered despite having rules in place that prohibit these kinds of changes on live systems. Investigations have been on to find if the alterations were accidental or made with malicious intent.

Reportedly, when the system began having issues, the technicians working with the NOTAM system switched to the backup systems. Unfortunately, the backup systems were trying to access the same damaged files as the original systems leading to system breakdown. To restore the system, a complete shutdown became essential. This, in turn, called for about 90 minutes of a complete halt on all flight departures to be announced by the FAA.

Huge losses have been incurred due to this system outage, affecting not just the flights and airports in the United States but in other countries of the world as well.

So, what does it all boil down to? We would say system resilience. Today, businesses are moving ahead at top speed to deliver innovative services while keeping up their high level of operational stability. This often leads to the system’s resilience taking a hit. A lot of traditional enterprises, including the FAA, have lagged in updating their systems to build much-needed resilience. This has led to a steep spike in system crashes, outages, and service-level disruptions as everyone continues to embrace new technologies, speed up IT application development, and work to scale up as quickly as possible. Many of these enterprises would have spent a good deal of resources on building service redundancy, having in place hot standbys, real-time backup-and-restore mechanisms, stronger disaster-recovery planning, and system self-healing capabilities.

Such outages lead to losses amounting to millions of dollars. Reports say a large retailer lost about $5 million in sales when their systems went down for quite a few hours during a peak shopping day. A similar outage led to losses of close to 8% of its revenue for a software company.

The Ponemon Institute predicts that on average the cost of an unplanned outage like the one NOTAM experienced runs to about $9000 per minute, leading to $540,000 per hour. There would be many other indirect costs over and above this as well. Additionally, customer satisfaction and employee productivity take a beating too.

The computer failure that brought US flights to a stop

So what can an enterprise do?

Here are four ways recommended by McKinsey to improve technology service resilience.

First, McKinsey recommends, going beyond the triggers to look for root causes and patterns instead of looking at things superficially. For instance, it is often the surge events that lead to outages and breakdowns. So, don’t just look at it as a system overload, instead dig deeper to find the root cause of the problem. Was it poor capacity planning? Was it insufficient load or performance testing? Was it the rigid architectures that were too easy to break? To build resilience these triggers need to be carefully studied and analyzed.

When the enterprise has a clear understanding of the root cause, systemic improvements can be implemented to build system resilience, which would significantly help reduce such incidents in the future.

The second way that McKinsey recommends is integration and automation for preventing and detecting the issues quicker. A lot of enterprises have separate development, testing, and production environments. Monitoring these environments can be just as fragmented and challenging. To overcome the limitations of this, it is recommended to identify the most critical customer journeys and then modernize & automate the underlying processes end-to-end. Solutions for this could range from building a single management console for tracking and aggregating the alerts across all possible customer journeys or building self-healing systems that execute automated scripts whenever there is an anomaly. Improvements in how change requests are handled would also help enterprises.

The third way recommended by McKinsey is to develop tools and expert networks to speed up the incident response by the enterprise. This aims to reduce the frequency and severity of outages and breakdowns, as well as minimize their impact. For this, enterprises would need access to expertise and also maintain clear, ongoing communication with all the stakeholders, including the customers. Knowledge repositories need to have user-friendly tools to help concerned teams implement troubleshooting measures quickly. McKinsey says, “Cataloging relevant subject-matter experts—both inside and outside the IT function—and bringing groups together for occasional brown-bag discussions and tabletop exercises have helped leaders create stronger networks of experts, which has led to faster and more effective responses to resiliency issues.”

The fourth means to achieve system resilience is to make sure problem management has structure and more importantly teeth. If that sounds confusing, let us explain. Enterprises need to have problem-management teams in place. The problem management team would carefully analyze the incident and the response to it. Based on their findings, they would make recommendations of what went wrong and what could be done better. These suggestions should not be ignored and should be carefully implemented as soon as possible. Enterprises need to have clearly defined service-level agreements or SLAs in place to drive accountability among the problem management team as well as the implementation of the recommendations they make. Having some institutional clout, such as strong CIO engagement would also help the cause of the problem management team.

Building system resilience is extremely important. Every enterprise today is at enough risk from the outside world, the least it can do is to minimize risk from internal factors like outages and breakdowns because the system was not resilient enough. These things can go a long way in safeguarding an organization’s interests and reputation while eliminating the possibility of incurring tremendous losses due to these issues. Needless to say, this must be taken very seriously and actions must be taken to mitigate this risk as soon as possible in every enterprise.

With that, we come to the end of this week’s podcast episode. Hope you enjoyed listening to us. To learn more about our courses as well as the ongoing promotions and offers, visit our website at www.cognixia.com. You can talk to us through the chat function on the website and get an immediate response to your queries too. Our live online instructor-led training and certifications are designed to help you imbibe all the important skills and knowledge essential to take the next big leap in your career & achieve your professional goals.

Until next week then, happy learning!