Service Disruption in all regions

Incident Report for LeanIX

Postmortem

Summary

On January 25th between 07:15 UTC till 11:00 UTC access to our application was affected by Azure’s global WAN issues.

During this time users might have had difficulties accessing our application due to high network latency and/or timeouts.

For more details about the incident, and its resolution have a look at the post incident review below that was published by our cloud provider Microsoft Azure.

Post Incident Review provided by Microsoft Azure

Post Incident Review (PIR) – Azure Networking – Global WAN issues (Tracking ID VSG1-B90)

What happened?

Between 07:08 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with network connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform. This incident also impacted Azure Government cloud services that were dependent on Azure public cloud. While most regions and services had recovered by 09:05 UTC, intermittent packet loss issues caused some customers to continue seeing connectivity issues due to two routers not being able to recover automatically. All issues were fully mitigated by 12:43 UTC.

What went wrong and why?

At 07:08 UTC a network engineer was performing an operational task to add network capacity to the global Wide Area Network (WAN) in Madrid. The task included steps to modify the IP address for each new router, and integration into the IGP (Interior Gateway Protocol, a protocol used for connecting all the routers within Microsoft’s WAN) and BGP (Border Gateway Protocol, a protocol used for distributing Internet routing information into Microsoft’s WAN) routing domains.

Microsoft’s standard operating procedure (SOP) for this type of operation follows a 4-step process that involves: [1] testing in our Open Network Emulator (ONE) environment for change validation; [2] testing in the lab environment; [3] a Safe-Fly Review documenting steps 1 and 2, as well as a roll-out and roll-back plans; and [4] Safe-Deployment which allows access to only one device at a time, to limit impact. In this instance, the SOP was changed prior to the scheduled event, to address issues experienced in previous executions of the SOP. Critically, our process was not followed as the change was not re-tested and did not include proper post-checks per steps 1-4 above. This unqualified change led to a chain of events which culminated in the widespread impact of this incident. This change added a command to purge the IGP database – however, the command operates differently based on router manufacturer. Routers from two of our manufacturers limit execution to the local router, while those from a third manufacturer execute across all IGP joined routers, ordering them all to recompute their IGP topology databases. While Microsoft has a real-time Authentication, Authorization, and Accounting (AAA) system that must approve each command run on each router, including a list of blocked commands that have global impact, the command’s different, global, default action on the router platform being changed was not discovered during the high-impact commands evaluation for this router model and, therefore, had not been added to the block list.

Azure Networking implements a defense-in-depth approach to maintenance operations which allows access to only one device at a time to ensure that any change has limited impact. In this instance, even though the engineer only had access to a single router, it was still connected to the rest of the Microsoft WAN via the IGP protocol. Therefore, the change resulted in two cascading events. First, routers within the Microsoft global network started recomputing IP connectivity throughout the entire internal network. Second, because of the first event, BGP routers started to readvertise and validate prefixes that we receive from the Internet. Due to the scale of the network, it took approximately 1 hour and 40 minutes for the network to restore connectivity to every prefix.

Issues in the WAN were detected by monitoring and alerts to the on-call engineers were generated within 5 minutes of the command being run. However, the engineer making changes was not informed due to the unqualified changes to the SOP. Due to this, the same operation was performed again on the second Madrid router 33 mins after the first change, thus creating two waves of connectivity issues throughout the network impacting Microsoft customers.

This event caused widespread routing instability affecting Microsoft customers and their traffic flows: to/from the Internet, Inter-Region traffic, Cross-premises traffic via ExpressRoute or VPN/vWAN and US Gov Cloud services using commercial/public cloud services. During the time it took for routing to automatically converge, customer impact dynamically changed as the network completed its convergence. Some customers experienced intermittent connectivity, some saw connections timeout, and others experienced long latency or in some cases even a complete loss of connectivity.

How did we respond?

Our monitoring detected DNS and WAN issues starting at 07:11 UTC. We began investigating by reviewing all recent changes. By 08:20 UTC, as the automatic recovery was happening, we identified the problematic command that triggered the issue. Networking telemetry shows that nearly all network devices had recovered by 09:05 UTC, by which point most regions and services had recovered. Final networking equipment recovered by 09:25 UTC.

After routing in the WAN fully converged and recovered, there was still above normal packet loss in localized parts of the network. During this event, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network were not fully optimized and, therefore, experienced increased packet loss from 09:25 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. The recovery was ultimately completed at 12:43 UTC and explains why customers in different geographies experienced different recovery times. The long poles were traffic traversing our regions in India and parts of North America.

How are we making incidents like this less likely and less impactful?

Two main factors contributed to the incident:

A change was made to a standard operating procedure that was not properly revalidated and left the procedure containing an error and without proper pre- and post- checks.
A standard command that has different behaviors on different router models was issued outside of standard procedure that caused all WAN routers in the IGP domain to recompute reachability.

As such, our repair items include the following:

Audit and block similar commands that can have widespread impact across all three vendors for all WAN router roles (Estimated completion: February 2023).
Publish real-time visibility of approved-automated and approved-break glass, as well as unqualified device activity, to enable on-call engineers to see who is making what changes on network devices. (Estimated completion: February 2023).
Continued process improvement by implementing regular, ongoing mandatory operational training and attestation of following all SOPs. (Estimated completion: February 2023).
Audit of all SOPs still pending qualification will immediately be prioritized for a Change Advisory Board (CAB) review within 30 days, including engineer feedback to the viability and usability of the SOP. (Estimated completion: April 2023).

Posted Feb 08, 2023 - 09:45 CET

Resolved

This incident has been resolved. We appreciate your patience and understanding.

Posted Jan 25, 2023 - 12:02 CET

Update

Our service provider, Microsoft Azure, identified a networking configuration change as the underlying cause and rolled that change back. Our products are recovered and operational again.
We will continue to monitor the situation.

Posted Jan 25, 2023 - 11:17 CET

Monitoring

Our service provider Microsoft Azure is actively mitigating the issue and will provide updates on its status page as necessary.
Microsoft Azure status page: https://status.azure.com

We have seen partial recovery of our services, but disruptions may still happen.
We will continue to monitor the situation.

Posted Jan 25, 2023 - 10:44 CET

Identified

The networking issue is acknowledged by Microsoft Azure and is published on their status page: https://status.azure.com
Our team is ready to implement a solution as soon as we can take action.

Posted Jan 25, 2023 - 09:36 CET

Investigating

We are currently experiencing a service disruption in all applications. Our team is working to identify the root cause and implement a solution.

We will send an additional update in 60 minutes.

Posted Jan 25, 2023 - 08:50 CET

This incident affected: EU Instances (EAM, VSM, SMP), US Instances (EAM, VSM, SMP), CA Instances (EAM, VSM, SMP), AU Instances (EAM, VSM, SMP), DE Instances (EAM, VSM, SMP), and CH Instances (EAM, VSM, SMP).