On January 25th between 07:15 UTC till 11:00 UTC access to our application was affected by Azure’s global WAN issues.
During this time users might have had difficulties accessing our application due to high network latency and/or timeouts.
For more details about the incident, and its resolution have a look at the post incident review below that was published by our cloud provider Microsoft Azure.
What happened?
Between 07:08 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with network connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform. This incident also impacted Azure Government cloud services that were dependent on Azure public cloud. While most regions and services had recovered by 09:05 UTC, intermittent packet loss issues caused some customers to continue seeing connectivity issues due to two routers not being able to recover automatically. All issues were fully mitigated by 12:43 UTC.
What went wrong and why?
At 07:08 UTC a network engineer was performing an operational task to add network capacity to the global Wide Area Network (WAN) in Madrid. The task included steps to modify the IP address for each new router, and integration into the IGP (Interior Gateway Protocol, a protocol used for connecting all the routers within Microsoft’s WAN) and BGP (Border Gateway Protocol, a protocol used for distributing Internet routing information into Microsoft’s WAN) routing domains.
Microsoft’s standard operating procedure (SOP) for this type of operation follows a 4-step process that involves: [1] testing in our Open Network Emulator (ONE) environment for change validation; [2] testing in the lab environment; [3] a Safe-Fly Review documenting steps 1 and 2, as well as a roll-out and roll-back plans; and [4] Safe-Deployment which allows access to only one device at a time, to limit impact. In this instance, the SOP was changed prior to the scheduled event, to address issues experienced in previous executions of the SOP. Critically, our process was not followed as the change was not re-tested and did not include proper post-checks per steps 1-4 above. This unqualified change led to a chain of events which culminated in the widespread impact of this incident. This change added a command to purge the IGP database – however, the command operates differently based on router manufacturer. Routers from two of our manufacturers limit execution to the local router, while those from a third manufacturer execute across all IGP joined routers, ordering them all to recompute their IGP topology databases. While Microsoft has a real-time Authentication, Authorization, and Accounting (AAA) system that must approve each command run on each router, including a list of blocked commands that have global impact, the command’s different, global, default action on the router platform being changed was not discovered during the high-impact commands evaluation for this router model and, therefore, had not been added to the block list.
Azure Networking implements a defense-in-depth approach to maintenance operations which allows access to only one device at a time to ensure that any change has limited impact. In this instance, even though the engineer only had access to a single router, it was still connected to the rest of the Microsoft WAN via the IGP protocol. Therefore, the change resulted in two cascading events. First, routers within the Microsoft global network started recomputing IP connectivity throughout the entire internal network. Second, because of the first event, BGP routers started to readvertise and validate prefixes that we receive from the Internet. Due to the scale of the network, it took approximately 1 hour and 40 minutes for the network to restore connectivity to every prefix.
Issues in the WAN were detected by monitoring and alerts to the on-call engineers were generated within 5 minutes of the command being run. However, the engineer making changes was not informed due to the unqualified changes to the SOP. Due to this, the same operation was performed again on the second Madrid router 33 mins after the first change, thus creating two waves of connectivity issues throughout the network impacting Microsoft customers.
This event caused widespread routing instability affecting Microsoft customers and their traffic flows: to/from the Internet, Inter-Region traffic, Cross-premises traffic via ExpressRoute or VPN/vWAN and US Gov Cloud services using commercial/public cloud services. During the time it took for routing to automatically converge, customer impact dynamically changed as the network completed its convergence. Some customers experienced intermittent connectivity, some saw connections timeout, and others experienced long latency or in some cases even a complete loss of connectivity.
How did we respond?
Our monitoring detected DNS and WAN issues starting at 07:11 UTC. We began investigating by reviewing all recent changes. By 08:20 UTC, as the automatic recovery was happening, we identified the problematic command that triggered the issue. Networking telemetry shows that nearly all network devices had recovered by 09:05 UTC, by which point most regions and services had recovered. Final networking equipment recovered by 09:25 UTC.
After routing in the WAN fully converged and recovered, there was still above normal packet loss in localized parts of the network. During this event, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network were not fully optimized and, therefore, experienced increased packet loss from 09:25 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. The recovery was ultimately completed at 12:43 UTC and explains why customers in different geographies experienced different recovery times. The long poles were traffic traversing our regions in India and parts of North America.
How are we making incidents like this less likely and less impactful?
Two main factors contributed to the incident:
As such, our repair items include the following: