On November 26th during our scheduled maintenance (Infrastructure upgrades in Australia, Canada, Germany, Europe, and US) we applied security updates including a system kernel upgrade across our infrastructure.
Starting on November 28th around 01:10 UTC our on-call engineers were notified about intermittent network connection issues for one of our instances. This did not result in a downtime as the system recovered itself.
Around 05:30 UTC the intermittent network connection issue manifested itself on another instance and resulted in a brief downtime that was communicated via an incident on our Statuspage at 06:00 UTC (Intermittent network connection issues in Europe and Canada).
We applied mitigation steps and have seen full recovery at around 06:21 UTC.
Once we found the root cause, we started another mitigation at 10:55 UTC by downgrading the affected Ubuntu kernel from version 5.4.0-1095-azure
to 5.4.0-1094-azure
. At 11:30 UTC we got the confirmation from Azure support that the kernel downgrade is the correct solution. Throughout the day, we executed mitigation steps across our infrastructure which resulted in a brief service disruption.
Our root cause analysis led us to an Ubuntu bug report that exactly described the problem we were facing.
Due to the kernel bug, containerd
's process communication ran into timeouts that caused unexpected behavior like intermittent network connection issues or unavailability of the instances itself.
By rolling out similar predictable upgrades to a subset of our infrastructure first, we will reduce the likelihood of impact for our customers.
We sincerely apologize for any inconvenience this incident may have caused and appreciate your patience.