On August 26th, between 08:45 UTC and 13:32 UTC, the entire LeanIX US region was not usable due to the unavailability of a central component caused by a capacity outage of our hyperscaler.
No customer data was lost, and the region was immediately usable again after the central component came back online.
We immediately engaged the hyperscaler support to help us identify the root cause and work with us on a resolution. Multiple workstreams were executed in parallel to restore the central component, including our disaster recovery process. We stopped our disaster recovery process when the central component came back online after the hyperscaler’s on-call engineer manually unblocked the stuck workflow.
During a normal scaling operation to provide more compute capacity to a central component, which was executed in a standard maintenance, we got hit by the capacity outage of our hyperscaler.
The scaling operation requires a reboot to apply the new configuration. Due to the capacity outage, the workflow got stuck and left the central component in a turned-off state, which led to an unusable LeanIX US region.
We identified several areas for improvement: