Service Disruption in US

Incident Report for SAP LeanIX

Postmortem

Incident Description

On August 26th, between 08:45 UTC and 13:32 UTC, the entire LeanIX US region was not usable due to the unavailability of a central component caused by a capacity outage of our hyperscaler.

No customer data was lost, and the region was immediately usable again after the central component came back online.

Incident Resolution

We immediately engaged the hyperscaler support to help us identify the root cause and work with us on a resolution. Multiple workstreams were executed in parallel to restore the central component, including our disaster recovery process. We stopped our disaster recovery process when the central component came back online after the hyperscaler’s on-call engineer manually unblocked the stuck workflow.

Root Cause Analysis

During a normal scaling operation to provide more compute capacity to a central component, which was executed in a standard maintenance, we got hit by the capacity outage of our hyperscaler.

The scaling operation requires a reboot to apply the new configuration. Due to the capacity outage, the workflow got stuck and left the central component in a turned-off state, which led to an unusable LeanIX US region.

Preventative Measures

We identified several areas for improvement:

  • Reducing the risk of an unusable region by introducing an additional caching layer in front of the central component
  • Improving the disaster recovery process to speed up the resolution of similar incidents
  • Running capacity checks with our hyperscaler in advance of planned scaling operations
Posted Sep 04, 2025 - 10:37 UTC

Resolved

This incident has been resolved.
Posted Aug 26, 2025 - 14:00 UTC

Monitoring

Our cloud provider restored the database systems and we are monitoring the results. The system as of now should be back online and accessible again.
Posted Aug 26, 2025 - 13:32 UTC

Update

Our cloud provider ist still working an a mitigation to restore database systems that are currently unavailable due to capacity constraints.
We are in close contact with the cloud provider and update the status as information arrive.

We will send an additional update in 4 hours.
Posted Aug 26, 2025 - 11:55 UTC

Identified

The issue has been acknowledged by our cloud provider, they are working an a mitigation. The cloud provider confirmed capacity issues.

We will send an additional update in 60 minutes.
Posted Aug 26, 2025 - 10:18 UTC

Investigating

We are currently experiencing a service disruption in US. Our team is working with our cloud provider to identify the root cause and implement a solution.

We will send an additional update in 60 minutes.
Posted Aug 26, 2025 - 09:07 UTC
This incident affected: US Instances.