Degraded performance in EAM
Incident Report for SAP LeanIX
Postmortem

Incident Description

On January 21st, between 16:06 and 17:19 UTC, one of our database management systems (DBMS) in westeurope experienced multiple failovers due to high load. The load was caused by an event replay of our event-carried state transfer system. The repeated failovers led to a brief downtime of the DBMS. Several services simultaneously executed the replay process, inadvertently placing excessive pressure on the DBMS.

The incident caused degraded performance and temporary service disruptions for our customers for the following business capabilities:

  • diagrams
  • storage
  • todos
  • transformations
  • automations

Incident Resolution

To address the issue, we increased the SKU of the affected DBMS to a higher capacity tier, providing additional resources to handle the increased load during event replay scenarios. This adjustment immediately stabilized the system and prevented further failovers.

Root Cause Analysis

An event replay reprocesses historical events from an event log to rebuild the current state of a system. It is commonly used for synchronization in asynchronous systems. In this particular case, the event replay was necessary to enable new features in our product. However, we did not anticipate the capacity requirements ahead of time, as previous replay runs for partitions in the same region had not caused similar levels of load. The replay introduced unexpected pressure due to differences in partition size, leading to the incident at hand.

Preventative Measures

To prevent similar incidents in the future, we aim to improve in the following areas:

  1. Capacity Management: Better forecasting to ensure the DBMS can handle increased loads.
  2. Visibility: Enhanced monitoring to detect potential issues earlier.
  3. Service Distribution: More even distribution of load across DBMS instances.
  4. Replay Orchestration: Smarter scheduling to avoid concurrent high-load events.

These actions will help us build a more resilient system and ensure reliable performance as we continue to scale.

Posted Jan 24, 2025 - 14:03 UTC

Resolved
This incident has been resolved.
Posted Jan 21, 2025 - 19:05 UTC
Monitoring
Users may experience degraded performance in EAM. Our team is working to identify the root cause and implement a solution.

We will send an additional update in 60 minutes.
Posted Jan 21, 2025 - 17:21 UTC
This incident affected: EU Instances (EAM).