Diagram editor unavailable in EU region
Incident Report for SAP LeanIX
Postmortem

Incident Description

On 4 September 2024, the Diagram back-end service was unavailable from15:09 UTC to 15:26 UTC. During this time, the diagram editor could not be loaded, but already open diagrams could still be edited and saved, yet no new images could be inserted into diagrams and existing images in diagrams may not have loaded correctly. No data was lost during this incident.

Incident Resolution

A manual restart of the Diagram back-end service resolved the downtime.

Root Cause Analysis

The root cause was identified as a token-refresh call at 15:02:59 UTC from the diagram service in the West-Europe-region to an external system which never received a response, yet also never timed out. This token refresh is intentionally wrapped with a database transaction, which includes a lock of the corresponding database table row. As the request never completed or timed out, this lock was never released.

At 15:05:55 UTC, an unrelated job was started in the same region, which required a complete lock of the same database table. This lock query never completed, but already started to lock table access for all other processes and requests.

At 15:09:31 UTC, the Diagram-service-internal database connection pool was full and all subsequent requests to the Diagram service received a 503 Service Unavailable error.

Until recently, this complete Diagram-service unavailability would not have prevented the diagram editor in the front-end from loading and users from being able to edit and save diagrams, as only some of the diagram-features are provided by this service (such as the saving and storing of images included in diagrams). Yet due to a recent change in the Diagram front-end code, these failed diagram service requests now led to the diagram editor not loading any more at all.

Preventative Measures

We have undertaken and planned the following measures to prevent such an incident from happening again:

  • Ensure that the diagram editor in the front-end still loads and is usable, albeit with a slightly reduced feature set, when the Diagram service is unavailable. (Completed)
  • Add timeouts to all calls to external systems to prevent long-running database transactions and
    locks. (Completed)
  • Ensure that all pending database transactions are completed/rolled back when the Diagram service containers are shut down during deployments.
Posted Sep 12, 2024 - 10:38 UTC

Resolved
The diagram editor and the corresponding back-end service were unavailable from 15:09 UTC to 15:26 UTC.
Posted Sep 04, 2024 - 13:00 UTC