On 4 September 2024, the Diagram back-end service was unavailable from15:09 UTC
to 15:26 UTC
. During this time, the diagram editor could not be loaded, but already open diagrams could still be edited and saved, yet no new images could be inserted into diagrams and existing images in diagrams may not have loaded correctly. No data was lost during this incident.
A manual restart of the Diagram back-end service resolved the downtime.
The root cause was identified as a token-refresh call at 15:02:59 UTC
from the diagram service in the West-Europe-region to an external system which never received a response, yet also never timed out. This token refresh is intentionally wrapped with a database transaction, which includes a lock of the corresponding database table row. As the request never completed or timed out, this lock was never released.
At 15:05:55 UTC
, an unrelated job was started in the same region, which required a complete lock of the same database table. This lock query never completed, but already started to lock table access for all other processes and requests.
At 15:09:31 UTC
, the Diagram-service-internal database connection pool was full and all subsequent requests to the Diagram service received a 503 Service Unavailable
Until recently, this complete Diagram-service unavailability would not have prevented the diagram editor in the front-end from loading and users from being able to edit and save diagrams, as only some of the diagram-features are provided by this service (such as the saving and storing of images included in diagrams). Yet due to a recent change in the Diagram front-end code, these failed diagram service requests now led to the diagram editor not loading any more at all.
We have undertaken and planned the following measures to prevent such an incident from happening again: