On January 21st, between 16:06
and 17:19 UTC
, one of our database management systems (DBMS) in westeurope
experienced multiple failovers due to high load. The load was caused by an event replay of our event-carried state transfer system. The repeated failovers led to a brief downtime of the DBMS. Several services simultaneously executed the replay process, inadvertently placing excessive pressure on the DBMS.
The incident caused degraded performance and temporary service disruptions for our customers for the following business capabilities:
diagrams
storage
todos
transformations
automations
To address the issue, we increased the SKU of the affected DBMS to a higher capacity tier, providing additional resources to handle the increased load during event replay scenarios. This adjustment immediately stabilized the system and prevented further failovers.
An event replay reprocesses historical events from an event log to rebuild the current state of a system. It is commonly used for synchronization in asynchronous systems. In this particular case, the event replay was necessary to enable new features in our product. However, we did not anticipate the capacity requirements ahead of time, as previous replay runs for partitions in the same region had not caused similar levels of load. The replay introduced unexpected pressure due to differences in partition size, leading to the incident at hand.
To prevent similar incidents in the future, we aim to improve in the following areas:
These actions will help us build a more resilient system and ensure reliable performance as we continue to scale.