Beginning at 23:36 UTC on November 16th, customers on au.leanix.net faced an unexpected disruption in service. We successfully mitigated the incident by 00:05 UTC on November 17th.
Upon conducting a post-mortem analysis of this event, we noticed that our alerting system fell short of its primary function - it failed to trigger necessary alerts and notifications about the arising operational irregularities. Consequently, these anomalies in our services went unnoticed.
This unforeseen incident interrupted various pre-flight urgency levels, subsequently leading to a delay in detecting the issue. Once affected customers reported issues through our Zendesk support system, normal operations has been re-establish.
We sincerely regret any inconvenience that may have been caused due to this service disruption, and assure you of our commitment to prevent similar issues in the future.
Our team is currently in the process of conducting a root cause analysis to identify what led to the underlying technical outage in that time frame. We will update this status page with further details.
For the alerting system, we identified the root cause and already rolled out a fix. The APM system provider discontinued a framework version to which we’ve upgraded recently, but there hasn’t been a customer-facing communication yet. This subsequently also led to alerts not triggering anymore.
In order to ensure this type of outage does not recur in the future, we are going to enhance our monitoring process and add additional redundancy to our alerting system.