Between September 19, 01:46 PM UTC and 02:41 PM UTC, customers experienced issues with failing requests coming from out-of-the-box Integrations.
While continuously improving our stability mechanisms and standards in our services, we updated our GraphQL API rate-limiting mechanism.
During the above stated timeframe, Integration runs failed, as our GraphQL API rejected their requests due to a bug in the changed code, that resulted in unwanted failing of the requests.
Our monitoring systems alerted the team immediately after the release was live on production, and calls from our Integration services were blocked. The team created and released a fix, which resolved the situation.
After the mitigation, we did an in-depth analysis of why our CI/CD pipeline did not fail after we implemented these changes, and added test cases which prevent breaking it again.
Even though our monitoring system detected the issue immediately, we are going to improve further on alerting and checking possibilities to recover faster from such a scenario.