On Friday, Nov. 21, we experienced a major outage of our intake and processing infrastructure.
Between 19:42 and 21:45 (all times UTC), we experienced a major outage, resulting in the inability to receive or process errors and releases between 19:42 and 20:53, as well as about 50% of all tracked errors and releases between 20:53 and 21:45.
- 19:42 outage begins
- 19:45 alerts begin to trigger
- 19:55 the problem is identified as the RabbitMQ cluster being unresponsive
- 20:17 it becomes clear that we have to re-provision the cluster
- 20:53 new cluster is provisioned and intake re-configured to write to new cluster
- 20:55 monitoring reports error and release tracking functioning
- 21:00 alert systems still reports elevated error levels
- 21:30 problem is identified as faulty configuration
- 21:45 systems fully operational
Our end-2-end monitoring shows the complete outage from 19:42 to 20:53, and erratic behaviour until 21:45.
The root cause of the outage was a complete loss of our Multi-AZ RabbitMQ cluster, operated by a 3rd party service. The whole cluster was terminated due to a bug in a provisioning script.
Resolution and Recovery
After identifying the root problem, we immediately got in touch with the 3rd party provider. It became clear quite fast that our issue was part of a larger outage. A first assessment of the situation resulted in a bleak outlook that it could take hours or days to restore our cluster. At this point, we decided to provision a new cluster, thereby accepting the complete loss of the original one.
As soon as the 3rd party provider re-enabled provisioning, we set up a new cluster, which became operational at 20:40. After de-deploying the intake and our jobs servers with the new configuration, error and release tracking became operational again at 20:53.
At this point, we started monitoring the queues, as well as logs from the intake. We soon realized that we were not out of the woods quite yet. The intake still dropped about 50% of all requests. After further analysis, we discovered a faulty configuration, which resulted in some intake machines still trying to write to the old RabbitMQ cluster. After correcting this, the intake immediately became fully operational.
Unfortunately, it was not possible to restore errors and releases that were tracked during the outage. We are very sorry for this!
We are still analyzing the situation and discussing how we can prevent outages of this sort in the future. We will share more when we have implemented a solution.
As this was our largest outage in a long time, there are a lot of lessons we can take away from it. Foremost, we have to communicate faster about ongoing incidents. It took us almost an hour between noticing the outage and updating our status page. This delay was in part due to the (embarrassing) fact that the on-call developer did not have access credentials to the status page. This has been fixed, and all developers now have access rights to update the status page.
Furthermore, the continued problems after re-provisioning the RabbitMQ cluster were unfortunate. The main take-away message for us here is that the crisis is not over until all systems are back in the green.
On a more positive note, we learned how useful it is to document all steps we take during an incident in our ops room in Slack. Not just to keep your team mates up to date, but also as a permanent document to retrace everything once the dust settles.