TL;DR
We had a few brief outages in the past couple of weeks, this post-mortem aims to detail what happened, how we responded, and what was done to prevent similar incidents in the future.
What happened
Due to the nature of config management and automation, our hosted package repositories frequently get hit with very regular traffic spikes that are 200-300% above average. As of a few weeks ago, however, one of these spikes, started reaching 600% above average, causing the site to be unavailable for its duration, usually 5-10 minutes.
Initial Response
Once the site was deemed unreachable, our external monitoring system kicked in and paged the primary on-call team. After graphs and logs were analyzed, a traffic spike was determined to be the culprit. Capacity was then shifted from the worker tier to the front-end tier, helping absorb the impact. After the traffic spike subsided, the site returned to normal.
Mediation attempt #1: Increase max connections
Looking at the graphs afterwards, it became evident that we were hitting a connections limit in our load balancer. This limit was then raised and deployed.
23 hours later, the primary on-call team was preemptively alerted before the traffic spike. As predicted, the traffic spike came at the same time. Since the connection limit was increased, the bottleneck then moved down into the front-end tier. Again, the site was made unavailable until the traffic spike subsided.
Mediation attempt #2: Add Capacity
The next morning, it was decided that more front-end capacity was needed, so a new application server was provisioned and brought online that evening.
As usual, the traffic spike showed up at its due time. However, the site still seemed be unavailable, even with the increased capacity. We then noticed that the load on the new application server was strangely low, despite significantly increasing its weight in the load balancer. A random application process was sampled using strace and it revealed that it was mostly stuck waiting, blocked on connect()
calls to our datacenter’s DNS server.
Mediation attempt #3: Cache DNS Lookups
It seemed then that we were getting rate limited by our datacenter’s DNS server during these traffic spikes, presumably to keep the service running for other customers. So, before the spike hit that night, we deployed a local DNS lookup cache to all of the application servers. Finally, the site was able to stay online during the course of one of these spikes.
Outage definitions and Status Page updates
During these incidents, the site was unavailable to most users from anywhere between 5 to 20 minutes. Moreover, the status page was not updated during these times. From now on, we’ll update the status page whenever an outage lasts more than 10 minutes.
Conclusion
As an infrastructure company, any kind of outage is unacceptable to us. We sincerely apologize for any inconvenience this may have caused. We hope our transparency inspires trust in our commitment to providing an excellent package hosting experience.