This past Thursday (November 5, 2015), our primary database server had elevated error rates and was taken offline. It was determined that the cause of elevated error rates on that machine were due to some of the RAM on the machine going bad.
The packagecloud website was taken down for approximately 35 minutes, during which time failover to our secondary database server took place.
When the site was restored, some service degredation was experienced by customers in the form of increased latency and response time until our primary database server could be brought back online.
Our primary database server was available again approximately 3 hours after access to the site had been restore. Once the primary database server became available, response time and latency fell back within normal operational range.
We’re taking steps to add additional monitoring, scripting, and documentation to facilitate faster failover and recovery in the future.
Primary database RAM failure
We were alerted by our monitoring services at 07:32 AM PST on November 5, 2015 that some of our frontend applications were unable to establish a connection to our primary database server.
We immediately began investigating the issue and found that the kernel log on the machine had several error messages indicating that the RAM on the machine was failing.
Escalation, failover, and recovery
We escalated the issue and paged other members of the team for assistance. We confirmed the issue amongst the team and began by updating packagecloudstatus.io, our user slack channel, and our IRC channel.
After alerting our users, we stopped the major services powering the website to allow the database to quiesce.
We then began taking steps to failover to the secondary database server. Once the secondary machine had been promoted to the primary, an application change was pushed to point the applications at the new database primary.
We verified that the site was working, updated our status page, and alerted our users.
Increased latency and response times followed by full recovery
Starting at 08:37 AM PST, access to the service was restored and we began working with our hardware provider to swap the RAM on the machine that was taken offline.
The RAM swap was completed approximately 1 hour and 15 minutes later and we began to run a series of tests using memtest86, which took several hours to run.
Once the tests were completed, the machine was brought back online. Latency and response times began to drop back to normal operational ranges.
We have begun to investigate how to configure better alerting for low level hardware errors so that we can be alerted immediately of these types of errors in the future without waiting for alerts from failing applications.
We’ve also started revamping our documentation and runbooks so that database failover can be performed faster and easier in the future.
Hardware errors are impossible to prevent, but the steps we’ve taken will help protect our customers from future events and also help us resolve those events more quickly when they occur.
We sincerely apologize for the outage our customers experienced and hope that the changes we made to our infrastructure help protect customers against future outages of this nature.
If you have any questions, please feel free to email us at firstname.lastname@example.org.