No one who works in the tech industry should have any schadenfreude in response to GitLab’s outage yesterday as reported by Business Insider and TechCrunch.
According to the incredibly open notes that GitLabs published while the incident was still being worked on, the initial trigger to the problem was:
Spike in database load due to spam users
In response, they took a series of actions to attempt to resolve the spam problem but at 11pm an admin referred to as team-member-1 made a mistake and confused which machine they were running an rm -rf command on. This deleted a live production PostgreSQL data directory. By the time the mistake was noticed only 1.5% of approximately 300GB of data remained.
The problem was further compounded by a series of problems they had with their backups. According to an update that they posted some of the backup did not appear to have worked, producing:
files only a few bytes in size
They have since managed to restore their service but with 6 hours of data lost. They have promised to publish their 5 whys of the cause of the incident and steps they will implement to prevent this from happening again.
In another interesting blog post, 2ndQuadrant, the initial author of the core PostgreSQL’s backup technologies, responded to the incident with their observations and suggestions for tools to consider. Well worth a read.
As we said at the beginning, there is no room for any schadenfreude. Today this is GitLab, tomorrow it could be anybody. Admins are people and people make mistakes. The only solution is to try and make making a mistake that risks production data as difficult as possible through scripting and automation, regularly ensuring that backups are happening successfully and that backups will actually restore in practice.
One positive thing to come out of this will be that lots of people in the tech industry will checking their backups today (I know we are). Another was the #HugOps hashtag where people sent their best wishes to GitLab on Twitter. We certainly echo that sentiment.