GitLab faces backup failure after accidentally deleting data.
GitLab has currently been taken offline after suffering a major backup restoration failure following an incident of accidental data deletion.
The source-code hub released a series of tweets following the incident, one of which confirms the failure: “We accidentally deleted production data and might have to restore from backup.” This included a link to a Google Doc file with live notes.
The data loss took place when a system administrator accidentally deleted a directory on the wrong server during a database replication process. A folder containing 300GB of live production data was completely wiped.
GitLab said: “This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis).”
It was identified that out of the 5 backup techniques deployed, none had either not been working reliably or set up in the first place. The last potentially useful backup was taken six hours before the issue occurred.
However, this is not seen to be of any help as snapshots are normally taken every 24 hours and the data loss occurred six hours after the previous snapshot which results to six hours of data loss.
David Mytton, founder and CEO, Server Density said: “This unfortunate incident at GitLab highlights the urgent need for businesses to review and refresh their backup and incident handling processes to ensure data loss is recoverable, and teams know how to handle the procedure.
“This particular accident shows that any business, no matter how technical or experienced in data management, can become a victim of accidental and catastrophic data loss.”
Mistakes made by the company leading to the backup restoration failure include the fact that disk snapshots in Azure are normally enable for the NFS server, but not the DB servers that were used by GitLab.
GitLab said that within their efforts to restore the data, it was noticed that the replication procedure was very fragile, prone to error, relies on a handful of random shell scripts and is badly documented. This then brought about the realisation that all backups to S3 were also unsuccessful.
Overall, GitLab has confirmed that the disruption only affects the website and all customers who use the platform on premise will not be affected.