Lessons Learned from the Dropbox Outage
January 13, 2014 | DevOps | Kevin Gilpin
Last Friday, January 10, Dropbox experienced an outage that wasn’t fully rectified until late Sunday afternoon, nearly 48 hours later. As explained in a very candid post-mortem on the Dropbox Tech blog, during planned maintenance to upgrade the OS on some of the company’s machines a subtle bug in the upgrade script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted which resulted in the site going down.
The Outage Explained
Dropbox uses many machines that are constantly repurposed to do different things. Sometimes a machine might be a non-critical dev box, but later on it might get turned into a database-serving production box.
Dropbox runs scripts and jobs that perform tasks like OS patching and restarts, and some scripts are not supposed to run on machines that are serving critical functions like like hosting production data.
But because they are constantly changing the roles and configuration of their boxes and not keeping their identities up to date, scripts started running on production boxes which were serving databases, thinking they were running on development boxes. The scripts started shutting down and restarting the databases in order to do OS patches, disrupting the architecture and making the Dropbox site unavailable for several days – eons in the digital era.
So Dropbox decided that they need the boxes themselves to be able to determine whether to run the jobs or not.
How to Prevent the Same Problem from Recurring
When a machine goes to pull a script, it should not be able to access anything that was not designed to run on it. But as we see from the Dropbox outage, it can. So how do you make the machines “self-aware” in a way that is powerful, yet safe? How could Dropbox have provided machines with identity, link that identity to an environment (development, test, production, etc) the machine is assigned to, and the appropriate ecosystem they belong to?
The right way to manage this virtual environment is using host identity, as follows. Whenever a host is configured, it should obtain a new identity, and it should join the right layer for its role. When a maintenance script is created, it should be placed into an access-controlled script repository. Permissions rules should define which layers are allowed to access which scripts.
Of course a real infrastructure will have many more categories of hosts and data, but this gives you a basic sense of script access control:
This is exactly what Conjur Host identity and Layer membership is designed to do, and of course it can be extended beyond scripts to any kind of data, including keys and credentials.
It’s Essential to Capture a Rigorous Audit Trail for Future Troubleshooting of Complex Infrastructures
When it’s time to update a host, the host receives a message to run a certain script. It fetches the script from the script repository. With Conjur, an audit record of the request is generated when the host fetches and runs the script. Access to the script is denied if the host does not have permission. In this way there is be a foolproof and audited system for running scripts on hosts that does not compromise the ability of Dropbox to re-purpose their boxes as needed.
Conjur’s software, via SaaS or virtual appliance, means you never have to worry about managing script deployment into the correct environment and you won’t have to worry about outages owing to human error like the one that happened at Dropbox.