Secrets Management at Cloud Scale – A Benchmark
March 17, 2017 | DevOps | Ranjeet Vidwans
Secrets Management, a concept that overlaps somewhat with Privileged Access Management (PAM), has been a mainstay of enterprise security for a long time. However, the original use cases around which many of the incumbent tools in this space were built were human-centric — a canonical use case being say, a DBA needing privileged access to make some updates to a production database containing customer information. This human-centric approach, completely valid for the intended purpose, was designed to scale accordingly — hundreds or at the most thousands of administrative users of various stripes making interactive requests to manage servers, databases, and the like.
In the Cloud, Secrets Management has to scale at an entirely different level. With automated build, integration, and Configuration Management (CM) tools powering up VMs and micro-services, we are now required to contend with scale that is multiple orders of magnitude beyond those original use cases. In a previous blog post we shared that we had a customer that had to, from time to time, “reboot” their cloud for lack of a better phrase. This triggers tens of thousands of VMs being cycled simultaneously, resulting in the need to request over 4 million secrets per minute!
We don’t think this is unusual for organizations moving to the cloud. Even small and mid-sized institutions are delivering customer and consumer facing applications in the healthcare, financial services, media, entertainment, and other industries that will need this type of scale.
Unfortunately, doing a web search for any combination of “secrets management” and “scale” or “scalability” doesn’t yield great results. The topic is either confused or conflated with other considerations. While those other things are related and not unimportant, the crucial point around being able to deliver secrets management at cloud scale seems to get lost in the noise. Sometimes it’s discussed in the context of administrative scale (a not unimportant issue that we will discuss in a future blog post) — that is, the ability to manage key and secret distribution, user administration, and the like across a large and distributed set of humans. Other times it’s treated as a high-availability conversation (again, important but not the same thing). Sometimes it’s just left as an exercise in scaling the back-end you use for secrets management.
At the request of several of our prospects, we decided to run a benchmark of our performance at massive scale, and we wanted to share the results with you.
To demonstrate the scalability of the Conjur platform, we created an AWS Auto Scaling Group (ASG) containing Conjur Followers (Learn more about Conjur’s Master / Follower architecture here). The ASG was created from the Conjur CloudFormation template and modified to use an AMI that contains a Conjur container configured as a follower. The ASG is configured to use m4.large instances, each of which has 2 vCPUs. It has a scaling policy that will scale out when the average CPU utilization exceeds 50%. When it hits that threshold, it will add a new instance of a Conjur Follower to the cluster.
Load Test Execution:
The load generator ran in 30 second bursts, issuing Conjur API requests for the values of 20 variables in a single batch. We started the test at 8 requests per second (fetching 20 variables in each request), and scaled it up to 128 per second. It generated load at each rate for 10 minutes.
The grey line is requests/sec, the ever-increasing load on the system, and you can see how it relates to request latency (the orange line). Remember that each “hit” on the system is actually a batch lookup of 20 secrets, so by the end of the test (when we’re running 128 req/sec) we’re actually looking up 2,560 secrets each second, and applying full levels of authentication, authorization, and auditing to each lookup. In other words, they’re expensive operations! That’s a sustained load of 153,600 secrets looked up each minute, or 9.2M each hour. And that’s not even the ceiling – not even close. Read on…
The important part is that as the Amazon ASG adds more Conjur Follower nodes to the cluster to handle the increasing load, but the request latency always stays predictable, hovering around 100ms.
The blue line shows the CPU utilization of the whole cluster (as an average). The red line shows the size of the cluster group, and it steps up whenever AWS determines that the group is getting too loaded for the current traffic levels. You can see that whenever the red line steps up (when a node is added), the result is that both the latency and CPU utilization average drop thanks to the extra horsepower. Then, as the load increases, they track back up.
Conjur can handle ever-increasing load, but its latency never falters. And this is while performing expensive authentication, authorization, and audit functions on each operation. You can scale it to your heart’s content, and it remains a predictably-behaving system. We only stopped the test because we’d proven our point, but theoretically we could keep going indefinitely.
There are many aspects to properly architecting and deploying an enterprise-grade cloud security architecture. Scale and throughput, without sacrificing audit and security requirements, should be one of the key considerations when you design yours.