BLOG POST

Scaling security: 4 million secrets per minute

 

September 30, 2016 | DevOps | Andy Ellicott

 

What’s your cloud reboot preparedness? Whether it’s a disaster recovery scenario or applying a critical update, restarting your cloud infrastructure will need to happen fast, and it will require access to security credentials, possibly millions of them. You don’t want your security architecture to fail under the workload.

Conjur’s security software helps automate the identity, access management and secrets systems for cloud infrastructure, and we’ve put a lot of thought into scalability over the years. Scaling secrets management to work fast in the cloud restart scenario is hard because:

  • There are so many types of secrets (passwords, API keys, SSH keys, etc.), with different rules and permissions attached to them.
  • Permissions have to be checked to ensure secrets go out to the right places, and not the wrong ones.
  • An audit trail of all requests (successful or not) must be logged.
  • The requests for credentials often come from software that is distributed across multiple cloud data centers, platforms, or regions.

Here’s a customer use case that puts the challenges and solution in perspective for security architects.

The customer use case: “1.5 million secrets, please”

The customer is the developer of an on-line system with millions of simultaneous, interactive users. Their cloud restart involves simultaneously re-authenticating and re-authorizing 10,000 secure hosts, each of which must re-fetch all the “secrets” (e.g. API keys, SSH keys, passwords, etc.) that it needs–about 150 secrets per host, or 1.5 million secrets in all. The requests are all issued at the same time by their configuration management system (Salt), and the credentials are used by those hosts to connect to the services and data sources  across their infrastructure.

A distributed approach to scaling secrets management

Master-follower-architecture.pngTraditionally, secrets are kept and accessed via a centralized vault. In a situation like the one our customer faces, this can become a scalability bottleneck or a single point of failure.

Conjur takes a an alternative approach, and runs as a tree of distributed, replicating servers. A Conjur “master” is updated with new identities, roles, permissions and data, which it replicates in real time to one or more “followers.” Each Conjur follower provides read-only API access to services, such as permission checks, distributing public keys, and granting access to secrets. Each follower also continuously generates audit records of system activity, which it transfers back up the master. Followers are deployed as widely as the overall infrastructure: across multiple data centers, cloud providers, and cloud regions. This improves reliability and network latency by co-locating followers with client machines.

Because each follower has a complete copy of the data, followers can be used to very efficiently “scale out” the overall capacity. Load balancers can be deployed in front of the followers for efficient routing and health checks.

Customer benchmarks

Our customer ran benchmarks of their restart scenario. The results of the benchmarks, running a cluster of 3 Conjur followers on bare-metal servers (24 cores each), were as follows:

  • They achieved 218 authentication requests and 218 simultaneous batch secret fetches (150 secrets per batch) per Conjur follower, that’s 32,700 secrets served at once per follower.
  • Average response latency is about 5 seconds, so the auth+fetch cycle can be repeated every 5 seconds, serving up 392,000 secrets per minute per Conjur follower.
  • During the benchmark, access permissions for each secret request were validated and an audit trail kept.
  • Since Conjur is horizontally scalable, they plan to simply add 7 more followers in production. 10 Conjur followers will provide fault tolerance, and a total capacity of about 320,000 secrets every 5 seconds, which comes to ~4 million requests per minute.

On top of database, app logic, and networking, secrets management is one more scalability concern to tackle in the cloud. It’s a hard challenge, but highly rewarding when done right.

 

 

Share This