Scaling Jenkins with Machine Identity
| DevOps |
Today I’d like to share how we scale our build infrastructure at Conjur. We build and test the software we release on a Jenkins cluster running in AWS. Right now, we have nearly 100 different jobs defined and several different pipelines. We’ve been able to cut down on job maintenance by defining jobs in the Jenkins Job DSL (all our job definitions are open source here). But wasted time adds up if you can’t find a way to automatically launch and configure executor nodes as needed. In this post I’ll describe in detail how we scale Jenkins executors safely and securely.
Before diving into technical details I want to briefly cover our goals, the reasons why we spent time automating our build infrastructure.
- Demand is variable. When putting the finishing touches on a new release, we run more jobs than usual. Pipelines are run more often and at the same time. We need to be able to scale executor capacity up and down easily.
- No human in the loop. Launching and configuring nodes by hand distracts engineers from the creative work that only people can do. The only decision someone should have to make is how much capacity they need.
- Nodes should automatically bootstrap with a machine identity. Different jobs need different credentials to build and deploy artifacts. Executor nodes need to receive an identity that lets them fetch these credentials.
By automating our build infrastructure, engineers can focus on valuable projects and initiatives. This means less time from idea to implementation of new features. It also keeps the team happier and more engaged. Configuring servers by hand is not exciting work.
1. Define infrastructure as code
Since our infrastructure is in AWS, we use CloudFormation to describe it declaratively. Unfortunately, CloudFormation’s template language is JSON. JSON is great for machines, but not so great to write and edit by hand. To avoid some frustration, we use the CloudFormation Ruby DSL gem to write templates in Ruby. Running the Ruby file generates the JSON for you. We’ve found this to be a more flexible and less error-prone way to define templates.
For Jenkins executors, we define an autoscaling group with the desired capacity as a parameter. Our engineers have access to change this parameter as they need. There are several parameters for this template, including the Host Factory token. We’ll cover that in Step 3. The full template for our Jenkins executors is available on GitHub.
2. Allow executor nodes to register themselves
When Jenkins executor instances first start, they should register themselves with the Jenkins master. Adding and removing nodes manually in the Jenkins UI is a distraction from real work. Registering as an executor via JNLP is not an automatic process. You need a node name to register, which means that the node must already exist in Jenkins. That means you have to create one via the Jenkins UI or API before your newly-launched executor can join your Jenkins node pool. Bummer.
Thankfully, the Jenkins Swarm plugin exists. Using this plugin you can register executor nodes with the Jenkins master without manual intervention. Once the plugin is installed, you can use the swarm CLI to connect to your Jenkins master when an executor starts up. The CLI should be started as the ‘jenkins’ user; otherwise you’ll have issues with permissions in your builds. We created a single-purpose user for authenticating swarm nodes.
Bootstrapping into a swarm looks like this in a
userdata.sh script, a script that runs as root when the machine boots up:
One neat thing about CloudFormation parameters is that you can reference them in userdata.sh. You can see this above with
ref('NodeName'). We set the default for that parameter to ‘executor’. The Swarm CLI appends nodes with a random string so the resulting executor name you’ll see in Jenkins is something like ‘executor-3d5fe07e’. You can see in the script above that we use the Conjur CLI to fetch the Swarm password. I’ll explain how this works in the next section.
3. Apply machine identity on launch
Our Jenkins executors need access to fetch credentials to run jobs. They also need access to credentials during their bootstrap process. We dogfood here, using Conjur’s Host Factory to apply a unique machine identity and add the node to a build layer. The build layer is granted access to a set of credentials with Conjur policy, declared in YAML.
We created a Host Factory token that allows nodes to bootstrap themselves into the build layer. This token is passed as a CloudFormation parameter and used in ‘userdata.sh’ to run Chef cookbooks that apply machine identity and set up SSH access. There are several ways to bootstrap identity; passing CloudFormation parameters is the best fit for us at the moment.
This step happens before running swarm in step 2. The build layer has fetch access to the credential ‘jenkins/swarm/password’. Once the node is a member of that layer it can fetch the password. Access to this credential is also recorded to an immutable audit log. Here’s an example log from the Conjur UI:
Our full templates and userdata script for Jenkins can be found on GitHub. By implementing the steps above, our Jenkins executor pool can be scaled up and down by changing the Desired Capacity parameter defined in our CloudFormation template. Adding nodes via Jenkins Swarm just works. When nodes are terminated, they are removed from the Jenkins UI as well. Managing access to credentials can be done at any time without needing to destroy or recreate infrastructure. Our engineers can spend more time working on all of the projects that comprise Conjur and less time managing build infrastructure.