One of the most important things to happen in the evolution of development over the past many years is the widespread adoption of continuous integration and continuous deployment, or CI/CD. (Sometimes the “CD” stands for “continuous delivery,” depending on who you’re talking to.)
It’s a concept that jettisons a lot of older ideas about how systems should be managed and instead gives you a way to update code and integrate changes as live rolling deployments while ensuring that the new code is tested and slots in smoothly with stuff that’s already running. A properly architected CI/CD pipeline means you can get code changes into production faster and with fewer errors. But what does that look like in practice?
It looks like Ars Technica, because we’ve adopted a CI/CD workflow to take full advantage of the flexibility afforded us by serverless cloud hosting. Welcome to part three of our four-part series on how we host Ars—here, we’re going to swing away from the “ops” side of “DevOps” and peer more closely at the “dev” part instead. Join us for a look behind the curtain at how Ars uses CI/CD in both our deployed applications and our infrastructure management!
Version control is not optional
For the benefit of folks who only do the “ops” part of DevOps, let’s get a working definition going for “version control,” as the term underpins our entire approach toward maintaining code. When we say “version control” in this context, we’re talking about a method by which we’re able to track changes made to our production codebase—that is, the repository of files that makes Ars function.
Ensuring that production codebase is subject to some form of version control is a lot like turning on “track changes” in Word: the version control system keeps a record of every change made to every file, along with a correlated list of who made the change and when it happened. Version control is a critical component of most large-scale IT projects, and in some cases, it’s even a mandated regulatory requirement.
But version control is a hard problem to solve, and many of the solutions that are common now—including and especially Git—are still relatively young. Not that long ago, there was a time I don’t remember with fondness—a time in which you edited code in a text editor and then FTP’d it to a production server. This was a low and filthy era, rife with lost changes, production crashes, and ad-hoc backups with names like
oops-1997-05-21.tar.gz. To be sure, even back in those primitive days, there were bearded wizards that spoke of inscrutable technologies like CVS (Concurrent Versions System, not the pharmacy), but such things were conspicuously absent from most folks’ experience within the burgeoning universe of web development.
While many of us probably recall SVN repositories (and with many thoughts and prayers to those of you still dealing with SVN), it wasn’t until Git that version control gained massive popularity. Why? In part because tools like Git and GitHub made it dead simple to create and maintain repositories—just type a few commands, and you’re up and running. Most of the world’s developers now maintain code repositories on GitHub, and Ars is no exception.
How do changes flow from GitHub into deployed applications?
Well, we start by firing up our favorite FTP client, Transmit. I kid, I kid—but Transmit was (and apparently still is) an awesome app. For real, now: We start by working on a particular branch in one of our repositories. Remember from our previous installments that Ars is composed of four main applications, each running in its own container inside AWS ECS tasks:
- Arx: Our local Docker Compose development setup and Nginx server container
- Acta: The main WordPress application
- Civis: Our discussion forum software
- Taberna: Our e-commerce and subscription system
Each of these applications has its own repository on GitHub. When large changes are made, a new branch is created. Eventually, that new feature branch will be merged into a staging branch via a pull request. After testing, staging will be merged into the main branch with another pull request.
Branches, merges, and pull requests, oh my!
In version control terms, a “branch” is simply a named deviation from the main code repository. For example, if we decided to replace the Ars logo with the letter “X”—wait, that’s too relevant, let’s say the letter “Y”—we would start by checking out a new branch along the lines of this:
git checkout -b feature-l33t-new-ars-logo.
That command creates a clean copy of the main repository—one we can start messing with in our local development environment without having to worry about stepping on anything in production. Once we’ve completed an exhaustive find-and-replace to change all the places where the code says
y-technica-logo.png, we’re ready to “commit” the change to the branch. Committing does not make your changes live in production—instead, it’s just a way to describe your changes. In this case, our commit might be something like
git commit -m "Sweet new logo for Ars - Y is the future".
Next up, we need to actually get the change from our local development environment up into the GitHub repository. In Git terms, we need to “push” our changes. This is accomplished with, perhaps unsurprisingly, the
git push command. After that, the code exists in the GitHub repo, and the next step—if we feel good about what we’ve coded—is to test it in a staging environment. Doing this means we need to get our changes from the
feature-l33t-new-ars-logo branch and put them somewhere—like our
staging branch. There are many ways to integrate code between branches, but we like the formality of doing things with pull requests, which lets us merge our new code into an existing branch and also leaves us with good documentation around every change. A pull request could consist of one or hundreds of commits that took place during the development of the feature branch.
Git will automatically flag any “conflicts” when you merge branches. A conflict usually arises when someone has committed newer changes to the same files affected by the new feature branch. Working through large conflicts can be a nightmare, which is why it’s important to work as granularly as possible when creating new features—only check out what you need, when you need it, and be mindful of where others are working. Ideally, you introduce as little new and disruptive code as possible—and, fortunately, our new Y logo fits the bill.
When code is added to key branches we have set up, like
main, a series of automated events are kicked off. These all start with linting and testing. Linting—a term that refers to a static check of one’s code and configuration files to make sure there aren’t any typos or other issues—ensures that submitted code conforms to our internal styles and practices. After linting, testing ensures that the major features and functions of the applications are all continuing to produce the desired results so that you don’t introduce unanticipated problems in other areas of your software when you push new code.
On each push, GitHub fires up a build system that runs the linting and testing commands you specify, using a simple YML configuration file and GitHub Actions. It’s worth noting that these tests could also run inside AWS CodeBuild, but GitHub’s build system is very fast and automatically integrated with the repository user interface, so we can see at a glance what horrors we’ve introduced with our latest pulls.
Once our tests have passed, GitHub uses a webhook to notify AWS that we’ve got some code to deploy to production, and this is where the real fun begins.
AWS CodePipeline and CodeBuild
CodePipeline is an AWS tool that allows you to take code from a repository and pass it around to different services within AWS (hence the “pipeline” part!) with an eye toward helping out with building and deployment tasks. Additionally, at least in our case, changes to some key GitHub repositories and branches will also kick off a CodeBuild process.
CodeBuild is a service that actually fires up an instance of preconfigured build environments, inside of which code will be compiled to your specs. There are many such environments to choose from. The image and build instructions are read from a
buildspec.yml file in each application’s root directory. (Note that when we say “root directory” here, we mean that the
buildspec.yml file is stored in the root of the GitHub repo.)
This build file tells CodeBuild what runtimes we need (e.g., “Please provide PHP 8.1 and Node.js 18”) and breaks down into a series of build phases that are fairly easy to understand if you’ve seen YAML before. These
buildspec.yml files can become extremely complex and even include other build files, but our setup for the main WordPress application is fairly straightforward:
- Log in to our AWS ECR (Elastic Container Repository)
- Use Composer to install any defined PHP packages we need
- Delete a lot of unnecessary files (we don’t need that
READMEin production, do we?)
- Run the
docker buildcommand, which will create our image per a project-specific Dockerfile
- And finally, push this new image to the container repository, where it can be later pulled down by ECS
Whew! That sounds like a lot (and it is!), but it’s mostly a “design once and rarely edit” setup. There’s more involved with tagging and storing Docker images in ECR, but this gives you the general overview. (You can think of ECR like our own isolated version of DockerHub.) Once CodeBuild is done, it returns its results back to CodePipeline, which will continue its execution.
So now we’ve committed code, tested it, built our final Docker image with all our application files copied over, and pushed that image to ECR, where it can be accessed by a simple
docker pull command. What’s next? From our first installment, we know our newly built Docker images need to eventually end up living on one of our Fargate ECS clusters as containers in a “task,” but how, exactly?
Enter AWS CodeDeploy and blue/green deployments
If nothing else has piqued your interest about this approach to CI/CD, I think blue/green deployments may be the “Excelsior!” moment. It certainly was for me. Imagine: Even after running successful tests and builds, you eventually get to a point where you must flip a switch, finally saying goodbye to your stable production environment and puckering all orifices while you wait for the new deployment to come online and take over production work.
This was the way things were for us a few years ago—production changes were made by deploying code using a makeshift Debian repository (which was kind of a genius approach—hat tip to Ars developer emeritus Lee Aylward). Even with Fargate managing the tasks, a lot can go wrong. An errant incorrect environment variable, something wrong with the underlying provisioning service, an unintended change that passed all the tests but still breaks the site—there are a million potential problems.
Blue/green is one of multiple strategies provided out-of-the-box by AWS CodeDeploy, and this is where it saves the day. With blue/green deployment, a completely new target group containing your updated container(s) will be spun up in parallel to your existing production application, containing an identical number of tasks but accessible through an alternate port on the application load balancer. Thus, you can peruse the new stuff on the alternate port to your heart’s content until you’re completely satisfied that it’s ready for public traffic. From here, it’s a simple flip of the switch to swap public traffic to the new target group. This is one part of our CI/CD process that we do manually, for obvious reasons.
For folks who require a car analogy to properly grok a tech concept (I see you raising your hand back there, Lee!), think of blue/green deployments as being a bit like the DevOps take on a dual-clutch transmission—the transition between gears goes a lot faster in a DCT because the next gear change is already being handled by the other clutch while you’re still accelerating, just as we get our replacement environment built and brought to hot standby while the current environment runs. When it’s time to shift—gears or production containers!—the changeover happens much more quickly because all the work to accomplish the transition, along with the validation that it will be successful, has already happened. You just failover between clutches in the DCT and between “blue” and “green” environments for Ars.
There are competing deployment strategies, to be sure—such as slowly shifting traffic to your new setup over time. Personally, I love blue/green for high-volume websites because your commitment is zero until you’re happy with how everything is running. Even then, you have the option to roll your deployment back to the previous version, which—as long as you haven’t terminated its tasks—is still running in parallel. Failing back, if one needs to do it, takes only seconds. The entire process greatly reduces anxiety associated with deploying large changes, which any developer will be able to relate to.
While we’re talking anxiety, nothing has caused more insomnia among developers than the state of one’s infrastructure. (“It’s 10 pm—do you know where your servers are?”) We’ll discuss next how we’ve managed to reduce stress on that front.
IaC (“infrastructure as code”)
“Infrastructure as code” is an idea that has been around for a long time in one form or another. Think about all the wild things we have to do to get a single web server up and running from scratch—maybe, for annoying legacy reasons, you need to edit
/etc/hosts to include local machines in a cluster. You have to use
yum or whatever to install all the right software, then add specific configurations to Apache or Nginx.
If you stand up enough servers, you’ll likely end up with your own runbook (mental or physical), but the steps—and all the variations thereof that you likely have to keep track of for edge cases—can be a lot to keep track of. The problem gets worse if you’re standing up, say, fifty web servers instead of just one. And what if you also needed to initialize remote databases and configure routers, cache servers, and search appliances? And what if you needed to do the same thing again in an isolated testing environment, too?
This is why the concept of “infrastructure as code” exists. The idea is to take all the crazy things we do to get infrastructure up and running and reduce them to abstracted, readable code that can be used again and again. And, if you’re feeling particularly jaunty, you keep that abstracted, readable code under version control so you can see exactly how it changes over time—and so you can roll back to a previous revision when something goes wrong.
Of course, as you might anticipate, changing our approach from “infrastructure as infrastructure” to “infrastructure as code” also requires changing our toolset. I’ve done plenty of work in the past with tools like Puppet, Ansible, and Chef (now Progress Chef, which doesn’t quite have the same ring to it), the last of which ran Ars Technica’s infrastructure for many years. While the AWS console is quite lovely to look at, and I do spend a great deal of time staring at it, I would not want to use it alone to configure a complex infrastructure. And even if a single environment might be manageable that way, that flies out the window when you add two, three, or a hundred more. That’s where IaC becomes a necessity: it allows you to create infrastructure in a repeatable way in the cloud or otherwise.
The tool we use at Ars to manage all our infrastructure is called Terraform. Anyone doing web development from the mid-2000s on has undoubtedly spent time with an excellent piece of software called Vagrant, which made dealing with different virtual machine providers like VMBox or Parallels a much simpler prospect. Simply run
vagrant up with your configuration file, and viola—no need to struggle with VMBox’s weird Java GUI. Vagrant was (and still is) a product from Hashicorp, so named for its founder Mitchell Hashimoto. And perhaps unsurprisingly, Hashicorp is also the creator of Terraform.
Terraform is a tool that takes a series of simple, descriptive configuration files, built in a language called HCL (HashiCorp Configuration Language), and turns them into instructions for erecting infrastructure on many different cloud providers (Alibaba Cloud, anyone?). One of the key features of Terraform is that this instruction set is idempotent, which means—practically speaking—you can execute it against an infrastructure and expect it to only make changes if you’ve really altered something, no matter how many times you run it. That’s a relief when small alterations can wreak havoc.
When you run the command
terraform plan, you’ll be told precisely what resources will be created, destroyed, or modified in place before issuing any commands to AWS. This is incredibly useful when your cloud environment contains hundreds of resources. And indeed, in absolute terms, the Ars infrastructure has 295 managed AWS resources in it, despite the simplicity of the overview charts we’ve shared. Keeping track of those components and their interconnectedness using only the AWS console can be challenging—it gets extremely challenging if you’ve got a hundred or more environments, as many DevOps pros deal with.
There are multiple ways to work with Terraform, and while we do use their CLI tool to validate code, we primarily use a managed application called Terraform Cloud to handle actual deployments. Much like the application CI/CD process described above, we have a very similar setup for managing infrastructure. Once again, it starts from a GitHub repository that stores all the configuration files that describe our infrastructure. From the VPC subnets to the scaling parameters on the serverless products we use, every detail is stored in a versioned repository.
Depending on the GitHub repository and branch we’ve pushed changes to, a webhook from GitHub triggers a planning process in Terraform Cloud. Terraform will automatically generate a set of proposed changes, which can be applied to your cloud environment instantly or manually. Like with production code deployments, we rely on a manual switch here because even with all the safeguards in the world, it’s possible to destroy your entire infrastructure with the click of a button. (I’m not going to say anyone named Lee has ever done this, but I’m not going to say anyone named Lee has ever not done this, either. It’s why we have backups, right?)
If Terraform HCL and template files aren’t your jam, there are many other tools to accomplish the same thing. Amazon has multiple options like CDK or CloudFormation if you like the visual approach. There’s another up-and-coming tool in Pulumi that I’ve really enjoyed working with. Like Terraform, Pulumi is a multi-cloud tool, but it allows you to write your infrastructure in a number of programming languages. There are lots of options out there!
Planning, documentation, and version-controlled code—when you come down to it, that’s what sits underneath the Ars Technica front page. All the ECS tasks and Lambdas and serverless Aurora databases in the world won’t do a thing to help you if you can’t direct them properly, and we’ve put a lot of time and effort into a setup that is both resilient and flexible—and one that can change and grow with us. Tying together CI/CD and IaC and implementing them in an API-driven cloud environment is a bit like having a magic wand that you can point at the sky and cause fully realized system designs to materialize directly from the cosmic aether—it’s a powerful operating methodology enabled by powerful tools.
“AWS offers customers a comprehensive set of CICD services to support their people and processes,” said AWS senior solutions architecture lead William Torrealba during an architecture discussion for this piece. “This allows our customers to easily and quickly implement C/ICD processes in a very cost-effective way without complex hardware management. But it is also flexible enough to extend the functionality using third party partners and tools.”
We’re small potatoes in the grand scheme of things, too. Sites like Netflix can see our monthly traffic volume in a single day (or less!), and similar CI/CD practices serve them at that scale just as well as they do for us. I can’t imagine going back to a time before these tools and processes were in place—at this point, it would be like abandoning fire. We’ve adapted our entire business around workflows enabled by these tools, and we wouldn’t be Ars Technica without them. (We would be remiss in not giving credit to former lead developer Steven Klein for helping to pull much of the Ars IaC project together. We miss you, Steve!)
Coming up next
Part four, dear readers, will be our final installment. We have a grab bag of topics to cover, including how we do DNS, a bit more about our content delivery network, and a short discussion on architectural decisions—there are lower-cost 64-bit ARM offerings available on AWS, and although moving parts of one’s infrastructure from x86-64 to ARM isn’t necessarily easy, we’ve done some preliminary investigation, and it’s not exactly difficult, either. What things lurk in our 64-bit future? Tune in next Wednesday for our series finale and see!