Scaling Cloud Infrastructure with Terraform

By, Pratik Mallya, Julian Modesto, Aaron Sproul, and Joshua Kwan

In previous posts, we have talked about the technologies that enabled LiveRamp’s journey to the cloud. In this article, we will describe in some detail a key enabler of that journey.

Infrastructure as Code (IaC) allows for automation of infrastructure lifecycle management and for making such configurations discoverable. Having the infrastructure configuration of every team accessible on a company-wide scale makes it easy to share best practices, code, organize and track updates. Once a team sets up infrastructure, their work is made even more valuable by being visible to anyone who would want to reuse it. IaC captures infrastructure toil in reusable code. What was manually created by pointing and clicking, or with random shell scripts, is now abstracted away into a single source of truth.

Even with infrastructure managed in code, when it’s spread across multiple repositories, it leads to inconsistencies and unreliable IaC configurations, reducing the trust in these tools. Most of these problems can be traced to infrastructure drift, where the IaC no longer matches the infrastructure that it’s supposed to manage. Dedicated teams are then required to maintain the functionality of such systems.

The problems specific to LiveRamp were to scale infrastructure without having the projected amount of people that would be required by all the toil generated by having to maintain infrastructure. The solution was to enable self-service where possible. The focus on self-service created a desire to make tools that minimize impact on developer productivity; if you build tools that people don’t want to use, only mandates can force their usage. Build tools that provide demonstrable value, and teams will want to adopt them.

Terraform is a mature platform that allows capturing infrastructure configuration in static declarations. Changes to infrastructure happen via a structured process: at the end of a successful change, infrastructure is changed to match the declaration. Changes made to infrastructure that are not in code (e.g. by user modification though the UI) are immediately detected by drift during the plan stage. These out of band changes are removed by Terraform as it tries to match config with actual infrastructure. The changed declaration is then committed to the version control system (vcs, e.g. Git) and thus tracked. The current “state” of infrastructure is also stored in a so-called “statefile”. If the change is not committed, infrastructure is no longer accurately described by configs. Thus, automation around Terraform is critical for its usefulness. In this post, we describe LiveRamp’s automation around Terraform and how it has scaled to meet our need to manage cloud infrastructure across multiple teams and timezones.

The first solutions we tried were:

Every team would manage Terraform in their repository of choice (i.e. wherever their apps already lived)
Infrastructure team would build and distribute “golden modules” (i.e. Terraform modules that had been built and tested by infrastructure team)
Teams would use whatever workflow they like for their repositories

In practice, this leads to issues with infrastructure drift as changes would be made to infrastructure without corresponding changes to the code. Every repo was structured differently making it harder for the infrastructure team to assist quickly. Additionally, many teams didn’t want to devise their own workflow, but simply wanted a sane default.

Because we used Terraform fairly often, we used Atlantis to automate Terraform workflows for our own team. We organized all of the team’s Terraform configuration in a single repository. The ease of working with Atlantis greatly reduced toil and we realized it would be useful to have similar automation for every team.

Atlantis uses configured credentials when running Terraform. We wanted to give every team access to Atlantis but without giving them access to powerful credentials in the Atlantis server. To enable this, we used the following process:

All infrastructure configurations live in a single centralized repo supervised by our team, the infrastructure monorepo
Every team received a CODEOWNER owned folder in the monorepo; any changes to team’s folder required a code review from the team
Every team defined a custom workflow for Atlantis; assignment of teams to workflows required a review by the infrastructure team
Each custom workflow used different credentials limiting their permissions according to the team; thus team A’s infrastructure changes could not affect team B’s infrastructure. e.g. for Google Cloud Platform (GCP) Infrastructure, we used Terraform to create and install different service accounts per team to limit permissions.

We now give an example of concrete configuration. The overall organization for “backend” team’s folder looks like so:

Backend Team's Folder

It can be seen that the team has different kinds of resources but all of them are defined under the team’s folder.

The CODEOWNERS configuration to give the backend team ownership over the folder would look like so:

backend/** @LiveRamp/backend

(where @LiveRamp/backend is the Github team representing the backend team). This configuration forces a review from members of the backend team for any changes made in the backend folder.

The Atlantis repo configuration for the backend team’s dev environment in GCP would look like:

version: 2
automerge: true
projects:
– name: backend-dev
dir: backend/gcp/dev
workspace: default
workflow: workflow-backend
terraform_version: v0.12.20

where the Atlantis workflow ‘workflow-backend’ is defined using a specific GCP service account in an Atlantis server side config.

The benefits and lessons learned from this process are as follows:

Discoverability: every team’s infrastructure is visible; including their best practices. It allows teams to reuse infrastructure code, but also help each other when debugging infrastructure issues.
Auditability: audit trace of complex infrastructure changes which were previously hard to correlate from cloud audit logs are now visible as changes captured in vcs commits.
Scalability: new teams get their own folder and workflow in the monorepo; adding new teams is easy and does not interfere with others.
Self-Service Infrastructure: The monorepo’s self-governance using CODEOWNERS reduces the required gating that an infrastructure team needs to exercise over the creation of infrastructure. Developer teams are empowered to provision the infrastructure they need, requiring only their own team’s approval. This reduces overall time spent to get to usable infrastructure and allows fast iteration within a team.
Infrastructure Quality: code review improves the quality of application code; it does so for infrastructure code as well. Additionally, static analysis can be cheaply performed on IaC to suggest improvements for efficiency and security.
Universality: terraform is the common language of infrastructure management. Terraform resources exist for every major cloud platform and also for provisioning resources not canonically thought of as infrastructure (e.g. pagerduty monitors, datadog dashboards)

Today at LiveRamp, we use a monorepo with Atlantis to manage the majority of infrastructure automation. While we started with GCP Resources, Terraform providers are available for most services. Thus, we use it to manage Github Repos, Okta Configs, etc. Any resource that can be managed with Terraform can be included in the monorepo.

Some stats on usage:

> 65 % of engineers across the company have contributed to the repo
~ 10 commits every working day
200 + Terraform workspaces within the repository

Every tool comes with its challenges; we mention them here both as challenges that we wish to address and as something that infrastructure teams should be aware of when using this process:

Unfamiliarity with Terraform: teams that are unfamiliar with Terraform have a hard time navigating the repository and contributing. This problem is especially pronounced when dealing with complex state operations.
Stale Master Branch: changes are required to be updated with the master branch before being applied; this has lead to issues when a number of teams try to apply their changes at the same time (Github Issue)
Lack of simple validation: many resources will `plan` successfully but fail on `apply` leading to much developer frustration

Our immediate next concern is to make the system easier to use for developers who are unfamiliar with Terraform by adding more validations around the Terraform plan stage (e.g. with GoogleCloudPlatform/terraform-validator); this would increase confidence in the accuracy of terraform code. Another area of concern is merge conflicts: every time a PR is merged, it’s required for other PR’s to pull in changes from master. While this has been a minor annoyance so far, as more users have started using the repo, it has become a major concern and is something we seek to address.

We hope that our learnings here are helpful to you in organizing your IaC code in your organization. We hope to lean on our current architecture to continue building our global infrastructure management process in a scalable, visible and auditable manner.

We would like to acknowledge the efforts of LiveRamp Engineering in the success of this project. Our early adopters were instrumental in informing us of the pain points involved in using the tooling and in helping us to continuously improve and refine the process to make it easier to use. Special thanks to: Peter Hu, Harrison Wang, Aditya Sarang, Renaud Pere, and Forest Gagnon.

Scaling Cloud Infrastructure with Terraform

Platform

Solutions

Resources

Partners

Company