Migrating a Big Data Environment to the Cloud, Part 2

Starting the journey

Last post we discussed why we were migrating to the cloud, GCP in particular. Once we knew we were migrating, we started by asking ourselves three questions:

What will our cloud architecture look like on Day 1? We know there’s a lot of exciting stuff we could do in the cloud — but what do we want our MVP to look like?

How do we get there? Building from scratch in the cloud is “easy”. Getting our infrastructure there without disruption? Not as much.

What do we want our environment to look like a year later? We accept that our infrastructure isn’t going to be perfect on day 1, and that’s fine. But we want to set ourselves up for success later.

The first question is what I’ll discuss in more detail today.

MVP Architecture

Asking a development team to migrate into the cloud is already hard. Asking them to re-engineer their applications while they do it introduces a huge amount of uncertainty into the process. Where there weren’t drop in replacements for our infrastructure on GCP, we generally just lifted and shifted.

That said, some of the offerings on GCP were so compelling, and provided a direct enough translation of our infrastructure, that we felt it was appropriate to switch while migrating.

So first — what did we not change:

Our on-premise environment had a single logical internal network. Internal services communicated via private IPs, mostly orchestrated via Hashicorp consul. We felt it was critical to keep this the same for app teams, at least for the duration of the migration. By using a Cloud Interconnect and a Shared VPC Network, we gave our developers a cloud which acted just like an extension of our datacenter.

Our big data processing — the ETL and joining pipeline that is the core of what liveramp delivers — happens on Cloudera Hadoop. That isn’t changing, at least on day 1.

The focus of this post is not on our security and data privacy decisions, but they were part of everything we did. Our ops team remains in control of data permissioning and network rules. The cloud empowers developers, but it also empowers them to make really stupid decisions. It’s easiest to develop quickly and safely when you know you can’t accidentally leak customer data.

So what are we changing? A lot, but I’ll focus on three technologies:

On-premise, we naturally stored all our persistent data on Hadoop HDFS. While our HDFS cluster was actually a very well-oiled machine by the time we started this migration, it was stressful to maintain and upgrade without downtime or interruption. As our company grew, the downtime windows we could negotiate with the product team got shorter and shorter, until upgrading became functionally impossible. We knew we wanted to use GCS so we could stay nimble as a dev team.

On-premise, we used Chef to provision all of our VMs. We had a lot of logic built into Chef, and we did actually try to provision VMs in the cloud with Chef, but it went poorly. Our experiences with docker and Kubernetes have been great, so we are completely decommissioning Chef in our new environment.

Last but not least, we did decide that Google BigTable was an easy drop-in replacement for our homegrown key-value datamarketplace. It’s a bit sad to decommission a tool we’ve used for so long, but it lets us focus on new exciting challenges.

Hadoop in the cloud

Our Hadoop infrastructure — what it was, and what it is now — will be the topic of future blog posts. I’ll just cover the basics here of what we had, and what we built on GCP.

A highly simplified view of our on-premise Hadoop cluster looks like this:

For simplicity, this diagram omits JournalNodes, ZooKeeper and Cloudera management roles. The important notes are that this production cluster:

Was shared between development teams
Ran on physical nodes which did not scale with load
Launched jobs from gateway Virtual Machines
Stored all data on a federated HDFS (4 HA NameNode pairs)
Has run continuously (except for upgrades) since 2009

HDFS is without question the most scalable on-premise filesystem, but has drawbacks compared to cloud-native Object Storage (S3, GCS) — namely, if you destroy your instances, you lose the data (unless you also perform Persistent Disk book-keeping). When designing our cloud clusters, we knew we wanted:

Long-lived but ephemeral clusters (Problems? Blow the cluster away and start over)
All important data on GCS
Separate clusters per application team
Fast autoscaling
Jobs launched from GKE

So, we ended up with something like this:

Lots of lines, but I’ll break down the changes:

Separate clusters run in separate subnets per application team
Jobs are launched from GKE, not from VMs. Each pod contains only one application — no more manual bin-packing of applications onto VMs!
HDFS still exists, but only vestigially — YARN uses HDFS for JobConf storage, the DistributedCache and application logs. But all application data is stored on GCS
Since HDFS barely does any work, we only need a couple DataNodes. Most worker nodes are NodeManager-only. These can quickly scale up and down with application load

I’ll discuss this in more detail in a separate post, but the main takeaway — ephemeral data-less infrastructure has let us iterate on configurations and machine types 1000x faster than we could on physical machines.

These decisions gave us a starting point for our migration. In our next post, we’ll discuss the logistics of our migration, with an emphasis on how we handled data replication given limited throughput.

Migrating a Big Data Environment to the Cloud, Part 2

Platform

Solutions

Resources

Partners

Company