Migrating a Big Data Environment to the Cloud, Part 3

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Migrating a Big Data Environment to the Cloud, Part 3

Engineering

How do we get there?

In part 2 of this series we discussed what we wanted our cloud MVP to look like.  The next question was — how do we get there without turning the company off for a month?

We started with what we knew we needed.  For at minimum a few months, our application teams needed to be able to split internal services across GCP and our datacenter.  So we needed a single logical network between the two. We spun up a Cloud Interconnect between us-central1 and our datacenter to bridge our environments:


To avoid dumping our entire company in a single GCP project,  we used a shared VPC network, and each team got a subnetwork in their own project.  This let us spin up services in GCP that communicated in real-time with our datacenter services.  Likewise, we didn’t need any crazy database replication setups — services in GCP could communicate with databases still on-premise.

This shared logical network was absolutely critical to being able to migrate incrementally.  If our service web couldn’t bridge our networks, this migration would have been impossible.

Data Transfer

This brings us to one of our core challenges — dealing with the limited egress we had from our data center to GCP.  Internal to our data center we have incredible amounts of bandwidth, but we never needed to build infrastructure to optimize for massive data egress; we were instead limited to a modest 50Gbps to GCP.  We could have upgraded this, but we did not want to spend millions of dollars upgrading a data center we were actively decommissioning.

50Gbps was a reasonable number for our input and output data delivery.  The challenge really came in the middle of our data pipeline; the distributed joins in the middle of our pipeline produced wildly larger amounts of data than we send or receive.  If we weren’t careful about which teams migrated when, could easily saturate our connection.

A note on transfer devices: transfer devices don’t work well for us because only a small portion of the data we process is cold data; if we need to incrementally migrate our infrastructure, we’re going to end up moving a much larger amount of data than we actually have at any snapshot time.  Plus, cloud providers are drug dealers; data ingress is free, but if you ever want to move data out of the cloud, data egress for large volumes of data is expensive

To keep things running smoothly, we needed a service which could:

  • Prioritize data transfers
  • Avoid copying the same file to GCS multiple times
  • Provide visibility into bandwidth utilization

A detailed description of our Data Replicator service will be the topic of the next post.  The short version is — we built an internal service which copies data in a prioritized order.  Teams were free to use the interconnect for service and database calls, but the data infrastructure team handled all file copies.

This let us prioritize production data transfer with short SLAs, and de-prioritize cold data replication.  Just as critically, it gave us insight into where our bandwidth was going — we didn’t need to muck around with tcpdump to figure out who was saturating the interconnect — we could just pull up a datadog dashboard and check:

These constraints dictated the overall structure of our data migration.  A team at the end of our data pipeline would migrate an application to GCP.  That app would request that its inputs be mirrored from HDFS to GCS. The data replication service handled those file copies.  Later, the upstream application will migrate to GCP, and the data will exist natively on GCS, no copying needed:

While we’re quite proud of our Data Replication service, our hope is that we won’t need it for more than another 6 months — once all our data is natively on GCS, we won’t have anything we need to replicate.

Service migrations

One aspect of LiveRamp’s data-flow pipeline has been critical to helping us migrate without application disruption — the vast majority of our applications use a request based, Service Oriented Architecture.  

It is important to understand how a request-based architecture differs from a “traditional” Hadoop architecture.  A traditional Hadoop data pipeline is a series of (often daily) batch jobs which process entire datasets at once:

LiveRamp does not use this architecture.  Instead, data sets are processed as independent units:

Of course, running a hundred jobs instead of a single batch job requires more tracking infrastructure.  We use our Open Source daemon_lib framework to coordinate service requests.  The request flow looks like this:

  • Service endpoints register themselves in Hashicorp Consul as available to serve requests
  • When a service client wants to submit a unit of work (for example, a newly imported file from a single customer), it finds a live server in Consul
  • The client submits the unit of work.  The server translates this into a request, which is inserted into a queue (queueing behavior differs per application, but the simplest version is a priority queue.)
  • Daemon workers pull units of work from the queue.  In data applications units of work are usually processed as Hadoop jobs

When migrating to GCP, application teams could provision workers on GKE which pulled from the same work queues as the on-premise workers (remember, we are working in a single logical network — workers all talk to the same database!):

Because daemon workers can decide for themselves which units of work they are able to process, teams could incrementally migrate a service to GCP:

  • First, by directing the GKE workers to only process test or staging data
  • Second, by directing GKE workers to only process a small percent of production work
  • Last, by scaling up the GKE worker replicas to 100% and decommissioning all on-premise workers

By testing applications with staging data and then incrementally ramping up load, application teams could:

  • Gently hit GCP scaling problems (for example, hitting quotas we didn’t know existed, or exceeding our interconnect bandwidth)
  • Rapidly shift work back to our on-premise environment, if problems could not be resolved within a couple hours

This is not to trivialize the amount of work this migration has been for application teams — even with the tools available here, the migration has been a huge undertaking for all of our application teams.  But we’ve done so with a minimum of application downtime thanks to our live data replication and service architecture.

In the next post we will focus on the Data Replication service we briefly mentioned here, and discuss how we built a service which could handle both millions of requests and transfer petabytes of data during our 6-month migration.

If these challenges seem exciting, we’d love to talk — we’re always hiring great engineers!