Initial Release of Workflow2 – LiveRamp’s Big-Data Workflow Orchestrator

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Initial Release of Workflow2 – LiveRamp’s Big-Data Workflow Orchestrator

Engineering

We are excited to announce the initial Open-Source release of Workflow2, LiveRamp’s big data pipeline orchestrator. Workflow2 has been developed internally for many years, and we are now releasing it to the big-data community.

Why do I need a pipeline orchestrator? Why can’t I just write code?

You can! But if your code is launching a series of long running tasks (for LiveRamp, this usually means big-data jobs like Spark or MapReduce), stringing together those applications in a main method has drawbacks. Running your code with a pipeline orchestrator provides a lot of features out of the box:

  • System visibility: Humans can look at a UI to quickly see the status of the application.
  • Restartability: If the application fails halfway through, you don’t want to end in an inconsistent state or lose progress.
  • Reusability: Sub-components of an application should be easy to share.
  • Analytics/history: DAG frameworks capture historical timing info and failure histories.
  • Data framework integration: Gather statistics from launched Hadoop applications and easily click through to their UIs.
  • Alerting: Get automatically alerted if the application fails or succeeds.

When a simple script turns into a long-running, multi-step application, it’s probably time to use a processing framework.

At any given time at LiveRamp, we are running 500-1000 independent data workflows, which run up to 1500 concurrent Hadoop applications — in a given day, over 100,000 data pipelines start and finish. We needed to both give engineers tools to track individual data applications, and to aggregate statistics about our Hadoop applications across the organization. To solve these problems, we built Workflow2.

Workflow2 is heavily documented on GitHub, where the project is hosted, so this post will just touch on the highlights.

Workflow2 features

System visibility: It’s important to provide visibility into the state of a data processing pipeline for tracking and debugging. Humans can easily use the Workflow2 UI to see the status of current or old data applications:

Restartability: When running thousands of jobs a day, failure will happen; it’s important to be able to resume applications without losing progress. Workflow2 makes it easy to pick up data pipelines from the last point of failure:

Reusability: A lot of data engineering is not re-inventing the same wheel over and over; sub-components of an application should be easy to re-use and share. Workflow2 logical components are easily re-usable across different data applications.

Analytics/history: Workflow2 captures historical timing info and failure histories, to make it easy to track performance changes over time.

Data framework integration: Workflow2 gathers statistics from launched Hadoop applications and exposes links to Hadoop applications:

By tracking application counters, Workflow2 provides global views of resource utilization:

Hadoop-aware alerting: Workflow2 deeply integrates with Hadoop applications, and can alert about poorly performing or poorly optimized applications:

Alerting: Get automatically alerted if an application fails or succeeds:

All native Java: Workflow2 is written in Java, so data applications pipelines can be natively defined in Java alongside the JVM-based Hadoop applications:

    Step step1 = new Step(new NoOpAction("step1"));

    Step step2 = new Step(new NoOpAction("step2"));

    Step step3 = new Step(new WaitAction("step3", 180_000), step1, step2);

    WorkflowRunners.dbRun(
        SimpleWorkflow.class.getName(),
        HadoopWorkflowOptions.test(),
        dbHadoopWorkflow -> Sets.newHashSet(step3)
    );

High performance: Workflow2 is built for scale, and is able to run hundreds of thousands of applications per day, where individual applications may be running hundreds of concurrent steps. Workflow2 also makes it easy to run concurrent instances of the same data pipelines:

Using and contributing

Workflow2 is fully documented on GitHub. There you can find:

All Maven artifacts are published to Maven Central:

  <dependency>
    <groupId>com.liveramp.workflow2</groupId>
    <artifactId>workflow_hadoop</artifactId>
    <version>1.0</version>
  </dependency>

and all monitor and UI components are published as containers on Dockerhub.

Workflow2 is under active development and has been in use for many years at LiveRamp (we’re only open-sourcing it now, but iterations of this framework have been used internally for years) . LiveRamp has benefited massively from the Hadoop open source ecosystem, our hope is that by contributing Workflow2 back to the community, we can help others build high-performance data workflows.

We’d love to hear feedback, bugs or suggestions — the best way to get in touch is to leave an issue on the GitHub repository.

If building high-performance data applications and frameworks sounds exciting to you, remember we are always hiring passionate engineers.