Back to Blog

LiveRamp Clean Room Architecture Deep Dive: Interoperable, Flexible, and Actionable

  • - Siddharth Sharma
  • 8 min read

Data collaboration is a cornerstone of modern business intelligence, allowing organizations to unlock insights from shared data sets while working responsibly with data. With the increasing adoption of data clean rooms, enterprises are accelerating their collaboration efforts to extract value from data and drive better business results.

At LiveRamp, we’ve developed an interoperable clean room collaboration platform that enables secure, scalable analytics and machine learning across AWS, GCP, Azure, Snowflake, and Databricks. Our platform’s fully interoperable design ensures portability and repeatability, enabling seamless operation across cloud environments without compromising functionality. It integrates with diverse cloud data sources, allowing organizations to analyze and share data without moving it – a zero-copy principle critical for security, compliance, and operational efficiency.

This model allows for data partnerships that focus on business outcomes (insights, audience activation) while seamlessly managing the underlying tech platform. The product and engineering teams from our customers no longer need to worry about infrastructure management, but rather can focus on the higher-level productization of their data and collaboration models.

Platform Architecture

Our platform is built on a multi-plane architecture, with a single control plane coordinating operations across multiple data planes. This design enables scalability and multi-cloud compatibility for secure data collaboration across a global footprint spanning US, EU, APAC, and LATAM.

Multiple data planes operate independently across different cloud providers (AWS, Azure, GCP, etc.) or regions within the same cloud, processing data locally to comply with sovereignty and to minimize latency. Each data plane is fully isolated, ensuring secure collaboration.

This architecture decouples the management and orchestration layer from the data processing layer, offering the following benefits:

  • Scalability: Data planes can be scaled independently based on regional or workload-specific demands.
  • Flexibility: Multiple data planes enable seamless multi-cloud catering to diverse client requirements, so collaborators do not have to use the same providers in order to work together.
  • Security and governance: Isolation of data planes enables compliance with regional regulations and prevents data from leaving the local environment.

Fault tolerance: Distributed data planes ensure high availability, even in the case of localized failures.

Let’s now explore the key components of LiveRamp’s Clean Room architecture.

Control plane: The coordinator

The control plane serves as the central orchestrator, managing tenant configurations, user onboarding, query routing, and governance. It provides both a unified user interface and API access for users while maintaining strict control over policies and access. 

Control plane key responsibilities

  • Authentication and authorization: Enforces multi-tenant access controls, segregating data and permissions by organization. Manages identities and roles, integrating with identity providers (e.g. Okta) for SSO.
  • Tenant and user onboarding: Automates tenant onboarding, enabling quick clean room setup. Admins can configure data connections, privacy policies, and collaboration settings via a user-friendly UI.
  • Data collaboration and governance: Defines and enforces data access policies, query controls, and governance standards for clean room configurations.
  • Query orchestration and routing
    Validates and routes SQL queries or Python scripts to the correct data plane, optimizing performance and ensuring compliance with resource constraints.
  • Privacy controls: Implements responsible data controls, such as aggregation thresholds and differential privacy. Prevents unauthorized PII access with data owner-defined constraints.
  • Monitoring and reporting: Aggregates logs and metrics from distributed data planes for performance monitoring and compliance reporting.
  • Integration APIs: Provides APIs for seamless integration, enabling programmatic control and embedding clean room features into existing workflows.

Data plane: The executor

The data plane is where the actual processing happens. It operates in the client’s chosen cloud environment, ensuring data remains localized and secure.

Each data plane operates within a private virtual network in a region, spanning three availability zones (AZs) to ensure high availability and minimized downtime. These data planes execute data analysis and machine learning workloads on cloud-native infrastructure with Kubernetes (EKS, AKS, or GKE) serving as the compute engine for Apache Spark workloads. The cloud-native Spark on Kubernetes infrastructure is optimized for big data analytics, enabling Spark to process large data sets across distributed nodes. Kubernetes orchestrates the Spark jobs, dynamically scaling resources based on workload demand.

Data plane key responsibilities

  • Query execution: Executes SQL queries and distributed computations (e.g., Spark jobs) using in-memory processing and partitioning for scalable performance.
  • Data confidentiality: Employs confidential computing (e.g., AMD SEV-SNP on Azure) to keep data encrypted in transit, at rest, and during use. Integrates Azure Key Vault for secure key management and Azure Attestation Service for trusted execution environment verification, ensuring controlled decryption.
  • Data source connectivity: Offers connectors for cloud storage (e.g., S3, GCS, ADLS), data warehouses (e.g., Snowflake, BigQuery), and table formats (e.g., Apache Iceberg, Delta Lake). Supports schema management via AWS Glue and Databricks Unity Catalog.
  • Dynamic resource allocation: Auto-scales resources per query complexity, using spot instances for cost efficiency while isolating tenant workloads to ensure performance.
  • Local data processing: Processes data within the same cloud region to reduce egress costs and to ensure trustworthy data management. 
  • Privacy enforcement: Implements clean room privacy measures like noise injection and suppression for small data sets based on customer action.
  • High availability and fault tolerance: Ensures reliability with multi-AZ deployments and Kubernetes for workload distribution and node recovery.
  • Execution logs and metrics: Captures detailed query logs, Spark metrics, and utilization data, centralizing them via Prometheus and the Spark History Server for diagnostics.

Here’s how these key elements work together.

Collaboration between control and data planes

  • Query lifecycle
      • The control plane receives the query, validates it against policies, and routes it to the appropriate data plane.
      • The data plane executes the query, processes the results, and returns them to the control plane for further actions like visualization or export.
  • Secure communication
    • All interactions between the control and data planes occur over encrypted channels using secure protocols such as TLS.
    • Trusted IP CIDRs ensure only authorized traffic can traverse between the planes.

The Query Gateway is the key component of the control plane that orchestrates query execution across the distributed data planes. Acting as the brain of the system, it seamlessly connects the control plane and data planes, ensuring efficient, secure, and dynamic query management.

Key responsibilities of the Query Gateway

  • Query routing: Routes queries to the appropriate data plane based on workload requirements, client configurations, and query tags (e.g., client ID, priority, data sensitivity).
  • Load balancing and throttling: Balances workloads across multiple data planes, ensuring even resource utilization and preventing overloading of any single plane.
  • Query validation and optimization: Analyzes incoming queries for correctness, enforces syntax rules, and applies optimizations like predicate pushdown and partition pruning for efficient execution.
  • Dynamic resource allocation: Allocates resources dynamically, adjusting compute and storage based on the complexity and requirements of each query.
  • Retry handling: Automatically retries failed queries, distinguishing between transient infrastructure issues and persistent errors with retries routed to alternate resources if necessary.
  • Metrics and monitoring: Collects and emits telemetry data for real-time visibility into query performance, resource utilization, and errors.

Query execution flow

  • Query submission and spec generation: Users submit queries to the Query Gateway, which validates them against privacy and governance policies. A query specification (spec) is generated, detailing the target data plane, required resources, and metadata (e.g., client ID, priority, workload type) for proper routing.
  • Airflow DAG triggering: The platform uses Apache Airflow to manage query lifecycles. The Query Gateway triggers an Airflow DAG, dynamically generating a Spark Application CRD spec with configurations for workload, resource scaling, and instance types.
  • Spark operator deployment: The CRD spec is submitted to the Spark Operator in the data plane (EKS, AKS, or GKE). The operator deploys Spark applications, allocating resources and scaling dynamically based on workload complexity.
  • Execution in the data plane: The Spark Operator schedules and executes jobs across Kubernetes nodes, leveraging the data plane’s Spark infrastructure for dynamic resource scaling.
  • Monitoring & failure handling: Custom Spark Sensors in Airflow monitor job status, tracking progress, and identifying issues. Failures trigger retries or escalate for manual intervention. Timeouts prevent delays, and the Query Gateway sanitizes error messages for clarity, enabling self-healing.

Interacting with output results

The platform provides secure, flexible options for accessing aggregated query results, ensuring trustworthy usability:

  • Differential privacy: Aggregated results are protected with differential privacy to prevent re-identification of individual data points while enabling insightful analysis.
  • Access via Control Plane: Results are accessible through:
    • Native UI charting: Built-in charting tools for customizable visualizations.
    • Downloads: Direct result downloads for offline analysis.
    • Exports: Seamless export to S3, GCS, ADLS, Snowflake, or BigQuery for further workflows.
    • BI tool integration: Embedded tools for creating interactive dashboards and reports.
  • Zero-copy querying with DuckDB: DuckDB enables efficient, in-memory querying of results directly within the control plane, reducing data transfer costs and accelerating analysis.

Performance efficiency and optimization

The platform balances high performance and optimization through dynamic scaling, efficient query execution, and resource management:

  • Automatic cluster scaling: Utilizes Karpenter (AWS) and Cluster Autoscaler (Azure, GCP) for Kubernetes cluster scaling. A mix of spot and on-demand instances maximizes cost-efficiency, with spot instances for fault-tolerant Spark executors and on-demand for stable Spark drivers. Instance types (Graviton, AMD/Intel) are dynamically selected for price-performance optimization, and NVMe SSDs and Persistent Volumes are used for high-throughput, low-latency storage.
  • Federated querying and zero-copy principle: Ensures data remains at the source (S3, GCS, BigQuery, Snowflake) with predicate and projection pushdown and partition pruning, minimizing data movement and reducing costs.
  • Dynamic resource allocation: Allocates Spark resources (memory, CPU, or scratch space) based on query requirements. The Query Gateway dynamically adjusts resource allocation in real time.
  • Predefined warehouse capacities: Offers predefined warehouse sizes (S, M, L, XL) based on workload complexity, optimizing resource allocation, and preventing over-provisioning.
  • Intra-AZ optimization: Schedules Spark jobs within the same availability zone (AZ) to minimize network latency and inter-AZ data transfer costs, improving performance and reducing charges.

These strategies ensure high-performance Spark workloads while optimizing efficiencies and delivering excellent value with reliable speed.

Operational efficiency and data plane provisioning

The platform enhances efficiency with automated data plane provisioning for rapid deployment and scalability. Here’s how:

  • Provisioning with Terraform
    Automates infrastructure set up in under a day, including virtual networks, Kubernetes clusters, and storage integrations. Configurations are tailored for:

    • Single-tenant: Dedicated environments for isolated, compliant use.
    • Multi-tenant: Shared environments for secure collaboration with logical isolation.
  • Dynamic query routing via Query Gateway
    The Query Gateway tags queries with metadata (e.g., client ID, workload type, priority) and routes them to the appropriate single-tenant or multi-tenant data plane, ensuring scalability and resource optimization.

Monitoring and profiling Spark Jobs

The platform ensures optimal performance of Spark jobs with comprehensive monitoring and profiling through the following methods:

  • Prometheus integration: Spark events are captured and sent to Prometheus for detailed performance metrics, visualized in Grafana for real-time monitoring and issue identification.
  • Spark history server: Aggregates all Spark events, offering a central location for job profiling and analysis.
  • Airflow and alerting: Airflow detects query failures, enriching alerts with error logs and execution data, and routes them to Slack via Robusta for rapid response. Automated mitigations like retries or resource adjustments are triggered to minimize downtime.

Handling high-volume, complex queries

The platform is designed to handle terabytes of data and can execute hundreds of queries per hour, with each query processing billions of rows. Spark jobs are optimized for high-volume data operations, ensuring efficient resource allocation during complex queries that involve modeling and machine learning algorithms.

Conclusion

Building cloud-agnostic distributed data planes across multiple cloud providers is a complex challenge due to the distinct architectures, APIs, and operational models of each cloud. LiveRamp’s platform design accounts for these differences in networking, security, resource scaling, and data management across each cloud environment, creating an interoperable and scalable solution. 

If you’re ready to get started, reach out to one of our experts.