Back to Blog

Deploying Opentelemetry (OTEL) Agent to Your GCE Instances

  • - Ryan Sun
  • 5 min read

In the dynamic landscape of cloud infrastructure, we’ve made significant strides in orchestrating observability on Kubernetes clusters using Helm-managed Terraform modules for OpenTelemetry agent deployment. However, a critical gap persists—there is no parallel approach for installing OpenTelemetry agents on Google Compute Engine (GCE) instances, whether they are running Windows or Linux.

While Terraform modules prove effective in enhancing observability within Kubernetes, extending this capability to individual GCE instances—such as Jupyterhub, Citrix, and Tableau are still a challenge. To tackle this issue, we suggest a Terraform-based approach for installing the OTEL agent on GCE.

The Solution: Leveraging Terraform + VM Manager

We decided to opt for Terraform + VM manager over Ansible, prioritizing ease of use and a smoother learning curve.

Pros Cons
Terraform + VM Manager
  • Native GCP Integration: GCP VM Manager is designed specifically for managing virtual machines on Google Cloud, providing a seamless integration with GCP resources.
  • Consistency: Ensures consistent infrastructure deployment across different projects and environments.
  • Learning Curve: We aim to empower the Dev team to take ownership of installations, so we want to minimize any steep learning curves
  • Limited Configuration Management: Terraform focuses more on infrastructure provisioning than configuration management, which might be a limitation for complex configuration tasks.
  • Lack of Cross-Cloud Compatibility: It’s specifically designed for managing VMs on GCP and may not be suitable for environments that span multiple cloud providers. 
Terraform + Ansible
  • Configuration Management: Ansible excels in configuration management, making it suitable for more complex setup and fine-grained control over configurations.
  • Multi-Cloud Support: Ansible is cloud-agnostic, allowing you to manage instances across multiple cloud providers with the same playbook.
  • Additional Setup: Setting up an Ansible server introduces an additional component, which might increase complexity and potential security risks.
  • Learning Curve: Ansible might have a steeper learning curve for those not familiar with its playbook syntax and concepts.
  • Security: Ansible may raise security concerns due to the need for the Ansible control nodes. Unauthorized access to these nodes can lead to potential security breaches.

It’s evident that using a VM manager would confine our use case to the GCP cloud provider. Fortunately, this doesn’t impact the majority of scenarios within LiveRamp.

What is Google VM Manager?

Google VM Manager is a suite of tools that can be used to manage operating systems for large virtual machine (VM) fleets running Windows and Linux on Compute Engine.

Image 1: VM Manager architecture overview

Architecting OTEL Installation

Image 2: The workflow

Prerequisites:

  • os-config agent: os-config agent needs to be installed and running on the target machine. Note: OS image provided by Google comes with the os-config agent pre-installed and set to start automatically.

Process:

The process consists of two primary steps to install OTEL agent to target GCE instance:

  1. Preparation – Establish the infrastructure resources using our Terraform module. 
    1. Enabling the os-config service API
    2. Creating a source bucket
    3. Granting necessary permissions to service accounts. 
  2. Process – Utilize vm-manager module to
    1. Generate configuration files
    2. Download required installation files
    3. Upload components to the source bucket
    4. Set policy assignments for installing the OTEL agent, along with node_exporter/windows_exporter.

Below diagram describes how the Terraform module vm-manager works.

Image 3: vm-manager

Sub-Modules

Data: This module generates essential configuration artifacts (otel_conf, startup_script, otel-ca and calculates the files’ sha1sum) and corresponding tarballs. Importantly, it ensures that generating the conf tarball does not cause Terraform drift.

Check: This module verifies the existence of key resources, including OTEL agent, node exporter, and OTEL conf tarball. It performs necessary checks to ensure proper resource availability before proceeding. This module helps maintain resource consistency.

Download: The download module handles the download of otel agent and node exporter tarballs. It only downloads these resources if they do not exist, ensuring efficient and non-redundant download operations.

gsutil: The gsutil module copies *.tar.gz files from the temporary folder to the specified source bucket. It avoids uploading the otel agent conf if it already exists, providing efficient upload management.

Policy-assignment: The module fetches OTEL agent installation and configuration files from the designated source bucket, facilitating the setup of OTEL agent, node_exporter, or windows_exporter.

Here is a complete example, with default values assigned to most of the variables.

Outcomes

Verifying from GCP

Once terraform changes are merged, you’ll see all policies shift to compliance, indicating successful application to the target GCE instances.

Grafana Prom Metrics

Then in your Grafana, you can now utilize the key label “otel_agent” to filter the VMs, both Linux and Windows.

Grafana Loki Logs

Grafana Dashboards

Once the logs and metrics are available in Grafana you can configure and create dashboards to monitor your GCE instances.

You can now customize to create dashboards based on the specific needs, for example CPU, system, disk and memory consumption.

Linux Dashboard Example:

Windows Dashboard Example:

Key Takeaways of the Deployment of the OTEL Agent onto GCE Instances

In our quest for a seamless observability solution, the deployment of the OpenTelemetry (OTEL) Agent onto GCE instances has taken center stage. While our strategy revolves around utilizing Terraform and Google VM Manager for infrastructure orchestration, it is vital to highlight key takeaways that underscore the significance of OTEL installation.

Unified Observability Stack

Deploying the OTEL Agent alongside Terraform and Google VM Manager in infrastructure provisioning establishes a cohesive observability stack. This integration not only facilitates the correlation of metrics, traces, and logs (Grafana+Prom+Loki) but also emphasizes our commitment to a consistent monitoring approach across both GKE and GCE.

Cost Savings through Open Source Adoption

Opting for the OTEL Agent and Grafana has allowed us to replace expensive third-party observability tools, achieving substantial cost savings without compromising functionality. This strategic move aligns with the principle of maximizing resource efficiency without sacrificing the quality of insights derived from our observability stack.

Learning and Insights from OTEL Installation

The deployment of the OTEL Agent using Terraform and Google VM Manager proved to be a valuable learning experience for our team. It provided insights into making applications observable, involving understanding how to gather crucial information for enhanced monitoring. 

Integrating OTEL into GCE instances not only improved our technical skills but also emphasized the vital role of telemetry in managing modern infrastructure. This journey broadened our skills and deepened our understanding of effective system monitoring practices.

In a nutshell, using Terraform and VM Manager not only widened our monitoring view to include VM instances but also showcased how SRE practices make things work smoothly. This approach ensures we monitor everything seamlessly and highlights the strength of SRE methods in creating a well-organized monitoring setup for the whole organization.