In the dynamic landscape of cloud infrastructure, we’ve made significant strides in orchestrating observability on Kubernetes clusters using Helm-managed Terraform modules for OpenTelemetry agent deployment. However, a critical gap persists—there is no parallel approach for installing OpenTelemetry agents on Google Compute Engine (GCE) instances, whether they are running Windows or Linux.
While Terraform modules prove effective in enhancing observability within Kubernetes, extending this capability to individual GCE instances—such as Jupyterhub, Citrix, and Tableau are still a challenge. To tackle this issue, we suggest a Terraform-based approach for installing the OTEL agent on GCE.
The Solution: Leveraging Terraform + VM Manager
We decided to opt for Terraform + VM manager over Ansible, prioritizing ease of use and a smoother learning curve.
|Terraform + VM Manager
|Terraform + Ansible
It’s evident that using a VM manager would confine our use case to the GCP cloud provider. Fortunately, this doesn’t impact the majority of scenarios within LiveRamp.
What is Google VM Manager?
Google VM Manager is a suite of tools that can be used to manage operating systems for large virtual machine (VM) fleets running Windows and Linux on Compute Engine.
Image 1: VM Manager architecture overview
Architecting OTEL Installation
Image 2: The workflow
- os-config agent: os-config agent needs to be installed and running on the target machine. Note: OS image provided by Google comes with the os-config agent pre-installed and set to start automatically.
The process consists of two primary steps to install OTEL agent to target GCE instance:
- Preparation – Establish the infrastructure resources using our Terraform module.
- Enabling the os-config service API
- Creating a source bucket
- Granting necessary permissions to service accounts.
- Process – Utilize vm-manager module to
- Generate configuration files
- Download required installation files
- Upload components to the source bucket
- Set policy assignments for installing the OTEL agent, along with node_exporter/windows_exporter.
Below diagram describes how the Terraform module vm-manager works.
Image 3: vm-manager
Data: This module generates essential configuration artifacts (otel_conf, startup_script, otel-ca and calculates the files’ sha1sum) and corresponding tarballs. Importantly, it ensures that generating the conf tarball does not cause Terraform drift.
Check: This module verifies the existence of key resources, including OTEL agent, node exporter, and OTEL conf tarball. It performs necessary checks to ensure proper resource availability before proceeding. This module helps maintain resource consistency.
Download: The download module handles the download of otel agent and node exporter tarballs. It only downloads these resources if they do not exist, ensuring efficient and non-redundant download operations.
gsutil: The gsutil module copies *.tar.gz files from the temporary folder to the specified source bucket. It avoids uploading the otel agent conf if it already exists, providing efficient upload management.
Policy-assignment: The module fetches OTEL agent installation and configuration files from the designated source bucket, facilitating the setup of OTEL agent, node_exporter, or windows_exporter.
Here is a complete example, with default values assigned to most of the variables.
Verifying from GCP
Once terraform changes are merged, you’ll see all policies shift to compliance, indicating successful application to the target GCE instances.
Grafana Prom Metrics
Then in your Grafana, you can now utilize the key label “otel_agent” to filter the VMs, both Linux and Windows.
Grafana Loki Logs
Once the logs and metrics are available in Grafana you can configure and create dashboards to monitor your GCE instances.
You can now customize to create dashboards based on the specific needs, for example CPU, system, disk and memory consumption.
Linux Dashboard Example:
Windows Dashboard Example:
Key Takeaways of the Deployment of the OTEL Agent onto GCE Instances
In our quest for a seamless observability solution, the deployment of the OpenTelemetry (OTEL) Agent onto GCE instances has taken center stage. While our strategy revolves around utilizing Terraform and Google VM Manager for infrastructure orchestration, it is vital to highlight key takeaways that underscore the significance of OTEL installation.
Unified Observability Stack
Deploying the OTEL Agent alongside Terraform and Google VM Manager in infrastructure provisioning establishes a cohesive observability stack. This integration not only facilitates the correlation of metrics, traces, and logs (Grafana+Prom+Loki) but also emphasizes our commitment to a consistent monitoring approach across both GKE and GCE.
Cost Savings through Open Source Adoption
Opting for the OTEL Agent and Grafana has allowed us to replace expensive third-party observability tools, achieving substantial cost savings without compromising functionality. This strategic move aligns with the principle of maximizing resource efficiency without sacrificing the quality of insights derived from our observability stack.
Learning and Insights from OTEL Installation
The deployment of the OTEL Agent using Terraform and Google VM Manager proved to be a valuable learning experience for our team. It provided insights into making applications observable, involving understanding how to gather crucial information for enhanced monitoring.
Integrating OTEL into GCE instances not only improved our technical skills but also emphasized the vital role of telemetry in managing modern infrastructure. This journey broadened our skills and deepened our understanding of effective system monitoring practices.
In a nutshell, using Terraform and VM Manager not only widened our monitoring view to include VM instances but also showcased how SRE practices make things work smoothly. This approach ensures we monitor everything seamlessly and highlights the strength of SRE methods in creating a well-organized monitoring setup for the whole organization.