At LiveRamp, monitoring is a critical part of our infrastructure to ensure the availability of our services. At a system level, we have used several monitoring solutions like zabbix and nagios to keep track of the health of our servers. However, we also want the monitoring tools to take actions appropriately during failures. Our internal services are developed as thrift services, and each service is essentially a background process (daemon). They act as cross team communication channels. When a service goes down, we need a way to proactively restart it and alert the corresponding team to do further investigation.
Moreover, we want to empower our engineers to monitor services they’ve built, because this gives them more responsibility. When things fail, they will be notified and understand how to quickly resolve the problem. We also want the monitoring alerts actionable for on-calls engineers.
Monit is Awesome
Based on the above reasons, we are looking for a process monitoring tool that also obtains simplicity, stability, and extensibility. After comparing several tools including supervisord and runit, we decided to use monit, because:
- Monit’s DSL syntax is easy to read. You can restart service based on system level condition. For example, when memory consumption of a service is over a certain limit, we could let monit restart it using this syntax:
if total memory > 4 GB for 5 cycles then restart
- Monit is extensible. It supports execution of custom script. You could write custom scripts to verify if service is in healthy condition. This gives us some buffer to investigate the issue while maintaining availability. It helps us detect any instability.
- We are already using it to monitor some third-party applications, such as, Apache, Elasticsearch, etc. It makes sense to leverage existing knowledge in the team.
How we integrate monit
To make integration easy, we developed a wrapper script called service launcher. It uses some conventions to help us deploy monit config files, and it allows us to put these config files along with project code. When the project gets deployed, these files will be copied to a shared folder between releases, and the deploy will reload monit daemon and restart the service. Service launcher hides these complexity from our engineers. When we develop a new service, we just need to follow the naming conventions and put monit config file into proper folder.
In high level overview, service launcher will use following steps to restart service.
When developing service launcher, we found an issue in monit that it can not reliably reload daemons. Since monit is using event execution model, the issue happens when we restart the service immediately after the monit daemon reload. Fortunately, we are able to find some workarounds. Besides that, Monit has been very stable in production and it now acts as safeguard to our services.