While we take great pride in testing and QAing our engineering systems at LiveRamp, things do occasionally break in production. When they do, it’s essential that an engineer can diagnose the problem and deploy a fix or workaround in a timely fashion. To this end, each our engineering teams has an on-call rotation in which in each engineer takes on these important responsibilities in turn. Whether an application is down or we’re at risk of missing a data delivery SLA, our on-call process guarantees there is someone on the team to handle it, while other team members can work as planned without interruption.
As we expand our team, we’ve adapted our on-call process to accommodate the growth. Particularly, we’ve changed the way we manage issues and also the way we triage issues.
Previously, we relied heavily on emails to manage on-call issues. Email is a common and easy form of communication, but it has its limits. The issues in emails are difficult to track, discover and categorize. There’s no clear way to assign issues to someone. What’s more, there’s no sense of priority or urgency for different issues.
In order to mitigate these weaknesses, we started managing our on-call issues using JIRA service desk. Users report issues by creating JIRA tickets, and a simple, centralized dashboard makes it easy for both engineers and non-engineers to manage reported issues, track progress, and communicate. Each ticket is prioritized with its urgency and corresponding SLA, and can be assigned to engineers to indicate responsibility. Each ticket itself is a central hub for communication, where everyone, including the reporter and the assignee can add comments, links, screenshots, files, etc. In addition, JIRA labels give us the ability to categorize on-call issues and systematically work on improving those underlying systems.
We also use Flowdock as the online chat platform when we want to ping someone instantly. With the integration with email inbox and JIRA, Flowdock is quite useful for cross-team communication and issue-oriented problem solving.
Back in the days when we had only a few engineering teams, simpler systems, and fewer use cases, the amount of communication needed to report and triage production issues was little. But it’s not the case anymore as our teams have grown, and our systems have become more complex. Our product implementation team is finding it increasingly difficult to figure out which team is responsible for product issues. As a result, they are more likely to turn to a wrong team. Engineers get unnecessarily interrupted, and our SLA would be impacted when the problem stays with the wrong team.
To address the problem, we changed our issue triaging process: our web development team, serving as an interface between non-engineer teams and other engineer teams, becomes in charge of triaging on-call issues from non-engineers. The new flow reduces communication overhead by replacing the 1-to-many communication with a 1-to-1 communication. It also reduces unnecessary interrupts since engineers, who are better equipped and informed to tell where an issue is, are responsible for triaging on-call issues instead of users of the system.
All technical problems aside, how to maximize efficiency in on-call process itself is an interesting optimization problem. The fact that we went out of our way to solve this problem, reminds me why I love working here: everything can be an opportunity for improvement and everyone is trying to seize it.