Friday thoughts: fail, fast and furiously

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Friday thoughts: fail, fast and furiously

Engineering

This post is by John Graettinger, lead engineering architect based in our New York office. This is part of Friday Thoughts, a post series on improving best practices throughout LiveRamp’s engineering organization. Do you like engineering teams that continuously seek to improve themselves? We’re always hiring.

——————————————————————————————————————

tl;drWhen implementing a service or API, if you get a request you don’t quite understand, the kindest thing you can do is to return a noisy error.

Let’s consider an API like:

  GET /mySum?num=3&num=42

Pretty trivial, eh? I might implement this with something like:

  func mySum(args url.Arguments, w http.ResponseWriter) {
    w.Write(int(args[“num”][0]) + int(args[“num”][1]))
  } 

Wait a tick: what if an extra `foo` argument is provided? A very common answer is that it is ignored: we often try to make our APIs helpful by doing as much as they can with the parts of the request that they understand. In this toy example, ignoring an extra &foo=bar argument doesn’t seem such a big deal.

But, what if a third `num` argument is provided? Now that’s more of a gray area: as implemented it’d be ignored, though the caller might very reasonably expect it to be included in the sum. Our failure to verify our expectations over our arguments within the service has introduced a subtle bug for the caller of the worst kind: a silent failure.

Silent failures are terrifying. They give me night sweats. They can go for long, even indefinite periods of time without detection: a time during which we think everything is hunky-dory. They are extremely tough to debug since they’re, well, silent. They have high opportunity costs, which can add up to real money.

Even worse, as service API complexity rises, and especially as services rely on other services, the impacts of silent failures become harder to understand and wider-spread. It’s a compounding problem. We should rather face a slew of error logs, segfaults, exceptions, and even customer-facing errors than a silent failure.

What to do about it? I want to introduce a pattern which I’ve found to be super helpful in building reliable services which avoid silent failures:

(1) Give your arguments an explicit representation. Often, Thrift RPC or gRPC will helpfully do this for you. For our mySum API, we have:

  type MySumParams struct {
     Lhs, Rhs int
  }

(2) Write Validators which assert expectations that the type has over its semantics. Eg, if MySumParams expects only positive arguments, it should define a Validate() function which returns a meaningful error if they’re not. In the Go codebase, we have a prescribed shape for these validators which is used universally: func Validate() error

Seriously consider having a validator defined for each and every Thrift RPC or gRPC type. It adds a huge amount of safety by providing strong verification that the client & server don’t have miss-matched expectations, which tends to come up often as services are modified or updated over time (eg, if we later change mySum to allow negative integers).

For REST APIs, validators should be decoupled from argument parsing and extraction. They’re separate concerns, and often easier to test independently.

(3) Consume arguments as they’re extracted into the request representation. Thrift & gRPC tend to manage this for you, but eg for REST APIs:

  func extractMySumParams(args url.Arguments) MySumParams {
    var result = MySumParams{
      Lhs: args[“num”][0],
      Rhs: args[“num”][1],
    }
    // Strip consumed `num` arguments.
    args[“num”] = args[“num”][:2]
  }

I might want to use extractMySumParams in combination with other extractors it knows nothing about, which also pop arguments from `args`. However, when my API has extracted and validated all its various parameters, I get to trivially assert that `args` is empty (and should return an error if not).

Factoring argument handling into these separate extraction, validation, and assertion steps has proven enormously helpful in catching problems early and root-causing them quickly.