Striving for zero copies with Thrift 0.5

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Striving for zero copies with Thrift 0.5

Engineering

“Zero copies” is a common optimization principle used in high-performance applications. The gist of the technique is to have the smallest number of byte array copies necessary for the server to perform its task. Byte array copies are one of those insidious time-wasters that are hard to understand or even detect until you start looking for them. It seems intuitive to use a perfectly-sized byte array for everything you do: it’s straightforward, reduces the number of arguments you have to pass to each method, but most of all, it’s just simple. However, you actually pay a steep price every time you copy a byte array – the CPU is spinning away shuffling bytes from one memory location to another. It’s actually even worse in Java: every time you create a new byte[], you’re both allocating memory and looping over it to zero each position out. This means you pay a price now as you iterate over every position and a price later when you ultimately have to garbage collect the new byte[] you threw away. An ideal server would never copy a byte[] unnecessarily, preferring to reuse the one over and over again.

Before Thrift 0.4, no matter how much you might want to, there was no way to avoid doing an extra byte[] copy for each binary field that you deserialized, despite the fact that virtually all deserialization happens directly from an in-memory byte[] buffer. Thrift 0.4 changed that by switching the underlying type of binary fields from byte[] to the Java NIO construct ByteBuffer; Thrift 0.5 elaborated on this theme by making it easier to get the byte[]s that everyone expected while still offering access to the ByteBuffer for more advanced operations.

So how do you actually use this feature to speed up your servers? Let’s take a look at a pair of examples. In both examples, we’ll use the following Thrift file as our base:


struct A {
1: required binary foo;
}

service SomeService {
A read();
void logFoo(1: A a);
}

The logFoo method

Let’s pretend that your objective is to log the contents of the foo field to some stream. Here’s how you might do it naively:


private DataOutputStream out;

public void logFoo(A a) throws TException {
byte[] value = a.getFoo();
out.writeInt(value.length);
out.write(value);
}

Seems simple enough. So what’s wrong here? The problem is that calling getFoo() causes a byte[] copy. It’s hidden from you by the method, but it’s happening nonetheless. The copy you create is only used for an instant, becoming garbage after you pass it to write(), and then the entire A object becomes garbage.

Here’s the right way to do it:


private DataOutputStream out;

public void logFoo(A a) throws TException {
ByteBuffer value = a.bufferForFoo();
out.writeInt(value.remaining());
out.write(value.array(), value.arrayOffset() + value.position(), value.remaining());
}

There’s a lot here, so let’s break it down. First, notice that we called “bufferForFoo” instead of “getFoo”. This returns a ByteBuffer instead of a byte[]. Then, we use the remaining() method to get the number of bytes in the buffer that belong to this value. Finally, we go to the write() call, but this time using the “array, offset, length” version of write. This allows us to reference a subarray directly from the array that backs value without any intermediate copying. There’s some trickiness that goes into understanding why the first element in the backing byte array is arrayOffset() + position(), but for right now, trust me that it’s the case.

It’s a small difference with a bit more code, but depending on the size of foo, you could see a substantial boost in performance.

The read method

Now let’s look at things from the other side of the equation. Let’s say that the objective of the read() method is to read the bytes of foo from an input stream and return them wrapped in an instance of A. Here’s what the naive approach might look like:


public A read() throws TException {
// assume that "in" is a DataInputStream
int fooLength = in.readInt();
byte[] value = new byte[fooLength];
in.readFully(value);
A result = new A();
result.setFoo(value);
return result;
}

The problem with this method is that every call leads to a new short-lived instance of A and a brand new perfect-sized byte[]. Both of these will become garbage very soon, and allocating the new byte[] every time is a drag on your CPU.

Let’s focus on how we can reuse the byte[] for now and think about the A instance some other time. There are many possible strategies for caching your buffers, but here’s a simple one:


private final ThreadLocal bufferCache;

public A read() throws TException {
int fooLength = in.readInt();
byte[] value = bufferCache.get();
if (value.length < fooLength) {
value = new byte[fooLength];
bufferCache.set(value);
}
in.readFully(value, 0, fooLength);
A result = new A();
result.setFoo(ByteBuffer.wrap(value, 0, fooLength);
return result;
}

There’s a good bit more to this version. First, note that we’re using Java’s ThreadLocal capability to support us keeping a single byte[] per active thread. This makes sure that each thread servicing a client won’t interfere with any other, and there’s no contention (synchronization) for the thread-local buffer. Next, after we figure out how much we need to read, we make a point of checking if we have enough buffer space to complete the read. If not, we replace our buffer with a new, bigger one. Then we complete the read into the buffer – this time specifying the length we want to read, rather than letting the length of the buffer imply the size of the read. This ensures that on subsequent reads, when fooLength is less than value.length, we don’t try to read more than we wanted to. Finally, instead of passing the entire buffer into the foo field, we pass in a ByteBuffer that wraps just the portion that contains the value we read this time.

By using this technique, we’ve avoided one copy per call plus an unknown number of byte[] allocations – if your record sizes vary a lot, then it will take some time before the buffer has to expand to accommodate the biggest one, but after that, you won’t need any more allocations. If your records are fixed size, then you should reach that point immediately.