Recently we noticed some intermittent performance issues with our APIs and traced the problem back to garbage collection pauses. This post describes how we identified the problem, what it taught us about Java’s concurrent mark-and-sweep garbage collection, and how we resolved it with just a few tweaks to our JVM parameters.
We monitor our API request times very carefully, and while the average is ~5ms, there have always been a small number of outliers that take hundreds of milliseconds. Recently, though, the number of outliers has grown, and the timing info in our logs gave us reason to believe it had to do with garbage collection. The main sign was that calls which should take virtually no time (like memcached) were sometimes taking hundreds of milliseconds even when load was normal.
To confirm our theory, we turned on verbose GC logs with the following flags:
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc -Xloggc:gc-`date +%s`.log
date to generate a dynamic name for the log so the previous one wouldn’t get overwritten when restarting the server. And to be clear, we were already using
-XX:+UseConcMarkSweepGC. If you are concerned about pauses and aren’t using this collector, you should start using it.
As soon as we looked at the resulting logs it was clear GC was a problem. The garbage collector was running pretty much all the time with minor collections happening at least once a second, if not more frequently, and major collections happening about every 3 minutes. This, all in an app that creates no long lived-objects after startup, so there shouldn’t really be the need for major collections (which are what pause the entire VM).
To make interpreting the GC logs easier we used GCViewer. It was great to be able to visualize our memory usage and compare tweaks we made with our baseline. When we viewed the data as a graph, we noticed a really interesting pattern: after minor collections (the little drops), total memory usage would go up, but after major collections (the big drops), total memory usage would return to the same low level as before. Even though we weren’t leaking memory overall, on a short time scale, it looked like we were.
So the key question is: since all of our objects are short lived, why were the minor collections leaving memory around? This type of situation is what a generational garbage collector is ideally suited for, right? It doesn’t take a deep understanding of Java GC to realize the problem was that our young generation was too small, which caused even some short lived objects to be promoted to the tenured generation.
To understand why this was happening requires a bit more understanding, though. The young generation is split up into three parts: Eden space, and two Survivor spaces. New objects are allocated in Eden, and when minor collections occur the objects that are still live are copied to one of the Survivor spaces. At any given time one of the Survivor spaces is empty and is the destination of the next minor collection. The other Survivor space holds the live objects from the previous minor collection, and gets collected during the next minor collection. Since it’s assumed most objects die young, the Survivor spaces are relatively small.
So what happens when the live objects from a minor collection can’t fit into a survivor space? The leftover objects are moved to the tenured generation which is only cleaned by major collections. In our case the problem was that the minor collections were so frequent that a decent percentage of objects were still live at each run—enough that they couldn’t all fit in the survivor space, causing even relatively short lived objects to be tenured. We realized we had to get minor collections to happen less often and have more space for surviving objects so they aren’t prematurely tenured.
We played around with some GC parameters to find the right balance for our application and arrived at this:
Which resulted in a much better GC graph:
Hooray! No more trending up; no more major collections! Problem solved, right?
Well mostly. But there are still special circumstances in which memory usage is not as friendly-looking as in the graph above. At the core of our problems with garbage collection is that we’re allocating a lot of memory in a very short amount of time. Addressing that issue will be the topic of a follow-up post. (Spoiler: We significantly decrease our memory usage.)