Saturday, February 16, 2013

The impact of #JVM garbage collection on response time

When analyzing responsiveness of Java applications, it's always good to start with looking at the GC activity. 

It makes sense to look at JMX GC stats first, but if you don't see anything obvious, don't discard GC as the potential culprit just yet! Your JMX GC stats might just not be granular enough for impact to be noticeable on 1-minute granularity time series. Only digging in logs will give a conclusive answer.

Now, GC activity timestamps aren't easy to work with, as they correspond to JVM uptime, rather than a standard Unix timestamp or formatted date as is the case of pretty much every other log timestamp format.

An example entry goes as follows (read Poonam's Bajaj post on how to decypher them):
 
39.910: [GC 39.910: [ParNew: 261760K->0K(261952K), 0.2314667 secs] 262017K->26386K(1048384K), 0.2318679 secs]

Finding out creation date of a process on Linux is simple (stat /proc/<jvm pid>), but JVM takes some time to bootstrap itself, so its timer starts about 100ms later than the creation date on the /proc filesystem. What this means is GC logstamps need to be adjusted accordingly in order to reflect timing more accurately, in my case by about 138ms.

precise GC time = GC logstamp + JVM process start + JVM bootstrap overhead

At millisecond granularity, correlating precise GC times with request logs of an application allows to assess the extent of impact a garbage collector has on the app.

The following is an example of a millisecond-granularity breakdown of open requests to a high-throughput application during a sample one minute period. During normal operation there isn't more than 5 active requests being served at the same time.
Request data can be generated by extracting start time of a request and corresponding request duration from the logs. Data on when and for how long GC is taking place can be put together the same way.

Looking at the plot of GC activity (pink=activity, white=lack of activity) spikes seem to roughly correspond to the being collector at work.
The correlation becomes clearly visible when GC and request data are overlayed and zoomed in on - as below, to the 6 seconds period, between seconds 14 and 20.
Look what happens: the requests are being enqueued but not processed while the collector is running. When they're finally being processed, each takes less than a millisecond. As a result, one might be feeling pretty comfortable about response time average being low if such average is calculated via request duration from the logs (the application really thinks the request took <1 ms), while in reality some requests are held up for about 300 milliseconds, which is how long it takes for an external caller to get an answer.

No comments:

Post a Comment