Microbenchmarking in Java with JMH: Common Flaws of Handwritten Benchmarks

Microbenchmarking in Java with JMH: Common Flaws of Handwritten Benchmarks
Written on May 20, 2014

Fail Road

This is the third post in a series about microbenchmarking on the JVM with the Java Microbenchmarking Harness (JMH).

part 1: Microbenchmarking in Java with JMH: An Introduction

part 2: Microbenchmarks and their environment

In the previous post I have shown typical issues that have to be considered when executing microbenchmarks, such as the behavior of the JIT compiler or background system load. In this blog post we’ll discover different problems that you’ll encounter when writing microbenchmarks on the JVM.

Flaws and Dangers of Java Microbenchmarks

We write microbenchmarks because we want to know about the performance characteristics of a piece of code. In an ideal world, we would like to argue:

For given microbenchmark candidates X and Y: If X performs better than Y in a given microbenchmark, X will perform better than Y in any "similar" situation.

But Sheldon Cooper knows better: That’s just superstitious hokum.

Sheldon Cooper says ‘Superstitious Hokum’

What could possibly go wrong with this innocent argument? For one thing the microbenchmark could be flawed, which renders the premise invalid. Here are some examples:

No warmup phase: Each method is executed in interpreted mode at first. The Java interpreter counts how many times a method is invoked and requests that it should be JIT-compiled. Loosely speaking this happens after a method has been called 10.000 times (see an article about tiered compilation by Dr. Cliff Click for more details on how the process works and the Oracle documentation on HotSpot performance options). Consequently, we have to run the benchmarked code often enough before the actual measurement starts to ensure that all benchmarked code has been JIT-compiled beforehand. This can easily be verified by providing -XX:+PrintCompilation. You should not see any JIT-compiler activity after the warmup phase.
Benchmark code falls victim to dead code elimination: In certain circumstances the JIT-compiler may be able to detect that the benchmark does not do anything and eliminates large parts or even the whole benchmark code. The performance looks astonishing but unfortunately the benchmark developer has just been fooled by the JIT-compiler. Stuart Sierra provides an example of Dead Code elimination. Although I wasn’t able to demonstrate the problem in Oracle JDK 8, it is theoretically possible that a smart JIT-compiler detects that the measurement loop in FlawedSetMicroBenchmark (see my previous post) also does no real work and eliminates it.
Deoptimization: FlawedSetMicroBenchmark also suffers from so-called deoptimization. The microbenchmark measures the performance of all three benchmark candidates by providing an instance to the measurement method #doBenchmark(Set). When invoked with the HashSet the JIT-compiler will make assumptions and optimizations based on the currently loaded classes. However, when called with an instance of a newly loaded class TreeSet it has to revoke certain assumptions. By starting the microbenchmark with -XX:+PrintCompilation we can observe this behavior: The JIT-compiler will output “made not entrant”, which is HotSpot speak for “I’ll deoptimize this method”. Deoptimizations can occur on different occasions, for more information please read this great guide about the output of -XX:+PrintCompilation. Deoptimization can skew results as the JIT-compiler might have been able to optimize more aggressively on the first benchmark invocation (see the slides about method calls in Christian Wirth on Performance Paradoxes). Deoptimization itself can be mitigated by a proper warmup phase and a close look to the output of -XX:+PrintCompilation. The effects of different optimizations for different benchmark candidates can be avoided by running each microbenchmark for each benchmark candidate separately.
False sharing: In multithreaded microbenchmarks, false sharing can severely affect measured performance. See my earlier blog post on false sharing where I describe the issue in more detail.

These examples should demonstrate that there is a vast amount of things that can go wrong. However, in the unlikely event that we mere mortals get a microbenchmark right and the premise is valid, the conclusion could still be wrong. Let’s consider some examples (non-exhaustive):

Non-realistic memory usage: In a microbenchmark all relevant data might fit into L1 cache, which gives one benchmark candidate an advantage in a microbenchmark, whereas in an application, performance may suffer due a larger working set size.
Non-realistic usage of a component: Consider a situation where a microbenchmark is executed with one writer and multiple readers whereas a real application has multiple writers. It is very unlikely that the benchmark results are of any use to the application developers; worse, they might be even misleading.
Non-realistic inheritance hierarchy: It is quite common that classes in a Java program form an inheritance relationship. While you microbenchmark a specific implementation (i.e. subclass) a real program might use multiple subclasses. How can this possibly affect the validity of a microbenchmark? Well, the HotSpot JIT compiler is a pretty smart piece of software. As Java programmers we may decide that a method should be early-bound by declaring a method final, otherwise it is late-bound. The latter involves an additional indirection at runtime (remember vtable lookups from C++?). Now suppose for a given class “A” no subclasses are loaded when a non-final method of “A” is JIT-compiled. Such method calls are called “monomorphic” (in contrast to “polymorphic”). The HotSpot JIT compiler can detect monomorphic methods and compile them as direct method calls. This optimization is called monomorphic call transformation and has two consequences: First, it improves the performance of monomorphic calls and second, which is even more important, it opens the gate for other optimizations such as method inlining. This in turn, could lead to yet other optimizations, … you get the point. The problem is that all these optimizations might not be possible in a real program due to polymorphic calls. In short: The microbenchmark might not demonstrate a realistic performance profile and you are screwed.
Reliance on a specific environment: The JVM version, the OS and the hardware could be different in a microbenchmark and an application as I’ve already described earlier.

What’s next?

In this article, we have seen that we can fail in a lot of ways when trying to measure the performance of a Java component with a microbenchmark. However, all hope is not lost. In the next part I’ll introduce the Java Microbenchmarking Harness. Although it does not prevent all issues, it goes to great lengths to eliminate a lot of them upfront.

Questions or comments?

Just ping me on Twitter

Many thanks to @mmitterdorfer and @steve0392 for reading draft versions of this article.