Daniel Mitterdorfer

Microbenchmarking in Java with JMH: Microbenchmarks and their environment

CPU Pins

Image by Eduardo Diez ViƱuela; license: CC

This is the second post in a series about microbenchmarking on the JVM with the Java Microbenchmarking Harness (JMH).

part 1: Microbenchmarking in Java with JMH: An Introduction

part 3: Common Flaws of Handwritten Benchmarks

part 4: Hello JMH

part 5: Digging Deeper

In the previous post I have introduced microbenchmarking. In this blog post we’ll discover different problems and precautions that should be taken when writing microbenchmarks on the JVM.

Is it really that hard?

How hard can it be to write a microbenchmark? Just invoke the code that should be benchmarked, measure the elapsed time and we are done. Hold your horses, it’s not that easy. The class below - also available in the accompanying project for this series on Github - is a microbenchmark that attempts to determine the performance of Collection#add():

package name.mitterdorfer.benchmark.plain;

import java.util.*;
import java.util.concurrent.ConcurrentSkipListSet;

public class FlawedSetMicroBenchmark {
    private static final int MAX_ELEMENTS = 10_000_000;

    public static void main(String[] args) {
        List<? extends Set<Integer>> testees =
                Arrays.asList(
                        new HashSet<Integer>(),
                        new TreeSet<Integer>(),
                        new ConcurrentSkipListSet<Integer>());
        for (Set<Integer> testee : testees) {
            doBenchmark(testee);
        }
    }

    private static void doBenchmark(Set<Integer> testee) {
        long start = System.currentTimeMillis();
        for (int i = 0; i < MAX_ELEMENTS; i++) {
            testee.add(i);
        }
        long end = System.currentTimeMillis();
        System.out.printf("%s took %d ms%n", testee.getClass(), (end - start));
    }
}

This microbenchmark looks simple but it does not measure what we think it does. Just consider two apparent issues:

  1. The program will start in interpreted mode. After some time the runtime detects that #doBenchmark() is a so-called hot method, the Just-In-Time (JIT) compiler will kick in and compile it. Due to a technique called on-stack-replacement (OSR) the runtime is able to switch in the middle of executing a method from interpreted to compiled mode. Thus, the microbenchmark will measure the performance for a mixture of both modes. Dr. Cliff Click wrote an in-depth article about on-stack replacement in which he writes that OSRed code may not be optimal from a performance perspective.
  2. On the first invocation of #doBenchmark(Set), the microbenchmark continuously inserts elements into a HashSet. Therefore, the internal array will get resized, which produces garbage. This will in turn trigger the garbage collector and distort measurement results.

To summarize, we have identified two common sources of problems in Java microbenchmarks: the JIT-compiler and the garbage collector. This microbenchmark contains a few more surprises that we’ll cover in a later article. Now, let’s take a step back and think first about the ideal environment for microbenchmarks.

Environmental Considerations

Before running any benchmark, we have to consider these goals:

  1. We want to stay as close as possible to the target system configuration, provided it is known in advance.
  2. We do not want to introduce any unintended bottlenecks due to the execution environment.

Therefore, we should think about various system components and their configuration. Of particular importance for microbenchmarks are:

  • CPU: A multithreaded microbenchmark should be run on a multicore machine. Otherwise lots of effects and problems may never be observed in the microbenchmark, e.g. the effects of false sharing or many more.
  • Operating System: An application may exhibit very different performance characteristics on different operating systems. For that matter, this also applies to the JVM as it is the execution environment of Java programs. Hence, performance-wise a system will also behave differently on different JVM implementations or even different versions of the same JVM.
  • System load: It is obviously not a good a idea to encode videos while running a benchmark. Watch out for CPU or memory hogs and ensure only essential processes are running when the benchmark is executed.
  • Energy Saving Settings: If a notebook runs on battery, the operating system can dynamically reduce clock frequency to save energy. This renders the benchmark results worthless.

We can conclude that we need to select hardware and system software carefully and configure it properly. Depending on the requirements, the same microbenchmark should even be run on different systems (hardware, OS, JVM) to get a grasp of the performance characteristics. These considerations affect all benchmarks, regardless of underlying technology. Microbenchmarks running on virtual machines face additional challenges.

The JVM’s Dynamic Nature

On a very high level, the JVM consists of three main components that work together: the runtime including the interpreter, the garbage collector and the JIT-compiler. Due to these components, we neither know in advance which machine code will be executed nor how it behaves exactly at runtime. Contrast this behavior with a regular C program:

Comparison of optimizations that happen at compile time and optimizations in JITed code

Oracle’s HotSpot JVM applies a vast amount of optimizations on Java code: an outdated page on the OpenJDK Wiki lists close to 70 optimization techniques and I suspect HotSpot applies a lot more today. This makes it impossible to reason which machine code is finally executed based on the source code or the bytecode. For the curios, HotSpot provides a possibility to look at the generated assembly code of the JIT-compiler. The easy part: add a disassembler library to the Java library path and provide the proper JVM flags: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=intel (see also the OpenJDK documentation on PrintAssembly). The hard part: Understand the output. For x86 architectures, the instruction set references (mnemonics A-M, mnemonics N-Z) are a start. To sum up, the dynamic behavior of the JVM makes it even harder to write correct microbenchmarks.

What’s next?

In this article, we have seen that a lot of factors influence the runtime behavior of microbenchmarks. In the next article, we’ll take a deep dive and look at specific flaws that can happen in handwritten microbenchmarks.

Questions or comments?

Just ping me on Twitter

Many thanks to @mmitterdorfer and @steve0392 for reading draft versions of this article.