Daniel Mitterdorfer

Microbenchmarking in Java with JMH: Microbenchmarks and their environment

CPU Pins

Image by Eduardo Diez Viñuela; license: CC

This is the second post in a series about microbenchmarking on the JVM with the Java Microbenchmarking Harness (JMH).

part 1: Microbenchmarking in Java with JMH: An Introduction

part 3: Common Flaws of Handwritten Benchmarks

part 4: Hello JMH

part 5: Digging Deeper

In the previous post I have introduced microbenchmarking. In this blog post we'll discover different problems and precautions that should be taken when writing microbenchmarks on the JVM.

Is it really that hard?

How hard can it be to write a microbenchmark? Just invoke the code that should be benchmarked, measure the elapsed time and we are done. Hold your horses, it's not that easy. The class below - also available in the accompanying project for this series on Github - is a microbenchmark that attempts to determine the performance of Collection#add():

package name.mitterdorfer.benchmark.plain;

import java.util.*;
import java.util.concurrent.ConcurrentSkipListSet;

public class FlawedSetMicroBenchmark {
    private static final int MAX_ELEMENTS = 10_000_000;

    public static void main(String[] args) {
        List<? extends Set<Integer>> testees =
                Arrays.asList(
                        new HashSet<Integer>(),
                        new TreeSet<Integer>(),
                        new ConcurrentSkipListSet<Integer>());
        for (Set<Integer> testee : testees) {
            doBenchmark(testee);
        }
    }

    private static void doBenchmark(Set<Integer> testee) {
        long start = System.currentTimeMillis();
        for (int i = 0; i < MAX_ELEMENTS; i++) {
            testee.add(i);
        }
        long end = System.currentTimeMillis();
        System.out.printf("%s took %d ms%n", testee.getClass(), (end - start));
    }
}

This microbenchmark looks simple but it does not measure what we think it does. Just consider two apparent issues:

  1. The program will start in interpreted mode. After some time the runtime detects that #doBenchmark() is a so-called hot method, the Just-In-Time (JIT) compiler will kick in and compile it. Due to a technique called on-stack-replacement (OSR) the runtime is able to switch in the middle of executing a method from interpreted to compiled mode. Thus, the microbenchmark will measure the performance for a mixture of both modes. Dr. Cliff Click wrote an in-depth article about on-stack replacement in which he writes that OSRed code may not be optimal from a performance perspective.
  2. On the first invocation of #doBenchmark(Set), the microbenchmark continuously inserts elements into a HashSet. Therefore, the internal array will get resized, which produces garbage. This will in turn trigger the garbage collector and distort measurement results.

To summarize, we have identified two common sources of problems in Java microbenchmarks: the JIT-compiler and the garbage collector. This microbenchmark contains a few more surprises that we'll cover in a later article. Now, let's take a step back and think first about the ideal environment for microbenchmarks.

Environmental Considerations

Before running any benchmark, we have to consider these goals:

  1. We want to stay as close as possible to the target system configuration, provided it is known in advance.
  2. We do not want to introduce any unintended bottlenecks due to the execution environment.

Therefore, we should think about various system components and their configuration. Of particular importance for microbenchmarks are:

We can conclude that we need to select hardware and system software carefully and configure it properly. Depending on the requirements, the same microbenchmark should even be run on different systems (hardware, OS, JVM) to get a grasp of the performance characteristics. These considerations affect all benchmarks, regardless of underlying technology. Microbenchmarks running on virtual machines face additional challenges.

The JVM's Dynamic Nature

On a very high level, the JVM consists of three main components that work together: the runtime including the interpreter, the garbage collector and the JIT-compiler. Due to these components, we neither know in advance which machine code will be executed nor how it behaves exactly at runtime. Contrast this behavior with a regular C program:

Comparison of optimizations that happen at compile time and optimizations in JITed code

Oracle's HotSpot JVM applies a vast amount of optimizations on Java code: an outdated page on the OpenJDK Wiki lists close to 70 optimization techniques and I suspect HotSpot applies a lot more today. This makes it impossible to reason which machine code is finally executed based on the source code or the bytecode. For the curios, HotSpot provides a possibility to look at the generated assembly code of the JIT-compiler. The easy part: add a disassembler library to the Java library path and provide the proper JVM flags: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=intel (see also the OpenJDK documentation on PrintAssembly). The hard part: Understand the output. For x86 architectures, the instruction set references (mnemonics A-M, mnemonics N-Z) are a start. To sum up, the dynamic behavior of the JVM makes it even harder to write correct microbenchmarks.

What's next?

In this article, we have seen that a lot of factors influence the runtime behavior of microbenchmarks. In the next article, we'll take a deep dive and look at specific flaws that can happen in handwritten microbenchmarks.

Questions or comments?

Just ping me on Twitter

Many thanks to @mmitterdorfer and @steve0392 for reading draft versions of this article.


False Sharing

Normally, Java programmers are not too concerned about the hardware on which their beautiful software runs as long as provides loads of memory. Most of the time this is a good thing as software should solve a business problem rather than satisfying a machine. The JVM does a decent job hiding the underlying platform but as we know, abstractions are leaky. Sometimes we have to peek under hood, especially when we're concerned about performance. One such topic is false sharing, were a very performance-critical component might not perform as well as we'd expect.

Consider this class:

public final class X {
    public volatile int f1;
    public volatile int f2;
}

On Oracle JDK 1.8, instances of this class are laid out in memory as shown below:

Object Layout of sample class X

Note: I have determined the layout on OpenJDK 1.8.0-b132 (64-Bit) using JOL, see also my accompanying project on false sharing on Github.

We have declared all fields as volatile indicating that these fields may be used by different threads and we want to ensure writes are visible to all of them. As a result, the runtime will emit code to ensure that writes to a field are visible across CPU cores. But how does this work?

Caching on Modern CPUs in less than 100 words

CPUs don't care about classes or objects; they are concerned with reads and writes on memory cells. To efficiently operate on data, it is fetched from main memory into a CPU cache at the granularity of a cache line. A cache line is a block of memory in the CPU cache, say 64 bytes. If data in a cache line is modified, each core sends messages across the memory bus according to a well-defined cache coherence protocol (the most common one being the MESI protocol) so all cores can reconcile.

False Sharing

In this example I assume a hypothetical CPU with one cache layer (L1 cache) where each cache belongs to one core. What happens if one thread, running on Core0 is constantly writing to X.f1 (depicted in red below) and another thread on Core1 is constantly reading from f2 (depicted in green below)?

Cache coherence with false sharing

Core0 knows that it does not own the cache line "n" exclusively, so it has to broadcast an "Invalidate" message across the bus after changing the cache line to notify other cores that their caches are stale. Core1 is listening on the bus and invalidates the corresponding cache line. Consequently, this produces a lot of unnecessary bus traffic although both cores operate on different fields. This phenomenon is known as false sharing.

Countermeasures to False Sharing

This is a rather unsatisfying state of affairs: Although each thread operates on different fields, we produce a lot of unnecessary bus traffic which is ultimately caused by the memory layout of our Java objects. If we were able to influence the memory layout we can avoid false sharing. It turns out there are a few ways around this problem. A word of warning though: The following techniques depend on the JVM implementation or the underlying hardware so take my words with a grain of salt. The techniques are taken from the JMH example class JMHSample_22_FalseSharing and a presentation on @Contended which inspired me to write this blog post. I'll just describe three of them shortly here as the example class of JMH documents these and more techniques already very clearly:

To properly pad f1, annotate it with @Contended:

import sun.misc.Contended;

public final class X {
    @Contended
    public volatile int f1;
    public volatile int f2;
}

Which will result in the object layout below:

Object Layout of sample class X with @Contended

False Sharing vs. @Contended: A Microbenchmark

To demonstrate the effects of false sharing, I have written a small JMH microbenchmark which is available on Github. Below is the result of this benchmark comparing false sharing and the @Contended approach using three reader threads and one writer thread. It has been run on an Intel Core i7-2635QM with 4 physical cores.

Cache coherence with false sharing

While write throughput is roughly equivalent (mean throughput 373 ops/µs for false sharing and 371 ops/µs for the solution with @Contended), the mean read throughput is around three times higher with @Contended (2338 ops/µs) than with false sharing (813 ops/µs). This clearly demonstrates the influence of false sharing in this scenario.

However, don't worry: @Contended is definitely not meant to be sprinkled across the entire codebase. In fact, it should be used very, very sparingly and people wanting to use @Contended probably already use one of the previously-mentioned field padding approaches. In JDK 8, @Contended is just used in 5 classes (Thread, Striped64, ConcurrentHashMap, Exchanger and ForkJoinPool). There are limited circumstances where false sharing is really a problem and bottlenecks should be identified before-hand. But that may be the topic of another article some day...

Questions or comments?

Just ping me on Twitter


Microbenchmarking in Java with JMH: An Introduction

Fire Breathers

Image by Zach Dischner; license: CC

This is the first post in a series about microbenchmarking on the JVM with the Java Microbenchmarking Harness (JMH).

part 2: Microbenchmarks and their environment

part 3: Common Flaws of Handwritten Benchmarks

part 4: Hello JMH

part 5: Digging Deeper

In this post I'll introduce benchmarking conceptually and describe the specific flavor microbenchmarking.

What is Benchmarking?

When software engineers are concerned with the performance of a system, they can resort to a rich variety of practices, for example:

In this series of blog posts we take a deeper look at benchmarking. As software engineers, we are interested in answers to questions like:

These examples already expose some characteristics of software benchmarks:

This leads to two commonly known types of benchmarks:

You should now have a rough understanding of microbenchmarking and where this series of blog posts is headed. In the next post, we'll discuss common flaws of microbenchmarks on the JVM and find out why writing a correct one is not a piece of cake. Stay tuned.

Questions or comments?

Just ping me on Twitter


Using Perlock with Spring

Perlock - a simple Java path watching library I have written - is available in Maven Central since a few weeks now. I thought it would be nice to provide another demo application to show how it can be used with Spring. It is provided along with Perlock on Github in examples/perlock-spring-demo.

Demo Scenario

The demo application implements a XML file processing application. Clients put XML files in an input directory (in the example, it's just the /tmp directory), a MessageDispatcher listens for new XML files in the directory with Perlock, and dispatches the file processing to a MessageProcessor which simulates some work. Below is an overview of the involved classes:

Key Classes of the Spring Perlock demo application

The most interesting part of the demo application is MessageDispatcher (slightly simplified below):

public class MessageDispatcher extends AbstractPathChangeListener {
    private final Executor executor;

    public MessageDispatcher(Executor executor) {
        this.executor = executor;
    }

    @Override
    public void onPathCreated(Path path) {
        if (path.toString().endsWith("xml")) {
            executor.execute(new MessageProcessor(path));
        }
    }
}

It is notified by Perlock whenever a path is created (callback #onPathCreated(Path)) and decides whether a corresponding message needs to be processed. As message processing might by very time-consuming, the MessageDispatcher executes each MessageProcessor in an Executor. Thus, the MessageDispatcher returns very quickly allowing the application to stay responsive.

Bootstraping Perlock with Spring

The Perlock API provides three key API elements: We have seen the first one already: PathChangeListener, which listens for changes on a file system. It is always associated to one PathWatcher, which allows applications to control the path watching life cycle. A PathWatcher is created by a PathWatcherFactory. For details refer to the README on Github or my introductory post on Perlock. The Perlock-Spring integration consists of two FactoryBean implementations; one to create the PathWatcherFactory and another one to create PathWatcher instances:

Key classes of the Spring-Perlock Integration

This might look verbose at first glance, but keep in mind that an application should only create one instance of PathWatcherFactory but may create multiple instances of PathWatcher to watch different paths, so the separation of the FactoryBeans makes sense.

Trying the Spring Demo

To try the demo yourself, follow these steps:

  1. Clone the repo: git clone https://github.com/danielmitterdorfer/perlock.git
  2. Create a JAR with all dependencies: gradle fatJar
  3. Issue: java -jar examples/perlock-spring-demo/build/libs/perlock-spring-demo-0.2.0.jar

You'll now see this output:

TRACE 2014-02-19 21:04:15,298 [main] (PathWatcherFactory.java:114) - Submitting 'PathWatcher for '/tmp'' to executor service.
 INFO 2014-02-19 21:04:15,302 [main] (SpringPathWatcherDemo.java:29) - Press any key to stop the demo
TRACE 2014-02-19 21:04:15,302 [pathWatchExecutor-1] (PathWatcherFactory.java:140) - About to run 'PathWatcher for '/tmp''.
TRACE 2014-02-19 21:04:15,308 [pathWatchExecutor-1] (AbstractRegistrationStrategy.java:33) - Registering Path: '/tmp'
TRACE 2014-02-19 21:04:15,309 [pathWatchExecutor-1] (WatchServicePathWatcher.java:91) - Waiting for file system events

cd /tmp in a new terminal window and issue touch hello{1,2}.xml && sleep 2 && touch hello{3,4}.xml:

You'll now see output like:

TRACE 2014-02-19 21:07:01,425 [pathWatchExecutor-1] (WatchServicePathWatcher.java:124) - Received watch event with kind 'ENTRY_CREATE'
 INFO 2014-02-19 21:07:01,426 [pathWatchExecutor-1] (MessageDispatcher.java:31) - Starting handling path '/tmp/hello1.xml'
TRACE 2014-02-19 21:07:01,428 [pathWatchExecutor-1] (WatchServicePathWatcher.java:124) - Received watch event with kind 'ENTRY_CREATE'
TRACE 2014-02-19 21:07:01,428 [messageHandlingExecutor-1] (MessageProcessor.java:24) - Simulating work on '/tmp/hello1.xml'
 INFO 2014-02-19 21:07:01,428 [pathWatchExecutor-1] (MessageDispatcher.java:31) - Starting handling path '/tmp/hello2.xml'
TRACE 2014-02-19 21:07:01,428 [messageHandlingExecutor-1] (MessageProcessor.java:27) - Processing '/tmp/hello1.xml'. 5 seconds left...
TRACE 2014-02-19 21:07:01,429 [pathWatchExecutor-1] (WatchServicePathWatcher.java:124) - Received watch event with kind 'ENTRY_CREATE'
TRACE 2014-02-19 21:07:01,429 [messageHandlingExecutor-2] (MessageProcessor.java:24) - Simulating work on '/tmp/hello2.xml'
TRACE 2014-02-19 21:07:02,429 [messageHandlingExecutor-1] (MessageProcessor.java:27) - Processing '/tmp/hello1.xml'. 4 seconds left...
# ... more output here ...
TRACE 2014-02-19 21:07:05,438 [messageHandlingExecutor-3] (MessageProcessor.java:27) - Processing '/tmp/hello3.xml'. 1 seconds left...
TRACE 2014-02-19 21:07:06,433 [messageHandlingExecutor-1] (MessageProcessor.java:34) - Done. Cleaning up... '/tmp/hello1.xml'
TRACE 2014-02-19 21:07:06,434 [messageHandlingExecutor-2] (MessageProcessor.java:34) - Done. Cleaning up... '/tmp/hello2.xml'
TRACE 2014-02-19 21:07:06,439 [messageHandlingExecutor-3] (MessageProcessor.java:34) - Done. Cleaning up... '/tmp/hello3.xml'
TRACE 2014-02-19 21:07:06,439 [messageHandlingExecutor-4] (MessageProcessor.java:34) - Done. Cleaning up... '/tmp/hello4.xml'

Note how all MessageProcessors run interleaved in their own thread. This allows the application to receive new events while still processing older messages.

What have we done?

The takeaways of this post are:

If you need path watching in a Spring-based application go and get the new Perlock-Spring integration! In Gradle it is as easy as:

dependencies {
    compile group: 'name.mitterdorfer.perlock', name: 'perlock-spring', version: '0.2.0'
}

or alternatively via Maven.

<dependency>
    <groupId>name.mitterdorfer.perlock</groupId>
    <artifactId>perlock-spring</artifactId>
    <version>0.2.0</version>
</dependency>

Questions or comments?

Just ping me on Twitter


Perlock - Path Watching without Headaches

Sherlock

Image by Sandra; license: CC

As I have written in my previous post the JDK 7 WatchService API is too low level to be used directly in an application. To my surprise, to this day no Java library has existed that abstracts the JDK 7 WatchService API and provides an easy high level API on top.

A Simplified API: Perlock

Therefore I created perlock (short for Path-Sherlock). It supports the standard use case of watching files or directories for modifications and performing some action then. It consists of a very small API with two interfaces and two classes:

Suppose, you want to print a message every time a file is created in the path /tmp. First, implement the callback:

public class SamplePathLogger extends AbstractPathChangeListener {
    @Override
    public void onPathCreated(Path path) {
        System.out.println("Created: '" + path + "'");    
    }
}

Next, create a PathWatcher and start watching:

Path pathToWatch = Paths.get("/tmp");
ExecutorService executorService = Executors.newFixedThreadPool(1);
PathWatcherFactory pathWatchers = new PathWatcherFactory(executorService);
PathWatcher watcher = pathWatchers.createNonRecursiveWatcher(pathToWatch, new SamplePathLogger());
watcher.start();

That's it. When you are done, call watcher.stop() and shutdown the thread pool. You might wonder why the PathWatcherFactory needs an ExecutorService. Path watching is always performed in a background thread (one per PathWatcher instance) otherwise path watching would block your entire program.

Using Perlock

If you want to try perlock, clone the Github repo, try the example application or build the binary yourself with Gradle.

Questions or comments?

Just ping me on Twitter