Daniel Mitterdorfer

False Sharing

Normally, Java programmers are not too concerned about the hardware on which their beautiful software runs as long as provides loads of memory. Most of the time this is a good thing as software should solve a business problem rather than satisfying a machine. The JVM does a decent job hiding the underlying platform but as we know, abstractions are leaky. Sometimes we have to peek under hood, especially when we're concerned about performance. One such topic is false sharing, were a very performance-critical component might not perform as well as we'd expect.

Consider this class:

public final class X {
    public volatile int f1;
    public volatile int f2;
}

On Oracle JDK 1.8, instances of this class are laid out in memory as shown below:

Object Layout of sample class X

Note: I have determined the layout on OpenJDK 1.8.0-b132 (64-Bit) using JOL, see also my accompanying project on false sharing on Github.

We have declared all fields as volatile indicating that these fields may be used by different threads and we want to ensure writes are visible to all of them. As a result, the runtime will emit code to ensure that writes to a field are visible across CPU cores. But how does this work?

Caching on Modern CPUs in less than 100 words

CPUs don't care about classes or objects; they are concerned with reads and writes on memory cells. To efficiently operate on data, it is fetched from main memory into a CPU cache at the granularity of a cache line. A cache line is a block of memory in the CPU cache, say 64 bytes. If data in a cache line is modified, each core sends messages across the memory bus according to a well-defined cache coherence protocol (the most common one being the MESI protocol) so all cores can reconcile.

False Sharing

In this example I assume a hypothetical CPU with one cache layer (L1 cache) where each cache belongs to one core. What happens if one thread, running on Core0 is constantly writing to X.f1 (depicted in red below) and another thread on Core1 is constantly reading from f2 (depicted in green below)?

Cache coherence with false sharing

Core0 knows that it does not own the cache line "n" exclusively, so it has to broadcast an "Invalidate" message across the bus after changing the cache line to notify other cores that their caches are stale. Core1 is listening on the bus and invalidates the corresponding cache line. Consequently, this produces a lot of unnecessary bus traffic although both cores operate on different fields. This phenomenon is known as false sharing.

Countermeasures to False Sharing

This is a rather unsatisfying state of affairs: Although each thread operates on different fields, we produce a lot of unnecessary bus traffic which is ultimately caused by the memory layout of our Java objects. If we were able to influence the memory layout we can avoid false sharing. It turns out there are a few ways around this problem. A word of warning though: The following techniques depend on the JVM implementation or the underlying hardware so take my words with a grain of salt. The techniques are taken from the JMH example class JMHSample_22_FalseSharing and a presentation on @Contended which inspired me to write this blog post. I'll just describe three of them shortly here as the example class of JMH documents these and more techniques already very clearly:

To properly pad f1, annotate it with @Contended:

import sun.misc.Contended;

public final class X {
    @Contended
    public volatile int f1;
    public volatile int f2;
}

Which will result in the object layout below:

Object Layout of sample class X with @Contended

False Sharing vs. @Contended: A Microbenchmark

To demonstrate the effects of false sharing, I have written a small JMH microbenchmark which is available on Github. Below is the result of this benchmark comparing false sharing and the @Contended approach using three reader threads and one writer thread. It has been run on an Intel Core i7-2635QM with 4 physical cores.

Cache coherence with false sharing

While write throughput is roughly equivalent (mean throughput 373 ops/µs for false sharing and 371 ops/µs for the solution with @Contended), the mean read throughput is around three times higher with @Contended (2338 ops/µs) than with false sharing (813 ops/µs). This clearly demonstrates the influence of false sharing in this scenario.

However, don't worry: @Contended is definitely not meant to be sprinkled across the entire codebase. In fact, it should be used very, very sparingly and people wanting to use @Contended probably already use one of the previously-mentioned field padding approaches. In JDK 8, @Contended is just used in 5 classes (Thread, Striped64, ConcurrentHashMap, Exchanger and ForkJoinPool). There are limited circumstances where false sharing is really a problem and bottlenecks should be identified before-hand. But that may be the topic of another article some day...

Questions or comments?

Just ping me on Twitter


Microbenchmarking in Java with JMH: An Introduction

Fire Breathers

Image by Zach Dischner; license: CC

This is the first post in a series about microbenchmarking on the JVM with the Java Microbenchmarking Harness (JMH).

part 2: Microbenchmarks and their environment

part 3: Common Flaws of Handwritten Benchmarks

part 4: Hello JMH

part 5: Digging Deeper

In this post I'll introduce benchmarking conceptually and describe the specific flavor microbenchmarking.

What is Benchmarking?

When software engineers are concerned with the performance of a system, they can resort to a rich variety of practices, for example:

In this series of blog posts we take a deeper look at benchmarking. As software engineers, we are interested in answers to questions like:

These examples already expose some characteristics of software benchmarks:

This leads to two commonly known types of benchmarks:

You should now have a rough understanding of microbenchmarking and where this series of blog posts is headed. In the next post, we'll discuss common flaws of microbenchmarks on the JVM and find out why writing a correct one is not a piece of cake. Stay tuned.

Questions or comments?

Just ping me on Twitter


Using Perlock with Spring

Perlock - a simple Java path watching library I have written - is available in Maven Central since a few weeks now. I thought it would be nice to provide another demo application to show how it can be used with Spring. It is provided along with Perlock on Github in examples/perlock-spring-demo.

Demo Scenario

The demo application implements a XML file processing application. Clients put XML files in an input directory (in the example, it's just the /tmp directory), a MessageDispatcher listens for new XML files in the directory with Perlock, and dispatches the file processing to a MessageProcessor which simulates some work. Below is an overview of the involved classes:

Key Classes of the Spring Perlock demo application

The most interesting part of the demo application is MessageDispatcher (slightly simplified below):

public class MessageDispatcher extends AbstractPathChangeListener {
    private final Executor executor;

    public MessageDispatcher(Executor executor) {
        this.executor = executor;
    }

    @Override
    public void onPathCreated(Path path) {
        if (path.toString().endsWith("xml")) {
            executor.execute(new MessageProcessor(path));
        }
    }
}

It is notified by Perlock whenever a path is created (callback #onPathCreated(Path)) and decides whether a corresponding message needs to be processed. As message processing might by very time-consuming, the MessageDispatcher executes each MessageProcessor in an Executor. Thus, the MessageDispatcher returns very quickly allowing the application to stay responsive.

Bootstraping Perlock with Spring

The Perlock API provides three key API elements: We have seen the first one already: PathChangeListener, which listens for changes on a file system. It is always associated to one PathWatcher, which allows applications to control the path watching life cycle. A PathWatcher is created by a PathWatcherFactory. For details refer to the README on Github or my introductory post on Perlock. The Perlock-Spring integration consists of two FactoryBean implementations; one to create the PathWatcherFactory and another one to create PathWatcher instances:

Key classes of the Spring-Perlock Integration

This might look verbose at first glance, but keep in mind that an application should only create one instance of PathWatcherFactory but may create multiple instances of PathWatcher to watch different paths, so the separation of the FactoryBeans makes sense.

Trying the Spring Demo

To try the demo yourself, follow these steps:

  1. Clone the repo: git clone https://github.com/danielmitterdorfer/perlock.git
  2. Create a JAR with all dependencies: gradle fatJar
  3. Issue: java -jar examples/perlock-spring-demo/build/libs/perlock-spring-demo-0.2.0.jar

You'll now see this output:

TRACE 2014-02-19 21:04:15,298 [main] (PathWatcherFactory.java:114) - Submitting 'PathWatcher for '/tmp'' to executor service.
 INFO 2014-02-19 21:04:15,302 [main] (SpringPathWatcherDemo.java:29) - Press any key to stop the demo
TRACE 2014-02-19 21:04:15,302 [pathWatchExecutor-1] (PathWatcherFactory.java:140) - About to run 'PathWatcher for '/tmp''.
TRACE 2014-02-19 21:04:15,308 [pathWatchExecutor-1] (AbstractRegistrationStrategy.java:33) - Registering Path: '/tmp'
TRACE 2014-02-19 21:04:15,309 [pathWatchExecutor-1] (WatchServicePathWatcher.java:91) - Waiting for file system events

cd /tmp in a new terminal window and issue touch hello{1,2}.xml && sleep 2 && touch hello{3,4}.xml:

You'll now see output like:

TRACE 2014-02-19 21:07:01,425 [pathWatchExecutor-1] (WatchServicePathWatcher.java:124) - Received watch event with kind 'ENTRY_CREATE'
 INFO 2014-02-19 21:07:01,426 [pathWatchExecutor-1] (MessageDispatcher.java:31) - Starting handling path '/tmp/hello1.xml'
TRACE 2014-02-19 21:07:01,428 [pathWatchExecutor-1] (WatchServicePathWatcher.java:124) - Received watch event with kind 'ENTRY_CREATE'
TRACE 2014-02-19 21:07:01,428 [messageHandlingExecutor-1] (MessageProcessor.java:24) - Simulating work on '/tmp/hello1.xml'
 INFO 2014-02-19 21:07:01,428 [pathWatchExecutor-1] (MessageDispatcher.java:31) - Starting handling path '/tmp/hello2.xml'
TRACE 2014-02-19 21:07:01,428 [messageHandlingExecutor-1] (MessageProcessor.java:27) - Processing '/tmp/hello1.xml'. 5 seconds left...
TRACE 2014-02-19 21:07:01,429 [pathWatchExecutor-1] (WatchServicePathWatcher.java:124) - Received watch event with kind 'ENTRY_CREATE'
TRACE 2014-02-19 21:07:01,429 [messageHandlingExecutor-2] (MessageProcessor.java:24) - Simulating work on '/tmp/hello2.xml'
TRACE 2014-02-19 21:07:02,429 [messageHandlingExecutor-1] (MessageProcessor.java:27) - Processing '/tmp/hello1.xml'. 4 seconds left...
# ... more output here ...
TRACE 2014-02-19 21:07:05,438 [messageHandlingExecutor-3] (MessageProcessor.java:27) - Processing '/tmp/hello3.xml'. 1 seconds left...
TRACE 2014-02-19 21:07:06,433 [messageHandlingExecutor-1] (MessageProcessor.java:34) - Done. Cleaning up... '/tmp/hello1.xml'
TRACE 2014-02-19 21:07:06,434 [messageHandlingExecutor-2] (MessageProcessor.java:34) - Done. Cleaning up... '/tmp/hello2.xml'
TRACE 2014-02-19 21:07:06,439 [messageHandlingExecutor-3] (MessageProcessor.java:34) - Done. Cleaning up... '/tmp/hello3.xml'
TRACE 2014-02-19 21:07:06,439 [messageHandlingExecutor-4] (MessageProcessor.java:34) - Done. Cleaning up... '/tmp/hello4.xml'

Note how all MessageProcessors run interleaved in their own thread. This allows the application to receive new events while still processing older messages.

What have we done?

The takeaways of this post are:

If you need path watching in a Spring-based application go and get the new Perlock-Spring integration! In Gradle it is as easy as:

dependencies {
    compile group: 'name.mitterdorfer.perlock', name: 'perlock-spring', version: '0.2.0'
}

or alternatively via Maven.

<dependency>
    <groupId>name.mitterdorfer.perlock</groupId>
    <artifactId>perlock-spring</artifactId>
    <version>0.2.0</version>
</dependency>

Questions or comments?

Just ping me on Twitter


Perlock - Path Watching without Headaches

Sherlock

Image by Sandra; license: CC

As I have written in my previous post the JDK 7 WatchService API is too low level to be used directly in an application. To my surprise, to this day no Java library has existed that abstracts the JDK 7 WatchService API and provides an easy high level API on top.

A Simplified API: Perlock

Therefore I created perlock (short for Path-Sherlock). It supports the standard use case of watching files or directories for modifications and performing some action then. It consists of a very small API with two interfaces and two classes:

Suppose, you want to print a message every time a file is created in the path /tmp. First, implement the callback:

public class SamplePathLogger extends AbstractPathChangeListener {
    @Override
    public void onPathCreated(Path path) {
        System.out.println("Created: '" + path + "'");    
    }
}

Next, create a PathWatcher and start watching:

Path pathToWatch = Paths.get("/tmp");
ExecutorService executorService = Executors.newFixedThreadPool(1);
PathWatcherFactory pathWatchers = new PathWatcherFactory(executorService);
PathWatcher watcher = pathWatchers.createNonRecursiveWatcher(pathToWatch, new SamplePathLogger());
watcher.start();

That's it. When you are done, call watcher.stop() and shutdown the thread pool. You might wonder why the PathWatcherFactory needs an ExecutorService. Path watching is always performed in a background thread (one per PathWatcher instance) otherwise path watching would block your entire program.

Using Perlock

If you want to try perlock, clone the Github repo, try the example application or build the binary yourself with Gradle.

Questions or comments?

Just ping me on Twitter


Watching File Changes with the JDK 7 WatchService

Have you ever needed to watch the file system for changes in a Java application? Java 7 ships with the WatchService API that is suited exactly for this use case as you might know. My journey with the WatchService API began a few months ago when I stumbled across this code I have originally written a few years ago:

//TODO: Update to JDK 7 and replace with native watcher
FileSystemManager fsManager = VFS.getManager();
watchedDirectory = fsManager.resolveFile(queueDirectory.getAbsolutePath());

DefaultFileMonitor fileMonitor = new DefaultFileMonitor(this);
fileMonitor.setRecursive(false);
fileMonitor.addFile(watchedDirectory);

In the snippet above I setup a Apache Commons VFS DefaultFileMonitor which checks for modification within the queueDirectory. The code is part of a Dropbox-based data synchronization I use for myself. Although there is nothing wrong with Apache Commons VFS, I use only a small portion of a quite large library and wanted to reduce the application's footprint by switching to native JDK facilities. Furthermore, Apache Commons VFS relies on polling whereas the Java 7 WatchService can use native kernel facilities if they are available on the platform. As Java 7 is available for quite some time, I decided to finally have a look at the WatchService API. I was not amused.

The Pain

As you can see above, the DefaultFileMonitor in Apache Commons VFS has a very straightforward API: You register for a directory to watch and receive change events afterwards. However, a look at the Javadoc of WatchService shows that Oracle has chosen a totally different approach. The code snippet below demonstrates how to watch for changes in a directory with the WatchService API. I simplified the example to its bare essentials: the code contains no precondition checks and no exception handling. Do not use this code for anything else than playing around with the API, I have warned you:

import java.io.File;
import java.nio.file.*;
import static java.nio.file.StandardWatchEventKinds.*;

public class BasicPathWatcher {
    public static void main(String[] args) throws Exception {
        //the path to watch; we assume no recursive watching (would be even more complicated)
        Path pathToWatch = Paths.get(args[0]);
        WatchService ws = pathToWatch.getFileSystem().newWatchService(); // 1.
        // we do not care about the returned WatchKey in this simple example
        pathToWatch.register(ws, ENTRY_CREATE, ENTRY_MODIFY, ENTRY_DELETE); //2.
        boolean valid = true;
        while(valid) {
            WatchKey key = ws.take(); //3.
            for (WatchEvent &lt;?&gt; e: key.pollEvents()) { //4.
                WatchEvent.Kind&lt;?&gt; kind = e.kind();
                if (kind != OVERFLOW) {
                    WatchEvent&lt;Path&gt; event = (WatchEvent&lt;Path&gt;)e;
                    Path filename = event.context();
                    Path child = pathToWatch.resolve(filename);
                    System.out.println("Got event '" + kind + "' for path '" + child + "'");
                }
            }
            valid = key.reset(); //5.
        }
    }
}

Here is what the code does:

  1. Get the right WatchService. Many code examples get this step wrong by always obtaining the WatchService from the "default" file system. If the provided path is not on the "default" file system, the wrong WatchService is obtained. The correct way is to use Path#getFileSystem()#newWatchService() as shown in the example.
  2. Register a Path at the WatchService. You'll then get a WatchKey, which is a handle for events that you'll receive later on.
  3. Next, call WatchService#take() in a loop to get notified on file system changes. This call blocks until an event occurs.
  4. The call returns a WatchKey, which you then have to poll for events.
  5. Now you can handle the events and finally reset the WatchKey (or you won't get any further events). If you want to watch recursively and a directory has been created, you have to register the new directory again with the watch service.

Even for the simplest case, these are a lot of details that you have to get right in order to use the API correctly. A more robust implementation is as least twice as long.

There is a reason why the API is designed this way. Oracle engineer Alan Bateman gave an introductory talk on the WatchService API at Devoxx 2011 where he also described the API design (starts at around 45 minutes into the video).

This part of the API is deliberately a low-level interface

Alan Bateman at Devoxx 2011

In the video, he describes that the key decision was to design the interface very low-level to allow for many use-cases such as server applications or graphical applications.

Others must experience this pain too...

Although the API is hard to use, I still wanted to integrate it albeit not directly in the application. So I searched for libraries that abstract all the gory details. To my surprise I couldn't find a single one. These are the file watching libraries in Java I am aware of:

There also exist a few Scala libraries but nobody seems to provide an abstraction library in plain Java.

Conclusion

I refuse to use such a complex low-level API directly in my application. Since Oracle designed the WatchService API as a low level API I would have expected that someone has written a library by now that abstracts the details. It's time to do something about that ...

Questions or comments?

Just ping me on Twitter