Daniel Mitterdorfer

False Sharing

Normally, Java programmers are not too concerned about the hardware on which their beautiful software runs as long as provides loads of memory. Most of the time this is a good thing as software should solve a business problem rather than satisfying a machine. The JVM does a decent job hiding the underlying platform but as we know, abstractions are leaky. Sometimes we have to peek under hood, especially when we’re concerned about performance. One such topic is false sharing, were a very performance-critical component might not perform as well as we’d expect.

Consider this class:

public final class X {
    public volatile int f1;
    public volatile int f2;
}

On Oracle JDK 1.8, instances of this class are laid out in memory as shown below:

Object Layout of sample class X

Note: I have determined the layout on OpenJDK 1.8.0-b132 (64-Bit) using JOL, see also my accompanying project on false sharing on Github.

We have declared all fields as volatile indicating that these fields may be used by different threads and we want to ensure writes are visible to all of them. As a result, the runtime will emit code to ensure that writes to a field are visible across CPU cores. But how does this work?

Caching on Modern CPUs in less than 100 words

CPUs don’t care about classes or objects; they are concerned with reads and writes on memory cells. To efficiently operate on data, it is fetched from main memory into a CPU cache at the granularity of a cache line. A cache line is a block of memory in the CPU cache, say 64 bytes. If data in a cache line is modified, each core sends messages across the memory bus according to a well-defined cache coherence protocol (the most common one being the MESI protocol) so all cores can reconcile.

False Sharing

In this example I assume a hypothetical CPU with one cache layer (L1 cache) where each cache belongs to one core. What happens if one thread, running on Core0 is constantly writing to X.f1 (depicted in red below) and another thread on Core1 is constantly reading from f2 (depicted in green below)?

Cache coherence with false sharing

Core0 knows that it does not own the cache line “n” exclusively, so it has to broadcast an “Invalidate” message across the bus after changing the cache line to notify other cores that their caches are stale. Core1 is listening on the bus and invalidates the corresponding cache line. Consequently, this produces a lot of unnecessary bus traffic although both cores operate on different fields. This phenomenon is known as false sharing.

Countermeasures to False Sharing

This is a rather unsatisfying state of affairs: Although each thread operates on different fields, we produce a lot of unnecessary bus traffic which is ultimately caused by the memory layout of our Java objects. If we were able to influence the memory layout we can avoid false sharing. It turns out there are a few ways around this problem. A word of warning though: The following techniques depend on the JVM implementation or the underlying hardware so take my words with a grain of salt. The techniques are taken from the JMH example class JMHSample_22_FalseSharing and a presentation on @Contended which inspired me to write this blog post. I’ll just describe three of them shortly here as the example class of JMH documents these and more techniques already very clearly:

  • Field padding within class: The idea is to stuff enough fields between f1 and f2 so that they end up on different cache lines. Depending on the JVM implementation this may not work as a JVM implementation can lay out fields in memory as it sees fit.
  • Field padding across class hierarchy: A JVM implementation might be cleverly rearranging fields though so field padding will not work. However, if f1 and the padding fields are placed in a dedicated class, and f2 is placed in a subclass of it, currently no JVM implementation will rearrange the padding fields across the class hierarchy. This is quite quirky and fortunately, there is a better solution in Java 8.
  • @Contended: @Contended has been introduced to Java 8 with JEP-142. With this annotation, fields can be declared as contended. The current OpenJDK implementation will then pad the field appropriately, inserting a 128 byte padding after each annotated field. 128 bytes is twice the typical cache line size and were chosen to account for cache prefetching algorithms, specifically algorithms which fetch two adjacent cache lines. Consideration of @Contended by the JVM must be explicitly enabled -XX:-RestrictContended. As it is likely that cache line size or cache prefetching algorithms will change over time, the JVM flag -XX:ContendedPaddingWidth (previously known as -XX:FieldPaddingWidth) allows to control the padding size. If this all sounds a bit scary to you, you are not alone. In the introductory post on the OpenJDK mailing list the annotation was controversially discussed. It is likely that you’ll almost never encounter the annotation in an application but parts of the JDK such as ForkJoinPool already take advantage of it.

To properly pad f1, annotate it with @Contended:

import sun.misc.Contended;

public final class X {
    @Contended
    public volatile int f1;
    public volatile int f2;
}

Which will result in the object layout below:

Object Layout of sample class X with @Contended

False Sharing vs. @Contended: A Microbenchmark

To demonstrate the effects of false sharing, I have written a small JMH microbenchmark which is available on Github. Below is the result of this benchmark comparing false sharing and the @Contended approach using three reader threads and one writer thread. It has been run on an Intel Core i7-2635QM with 4 physical cores.

Cache coherence with false sharing

While write throughput is roughly equivalent (mean throughput 373 ops/µs for false sharing and 371 ops/µs for the solution with @Contended), the mean read throughput is around three times higher with @Contended (2338 ops/µs) than with false sharing (813 ops/µs). This clearly demonstrates the influence of false sharing in this scenario.

However, don’t worry: @Contended is definitely not meant to be sprinkled across the entire codebase. In fact, it should be used very, very sparingly and people wanting to use @Contended probably already use one of the previously-mentioned field padding approaches. In JDK 8, @Contended is just used in 5 classes (Thread, Striped64, ConcurrentHashMap, Exchanger and ForkJoinPool). There are limited circumstances where false sharing is really a problem and bottlenecks should be identified before-hand. But that may be the topic of another article some day…

Questions or comments?

Just ping me on Twitter