Daniel Mitterdorfer

Cheap Read-Write Lock Explained

I have recently stumbled across a nice idiom called the cheap read-write lock. It is intended for very frequent concurrent reads of a field where synchronization would lead to too much contention. I think that the idiom is a bit odd and begs for an explanation.

Usage Scenario

For the sake of demonstration consider the following situation: Our system has a globally available calendar that holds the current day within the year. Every part of the system may read the current day. Once a day, a background thread will update it. Note that this scenario might not justify using the idiom in practice, but the example is sufficient for illustrating it.

@NotThreadSafe
public final class SystemCalendar {
  // may be a value between 0 and 364
  private int currentDayOfYear;
  
  public int getCurrentDayOfYear() {
    return currentDayOfYear;
  }
  
  public void advanceToNextDay() {
    // let's ignore leap years for this example...
    this.currentDayOfYear = (this.currentDayOfYear + 1) % 365;
  }
}

That's all nice and well, but this code is not thread safe. To demonstrate why, follow me on a short detour. The picture below illustrates what could happen to currentDayOfYear on a hypothetical and very simple multicore system. We assume that the updating thread always runs on CPU1 and reader threads run on CPU2. On initial access, the value of currentDayOfYear is read from main memory and cached on each CPU separately. The yellow circle indicates the currently known value of currentDayOfYear to any part of the system.

Memory Hierarchy with missing happens-before order

As you can see, the value has been updated on CPU1, but it has never been reconciled with main memory and thus never reached CPU2. How can this happen? Java 5 has brought us JSR 133, better known as the Java Memory Model, which defines a happens‑before order between memory accesses. I won't go into the details here, but we failed to establish a happens‑before order between writes and reads of currentDayOfYear, which allowed the runtime to cache its value in each processor separately. Let's fix it:

@ThreadSafe
public final class SystemCalendar {
  private int currentDayOfYear;
  
  public synchronized int getCurrentDayOfYear() {
    return currentDayOfYear;
  }
  
  public synchronized void advanceToNextDay() {
    this.currentDayOfYear = (this.currentDayOfYear + 1) % 365;
  }
}

By marking both methods as synchronized, we have established a happens‑before order. Writes will now be flushed to main memory, and the value will be fetched from main memory before reads (This is not entirely correct and the reality is more complicated, but this mental model is sufficient for understanding the idiom. If you are interested in the details, have a look at the MESI protocol).

Memory Hierarchy with established happens-before order

Problem solved? You bet! But let's suppose that there are thousands of concurrent readers of currentDayOfYear. As a result, #getCurrentDayOfYear() will suffer from heavy contention. Each reader has to obtain the monitor on SystemCalendar.this just to ensure it gets the most recent value of currentDayOfYear.

Removing the synchronization on #getCurrentDayOfYear() in this situation is fatal, as the class would not be thread safe anymore. JLS §17.4.4 states: "An unlock action on monitor m synchronizes‑with all subsequent lock actions on m [...]" (you can substitute synchronizes‑with with happens‑before for our purposes). The mentioned "subsequent lock action" we want is exactly the acquisition of the monitor lock by #getCurrentDayOfYear() after #advanceToNextDay() has been called. Only then, we have established a happens‑before order and are thus able to see the new value. Too bad: It seems we are stuck with the lock.

It turns out we can relax the constraint by declaring currentDayOfYear as volatile instead of marking #getCurrentDayOfYear() as synchronized. That's essentially the cheap read-write lock:

@ThreadSafe
public final class SystemCalendar {
  private volatile int currentDayOfYear;
  
  public int getCurrentDayOfYear() {
    return currentDayOfYear;
  }
  
  public synchronized void advanceToNextDay() {
    this.currentDayOfYear = (this.currentDayOfYear + 1) % 365;
  }
}

Why does this work? Both synchronized and volatile establish a happens‑before order, but only synchronized guarantees that the guarded block is executed atomically. The setter performs multiple operations that have to be atomic, so synchronizing the method is the correct choice. However, the getter performs only one action, namely a read of the field, which is atomic (as per guarantee of the JLS). This allows us to use volatile instead of synchronized to establish a happens‑before order.

When to use the cheap read-write lock?

This idiom may be applied when reads far outweigh writes and synchronization would lead to too much contention. You should not just assume that there is a bottleneck but actually measure it before and after applying this idiom.

When to avoid the cheap read-write lock?

In short: almost always. In the original article, Brian Goetz warns firmly that you should not use this idiom lightly. First, you should empirically demonstrate that there is a performance problem and using the cheap read-write lock eliminates it. The code should be properly encapsulated and be documented very well. You must not use the cheap read-write lock if you deviate from the standard use case above. For example, the idiom is obviously insufficient if your getter consists of multiple statements.

Alternatives

The safer alternative in most similar situations that occur in practice will be ReadWriteLock, which is provided by the JDK since Java 5:

@ThreadSafe
public final class SystemCalendar {
  private final ReadWriteLock lock = new ReentrantReadWriteLock();
  private final Lock readLock = lock.readLock();
  private final Lock writeLock = lock.writeLock();
  
  private int currentDayOfYear;
  
  public int getCurrentDayOfYear() {
    readLock.lock();
    try {
      return currentDayOfYear;
    } finally {
      readLock.unlock();
    }
  }
  
  public void advanceToNextDay() {
    writeLock.lock();
    // The try - finally block is not strictly necessary in this case 
    // but it is the common idiom when using locks. You almost always want to use it.
    try {
      this.currentDayOfYear = (this.currentDayOfYear + 1) % 365;
    } finally {
      writeLock.unlock();
    }
  }
}

Since Java 8 you can also use StampedLock.

Conclusion

With careful thought and a very good understanding of the runtime behavior of our system, we are able to relax the cost that may be introduced by synchronized access to highly contended code paths with a lot of reads. Which technique is sufficient to solve a performance problem is best demonstrated by intensively profiling your system in different usage scenarios. While the cheap read-write lock is certainly not something you'll pull out every day, it is a nice technique for your concurrency toolbox.

If you have any questions or comments just ping me on Twitter.


Microbenchmarking in Java with JMH: Hello JMH

This is the fourth post in a series about microbenchmarking on the JVM with the Java Microbenchmarking Harness (JMH).

part 1: Microbenchmarking in Java with JMH: An Introduction

part 2: Microbenchmarks and their environment

part 3: Common Flaws of Handwritten Benchmarks

part 5: Digging Deeper

In the previous post I have shown different problems that we might miss when writing microbenchmarks from scratch. In this blog post I'll introduce JMH and show how it helps us to avoid these problems.

Java Microbenchmarking Harness: Hello World Walkthrough

By now you might think there is no way to write a correct microbenchmark on the JVM without being an engineer working on HotSpot. Fortunately, some people on the OpenJDK team, most prominently Aleksey Shipilёv, have written the Java Microbenchmarking Harness or JMH for short. JMH takes all sorts of countermeasures to eliminate or reduce the problems I have described earlier and helps you to concentrate on writing the microbenchmarking instead of satisfying the JVM. To get a grasp of JMHs approach to microbenchmarking, let's write a hello world benchmark which you can also find in the accompanying project on Github:

package name.mitterdorfer.benchmark.jmh;

import org.openjdk.jmh.annotations.Benchmark;

public class HelloJMHMicroBenchmark {
    @Benchmark
    public void benchmarkRuntimeOverhead() {
        //intentionally left blank
    }
}

A JMH microbenchmark is a plain Java class. Each microbenchmark is implemented as a method that is annotated with @Benchmark (in earlier versions of JMH the annotation has been called @GenerateMicroBenchmark). But how to we run it? Before we can run the microbenchmark, we have some work to do. To see why, let's have a look at the basic workflow with JMH:

Runtime diagram of the forked JVM runs of JMH

This workflow might strike you as a bit odd at first. Why is JMH generating code? Why do we have to create a shaded JAR? Wouldn't it be easier to run a microbenchmark just like a JUnit test? Let's go through this process step by step.

We have already completed the first step by annotating a method with @Benchmark. The second step is carried out when the microbenchmarking class is compiled. JMH implements multiple annotation processors that generate the final microbenchmark class. This generated class contains setup and measurement code as well as code that's required to minimize unwanted optimizations of the JIT compiler in the microbenchmark. The generated class for name.mitterdorfer.jmh.HelloJMHMicroBenchmark is name.mitterdorfer.jmh.generated.HelloJMHMicroBenchmark_benchmarkRuntimeOverhead and can be found in the corresponding .java file below build/classes/main if you're curios. As you can see, JMH generates one class per method that is annotated with @Benchmark but that is transparent to JMH users.

JMH contains a Runner class somewhat similar to JUnit so it is possible to run embedded microbenchmarks using the JMH Java API. However, let's use the JAR-based workflow for now and create a shaded JAR which we'll run. JMH allows multiple microbenchmark classes in the same JAR and can run all of them in the same microbenchmarking run.

To run the microbenchmark we will now tacke step three and create a shaded JAR. I'll use Gradle for that as I prefer it over Maven. If you want or have to use Maven, just look at the JMH example POM or at the sample POM of my benchmarking project. Just type gradle shadow to create the shaded JAR, which means that a single JAR will be created that contains your microbenchmark and its dependencies. When you type java -jar build/libs/benchmarking-experiments-0.1.0-all.jar JMH runs the microbenchmarks that are contained in the JAR and prints something similar to this:

# Run progress: 0,00% complete, ETA 00:06:40
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: name.mitterdorfer.benchmark.jmh.HelloJMHMicroBenchmark.benchmarkRuntimeOverhead
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.7.0_10.jdk/Contents/Home/jre/bin/java
# VM options: -Dfile.encoding=UTF8
# Fork: 1 of 10
# Warmup Iteration   1: 1442257053,080 ops/s
# Warmup Iteration   2: 1474088913,188 ops/s
[...]
# Warmup Iteration  19: 435080374,496 ops/s
# Warmup Iteration  20: 436917769,398 ops/s
Iteration   1: 1462176825,349 ops/s
Iteration   2: 1431427218,067 ops/s
[...]

# Run complete. Total time: 00:08:06

Benchmark                                                   Mode   Samples        Score  Score error    Units
n.m.b.j.HelloJMHMicroBenchmark.benchmarkRuntimeOverhead    thrpt       200 1450534078,416 29308551,722    ops/s

You can see that JMH creates multiple JVM forks. For each for fork, it runs n warmup iterations (shown in blue in the picture below), which do not get measured and are just needed to reach steady state before m iterations are run (shown in red in the picture below). In this example, n is 20 and m is 20 but you can change this with command line parameters.

Runtime diagram of the forked JVM runs of JMH

At the end, JMH summarizes the result of all microbenchmarking runs. The two most important measures are "score" (which is the mean for the throughput benchmarking mode) which allows you to estimate the performance of the benchmarked code and the "score error" which allows you to estimate the "noisyness" of the measurements taken by the microbenchmark. As this post is not intended to give an introduction to statistics I suggest "Explained: Key Mathematic Principles for Performance Testers" written by Microsoft's patterns & practices group. If you are more the intuitive type you'll like the articles from Kalid Azad very much, especially "How To Analyze Data Using the Average".

That's basically the whole microbenchmarking process with JMH. Congratulations, you mastered the first step of writing microbenchmarks with JMH! In the next post we'll get to know more concepts of JMH.

Questions or comments?

Just ping me on Twitter


Microbenchmarking in Java with JMH: Common Flaws of Handwritten Benchmarks

Fail Road

Image by Sarah; license: CC

This is the third post in a series about microbenchmarking on the JVM with the Java Microbenchmarking Harness (JMH).

part 1: Microbenchmarking in Java with JMH: An Introduction

part 2: Microbenchmarks and their environment

part 4: Hello JMH

part 5: Digging Deeper

In the previous post I have shown typical issues that have to be considered when executing microbenchmarks, such as the behavior of the JIT compiler or background system load. In this blog post we'll discover different problems that you'll encounter when writing microbenchmarks on the JVM.

Flaws and Dangers of Java Microbenchmarks

We write microbenchmarks because we want to know about the performance characteristics of a piece of code. In an ideal world, we would like to argue:

For given microbenchmark candidates X and Y: If X performs better than Y in a given microbenchmark, X will perform better than Y in any "similar" situation.

But Sheldon Cooper knows better: That's just superstitious hokum.

Sheldon Cooper says 'Superstitious Hokum'

What could possibly go wrong with this innocent argument? For one thing the microbenchmark could be flawed, which renders the premise invalid. Here are some examples:

These examples should demonstrate that there is a vast amount of things that can go wrong. However, in the unlikely event that we mere mortals get a microbenchmark right and the premise is valid, the conclusion could still be wrong. Let's consider some examples (non-exhaustive):

What's next?

In this article, we have seen that we can fail in a lot of ways when trying to measure the performance of a Java component with a microbenchmark. However, all hope is not lost. In the next part I'll introduce the Java Microbenchmarking Harness. Although it does not prevent all issues, it goes to great lengths to eliminate a lot of them upfront.

Questions or comments?

Just ping me on Twitter

Many thanks to @mmitterdorfer and @steve0392 for reading draft versions of this article.


Microbenchmarking in Java with JMH: Microbenchmarks and their environment

CPU Pins

Image by Eduardo Diez Viñuela; license: CC

This is the second post in a series about microbenchmarking on the JVM with the Java Microbenchmarking Harness (JMH).

part 1: Microbenchmarking in Java with JMH: An Introduction

part 3: Common Flaws of Handwritten Benchmarks

part 4: Hello JMH

part 5: Digging Deeper

In the previous post I have introduced microbenchmarking. In this blog post we'll discover different problems and precautions that should be taken when writing microbenchmarks on the JVM.

Is it really that hard?

How hard can it be to write a microbenchmark? Just invoke the code that should be benchmarked, measure the elapsed time and we are done. Hold your horses, it's not that easy. The class below - also available in the accompanying project for this series on Github - is a microbenchmark that attempts to determine the performance of Collection#add():

package name.mitterdorfer.benchmark.plain;

import java.util.*;
import java.util.concurrent.ConcurrentSkipListSet;

public class FlawedSetMicroBenchmark {
    private static final int MAX_ELEMENTS = 10_000_000;

    public static void main(String[] args) {
        List<? extends Set<Integer>> testees =
                Arrays.asList(
                        new HashSet<Integer>(),
                        new TreeSet<Integer>(),
                        new ConcurrentSkipListSet<Integer>());
        for (Set<Integer> testee : testees) {
            doBenchmark(testee);
        }
    }

    private static void doBenchmark(Set<Integer> testee) {
        long start = System.currentTimeMillis();
        for (int i = 0; i < MAX_ELEMENTS; i++) {
            testee.add(i);
        }
        long end = System.currentTimeMillis();
        System.out.printf("%s took %d ms%n", testee.getClass(), (end - start));
    }
}

This microbenchmark looks simple but it does not measure what we think it does. Just consider two apparent issues:

  1. The program will start in interpreted mode. After some time the runtime detects that #doBenchmark() is a so-called hot method, the Just-In-Time (JIT) compiler will kick in and compile it. Due to a technique called on-stack-replacement (OSR) the runtime is able to switch in the middle of executing a method from interpreted to compiled mode. Thus, the microbenchmark will measure the performance for a mixture of both modes. Dr. Cliff Click wrote an in-depth article about on-stack replacement in which he writes that OSRed code may not be optimal from a performance perspective.
  2. On the first invocation of #doBenchmark(Set), the microbenchmark continuously inserts elements into a HashSet. Therefore, the internal array will get resized, which produces garbage. This will in turn trigger the garbage collector and distort measurement results.

To summarize, we have identified two common sources of problems in Java microbenchmarks: the JIT-compiler and the garbage collector. This microbenchmark contains a few more surprises that we'll cover in a later article. Now, let's take a step back and think first about the ideal environment for microbenchmarks.

Environmental Considerations

Before running any benchmark, we have to consider these goals:

  1. We want to stay as close as possible to the target system configuration, provided it is known in advance.
  2. We do not want to introduce any unintended bottlenecks due to the execution environment.

Therefore, we should think about various system components and their configuration. Of particular importance for microbenchmarks are:

We can conclude that we need to select hardware and system software carefully and configure it properly. Depending on the requirements, the same microbenchmark should even be run on different systems (hardware, OS, JVM) to get a grasp of the performance characteristics. These considerations affect all benchmarks, regardless of underlying technology. Microbenchmarks running on virtual machines face additional challenges.

The JVM's Dynamic Nature

On a very high level, the JVM consists of three main components that work together: the runtime including the interpreter, the garbage collector and the JIT-compiler. Due to these components, we neither know in advance which machine code will be executed nor how it behaves exactly at runtime. Contrast this behavior with a regular C program:

Comparison of optimizations that happen at compile time and optimizations in JITed code

Oracle's HotSpot JVM applies a vast amount of optimizations on Java code: an outdated page on the OpenJDK Wiki lists close to 70 optimization techniques and I suspect HotSpot applies a lot more today. This makes it impossible to reason which machine code is finally executed based on the source code or the bytecode. For the curios, HotSpot provides a possibility to look at the generated assembly code of the JIT-compiler. The easy part: add a disassembler library to the Java library path and provide the proper JVM flags: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=intel (see also the OpenJDK documentation on PrintAssembly). The hard part: Understand the output. For x86 architectures, the instruction set references (mnemonics A-M, mnemonics N-Z) are a start. To sum up, the dynamic behavior of the JVM makes it even harder to write correct microbenchmarks.

What's next?

In this article, we have seen that a lot of factors influence the runtime behavior of microbenchmarks. In the next article, we'll take a deep dive and look at specific flaws that can happen in handwritten microbenchmarks.

Questions or comments?

Just ping me on Twitter

Many thanks to @mmitterdorfer and @steve0392 for reading draft versions of this article.


False Sharing

Normally, Java programmers are not too concerned about the hardware on which their beautiful software runs as long as provides loads of memory. Most of the time this is a good thing as software should solve a business problem rather than satisfying a machine. The JVM does a decent job hiding the underlying platform but as we know, abstractions are leaky. Sometimes we have to peek under hood, especially when we're concerned about performance. One such topic is false sharing, were a very performance-critical component might not perform as well as we'd expect.

Consider this class:

public final class X {
    public volatile int f1;
    public volatile int f2;
}

On Oracle JDK 1.8, instances of this class are laid out in memory as shown below:

Object Layout of sample class X

Note: I have determined the layout on OpenJDK 1.8.0-b132 (64-Bit) using JOL, see also my accompanying project on false sharing on Github.

We have declared all fields as volatile indicating that these fields may be used by different threads and we want to ensure writes are visible to all of them. As a result, the runtime will emit code to ensure that writes to a field are visible across CPU cores. But how does this work?

Caching on Modern CPUs in less than 100 words

CPUs don't care about classes or objects; they are concerned with reads and writes on memory cells. To efficiently operate on data, it is fetched from main memory into a CPU cache at the granularity of a cache line. A cache line is a block of memory in the CPU cache, say 64 bytes. If data in a cache line is modified, each core sends messages across the memory bus according to a well-defined cache coherence protocol (the most common one being the MESI protocol) so all cores can reconcile.

False Sharing

In this example I assume a hypothetical CPU with one cache layer (L1 cache) where each cache belongs to one core. What happens if one thread, running on Core0 is constantly writing to X.f1 (depicted in red below) and another thread on Core1 is constantly reading from f2 (depicted in green below)?

Cache coherence with false sharing

Core0 knows that it does not own the cache line "n" exclusively, so it has to broadcast an "Invalidate" message across the bus after changing the cache line to notify other cores that their caches are stale. Core1 is listening on the bus and invalidates the corresponding cache line. Consequently, this produces a lot of unnecessary bus traffic although both cores operate on different fields. This phenomenon is known as false sharing.

Countermeasures to False Sharing

This is a rather unsatisfying state of affairs: Although each thread operates on different fields, we produce a lot of unnecessary bus traffic which is ultimately caused by the memory layout of our Java objects. If we were able to influence the memory layout we can avoid false sharing. It turns out there are a few ways around this problem. A word of warning though: The following techniques depend on the JVM implementation or the underlying hardware so take my words with a grain of salt. The techniques are taken from the JMH example class JMHSample_22_FalseSharing and a presentation on @Contended which inspired me to write this blog post. I'll just describe three of them shortly here as the example class of JMH documents these and more techniques already very clearly:

To properly pad f1, annotate it with @Contended:

import sun.misc.Contended;

public final class X {
    @Contended
    public volatile int f1;
    public volatile int f2;
}

Which will result in the object layout below:

Object Layout of sample class X with @Contended

False Sharing vs. @Contended: A Microbenchmark

To demonstrate the effects of false sharing, I have written a small JMH microbenchmark which is available on Github. Below is the result of this benchmark comparing false sharing and the @Contended approach using three reader threads and one writer thread. It has been run on an Intel Core i7-2635QM with 4 physical cores.

Cache coherence with false sharing

While write throughput is roughly equivalent (mean throughput 373 ops/µs for false sharing and 371 ops/µs for the solution with @Contended), the mean read throughput is around three times higher with @Contended (2338 ops/µs) than with false sharing (813 ops/µs). This clearly demonstrates the influence of false sharing in this scenario.

However, don't worry: @Contended is definitely not meant to be sprinkled across the entire codebase. In fact, it should be used very, very sparingly and people wanting to use @Contended probably already use one of the previously-mentioned field padding approaches. In JDK 8, @Contended is just used in 5 classes (Thread, Striped64, ConcurrentHashMap, Exchanger and ForkJoinPool). There are limited circumstances where false sharing is really a problem and bottlenecks should be identified before-hand. But that may be the topic of another article some day...

Questions or comments?

Just ping me on Twitter