Find performance regressions in critical components.
Compare alternative implementations or system configurations
Understand the low-level behavior of system components
Ultimate purpose: Derive a performance model for a component
Example: How long does it take to calculate the sum of an array?
public class SumBenchmark {
public static double sum(double[] array) {
double total = 0.0d;
for (int i = 0; i < array.length; i++) {
total += array[i];
}
return total;
}
}
public class SumBenchmark {
private static final int BATCH_SIZE = 15000;
private static void benchmarkSum(double[] array) {
long start = System.nanoTime();
for (int j = 0; j < BATCH_SIZE; j++) {
sum(array);
}
long stop = System.nanoTime();
System.out.printf("Computation finished in %d ns.%n",
((stop - start) / BATCH_SIZE));
}
}
Benchmarking Scenario: Benchmark with 10.000 array elements
public class SumBenchmark {
public static void main(String[] args) {
double[] array = new double[10000];
// initialize array with some values
for (int i = 0; i < array.length; i++) {
array[i] = (double)i;
}
// perform actual benchmark
for (int iteration = 0; iteration < 10; iteration++) {
benchmarkSum(array);
}
}
}
Computation finished in 11561 ns.
Computation finished in 447 ns.
Computation finished in 0 ns.
Computation finished in 0 ns.
[...]
Computation finished in 0 ns.
Rerun with -XX:+PrintCompilation
[...]
123 7 name.mit[...].SumBenchmark::sum (24 bytes)
127 1 % name.mit[...].SumBenchmark::sum @ 4 (24 bytes)
293 2 % name.mit[...].SumBenchmark::benchmarkSum @ 6 (51 bytes)
306 8 java.lang.String::indexOf (166 bytes)
Computation finished in 11561 ns.
313 9 name.mit[...].SumBenchmark::benchmarkSum (51 bytes)
319 2 % name.mit[...].SumBenchmark::benchmarkSum @ -2 (51 bytes) made not entrant
Computation finished in 447 ns.
Computation finished in 0 ns.
Computation finished in 0 ns.
[...]
Computation finished in 0 ns.
The JIT compiler kicks in and eliminates the benchmark loop
private static void benchmarkSum(double[] array) {
long start = System.nanoTime();
for (int j = 0; j < BATCH_SIZE; j++) {
// (1) The return value is never used, let's eliminate the call
sum(array);
}
long stop = System.nanoTime();
System.out.printf("Computation finished in %d ns.%n",
((stop - start) / BATCH_SIZE));
}
Only illustrative: HotSpot may implement this differently
private static void benchmarkSum(double[] array) {
long start = System.nanoTime();
for (int j = 0; j < BATCH_SIZE; j++) {
// (2) The loop body is empty, let's eliminate the loop
}
long stop = System.nanoTime();
System.out.printf("Computation finished in %d ns.%n",
((stop - start) / BATCH_SIZE));
}
Only illustrative: HotSpot may implement this differently
private static void benchmarkSum(double[] array) {
long start = System.nanoTime();
//(3) Huh, were is the benchmark?
long stop = System.nanoTime();
System.out.printf("Computation finished in %d ns.%n",
((stop - start) / BATCH_SIZE));
}
“Without exception every microbenchmark I've seen has had serious flaws. Except those I've had a hand in correcting.”
import org.openjdk.jmh.annotations.Benchmark;
public class HelloJMHMicroBenchmark {
@Benchmark
public void hello() {
//intentionally left blank
}
}
mvn clean install
@Benchmark
annotated method => one benchmark class# Run progress: 0,00% complete, ETA 00:06:40
[...]
# Fork: 1 of 10
# Warmup Iteration 1: 1442257053,080 ops/s
[...]
# Warmup Iteration 20: 436917769,398 ops/s
Iteration 1: 1462176825,349 ops/s
Iteration 2: 1431427218,067 ops/s
[...]
# Run complete. Total time: 00:08:06
Benchmark Mode Samples Score Score error Units
n.m.b.j.H.hello thrpt 200 1450534078,416 29308551,722 ops/s
import org.openjdk.jmh.annotations.*;
@State(Scope.Benchmark)
public class SumBenchmark {
private double[] values;
@Setup
public void setup() {
this.values = new double[10000];
for (int i = 0; i < values.length; i++) {
values[i] = (double)i;
}
}
@Benchmark
public double calcSum() {
return sum(values);
}
}
# Run progress: 0,00% complete, ETA 00:06:40
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: name.mitterdorfer.benchmark.jmh.SumBenchmark.calcSum
[...]
# Fork: 1 of 10
# Warmup Iteration 1: 89162,938 ops/s
# Warmup Iteration 2: 91655,330 ops/s
[...]
# Run complete. Total time: 00:08:04
Benchmark Mode Samples Score Score error Units
n.m.b.j.SumBenchmark.calcSum thrpt 200 92684,491 395,882 ops/s
Score based on array size (10.000 elements). Use @OperationsPerInvocation
to normalize the reported throughput if needed.
@State
@Threads
@CompilerControl
For more information please study the official JMH samples.
perf
profilerCPU speculatively loads data based on memory access patterns
int[]
Contiguous array: Linear memory access pattern for traversal:
ArrayList
Linear memory access pattern for array traversal; pointer chasing for elements:
LinkedList
Nonlinear memory access pattern for traversal and elements:
LinkedList
@State(Scope.Benchmark)
public class PointerChasingBenchmark {
@Param({"1024", "2048", "4096", "8192", "16384", "32768"})
public int problemSize;
private final List<Integer> linkedList = new LinkedList<>();
@Setup
public void setUp() {
for (int idx = 0; idx < problemSize; idx++) {
linkedList.add(idx);
}
}
// ...
}
Note: the other setup methods are identical except for their type
LinkedList
@State(Scope.Benchmark)
public class PointerChasingBenchmark {
// .. Setup ..
@Benchmark
public long sumLinkedList() {
long sum = 0;
for (int val : linkedList) {
sum += val;
}
return sum;
}
}
Note: the other benchmark methods are identical except for their type
Read CPU performance monitoring data with JMH's perf
profiler
Metric | int[] |
ArrayList |
LinkedList |
---|---|---|---|
L1-dcache-loads | 61 * 109 | 58 * 109 | 21 * 109 |
L1-dcache-load-misses (relative to L1 cache hits) | 6 % | 10 % | 22 % |
Pointer indirection renders prefetching ineffective
Microbenchmarks are not the solution to every performance problem