Daniel Mitterdorfer

Latency Analysis of Elasticsearch

I currently work on the backend of Elastic Universal Profiling, which is a fleet-wide continuous profiler. In our benchmarks we have observed higher than desired latency in a query that retrieves data for flamegraphs. As this is a key aspect of our product we wanted to better understand the root cause.

CPU Profiling

We started to gather CPU profiles with async-profiler in Wall-clock profiling mode on the Elasticsearch cluster while it was processing a query:

CPU profile showing a timeline view where the CPU is mostly idle

The diagram above shows an activity timeline for one of the threads involved in query processing.The thread is only active for a brief period of time around the middle of the timeline but is otherwise idle. Other involved threads show a similar pattern. In other words, this query is not CPU bound. We therefore analyzed disk utilization but also disk I/O was far below the supported IOPS. So if the problem is neither the time spent on CPU nor disk I/O, what should we do to get to the bottom of this mystery?

Off-CPU Analysis

Time for off-CPU analysis! Put bluntly, we want to see why the CPU is not doing anything. We will use perf sched record to do this. Around the time the query is executed, we’ll run the following command:

perf sched record -g -- sleep $PROFILING_DURATION

This traces CPU scheduler events and creates a perf.data file that we can analyze, e.g. using perf sched timehist -V. The data in this file tell us when, for how long and why a thread went off-CPU (slightly edited for brevity):

           time task name                       wait time  sch delay   run time
                [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------------------------------  ---------  ---------  ---------
   15589.656599 elasticsearch[e[8937/8881]          0.000      0.000      0.020    [unknown] <- [unknown] <- [unknown] <- [...]
   15589.656601 Attach Listener[9157/8881]          0.162      0.000      0.053    [unknown] <- [unknown] <- [unknown] <- [...]
   15589.656601 <idle>                              0.023      0.000      0.231 
   15589.656605 elasticsearch[e[8936/8881]          0.000      0.000      0.022    [unknown] <- [unknown] <- [unknown] <- [...]
   15589.656609 <idle>                              0.009      0.000      0.433 
   15589.656616 elasticsearch[d[8953/8881]          0.000      0.000      0.014    [unknown] <- [unknown] <- [unknown] <- [...]
   15589.656623 elasticsearch[e[8956/8881]          0.000      0.000      0.014    [unknown] <- [unknown] <- [unknown] <- [...]
   15589.656753 <idle>                              0.053      0.000      0.152 
   15589.656762 <idle>                              0.015      0.000      0.389 
   15589.656765 <idle>                              0.015      0.000      0.173 

That doesn’t look too useful because kernel symbols are missing. So make sure to install them first (check your distro’s docs how to do this). If kernel symbols are available, the output looks as follows (slightly edited for brevity):

         time task name                   wait time  sch delay   run time
              [tid/pid]                      (msec)     (msec)     (msec)
------------- --------------------------  ---------  ---------  ---------
 17452.955493 <idle>                          0.165      0.000      0.018 
 17452.955494 elasticsearch[i[9782/8239]      0.000      0.000      0.044    futex_wait_queue_me <- futex_wait <- do_futex <- __x64_sys_futex
 17452.955507 elasticsearch[i[9388/8239]      0.000      0.000      0.036    futex_wait_queue_me <- futex_wait <- do_futex <- __x64_sys_futex
 17452.955509 <idle>                          0.041      0.000      0.464 
 17452.955639 <idle>                          0.068      0.000      0.168 
 17452.955682 <idle>                          0.027      0.000      0.208 
 17452.955684 <idle>                          0.021      0.000      0.194 
 17452.955687 <idle>                          0.024      0.000      0.170 
 17452.955689 elasticsearch[i[5892/4393]      0.558      0.000      0.180    io_schedule <- __lock_page_killable <- filemap_fault <- __xfs_filemap_fault

That’s much better. Looking at the last line in the output, we can infer that an Elasticsearch thread was running for 180µs and was then blocked for 558µs because it caused a page fault. While this is useful, our ultimate goal is to analyze and eliminate major latency contributors. Just looking at this output does not help us understand what piece of code in the application has caused that page fault. Ideally we’d like to see the off-CPU profile integrated with the on-CPU profile. Then we’d be able to see what piece of code has caused the application to block and what it is blocked on (e.g. page fault handling, locking, blocking disk I/O).

Combining both Profiles

I wrote a program that parses the perf output file and injects the stacktraces into the on-CPU profile to provide a complete picture:

CPU profile showing a combined timeline view of on-CPU and off-CPU stacktraces

While this looks promising upon first glance, the two profiles are completely misaligned. The on-CPU profile at the beginning indicates that the thread pool is waiting for new tasks (i.e. it’s idle), yet we see several page faults occuring, which can’t be right. Therefore, the two profiles must be misaligned on the time axis. So, how exactly do both profilers get their timestamps?

perf record allows to control the clock with the –clock-id parameter. Several clocks are supported, among which we picked CLOCK_MONOTONIC. However, the clock in async profiler is based on the rdtsc instruction on Intel CPUs and non-configurable. rdtsc is basically a hardware counter on the CPU and an interesting topic on its own. For our purposes it’s sufficient to know that the two profilers use different clock sources. Therefore, I’ve raised an issue to add configurable clock sources to async-profiler which Andrei Pangin was kind enough to implement. With async-profiler 2.10 (not yet released as of October 2023) we can use --clock monotonic and force the same clock source in both profilers. This gives us aligned stack traces that combine both on-CPU and off-CPU states:

CPU profile showing a timeline view that correctly combines on-CPU and off-CPU stacktraces

For example, in the picture above we see heavy page-faulting when accessing doc values in Lucene, which is the underlying search engine library that Elasticsearch uses. Being able to visualize this behavior allows us to provide an accurate diagnosis of the situation and then implement appropriate mitigations. Note that some stacktraces might still not be perfectly aligned because perf is tracing whereas async-profiler is sampling. But despite this caveat, seeing on-CPU and off-CPU behavior of an application has still proven extremely valuable when analyzing latency.