Daniel Mitterdorfer

Recent Posts

Life as a remote worker: Organization

I work for more than two years at Elastic where I focus on benchmarking, performance analysis and performance tuning for Elasticsearch. In my daily work, I have to deal with many people: our users, various of our engineering teams, support, consulting, marketing, sales, product management and our developer relations team. As you might imagine, these are a lot of balls to juggle.

Structuring Work

Often, I get requests that I cannot immediately work on. Therefore, I need a way to organize the incoming stream of tasks. I use a variation of Personal Kanban that I have implemented in Trello.

Overview of my Trello Board

You can see several columns (from left to right):

Handling Incoming Tasks

I think the structure is pretty simple to grasp but another important part is how I interact with the board.

There are several streams from where I get new tasks:

Whenever this happens, I create a new Trello card. Depending on the urgency, the card either goes into the backlog, the “Next” column or I add it to “Doing” and start to work on it immediately. Thanks to my external memory I will never forget anything, even if I will not work on it for a long time.

Suppose, for example that I get an idea for a feature that I could demo for Rally (our macrobenchmarking tool for Elasticsearch):

Task Detail: Prepare a demo for Rally

As you can see from the screenshot, a card does not only consist of its title. Using comments helps me better organize my thoughts around that topic and document areas that I need to look into.

There are several more features in Trello that I use, e.g.:

Progress Reporting

At the end of each week, we send a brief update to the rest of the team for which I have setup a notification. I then kick off a script that reads my board and provides a draft report within seconds. Before sending, I just have to prune the list so it includes only tasks that are relevant for the rest of the team. Finally, I archive all cards in the “Done this week” column. That whole process takes at most one minute per week. Another important aspect: Due to externalizing everything again, I don’t need to spend any brain cycles on it.


Xprof, the forgotten HotSpot profiler

Did you know that every HotSpot-based JVM includes a profiler? You can activate it on the command line with -Xprof and it dumps its output to stdout.

The information about it is pretty sparse, so I decided to write a short blog post about it.

Getting Started

Let’s dive right in with an example:

public class Fibonacci {
    private static long fib(long n) {
        if (n <= 0) {
            throw new IllegalArgumentException("n must be a natural number");
        } else if (n == 1 || n == 2) {
            return 1;
        } else {
            return fib(n - 2) + fib(n - 1);
        }
    }

    public static void main(String[] args) {
        System.out.println(fib(50));
    }
}

If you compile and run this code with java Fibonacci -Xprof you’ll get the following output after the program has finished:

Flat profile of 45.00 secs (3664 total ticks): main

     Compiled + native   Method                        
100.0%  3664  +     0    Fibonacci.fib
100.0%  3664  +     0    Total compiled

Flat profile of 0.00 secs (1 total ticks): DestroyJavaVM

  Thread-local ticks:
100.0%     1             Unknown: no last frame

This output reveals already a few things about the profiler:

  1. Why is main() not showing up in the profile? The reason is that Xprof is a “flat” profiler. We are normally used to hierarchical profilers that show complete stack traces. However, this profiler will only show the top-most method to avoid the overhead of stack-walking. As this example is only very simple, this is not immediately clear, but we’ll see this in a more complex example later.
  2. It is a sampling profiler. Samples are captured at each “tick”, and a tick occurs every 10 milliseconds (see the creation of FlatProfilerTask and the definition of the delay interval in the JDK sources)
  3. It prints the captured information only when a thread exits (see before_exit(JavaThread* thread) in the JDK sources)

If you pay close attention, you’ll see that the number of ticks does not match the reported time period. For 45 seconds, it should capture 4,500 samples, given a sampling frequency of 10 milliseconds. A quick experiment (just running Thread#sleep()) shows that the sampling interval (on my machine) is actually 12 milliseconds.

Sleep duration [ms] Total samples
5,000 421
10,000 830
20,000 1,665
40,000 3,328

Before we move on, let’s see what additional information we can get from Xprof.

We can see more information about JVM threads with -XX:+ProfileVM:

Flat profile of 0.00 secs (3665 total ticks): VM Thread

     Compiled + native   Method                        
100.0%     0  +  3665    __psynch_cvwait
100.0%     0  +  3665    Total compiled

We can see the following with -XX:+ProfilerPrintByteCodeStatistics (this will get more interesting with a complex example):

 Bytecode ticks:


Global summary of 45.00 seconds:
100.0%  3665             Received ticks
  0.0%     1             Unknown code

And finally we can get even low-level info about hot methods with -XX:+ProfilerRecordPC:

Printing compiled methods with PC buckets having more than 15 ticks
===================================================================

[CodeBlob (0x000000010c4b7890)]
Framesize: 8
Could not load hsdis-amd64.dylib; library not loadable; PrintAssembly is disabled

Apparently we need the Java disasssembler library hsdis, which you need to build yourself (see the build instructions for your platform in the JITWatch wiki). With: -XX:+ProfilerRecordPC and export LD_LIBRARY_PATH=/path/to/hsdis we get indeed some assembly. If you prefer Intel assembly syntax instead of AT&T, just add -XX:+UnlockDiagnosticVMOptions -XX:PrintAssemblyOptions=intel.

Here is a shortened version of the output:

Printing compiled methods with PC buckets having more than 15 ticks
===================================================================

[CodeBlob (0x000000010bf09650)]
Framesize: 8
Loaded disassembler from hsdis-amd64.dylib
Decoding compiled method 0x000000010bf09650:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x0000000124b60b70} 'fib' '(J)J' in 'Fibonacci'
  # parm0:    rsi:rsi   = long
  #           [sp+0x40]  (sp of caller)
  0x000000010bf097c0: mov    DWORD PTR [rsp-0x14000],eax
  0x000000010bf097c7: push   rbp
  0x000000010bf097c8: sub    rsp,0x30
  0x000000010bf097cc: mov    r10,rsi
  0x000000010bf097cf: test   rsi,rsi
  0x000000010bf097d2: jle    0x000000010bf0989e
  0x000000010bf097d8: cmp    rsi,0x1
  0x000000010bf097dc: je     0x000000010bf097e4
  0x000000010bf097de: cmp    rsi,0x2
  0x000000010bf097e2: jne    0x000000010bf09844
  0x000000010bf097e4: mov    eax,0x1
  0x000000010bf097e9: jmp    0x000000010bf09838
  0x000000010bf097eb: mov    eax,0x1
  0x000000010bf097f0: jmp    0x000000010bf09833
  0x000000010bf097f2: mov    r11d,0x1
  0x000000010bf097f8: mov    QWORD PTR [rsp+0x10],r11
  0x000000010bf097fd: dec    r10
  0x000000010bf09800: test   r10,r10
  0x000000010bf09803: jle    0x000000010bf098f9
  0x000000010bf09809: cmp    r10,0x1
  0x000000010bf0980d: je     0x000000010bf097eb
  0x000000010bf0980f: cmp    r10,0x2
  0x000000010bf09813: je     0x000000010bf097eb
  0x000000010bf09815: mov    rbp,r10
  0x000000010bf09818: data32 xchg ax,ax
  0x000000010bf0981b: call   0x000000010bf097c0  ; OopMap{off=96}
                                                ;*invokestatic fib
                                                ; - Fibonacci::fib@37 (line 10)
                                                ; - Fibonacci::fib@43 (line 10)
                                                ;   {static_call}
  0x000000010bf09820: mov    rbp,rax
  0x000000010bf09823: mov    rsi,QWORD PTR [rsp+0x8]
  0x000000010bf09828: data32 xchg ax,ax
  0x000000010bf0982b: call   0x000000010bf097c0  ; OopMap{off=112}
                                                ;*invokestatic fib
                                                ; - Fibonacci::fib@43 (line 10)
                                                ; - Fibonacci::fib@43 (line 10)
                                                ;   {static_call}
36.1% [1321]

  0x000000010bf09830: add    rax,rbp
  0x000000010bf09833: add    rax,QWORD PTR [rsp+0x10]
  0x000000010bf09838: add    rsp,0x30
  0x000000010bf0983c: pop    rbp
  0x000000010bf0983d: test   DWORD PTR [rip+0xfffffffffded97bd],eax        # 0x0000000109de3000
                                                ;   {poll_return}
27.0% [989]

  0x000000010bf09843: ret    
  0x000000010bf09844: mov    r11,rsi
  0x000000010bf09847: add    r11,0xfffffffffffffffe
  0x000000010bf0984b: mov    QWORD PTR [rsp+0x8],r11
  0x000000010bf09850: test   r11,r11
  0x000000010bf09853: jle    0x000000010bf098c9
  0x000000010bf09855: add    rsi,0xfffffffffffffffd
  0x000000010bf09859: cmp    r11,0x1
  0x000000010bf0985d: je     0x000000010bf097f2
  0x000000010bf0985f: cmp    r11,0x2
  0x000000010bf09863: je     0x000000010bf097f2
  0x000000010bf09865: mov    QWORD PTR [rsp],rsi
  0x000000010bf09869: mov    rbp,r10
  0x000000010bf0986c: mov    rsi,r10
  0x000000010bf0986f: add    rsi,0xfffffffffffffffc
  0x000000010bf09873: call   0x000000010bf097c0  ; OopMap{off=184}
                                                ;*invokestatic fib
                                                ; - Fibonacci::fib@37 (line 10)
                                                ; - Fibonacci::fib@37 (line 10)
                                                ;   {static_call}
  0x000000010bf09878: mov    QWORD PTR [rsp+0x10],rax
  0x000000010bf0987d: mov    rsi,QWORD PTR [rsp]
[...]

A more complex example

For this example I have started Elasticsearch 5.5.0 with -Xprof -XX:+ProfileVM -XX:+ProfilerPrintByteCodeStatistics and ran a macrobenchmark that stresses indexing. Here is one example:

Flat profile of 573.60 secs (31156 total ticks): elasticsearch[vWpZ2JV][bulk][T#2]

  Interpreted + native   Method
  0.0%     0  +    10    java.lang.Thread.yield
  0.0%     9  +     0    org.apache.lucene.index.DefaultIndexingChain$PerField.invert
  0.0%     0  +     6    sun.nio.ch.FileDispatcherImpl.force0
  0.0%     5  +     0    org.apache.lucene.document.Field.tokenStream
  0.0%     4  +     0    org.elasticsearch.index.shard.IndexShard.shouldFlush
  0.0%     4  +     0    org.elasticsearch.common.util.concurrent.KeyedLock$ReleasableLock.<init>
  0.0%     3  +     0    org.apache.lucene.util.BytesRef.<init>
  0.0%     3  +     0    org.apache.lucene.index.DefaultIndexingChain.getOrAddField
  0.0%     3  +     0    org.apache.lucene.index.DefaultIndexingChain.processDocument
  0.0%     3  +     0    java.util.function.BinaryOperator$$Lambda$1634.<init>
  0.0%     3  +     0    org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.respond
  0.0%     0  +     2    sun.nio.ch.FileChannelImpl.unmap0
  0.0%     2  +     0    org.elasticsearch.indices.IndexingMemoryController.getIndexBufferRAMBytesUsed
  0.0%     2  +     0    org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$PendingTerm.<init>
  0.0%     2  +     0    org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>
  0.0%     2  +     0    java.util.concurrent.locks.AbstractQueuedSynchronizer.unparkSuccessor
  0.0%     2  +     0    org.elasticsearch.action.bulk.BulkResponse.toXContent
  0.0%     2  +     0    org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put
  0.0%     2  +     0    org.elasticsearch.index.mapper.FieldNamesFieldMapper.parseCreateField
  0.0%     2  +     0    java.util.concurrent.ThreadPoolExecutor.runWorker
  0.0%     2  +     0    org.elasticsearch.index.mapper.BaseGeoPointFieldMapper.parse
  0.0%     2  +     0    org.apache.lucene.index.DefaultIndexingChain.writeDocValues
  0.0%     0  +     1    sun.nio.fs.UnixNativeDispatcher.open0
  0.0%     1  +     0    org.apache.lucene.util.packed.PackedInts$Format.longCount
  0.0%     1  +     0    org.elasticsearch.index.mapper.DocumentParser.parsedDocument
  0.4%   101  +    22    Total interpreted (including elided)

     Compiled + native   Method
  6.0%  1695  +     1    org.apache.lucene.index.DefaultIndexingChain.processField
  5.2%  1476  +     0    org.apache.lucene.index.TermsHashPerField.add
  5.1%  1460  +     0    org.apache.lucene.util.BytesRefHash.add
  4.6%  1308  +     0    org.apache.lucene.index.DefaultIndexingChain$PerField.invert
  2.2%   624  +     0    org.apache.lucene.codecs.compressing.LZ4.compress
  1.8%   525  +     0    org.apache.lucene.document.Field$StringTokenStream.incrementToken
  1.7%   478  +     0    org.apache.lucene.index.DefaultIndexingChain.processDocument
  1.3%   358  +     0    org.apache.lucene.util.BytesRefHash.findHash
  1.2%   335  +    20    org.elasticsearch.index.mapper.FieldNamesFieldMapper.parseCreateField
  1.1%   324  +     0    org.apache.lucene.util.MSBRadixSorter.radixSort
  1.1%   309  +     0    org.apache.lucene.util.IntroSorter.quicksort
  1.1%   302  +     3    org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm
  1.0%   280  +     2    org.apache.lucene.util.BytesRefHash.rehash
  1.0%   273  +     0    org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken
  0.9%   242  +     0    org.apache.lucene.index.TermVectorsConsumerPerField.start
  0.8%   216  +     0    org.elasticsearch.index.mapper.DocumentParser.internalParseDocument
  0.8%   214  +     0    org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken
  0.7%   202  +     0    org.apache.lucene.analysis.FilteringTokenFilter.incrementToken
  0.7%   194  +     8    org.elasticsearch.index.mapper.FieldMapper.parse
  0.7%   195  +     1    org.apache.lucene.document.Field.tokenStream
  0.7%   161  +    32    org.apache.lucene.index.DocumentsWriter.updateDocument
  0.6%   176  +     0    org.apache.lucene.index.DefaultIndexingChain.getOrAddField
  0.6%   158  +     1    java.util.DualPivotQuicksort.sort
  0.6%   156  +     1    org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addNumericField
  0.5%   138  +     1    org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm
 55.2% 15335  +   364    Total compiled (including elided)

         Stub + native   Method
 20.8%     0  +  5906    java.lang.Object.notifyAll
  9.3%     0  +  2643    sun.reflect.Reflection.getCallerClass
  5.5%     0  +  1573    sun.reflect.Reflection.getCallerClass
  1.6%     0  +   458    java.lang.Thread.yield
  1.5%     0  +   418    sun.nio.ch.FileDispatcherImpl.write0
  1.1%     0  +   308    java.lang.Thread.isAlive
  1.1%     0  +   306    sun.nio.ch.NativeThread.current
  0.8%     0  +   214    java.lang.String.intern
  0.7%     0  +   213    sun.nio.ch.FileDispatcherImpl.force0
  0.6%     0  +   167    sun.nio.fs.UnixNativeDispatcher.open0
  0.5%     0  +   140    java.security.AccessController.getStackAccessControlContext
  0.3%    24  +    66    java.security.AccessController.doPrivileged
  0.2%     0  +    57    sun.nio.ch.FileDispatcherImpl.close0
  0.1%     0  +    37    java.io.UnixFileSystem.canonicalize0
  0.1%     0  +    27    sun.nio.ch.FileChannelImpl.unmap0
  0.1%     0  +    16    sun.nio.fs.UnixNativeDispatcher.stat0
  0.0%     0  +    11    sun.nio.fs.UnixNativeDispatcher.unlink0
  0.0%     0  +     8    sun.nio.ch.FileDispatcherImpl.size0
  0.0%     0  +     5    java.lang.Object.hashCode
  0.0%     0  +     4    java.lang.System.arraycopy
  0.0%     0  +     4    java.lang.Thread.isInterrupted
  0.0%     0  +     4    sun.misc.Unsafe.copyMemory
  0.0%     0  +     3    sun.nio.ch.FileChannelImpl.map0
  0.0%     0  +     2    sun.nio.ch.KQueueArrayWrapper.interrupt
  0.0%     0  +     2    java.lang.Object.clone
 44.4%    24  + 12598    Total stub (including elided)

  Thread-local ticks:
  8.7%  2712             Blocked (of total)

Remember that this shows a flat profile. So every method printed here was at some point in time the top-most method on the stack. We do not see anything about the caller-callee relationship of these methods.

We can see that the profiler differentiates between interpreted, compiled and JVM stub samples. For an application in steady state, we’d expect the majority of samples in compiled code (which is the case here). Most samples also show Lucene, which makes sense considering that Lucene is responsible for indexing the data in the end.

When we terminate the application, we’ll see a global summary. Also the bytecode ticks summary shows some statistics now:

 Bytecode ticks:
    47    0 = ldc
     1    0 = ldc_w
     1    0 = ifgt
     3    0 = if_icmplt
    46    0 = goto
     2    0 = ireturn
     2    0 = areturn
     2    0 = return
     8    0 = getstatic
     1    0 = putstatic
     3    0 = getfield
     3    0 = putfield
     5    0 = invokevirtual
     2    0 = invokespecial
     6    0 = invokestatic
     2    0 = invokeinterface
     4    0 = invokedynamic
    29    1 = new
    91    0 = newarray
   106    0 = anewarray
     2    0 = checkcast
     2    0 = instanceof
     2    0 = monitorenter
     2    0 = monitorexit

In order to understand the output, you need to glance at the source code. The left column shows the number of ticks outside of stub code and the right column the number of ticks in stub code.

However, you may find the VM summary more useful as we can get a feeling how much time the JVM has spent for compilation, GC and other internal operations.

Global summary of 592.86 seconds:
100.0% 34399             Received ticks
  4.6%  1571             Received GC ticks
 21.1%  7253             Compilation
  0.3%    86             Other VM operations
  0.0%     1             Class loader


Flat profile of 0.00 secs (34300 total ticks): VM Thread
  0.3%    99             Lost ticks

     Compiled + native   Method
 99.2%     0  + 34034    __psynch_cvwait
  0.2%     0  +    76    SafepointSynchronize::begin()
  0.2%     0  +    60    PromotionInfo::promoted_oops_iterate_nv(ParScanWithoutBarrierClosure*)
  0.0%     0  +    15    ThreadSafepointState::examine_state_of_thread()
  0.0%     0  +    13    PromotionInfo::nextDisplacedHeader()
  0.0%     0  +    11    SymbolTable::buckets_unlink(int, int, int*, int*, unsigned long*)
  0.0%     0  +     9    __psynch_cvsignal
  0.0%     0  +     8    RelocIterator::next()
  0.0%     0  +     5    JNIHandleBlock::weak_oops_do(BoolObjectClosure*, OopClosure*)
  0.0%     0  +     5    __psynch_mutexwait
  0.0%     0  +     4    ObjectSynchronizer::deflate_idle_monitors()
  0.0%     0  +     4    pthread_getspecific
  0.0%     0  +     3    Monitor::IWait(Thread*, long)
  0.0%     0  +     3    CodeHeap::find_start(void*) const
  0.0%     0  +     3    nmethod::get_deopt_original_pc(frame const*)
  0.0%     0  +     2    __vfprintf
  0.0%     0  +     2    RelocIterator::advance_over_prefix()
  0.0%     0  +     2    __mmap
  0.0%     0  +     2    ThreadSafepointState::roll_forward(ThreadSafepointState::suspend_type)
  0.0%     0  +     2    _platform_memmove$VARIANT$Haswell
  0.0%     0  +     2    ReceiverTypeData::clean_weak_klass_links(BoolObjectClosure*)
  0.0%     0  +     2    __mprotect
  0.0%     0  +     2    _pthread_cond_wait
  0.0%     0  +     2    TemplateInterpreter::notice_safepoints()
  0.0%     0  +     2    OopMapSet::find_map_at_offset(int) const
100.0%     0  + 34300    Total compiled (including elided)

Summary

So when should you use -Xprof? It has a couple of advantages:

However, I think the lack of information about it on the Internet is telling. In my opinion it has serious disadvantages:

I’d always prefer a “proper” profiler like Java Flight recorder, Honest Profiler or Yourkit during development. If you need to quickly analyze a problem in production, periodically taking thread dumps is usually helpful (either manually or automated, e.g. with stcap). An alternative to the latter may be -Xprof, especially if you are in an extremely limited environment and want to quickly get at least some information out of your application.


Simplicity

I admit it: I use noscript, I block trackers like Google Analytics and all sorts of ads. If you browse the web this way, you see strange things: in the best case layouts are completely broken but in some cases pages even do not show up. And no, I am not talking about interactive pages like Google Maps. I am talking about pages that are content heavy. To me these people are just downright sloppy.

As a counter-statement I have eliminated most waste on my own page now (“the one who is without sin is the one who should cast the first stone”). From today on, this page does NOT use:

What is left? The essentials: HTML, CSS and images. There are some old presentations that I made with reveal.js but the rest of this page will not see Javascript any time soon.


All Posts