HotSpot JVM JIT optimisation techniques

After a JVM JIT overview, I put together a list of important techniques used by HotSpot JVM JIT compiler as I discovered them during my study in this area. Neither it has the ambition to provide an exhaustive list of techniques nor providing a deep expertise on a particular topic. The level was set to just understand the technique. Each section has links I found pretty useful in my study of the subject.
Described techniques in this article are based on Oracle HotSpot JVM implementation, but it should also be valid for OpenJDK and VMs based on that at least in a limited scope.
Majority of the described methods are dependant on runtime information collected by VM during execution and stored in ephemeral method data objects.

  • method inlining – is a base optimisation technique which can enable more sophisticated and complex techniques later on as it brings related code closer together. It eliminates the cost associated with calling a method as it copies a body of the called method in a place from where it was called. HotSpot places some restrictions on which methods can be inlined as, e.g. bytecode size of the inlined method, the depth of the method in the current call chain, etc. For the exact list see HotSpot VM options. An earlier post contains detailed method inlining example including evaluation of the performance gain.
  • loop unrolling – reduces the number of jumps back to the loop body beginning. Each jump can have an adverse effect on the CPU as it dumps a pipeline of incoming instructions. It eliminates the cost associated with branch prediction and removes safe point polls which are typically added at the end of each loop iteration. The shorter the loop body the higher the penalty for the back jump. This HotSpot performs decision on various criteria like type of loop variable, number of loop exit points, etc. Available benchmark showed that loop with integer loop variable was two times faster than with long. This cannot be taken as a rule of thumb but rather an idea about the possible impact of this optimisation. The behaviours differ between HotSpot versions, is dependant on CPU architecture and available instruction set. For more details check loop optimisation in HotSpot VM compiler
  • escape analysis – the sole purpose of this technique is to decide whether work done in a method is visible outside of the method or has any side effects. This optimisation is performed after any inlining has completed. Such knowledge is utilised in eliminating unnecessary heap allocation via optimisation called scalar replacement. In principle, object fields become scalar values as if they had been allocated as local variables instead. This reduces the object allocation rate and reduces memory pressure and in the end results in fewer GC cycles. This effect is nicely demonstrated in “Automatic stack allocation in the java virtual machine” blog post. Details can be also found in OpenJDK Escape Analysis.
    Escape analysis is also utilised when optimizing a performance of intrinsic locks (those using synchronized). There are possible lock optimisations which and essentially eliminates lock overhead:

    • lock elision – removing locks on an object which doesn’t escape given scope. A great blog post from Aleksey Shipilёv on lock elision demonstrates the performance effects.
    • lock coarsening – merges sequential lock regions that share the same lock. More detailed information can be found in the post dedicated to lock coarsening
    • nested locks – detects blocks of code where the same lock is acquired without releasing
  • monomorphic dispatch – this optimisation relies on the empirical fact that in the authored code it is quite often that only one type is observed in receiver object. This optimisation essentially eliminates the cost of looking up method in vtables – removes invokevirtual bytecode instruction. This optimisation is also protected by a guard to ensure that no wrong code ever invoked. HotSpot also has bimorphic dispatch – pretty similar stuff just done with two classes. If there is a need to regain some performance it is possible to transform the code in a way that call contains just two types to take advantage of this optimisation.
  • intrinsic – is the highly performant native implementation of the method known to JVM in advance rather than generated by JIT. They are used for performance critical methods and are backed by either special instruction provided by CPU or operating system. That means that they are platform dependent and some of them may not be available on all platforms. A decision about which optimisation to use might be deferred until runtime. Intrinsics are contained in .ad files of HotSpot OpenJDK source codes for example intrinsics x86 64-bit architecture. List of HotSpot available intrinsics is listed in vmSymbols.
    Someone who would like to see more detail on intrinsic could find nice this blog post about POPCNT CPU instruction(x86 and amd64) utilised when Integer.bitcount including benchmark to get an idea about the performance impact. This presentation gives great insight into what does it mean to add a new intrinsic to HotSpot. The obvious fact is that you have to implement it in C++ with all the risks it brings it (no memory management and an issue can bring the whole VM down). Compared to Graal a new JIT compiler for JVM written in Java integrated via JVM Compiler Interface (JVMCI). As JIT is nothing more than just the transformation from bytecode to machine code. It brings a lot of benefits e.g. using standard Java tooling. Graal dramatically simplifies developing of custom intrinsics for specific hardware. For better understanding Graal I found Chris Seaton presentation pretty useful.
  • on stack replacement (OSR) – is another term you might find when you start reading about JIT optimisations. OSR is just a technique for switching between two different implementations while it is running. This comes in handy in situations when some functions are invoked just once and could benefit from optimised version. When OSR happens a VM needs to be paused and replace the current frame with the new one which may have variables in different locations. For further info, I refer to an interesting post on On Stack Replacement whether is good or bad.

To see what JIT compilers perform during application runtime use JITWatch. Thanks to a newly gained understanding of JIT optimisations is should make a lot more sense what’s going on.

References for further study:
Useful JIT optimisation techniques
Java HotSpot performance engine architecture 

Advertisements

How to optimise a code to be JIT friendly

In the previous blog post, we measured the effect of basic JIT optimisation technique – method inlining(out of other JIT optimisation techniques). The code example was a bit unnatural as it was super simple Scala code just for demonstration purposes of method inlining. In this post, I would like to share a general approach I am using when I want to check how JIT treats my code or if there is some possibility to improve the code performance in regards to JIT. Even the method inlining requires the code to meet certain criteria as bytecode length of inlined methods etc. For this purpose, I am regularly using great OpenJDK project called JITWatch which comes with a bunch of handy tools in regard to JIT. I am pretty sure that there is probably more tools and I will be more than happy if you can share your approaches when dealing with JIT in the comment section below the article.
Java HotSpot is able to produce a very detailed log of what the JIT compiler is exactly doing and why. Unfortunately, the resulting log is very complex and difficult to read. Reading this log would require an understanding of the techniques and theory that underline JIT compilation. A free tool like JITWatch process those logs and abstract this complexity away from the user.

In order to produce log suitable for JIT Watch investigation the tested application needs to be run with following JVM flags:
-XX:+UnlockDiagnosticVMOptions

-XX:+LogCompilation

-XX:+TraceClassLoading
those settings will produce log file hotspot_pidXXXXX.log. For purpose of this article, I re-used code from the previous blog located on my GitHub account with JVM flags enabled in build.sbt.
In order to look into generated machine code in JITWatch we need to install HotSpot Disassembler (HSDIS) to install it to $JAVA_HOME/jre/lib/server/. For Mac OS X that can be used from here and try renaming it to hsdis-amd64-dylib. In order to include machine code into generated JIT log we need to add JVM flag -XX:+PrintAssembly.
[info] 0x0000000103e5473d: test %r13,%r13
[info] 0x0000000103e54740: jne 0x0000000103e5472a
[info] 0x0000000103e54742: mov $0xfffffff6,%esi
[info] 0x0000000103e54747: mov %r14d,%ebp
[info] 0x0000000103e5474a: nop
[info] 0x0000000103e5474b: callq 0x0000000103d431a0 ; OopMap{off=112}
[info] ;*invokevirtual inc
[info] ; - com.jaksky.jvm.tests.jit.IncWhile::testJit@12 (line 19)
[info] ; {runtime_call}
[info] 0x0000000103e54750: callq 0x0000000102e85c18 ;*invokevirtual inc
[info] ; - com.jaksky.jvm.tests.jit.IncWhile::testJit@12 (line 19)
[info] ; {runtime_call}
[info] 0x0000000103e54755: xor %r13d,%r13d
We run the JITWatch via ./launchUI.sh
JITWATCH_config
to configure source files and target generated class files
JITWatch_configuration

And finally, open prepared JIT log and hit Start.

The most interesting from our perspective is TriView where we can see the source code, JVM bytecode and native code. For this particular example we disabled method inlining via JVM Flag “-XX:CompileCommand=dontinline, com/jaksky/jvm/tests/jit/IncWhile.inc

JITWatch_notinlined
To just compare with the case when the method body of IncWhile.inc is inlined – native code size is greater 216 compared to 168 with the same bytecode size.
JITWatch-inlined
Compile Chain provides also a great view of what is happening with the code
JITWatch_compileChain
Inlining report provides a great overview what is happening with the code
JITWatch-inlining
As it can be seen the effect of tiered compilation as described in JIT compilation starts with client C1 JIT compilation and then switches to server C2 compilation. The same or even better view can be found on Compiler Thread activity which provides a timeline view. To refresh memory check overview of JVM threads. Note: standard java code is subject to JIT optimizations too that’s why so many compilation activities here.
JITWatch_compilerThreads
JITWatch is a really awesome tool and provides many others views which don’t make sense to screenshot all e.g. cache code allocation, nmethodes etc. For detail information, I really suggest reading JITWatch wiki pages.  Now the question is how to write JIT friendly code? Here pure jewel of JITWatch comes in: Suggestion Tool. That is why I like JITWatch so much. For demonstration, I selected somewhat more complex problem – N Queens problem.
JITWatch_suggestion
Suggestion tool clearly describes why certain compilations failed and what was the exact reason. It is a coincidence that in this example we hit again just inlining as there is definitely more going on in JIT but this window provides a clear view of how we can possibly help JIT.
Another great tool which is also a part of JITWatch is JarScan Tool. This utility will scan a list of jars and count bytecode size of every method and constructor. The purpose of this utility is to highlight the methods that are bigger than HotSpot threshold for inlining hot methods (default 35 bytes) so it provides hints where to focus benchmarking to see whether decomposing code into smaller methods brings some performance gain. The hotness of the method is determined by the set of heuristics including call frequency etc. But what can eliminate the method from inlining is its size. For sure just the method size it too big breaching some limit for inlining doesn’t automatically mean that method is a performance bottleneck. JarScan tool is a static analysis tool which has no knowledge of runtime statistics hence real method hotness.
jakub@MBook ~/Development/GIT_REPO (master) $ ./jarScan.sh --mode=maxMethodSize --limit=35 ./chess-challenge/target/scala-2.12/classes/
"cz.jaksky.chesschallenge","ChessChallange$","delayedEndpoint$cz$jaksky$chesschallenge$ChessChallange$1","",1281
"cz.jaksky.chesschallenge.solver","ChessBoardSolver$","placeFigures$1","scala.collection.immutable.List,scala.collection.immutable.Set",110
"cz.jaksky.chesschallenge.solver","ChessBoardSolver$","visualizeSolution","scala.collection.immutable.Set,int,int",102
"cz.jaksky.chesschallenge.domain","Knight","check","cz.jaksky.chesschallenge.domain.Position,cz.jaksky.chesschallenge.domain.Position",81
"cz.jaksky.chesschallenge.domain","Queen","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Rook","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Bishop","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Knight","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","King","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Position","Position","int,int",73
"cz.jaksky.chesschallenge.domain","Position","equals","java.lang.Object",72
To wrap up, JITWatch is a great tool which provides insight into HotSpot JIT optimisations happening during program execution and it can help you to understand how a decision made at the source code level can affect the performance of the program.

JVM JIT compilation as a way of performance optimisation

Previous article structure of JVM – java memory model briefly mentions bytecode executions modes and article JVM internal threads provides additional insight into the internal architecture of JVM execution. In this article, we focus on Just In Time compilation and on some of its basic optimisation techniques. We also discuss the performance impact of one JIT optimisation technique namely method inlining. In the remainder of this article we focus solely on HotSpot JVM, however, principles are valid in general.
HotSpot JVM is a mixed-mode VM which means that it starts off interpreting the bytecode, but it can compile code into very highly optimised native machine code for faster execution. This optimised code runs extremely fast and performance can be compared with C/C++ code.  JIT compilation happens on method basis during runtime after the method has been run a number of times and considered as a hot method. The compilation into machine code happens on a separate JVM thread and will not interrupt the execution of the program. While the compiler thread is compiling a hot method JVM keeps on using the interpreted version of the method until the compiled version is ready.  Thanks to code runtime characteristics HotSpot JVM can make a sophisticated decision about how to optimise the code.
Java HotSpot VM is capable of running in two separate modes (C1 and C2) and each mode has a different situation in which it is usually preferred:
  • C1 (-client) – used for application where quick startup and solid optimization are needed, typically GUI application are good candidates.
  • C2 (-server) – for long running server application
Those two compiler modes use different techniques for JIT compilation so it is possible to get for the same method very different machine code. Modern java application can take advantage of both compilation modes and starting from Java SE 7 feature called tiered compilation is available.  An application starts with C2 compilation which enables fast startup and once the application is warmed up compiler C2 takes over. Since Java SE 8 tiered compilation is a default. Server optimisation is more aggressive based on assumptions which may not always hold. These optimizations are always protected with guard condition to check whether the assumption is correct. If an assumption is not valid JVM reverts the optimisation and drops back to interpreted mode. In server mode HotSpot VM runs a method in interpreted mode 10 000 times before compiling it (can be adjusted via -XX:CompileThreshold=5000). Changing this threshold should be considered thoroughly as HotSpot VM works best when it can accumulate enough statistics in order to make an intelligent decision what to compile. If you wanna inspect what is compiled using-XX:PrintCompilation.
Among most common JIT compilation techniques used by HotSpot VM is method inlining, which is a practice of substituting the body of a method into the places where the method is called. This technique saves the cost of calling the method. In the HotSpot, there is a limit on method size which can be substituted. Next technique commonly used is monomorphic dispatch which relies on a fact that there are paths through method code which belongs to one reference type most of the time and other paths that belong to other types. So the exact method definitions are known without checking thanks to this observation and the overhead of virtual method lookup can be eliminated. JIT compiler can emit optimised machine code which is faster. There are many other optimisation techniques like loop optimisation, dead code elimination, intrinsics and others.
The performance gain by inlining optimisation can be demonstrated on a simple Scala code:
class IncWhile {

  def main(): Int = {
    var i: Int = 0
    var limit = 0

    while (limit < 1000000000) {
      i = inc(i)
      limit = limit + 1
    }
    i
  }

  def inc(i: Int): Int = i + 1
}

Where method inc is eligible for inlining as the method body is smaller than 35 bytes of JVM bytecode (actual size of inc method is 9 bytes). Inlining optimisation can be verified by looking into JIT optimised machine code.

IncWhile-inlined

Difference is obvious when compared to machine code when inlining is disabled use  –XX:CompileCommand=dontinline,com/jaksky/jvm/tests/jit/IncWhile.inc

IncWhile-dontinlineThe difference in runtime characteristics is also significant as the benchmark results show. With disabled inlining:

[info] Result "com.jaksky.jvm.tests.jit.IncWhile.main":
[info] 2112778741.540 ±(99.9%) 9778298.985 ns/op [Average]
[info] (min, avg, max) = (2040573480.000, 2112778741.540, 2192003946.000), stdev = 28831537.237
[info] CI (99.9%): [2103000442.555, 2122557040.525] (assumes normal distribution)
[info] # Run complete. Total time: 00:08:03
[info] Benchmark Mode Cnt Score Error Units
[info] IncWhile.main avgt 100 2112778741.540 ± 9778298.985 ns/op

When inlining enabled JVM JIT also capable to use next optimizations like loop optimizations which might case that our whole loop is eliminated as it is easily predictable. We would get time around 3 ns which are for 1GHz processor unreal to perform billions of operations. To disable most of loop optimizations use -XX:LoopOptsCount=0 JVM option.

[info] Result "com.jaksky.jvm.tests.jit.IncWhile.main":
[info] 332699064.778 ±(99.9%) 3485503.823 ns/op [Average]
[info] (min, avg, max) = (316312877.000, 332699064.778, 358738827.000), stdev = 10277087.396
[info] CI (99.9%): [329213560.955, 336184568.600] (assumes normal distribution)
[info] # Run complete. Total time: 00:04:55
[info] Benchmark Mode Cnt Score Error Units
[info] IncWhile.main avgt 100 332699064.778 ± 3485503.823 ns/op
so the performance gain by inlining a method body can be quite significant 2 seconds vs 300 milliseconds.
In this post, we discussed the mechanics of Java JIT compilation and some optimisation techniques used. We particularly focused on the one of the simplest optimisation technique called method inlining. We demonstrated performance gain brought by eliminating a method call represented by invokevirtual bytecode instruction. Scala also offers a special annotation @inline which should help us with performance aspects of the code under the development. All the code for running the experiments is available online on my GitHub account.