HotSpot JVM JIT optimisation techniques

After a JVM JIT overview, I put together a list of important techniques used by HotSpot JVM JIT compiler as I discovered them during my study in this area. Neither it has the ambition to provide an exhaustive list of techniques nor providing a deep expertise on a particular topic. The level was set to just understand the technique. Each section has links I found pretty useful in my study of the subject.
Described techniques in this article are based on Oracle HotSpot JVM implementation, but it should also be valid for OpenJDK and VMs based on that at least in a limited scope.
Majority of the described methods are dependant on runtime information collected by VM during execution and stored in ephemeral method data objects.

  • method inlining – is a base optimisation technique which can enable more sophisticated and complex techniques later on as it brings related code closer together. It eliminates the cost associated with calling a method as it copies a body of the called method in a place from where it was called. HotSpot places some restrictions on which methods can be inlined as, e.g. bytecode size of the inlined method, the depth of the method in the current call chain, etc. For the exact list see HotSpot VM options. An earlier post contains detailed method inlining example including evaluation of the performance gain.
  • loop unrolling – reduces the number of jumps back to the loop body beginning. Each jump can have an adverse effect on the CPU as it dumps a pipeline of incoming instructions. It eliminates the cost associated with branch prediction and removes safe point polls which are typically added at the end of each loop iteration. The shorter the loop body the higher the penalty for the back jump. This HotSpot performs decision on various criteria like type of loop variable, number of loop exit points, etc. Available benchmark showed that loop with integer loop variable was two times faster than with long. This cannot be taken as a rule of thumb but rather an idea about the possible impact of this optimisation. The behaviours differ between HotSpot versions, is dependant on CPU architecture and available instruction set. For more details check loop optimisation in HotSpot VM compiler
  • escape analysis – the sole purpose of this technique is to decide whether work done in a method is visible outside of the method or has any side effects. This optimisation is performed after any inlining has completed. Such knowledge is utilised in eliminating unnecessary heap allocation via optimisation called scalar replacement. In principle, object fields become scalar values as if they had been allocated as local variables instead. This reduces the object allocation rate and reduces memory pressure and in the end results in fewer GC cycles. This effect is nicely demonstrated in “Automatic stack allocation in the java virtual machine” blog post. Details can be also found in OpenJDK Escape Analysis.
    Escape analysis is also utilised when optimizing a performance of intrinsic locks (those using synchronized). There are possible lock optimisations which and essentially eliminates lock overhead:

    • lock elision – removing locks on an object which doesn’t escape given scope. A great blog post from Aleksey Shipilёv on lock elision demonstrates the performance effects.
    • lock coarsening – merges sequential lock regions that share the same lock. More detailed information can be found in the post dedicated to lock coarsening
    • nested locks – detects blocks of code where the same lock is acquired without releasing
  • monomorphic dispatch – this optimisation relies on the empirical fact that in the authored code it is quite often that only one type is observed in receiver object. This optimisation essentially eliminates the cost of looking up method in vtables – removes invokevirtual bytecode instruction. This optimisation is also protected by a guard to ensure that no wrong code ever invoked. HotSpot also has bimorphic dispatch – pretty similar stuff just done with two classes. If there is a need to regain some performance it is possible to transform the code in a way that call contains just two types to take advantage of this optimisation.
  • intrinsic – is the highly performant native implementation of the method known to JVM in advance rather than generated by JIT. They are used for performance critical methods and are backed by either special instruction provided by CPU or operating system. That means that they are platform dependent and some of them may not be available on all platforms. A decision about which optimisation to use might be deferred until runtime. Intrinsics are contained in .ad files of HotSpot OpenJDK source codes for example intrinsics x86 64-bit architecture. List of HotSpot available intrinsics is listed in vmSymbols.
    Someone who would like to see more detail on intrinsic could find nice this blog post about POPCNT CPU instruction(x86 and amd64) utilised when Integer.bitcount including benchmark to get an idea about the performance impact. This presentation gives great insight into what does it mean to add a new intrinsic to HotSpot. The obvious fact is that you have to implement it in C++ with all the risks it brings it (no memory management and an issue can bring the whole VM down). Compared to Graal a new JIT compiler for JVM written in Java integrated via JVM Compiler Interface (JVMCI). As JIT is nothing more than just the transformation from bytecode to machine code. It brings a lot of benefits e.g. using standard Java tooling. Graal dramatically simplifies developing of custom intrinsics for specific hardware. For better understanding Graal I found Chris Seaton presentation pretty useful.
  • on stack replacement (OSR) – is another term you might find when you start reading about JIT optimisations. OSR is just a technique for switching between two different implementations while it is running. This comes in handy in situations when some functions are invoked just once and could benefit from optimised version. When OSR happens a VM needs to be paused and replace the current frame with the new one which may have variables in different locations. For further info, I refer to an interesting post on On Stack Replacement whether is good or bad.

To see what JIT compilers perform during application runtime use JITWatch. Thanks to a newly gained understanding of JIT optimisations is should make a lot more sense what’s going on.

References for further study:
Useful JIT optimisation techniques
Java HotSpot performance engine architecture