HotSpot JVM JIT optimisation techniques

After a JVM JIT overview, I put together a list of important techniques used by HotSpot JVM JIT compiler as I discovered them during my study in this area. Neither it has the ambition to provide an exhaustive list of techniques nor providing a deep expertise on a particular topic. The level was set to just understand the technique. Each section has links I found pretty useful in my study of the subject.
Described techniques in this article are based on Oracle HotSpot JVM implementation, but it should also be valid for OpenJDK and VMs based on that at least in a limited scope.
Majority of the described methods are dependant on runtime information collected by VM during execution and stored in ephemeral method data objects.

  • method inlining – is a base optimisation technique which can enable more sophisticated and complex techniques later on as it brings related code closer together. It eliminates the cost associated with calling a method as it copies a body of the called method in a place from where it was called. HotSpot places some restrictions on which methods can be inlined as, e.g. bytecode size of the inlined method, the depth of the method in the current call chain, etc. For the exact list see HotSpot VM options. An earlier post contains detailed method inlining example including evaluation of the performance gain.
  • loop unrolling – reduces the number of jumps back to the loop body beginning. Each jump can have an adverse effect on the CPU as it dumps a pipeline of incoming instructions. It eliminates the cost associated with branch prediction and removes safe point polls which are typically added at the end of each loop iteration. The shorter the loop body the higher the penalty for the back jump. This HotSpot performs decision on various criteria like type of loop variable, number of loop exit points, etc. Available benchmark showed that loop with integer loop variable was two times faster than with long. This cannot be taken as a rule of thumb but rather an idea about the possible impact of this optimisation. The behaviours differ between HotSpot versions, is dependant on CPU architecture and available instruction set. For more details check loop optimisation in HotSpot VM compiler
  • escape analysis – the sole purpose of this technique is to decide whether work done in a method is visible outside of the method or has any side effects. This optimisation is performed after any inlining has completed. Such knowledge is utilised in eliminating unnecessary heap allocation via optimisation called scalar replacement. In principle, object fields become scalar values as if they had been allocated as local variables instead. This reduces the object allocation rate and reduces memory pressure and in the end results in fewer GC cycles. This effect is nicely demonstrated in “Automatic stack allocation in the java virtual machine” blog post. Details can be also found in OpenJDK Escape Analysis.
    Escape analysis is also utilised when optimizing a performance of intrinsic locks (those using synchronized). There are possible lock optimisations which and essentially eliminates lock overhead:

    • lock elision – removing locks on an object which doesn’t escape given scope. A great blog post from Aleksey Shipilёv on lock elision demonstrates the performance effects.
    • lock coarsening – merges sequential lock regions that share the same lock. More detailed information can be found in the post dedicated to lock coarsening
    • nested locks – detects blocks of code where the same lock is acquired without releasing
  • monomorphic dispatch – this optimisation relies on the empirical fact that in the authored code it is quite often that only one type is observed in receiver object. This optimisation essentially eliminates the cost of looking up method in vtables – removes invokevirtual bytecode instruction. This optimisation is also protected by a guard to ensure that no wrong code ever invoked. HotSpot also has bimorphic dispatch – pretty similar stuff just done with two classes. If there is a need to regain some performance it is possible to transform the code in a way that call contains just two types to take advantage of this optimisation.
  • intrinsic – is the highly performant native implementation of the method known to JVM in advance rather than generated by JIT. They are used for performance critical methods and are backed by either special instruction provided by CPU or operating system. That means that they are platform dependent and some of them may not be available on all platforms. A decision about which optimisation to use might be deferred until runtime. Intrinsics are contained in .ad files of HotSpot OpenJDK source codes for example intrinsics x86 64-bit architecture. List of HotSpot available intrinsics is listed in vmSymbols.
    Someone who would like to see more detail on intrinsic could find nice this blog post about POPCNT CPU instruction(x86 and amd64) utilised when Integer.bitcount including benchmark to get an idea about the performance impact. This presentation gives great insight into what does it mean to add a new intrinsic to HotSpot. The obvious fact is that you have to implement it in C++ with all the risks it brings it (no memory management and an issue can bring the whole VM down). Compared to Graal a new JIT compiler for JVM written in Java integrated via JVM Compiler Interface (JVMCI). As JIT is nothing more than just the transformation from bytecode to machine code. It brings a lot of benefits e.g. using standard Java tooling. Graal dramatically simplifies developing of custom intrinsics for specific hardware. For better understanding Graal I found Chris Seaton presentation pretty useful.
  • on stack replacement (OSR) – is another term you might find when you start reading about JIT optimisations. OSR is just a technique for switching between two different implementations while it is running. This comes in handy in situations when some functions are invoked just once and could benefit from optimised version. When OSR happens a VM needs to be paused and replace the current frame with the new one which may have variables in different locations. For further info, I refer to an interesting post on On Stack Replacement whether is good or bad.

To see what JIT compilers perform during application runtime use JITWatch. Thanks to a newly gained understanding of JIT optimisations is should make a lot more sense what’s going on.

References for further study:
Useful JIT optimisation techniques
Java HotSpot performance engine architecture 

Advertisements

HotSpot JVM internal threads

In the Structure of Java Virtual Machine we scratched the surface of a class file structure, how it is connected to java memory model via class loading process. Also, we briefly discussed the bytecode structure and its execution including a short introduction to Just In Time runtime optimisation. In this post we will look more at the internals of execution engine, however, there is no ambition to substitute a detailed VM implementation documentation for HotSpot JVM but just provide enough details to gain bigger picture.

Basic Threading model in HotSpot JVM is a one to one mapping between Java threads (an instance of java.lang.Thread) and native operating system threads. The native thread is created when the Java thread is started and reclaimed once it terminates. The operating system is responsible for scheduling all threads and dispatching to any available CPU. The relationship between java threads priorities and operating system thread priorities varies across operating systems.

HotSpot provides monitors by which threads running application code may participate in mutual exclusion (mutex) protocol. The monitor is either locked or unlocked. Only one thread may own the lock at any time. Only after acquiring ownership of the monitor thread may enter the critical section protected by this monitor. Critical sections are referred as synchronised blocks delineated by synchronised keyword.

Apart from application threads, JVM contains also internal threads which can be categorised into following groups:

  • VM Thread – responsible for executing VM operations
  • Periodic task thread – thread executing periodic operations within the VM (singleton instance of WatcherThread)
  • GC threads – threads of different types to support parallel an concurrent garbage collections
  • Compiler threads – performs a JIT compilation and optimisation of bytecode to native code at runtime (C1 and C2 JIT compilation threads)
  • Signal dispatcher threads – thread waiting for processing directed signals and dispatches them to a java signal handling method

JVM_compiler_threads

VM thread spends its time waiting for requested operations to appear in the operation queue (VMOperationQueue). The operation is typically passed to VM Thread because they require the VM to reach safepoint before they can be executed. When the VM is at safepoint all threads inside the VM have been blocked and any threads executing in native code are prevented from returning to the VM while the safepoint is in progress. This means that VM operation can be executed knowing that no thread can be in the middle of modifying heap. All threads are in a state such that their Java stacks are unchanging and can be examined.

Most familiar VM operation is related to garbage collection, particularly stop-the-world phase of garbage collection that is common to many garbage collocational algorithms. Other VM operation is: thread stacks dumps, thread suspension or stopping, inspection or modification via JVMTI etc. VM operation can be synchronous or asynchronous.

Safepoints are initiated using cooperative pooling based mechanism. Thread asks: “Should I block for a safepoint?” The moment when this is happening often is during thread state transition. Threads executing interpreted code don’t usually ask the question, instead of when safepoint is requested interpreter switches to different dispatch table which includes that question. When safepoint is over the dispatched table is switched back. Once safepoint has been requested VM Thread must wait until all threads are known to be in safepoint safe state before proceeding with the operation. During safepoint thread lock is used to block any threads that were running and releasing the lock when operation completed.

Structure of Java Virtual Machine (JVM)

Java-based applications run in Java Runtime Environment (JRE) which consists of a set of Java APIs and Java Virtual Machine (JVM). The JVM loads an application via class loaders and runs it by the execution engine.

JVM runs on all kind of hardware where executes a java bytecode without changing java execution code. VM implements a Write Once Run Everywhere principle or so-called platform independence principle. Just to sum up JVM key design principles:
  • Platform independence
    • Clearly defined primitive data types – Languages like C or C++ have a size of primitive data types depending on the platform. Java is unified in that matter.
    • Fixed byte order to Big Endian (network byte order) – Intel x86 uses little endian while RISC processors use big endian. Java uses big-endian.
  • Automatic memory management – class instances are created by the user and automatically removed by Garbage Collection.
  • Stack based VM – typical computer architectures such as Intel x86 are based on registers however JVM is based on a stack.
  • Symbolic reference – all types except primitive one are referred via symbolic name instead direct memory addresses.

Java uses bytecode as an intermediate representation between source code and machine code which runs on hardware. Bytecode instruction is represented as 1 byte numbers e.g. getfield 0xb4, invokevirtual 0xb6 hence there is a maximum of 256 instructions. If the instruction doesn’t need operand so next instruction immediately follows otherwise operands follow instruction according to instruction set specification. Those instructions are contained in class files produced by java compilation. The exact structure of the class file is defined in “Java Virtual Machine specification” section 4 – class file format. After some version information there are sections like constant pools, access flags, fields info, this and super info, methods info etc. See the spec for the details.

A class loader loads compiled java bytecode to the Runtime Data Areas and the execution engine executes the Java bytecode. Class is loaded when is used for the first time in the JVM. Class loading works in dynamic fashion on parent-child (hierarchical) delegation principle. Class unloading is not allowed. Some time ago I wrote an article about class loading on application server. Detail mechanics of class loading is out of scope for this article.

Runtime data areas are used during execution of the program. Some of these areas are created when JVM starts and destroyed when the JVM exits. Other data areas are per-thread – created on thread creation and destroyed on thread exit. FThe following picture is based mainly on JVM 8 internals (doesn’t include segmented code cache and dynamic linking of language introduced by JVM 9).

25AC9487-1AB9-48DE-936C-B6A9AC781637

  • Program counter – exist for one thread and has the address of currently executed instruction unless it is native then the PC is undefined. PC is, in fact, pointing to at a memory address in the Method Area.
  • Stack – JVM stack exists for one thread and holds one Frame for each method executing on that thread. It is LIFO data structure. Each stack frame has reference to for local variable array, operand stack, runtime constant pool of a class where the code being executed.
  • Native Stack –  not supported by all JVMs. If JVM is implemented using C-linkage model for JNI than stack will be C stack(order of arguments and return will be identical in the native stack typical to C program). Native methods can call back into the JVM and invoke Java methods.
  • Stack Frame – stores references that point to the objects or arrays on the heap.
    • Local variable array – all the variables used during execution of the method, all method parameters and locally defined variables.
    • Operand stack – used during execution of the bytecode instruction. Most of the bytecode is manipulating operand stack moving from local variables array.
  • Heap – area shared by all threads used to allocate class instances and arrays in runtime. Heap is the subject of Garbage Collection as a way of automatic memory management used by java. This space is most often mentioned in JVM performance tuning.
  • Non-Heap memory areas
    • Method area – shared by all threads. It stores runtime constant pool information, field and method information, static variable, method bytecode for each class loaded by the JVM. Details of this area depend on JVM implementation.
      • Runtime constant pool – this area corresponds constant pool table in the class file format. It contains all references to methods and fields. When a method or field is referred  to JVM searches the actual address of the method or field in the memory by using the constant pool
      • Method and constructor code
    • Code cache – used for compilation and storage of methods compiled to native code by JIT compilation
Bytecode assigned to runtime data areas in the JVM via class loader is executed by the execution engine. Engine reads bytecode in the unit of instruction. Execution engine must change the bytecode to the language that can be executed by the machine. This can happen in one of two ways:
  1. Interpreter – read, interpret and execute bytecode instructions one by one.
  2. JIT (Just In Time) compiler – compensate disadvantage of interpretation. Start executing the code in interpreted mode and JIT compiler compile the entire bytecode to native code. Execution is then switched from interpretation to execution of native code which is much faster thanks to various optimisation techniques JIT performs. Native code is stored in the cache. Compilation to native code takes time so JVM uses various metrics to decide whether to JIT compile the bytecode.

How the JVM execution engine runs is not defined by JVM specification so vendors are free to improve their JVM engines by various techniques, hotspot JVM is described in more detail and JIT watch gives insight into runtime characteristics.

More details can be found in The Java® Virtual Machine Specification – Java SE 9 Edition

 

 

Weblogic classloading

Getting a java.lang.NoSuchMethodError is usually the beginning of great exploration of your platform – in this case weblogic. Javadoc says:

Thrown if an application tries to call a specified method of a class (either static or instance), and that class no longer has a definition of that method.
Normally, this error is caught by the compiler; this error can only occur at run time if the definition of a class has incompatibly changed
.

What’s the hack going on here! Libraries used are embedded into the final archive I did verified that! If you don’t know simply suspect classloaders, publicly known enemies of java developers 🙂 As rule no.1 which says: “Verify your assumptions”. The fact that the class is in archive doesn’t necessary mean that it gets loaded, so to verify that simply pass -verbose or -verbose:class argument to weblogic’s JVM in startUp.sh/bin and you will get the origin of loaded classes.

Class loaded from WL_HOME/modules, how’s that possible? To understand that general understanding of classloading is essential and then understand your J2EE standard implementation e.g. Weblogic, JBoss, … This post is not going to pretend an expert detail knowledge level on this topic so I will rather stay with general principles with reference to details documentation.

Java has several class loaders (bootstrap, extension, …) the important fact is that they work in some hierarchy (parent-child relationship) with some delegation scheme which says when to load a class and from where. Java elementary delegation principle says: Delegate finding classes and resources to their parent before searching own classpath. Only if the parent cannot find it child is allowed to load it. So far so good. To complicate the matter a bit more – java servlet specification recommends look at child classloader before delegating to parent (if this recommendation were taken you need to check with documentation of J2EE implementation you are using, as you can see you know nothing based on those rules 🙂 ) So in my case of Weblogic J2EE implementation

as you can see system classloader is the parent of all the application’s classloaders, details can be found here. So how the class get loaded from WL_HOME/modules ? The framework library must be on system classpath. On the system classpath is just weblogic.jar not my framework library?
Weblogic 10 in order to better modularity included components under WL_HOME/modules and weblogic.jar now refers to these components in the modules directory from its manifest classpath. So that means that other version of library sits on system classloader – the parent of all the application classloaders, so that means that those libraries included in application archives will be ignores based on the delegation scheme. (That was probably the idea why was recommended in J2EE classloading delegation scheme – child first). However weblogic does offer other way how to solve this case by so called classloader filters/interceptors defined in weblogic specific deployment descriptor either on ear level or war level.
weblogic-application.xml

org.apache.log4j.*
antlr.*

weblogic.xml

      true