HotSpot JVM JIT optimisation techniques

After a JVM JIT overview, I put together a list of important techniques used by HotSpot JVM JIT compiler as I discovered them during my study in this area. Neither it has the ambition to provide an exhaustive list of techniques nor providing a deep expertise on a particular topic. The level was set to just understand the technique. Each section has links I found pretty useful in my study of the subject.
Described techniques in this article are based on Oracle HotSpot JVM implementation, but it should also be valid for OpenJDK and VMs based on that at least in a limited scope.
Majority of the described methods are dependant on runtime information collected by VM during execution and stored in ephemeral method data objects.

  • method inlining – is a base optimisation technique which can enable more sophisticated and complex techniques later on as it brings related code closer together. It eliminates the cost associated with calling a method as it copies a body of the called method in a place from where it was called. HotSpot places some restrictions on which methods can be inlined as, e.g. bytecode size of the inlined method, the depth of the method in the current call chain, etc. For the exact list see HotSpot VM options. An earlier post contains detailed method inlining example including evaluation of the performance gain.
  • loop unrolling – reduces the number of jumps back to the loop body beginning. Each jump can have an adverse effect on the CPU as it dumps a pipeline of incoming instructions. It eliminates the cost associated with branch prediction and removes safe point polls which are typically added at the end of each loop iteration. The shorter the loop body the higher the penalty for the back jump. This HotSpot performs decision on various criteria like type of loop variable, number of loop exit points, etc. Available benchmark showed that loop with integer loop variable was two times faster than with long. This cannot be taken as a rule of thumb but rather an idea about the possible impact of this optimisation. The behaviours differ between HotSpot versions, is dependant on CPU architecture and available instruction set. For more details check loop optimisation in HotSpot VM compiler
  • escape analysis – the sole purpose of this technique is to decide whether work done in a method is visible outside of the method or has any side effects. This optimisation is performed after any inlining has completed. Such knowledge is utilised in eliminating unnecessary heap allocation via optimisation called scalar replacement. In principle, object fields become scalar values as if they had been allocated as local variables instead. This reduces the object allocation rate and reduces memory pressure and in the end results in fewer GC cycles. This effect is nicely demonstrated in “Automatic stack allocation in the java virtual machine” blog post. Details can be also found in OpenJDK Escape Analysis.
    Escape analysis is also utilised when optimizing a performance of intrinsic locks (those using synchronized). There are possible lock optimisations which and essentially eliminates lock overhead:

    • lock elision – removing locks on an object which doesn’t escape given scope. A great blog post from Aleksey Shipilёv on lock elision demonstrates the performance effects.
    • lock coarsening – merges sequential lock regions that share the same lock. More detailed information can be found in the post dedicated to lock coarsening
    • nested locks – detects blocks of code where the same lock is acquired without releasing
  • monomorphic dispatch – this optimisation relies on the empirical fact that in the authored code it is quite often that only one type is observed in receiver object. This optimisation essentially eliminates the cost of looking up method in vtables – removes invokevirtual bytecode instruction. This optimisation is also protected by a guard to ensure that no wrong code ever invoked. HotSpot also has bimorphic dispatch – pretty similar stuff just done with two classes. If there is a need to regain some performance it is possible to transform the code in a way that call contains just two types to take advantage of this optimisation.
  • intrinsic – is the highly performant native implementation of the method known to JVM in advance rather than generated by JIT. They are used for performance critical methods and are backed by either special instruction provided by CPU or operating system. That means that they are platform dependent and some of them may not be available on all platforms. A decision about which optimisation to use might be deferred until runtime. Intrinsics are contained in .ad files of HotSpot OpenJDK source codes for example intrinsics x86 64-bit architecture. List of HotSpot available intrinsics is listed in vmSymbols.
    Someone who would like to see more detail on intrinsic could find nice this blog post about POPCNT CPU instruction(x86 and amd64) utilised when Integer.bitcount including benchmark to get an idea about the performance impact. This presentation gives great insight into what does it mean to add a new intrinsic to HotSpot. The obvious fact is that you have to implement it in C++ with all the risks it brings it (no memory management and an issue can bring the whole VM down). Compared to Graal a new JIT compiler for JVM written in Java integrated via JVM Compiler Interface (JVMCI). As JIT is nothing more than just the transformation from bytecode to machine code. It brings a lot of benefits e.g. using standard Java tooling. Graal dramatically simplifies developing of custom intrinsics for specific hardware. For better understanding Graal I found Chris Seaton presentation pretty useful.
  • on stack replacement (OSR) – is another term you might find when you start reading about JIT optimisations. OSR is just a technique for switching between two different implementations while it is running. This comes in handy in situations when some functions are invoked just once and could benefit from optimised version. When OSR happens a VM needs to be paused and replace the current frame with the new one which may have variables in different locations. For further info, I refer to an interesting post on On Stack Replacement whether is good or bad.

To see what JIT compilers perform during application runtime use JITWatch. Thanks to a newly gained understanding of JIT optimisations is should make a lot more sense what’s going on.

References for further study:
Useful JIT optimisation techniques
Java HotSpot performance engine architecture 

Advertisements

Stress testing asynchronous REST service with Gatling

In this blog post, I am not going to deep dive into application performance testing theory nor methodology but rather focus on explaining more advanced features of Gatling performance testing tool on the relatively common scenario when testing asynchronous service. Asynchronous service we are going to test in this post, report generation service, has following API specification:

  1. A report is requested for a given userId via POST method on /report endpoint and in the response request tracking reportId is returned
  2. Report status is tracked via GET method on /report/{reportId} when response status code is 202 Accepted it means that report generation is still in progress. 5xx status code means error during generation. When the report is done service returns 200 OK and the generated report.

Gatling testing tool allows you to create any testing scenario. Testing synchronous service is pretty simple – call the service with the request payload, get the response and assert the content. Performance, in this case, is relatively simple: response time of single service, error rates etc. However, testing an asynchronous service is a bit more difficult as it cannot be done via a single call. Testing scenario needs to follow the structure:

  1. request a service with given parameters and store id for this particular request
  2. poll the request status for tracing the progress
  3. download result when the request is completed

Performance of the service cannot be measured directly as processing is no longer bounded by a single call. What we would like to measure is processing time since the report is requested until processing is complete.

Test scenario can be nicely transformed into test scenario in Gatling DSL:

 val generateReport = scenario("Generate report")
 .exec(requestReport)
 .exec(pollReportGenerationState)
 .exec(getGeneratedReport)

Each test step contains validations that assure correctness of test scenario for a given virtual user. Test scenario relies on Gatling sessions which allow us to carry over user specific data between test steps. In this particular case, we need user specific reportId in order to track request progress and finally collect the report. ReportId is parsed from JSON Report generation response :

 private val requestReport: HttpRequestBuilder = http("Request report generation")
 .post("/report")
 .queryParamMap(withApiKey)
 .body(StringBody(reportRequestBody(userId, reportParameter)))
 .asJSON
 .check(status is 202)
 .check(jsonPath("$..reportId").ofType[String].exists.saveAs("reportId"))

Probably most interesting part is polling the report generation state where we need to keep a virtual user in polling mode until the report is ready. Again we are going to use Session in combination with asLongAs looping operator:

 private def notGenerated(sess: Session): Validation[Boolean] = {
val reportInProgress = 202
val generationStatus = sess("generationStatus").asOption[Int].getOrElse(reportInProgress)
logger.debug(s"Report generation status: $generationStatus")
generationStatus == reportInProgress
}

private val pollReportGenerationState = asLongAs(notGenerated)(
pause(100.millis)
.exec(
http("Poll report generation state")
.get("/report/${reportId}")
.queryParamMap(withApiKey)
.asJSON
.check(status saveAs ("generationStatus"))
)
)

When working with asynchronous service another operator worth consideration is “tryMax”. The disadvantage of this operator is that request failed are counted towards failed requests what in our case would drastically falsify results. Silencing the request would erase this test step completely.

Collecting report when is generated is pretty straightforward. In order to collect statistics on generating report scenario – time to get the report, we need to wrap the part of the scenario in group combinator which will result in group statistics for that group.

val generateReport = scenario("Generate report").group("Generation report completion"){
exec(requestReport)
.exec(pollReportGenerationState)
.exec(getGeneratedReport)
}

In this blog post, we used more advanced features of Gatling performance testing tool when testing asynchronous service using virtual users sharing important data between test steps via Session. Complete code snippet can be found on my GitHub gist. If you use different way when testing asynchronous service or any suggestion how to support custom metrics in Gatling report let me know in comment section below this blog post.

Communication in distributed teams using Slack

Some time ago I wrote a post about tools I found useful when working in a distributed team. Working habits are changing and companies which grasp it will benefit from access to the broader talent pool and possible benefits when managed rightly. However, managing remote teams pose significant challenges. This post isn’t going to discuss how to structure teams or how to address araising challenges but rather focusing on things I found useful when working in a remote distributed team.

In the past years, Slack become defacto standard in this area even though I keep hearing from various teams that it didn’t meet quite their needs. If you have experience with the different tool I will be happy if you can share your experience in the comment section below the post.

Following section lists features I discovered really useful during working in a remote distributed development team.

  1. Snooze notifications –  to protect your focus time and boost your productivity. Break day to focus units where all distraction is disabled.
  2. Highlight words – can save your time to skim through unread messages available via Slack > Preferences > Notifications > My Keywordsslack highlight keywords feature
  3. Reminder tools – on each message you can set a reminderslack remind me about this feature
  4. To quickly start google hangout meeting – use command /hangout and the call will be initiated
  5. If you want to append ¯\_(ツ)_/¯ to channel – type /shrug command
  6. Use advanced slack search – start in search box with plus signslack advanced search
  7. Create Poll – use the command: /poll “Question?” “Option1” “Option2” “Option3” …slack simple poll
  8. Convert direct communication to a private channel. To direct communication must be included more than one party involved and the resulted channel is always private with no option to make it public. You can, later on, invite new participants with the option to see the channel history.slack direct message to private channel
  9. Channels for dedicated groups or around topics of interest are standard but what I found really useful is User groups feature which allows communicating with a group of people across the channels.
  10. Use stars to create an instant to-do list. You can mark any message or phrase with a star next to the time-stamp. It’s almost the same as bookmarking. Hint: if you mark messages with the task then you’ll get a personal to-do list which is really convenient to form during the conversation.

This concludes my top 10 slack features I found really useful while working in a remote distributed team. If you have other tips and tricks please let me know. Also if you found other useful tools please also share that experience in the comment section below.

How to optimise a code to be JIT friendly

In the previous blog post, we measured the effect of basic JIT optimisation technique – method inlining(out of other JIT optimisation techniques). The code example was a bit unnatural as it was super simple Scala code just for demonstration purposes of method inlining. In this post, I would like to share a general approach I am using when I want to check how JIT treats my code or if there is some possibility to improve the code performance in regards to JIT. Even the method inlining requires the code to meet certain criteria as bytecode length of inlined methods etc. For this purpose, I am regularly using great OpenJDK project called JITWatch which comes with a bunch of handy tools in regard to JIT. I am pretty sure that there is probably more tools and I will be more than happy if you can share your approaches when dealing with JIT in the comment section below the article.
Java HotSpot is able to produce a very detailed log of what the JIT compiler is exactly doing and why. Unfortunately, the resulting log is very complex and difficult to read. Reading this log would require an understanding of the techniques and theory that underline JIT compilation. A free tool like JITWatch process those logs and abstract this complexity away from the user.

In order to produce log suitable for JIT Watch investigation the tested application needs to be run with following JVM flags:
-XX:+UnlockDiagnosticVMOptions

-XX:+LogCompilation

-XX:+TraceClassLoading
those settings will produce log file hotspot_pidXXXXX.log. For purpose of this article, I re-used code from the previous blog located on my GitHub account with JVM flags enabled in build.sbt.
In order to look into generated machine code in JITWatch we need to install HotSpot Disassembler (HSDIS) to install it to $JAVA_HOME/jre/lib/server/. For Mac OS X that can be used from here and try renaming it to hsdis-amd64-dylib. In order to include machine code into generated JIT log we need to add JVM flag -XX:+PrintAssembly.
[info] 0x0000000103e5473d: test %r13,%r13
[info] 0x0000000103e54740: jne 0x0000000103e5472a
[info] 0x0000000103e54742: mov $0xfffffff6,%esi
[info] 0x0000000103e54747: mov %r14d,%ebp
[info] 0x0000000103e5474a: nop
[info] 0x0000000103e5474b: callq 0x0000000103d431a0 ; OopMap{off=112}
[info] ;*invokevirtual inc
[info] ; - com.jaksky.jvm.tests.jit.IncWhile::testJit@12 (line 19)
[info] ; {runtime_call}
[info] 0x0000000103e54750: callq 0x0000000102e85c18 ;*invokevirtual inc
[info] ; - com.jaksky.jvm.tests.jit.IncWhile::testJit@12 (line 19)
[info] ; {runtime_call}
[info] 0x0000000103e54755: xor %r13d,%r13d
We run the JITWatch via ./launchUI.sh
JITWATCH_config
to configure source files and target generated class files
JITWatch_configuration

And finally, open prepared JIT log and hit Start.

The most interesting from our perspective is TriView where we can see the source code, JVM bytecode and native code. For this particular example we disabled method inlining via JVM Flag “-XX:CompileCommand=dontinline, com/jaksky/jvm/tests/jit/IncWhile.inc

JITWatch_notinlined
To just compare with the case when the method body of IncWhile.inc is inlined – native code size is greater 216 compared to 168 with the same bytecode size.
JITWatch-inlined
Compile Chain provides also a great view of what is happening with the code
JITWatch_compileChain
Inlining report provides a great overview what is happening with the code
JITWatch-inlining
As it can be seen the effect of tiered compilation as described in JIT compilation starts with client C1 JIT compilation and then switches to server C2 compilation. The same or even better view can be found on Compiler Thread activity which provides a timeline view. To refresh memory check overview of JVM threads. Note: standard java code is subject to JIT optimizations too that’s why so many compilation activities here.
JITWatch_compilerThreads
JITWatch is a really awesome tool and provides many others views which don’t make sense to screenshot all e.g. cache code allocation, nmethodes etc. For detail information, I really suggest reading JITWatch wiki pages.  Now the question is how to write JIT friendly code? Here pure jewel of JITWatch comes in: Suggestion Tool. That is why I like JITWatch so much. For demonstration, I selected somewhat more complex problem – N Queens problem.
JITWatch_suggestion
Suggestion tool clearly describes why certain compilations failed and what was the exact reason. It is a coincidence that in this example we hit again just inlining as there is definitely more going on in JIT but this window provides a clear view of how we can possibly help JIT.
Another great tool which is also a part of JITWatch is JarScan Tool. This utility will scan a list of jars and count bytecode size of every method and constructor. The purpose of this utility is to highlight the methods that are bigger than HotSpot threshold for inlining hot methods (default 35 bytes) so it provides hints where to focus benchmarking to see whether decomposing code into smaller methods brings some performance gain. The hotness of the method is determined by the set of heuristics including call frequency etc. But what can eliminate the method from inlining is its size. For sure just the method size it too big breaching some limit for inlining doesn’t automatically mean that method is a performance bottleneck. JarScan tool is a static analysis tool which has no knowledge of runtime statistics hence real method hotness.
jakub@MBook ~/Development/GIT_REPO (master) $ ./jarScan.sh --mode=maxMethodSize --limit=35 ./chess-challenge/target/scala-2.12/classes/
"cz.jaksky.chesschallenge","ChessChallange$","delayedEndpoint$cz$jaksky$chesschallenge$ChessChallange$1","",1281
"cz.jaksky.chesschallenge.solver","ChessBoardSolver$","placeFigures$1","scala.collection.immutable.List,scala.collection.immutable.Set",110
"cz.jaksky.chesschallenge.solver","ChessBoardSolver$","visualizeSolution","scala.collection.immutable.Set,int,int",102
"cz.jaksky.chesschallenge.domain","Knight","check","cz.jaksky.chesschallenge.domain.Position,cz.jaksky.chesschallenge.domain.Position",81
"cz.jaksky.chesschallenge.domain","Queen","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Rook","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Bishop","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Knight","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","King","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Position","Position","int,int",73
"cz.jaksky.chesschallenge.domain","Position","equals","java.lang.Object",72
To wrap up, JITWatch is a great tool which provides insight into HotSpot JIT optimisations happening during program execution and it can help you to understand how a decision made at the source code level can affect the performance of the program.

JVM JIT compilation as a way of performance optimisation

Previous article structure of JVM – java memory model briefly mentions bytecode executions modes and article JVM internal threads provides additional insight into the internal architecture of JVM execution. In this article, we focus on Just In Time compilation and on some of its basic optimisation techniques. We also discuss the performance impact of one JIT optimisation technique namely method inlining. In the remainder of this article we focus solely on HotSpot JVM, however, principles are valid in general.
HotSpot JVM is a mixed-mode VM which means that it starts off interpreting the bytecode, but it can compile code into very highly optimised native machine code for faster execution. This optimised code runs extremely fast and performance can be compared with C/C++ code.  JIT compilation happens on method basis during runtime after the method has been run a number of times and considered as a hot method. The compilation into machine code happens on a separate JVM thread and will not interrupt the execution of the program. While the compiler thread is compiling a hot method JVM keeps on using the interpreted version of the method until the compiled version is ready.  Thanks to code runtime characteristics HotSpot JVM can make a sophisticated decision about how to optimise the code.
Java HotSpot VM is capable of running in two separate modes (C1 and C2) and each mode has a different situation in which it is usually preferred:
  • C1 (-client) – used for application where quick startup and solid optimization are needed, typically GUI application are good candidates.
  • C2 (-server) – for long running server application
Those two compiler modes use different techniques for JIT compilation so it is possible to get for the same method very different machine code. Modern java application can take advantage of both compilation modes and starting from Java SE 7 feature called tiered compilation is available.  An application starts with C2 compilation which enables fast startup and once the application is warmed up compiler C2 takes over. Since Java SE 8 tiered compilation is a default. Server optimisation is more aggressive based on assumptions which may not always hold. These optimizations are always protected with guard condition to check whether the assumption is correct. If an assumption is not valid JVM reverts the optimisation and drops back to interpreted mode. In server mode HotSpot VM runs a method in interpreted mode 10 000 times before compiling it (can be adjusted via -XX:CompileThreshold=5000). Changing this threshold should be considered thoroughly as HotSpot VM works best when it can accumulate enough statistics in order to make an intelligent decision what to compile. If you wanna inspect what is compiled using-XX:PrintCompilation.
Among most common JIT compilation techniques used by HotSpot VM is method inlining, which is a practice of substituting the body of a method into the places where the method is called. This technique saves the cost of calling the method. In the HotSpot, there is a limit on method size which can be substituted. Next technique commonly used is monomorphic dispatch which relies on a fact that there are paths through method code which belongs to one reference type most of the time and other paths that belong to other types. So the exact method definitions are known without checking thanks to this observation and the overhead of virtual method lookup can be eliminated. JIT compiler can emit optimised machine code which is faster. There are many other optimisation techniques like loop optimisation, dead code elimination, intrinsics and others.
The performance gain by inlining optimisation can be demonstrated on a simple Scala code:
class IncWhile {

  def main(): Int = {
    var i: Int = 0
    var limit = 0

    while (limit < 1000000000) {
      i = inc(i)
      limit = limit + 1
    }
    i
  }

  def inc(i: Int): Int = i + 1
}

Where method inc is eligible for inlining as the method body is smaller than 35 bytes of JVM bytecode (actual size of inc method is 9 bytes). Inlining optimisation can be verified by looking into JIT optimised machine code.

IncWhile-inlined

Difference is obvious when compared to machine code when inlining is disabled use  –XX:CompileCommand=dontinline,com/jaksky/jvm/tests/jit/IncWhile.inc

IncWhile-dontinlineThe difference in runtime characteristics is also significant as the benchmark results show. With disabled inlining:

[info] Result "com.jaksky.jvm.tests.jit.IncWhile.main":
[info] 2112778741.540 ±(99.9%) 9778298.985 ns/op [Average]
[info] (min, avg, max) = (2040573480.000, 2112778741.540, 2192003946.000), stdev = 28831537.237
[info] CI (99.9%): [2103000442.555, 2122557040.525] (assumes normal distribution)
[info] # Run complete. Total time: 00:08:03
[info] Benchmark Mode Cnt Score Error Units
[info] IncWhile.main avgt 100 2112778741.540 ± 9778298.985 ns/op

When inlining enabled JVM JIT also capable to use next optimizations like loop optimizations which might case that our whole loop is eliminated as it is easily predictable. We would get time around 3 ns which are for 1GHz processor unreal to perform billions of operations. To disable most of loop optimizations use -XX:LoopOptsCount=0 JVM option.

[info] Result "com.jaksky.jvm.tests.jit.IncWhile.main":
[info] 332699064.778 ±(99.9%) 3485503.823 ns/op [Average]
[info] (min, avg, max) = (316312877.000, 332699064.778, 358738827.000), stdev = 10277087.396
[info] CI (99.9%): [329213560.955, 336184568.600] (assumes normal distribution)
[info] # Run complete. Total time: 00:04:55
[info] Benchmark Mode Cnt Score Error Units
[info] IncWhile.main avgt 100 332699064.778 ± 3485503.823 ns/op
so the performance gain by inlining a method body can be quite significant 2 seconds vs 300 milliseconds.
In this post, we discussed the mechanics of Java JIT compilation and some optimisation techniques used. We particularly focused on the one of the simplest optimisation technique called method inlining. We demonstrated performance gain brought by eliminating a method call represented by invokevirtual bytecode instruction. Scala also offers a special annotation @inline which should help us with performance aspects of the code under the development. All the code for running the experiments is available online on my GitHub account.

 

HotSpot JVM internal threads

In the Structure of Java Virtual Machine we scratched the surface of a class file structure, how it is connected to java memory model via class loading process. Also, we briefly discussed the bytecode structure and its execution including a short introduction to Just In Time runtime optimisation. In this post we will look more at the internals of execution engine, however, there is no ambition to substitute a detailed VM implementation documentation for HotSpot JVM but just provide enough details to gain bigger picture.

Basic Threading model in HotSpot JVM is a one to one mapping between Java threads (an instance of java.lang.Thread) and native operating system threads. The native thread is created when the Java thread is started and reclaimed once it terminates. The operating system is responsible for scheduling all threads and dispatching to any available CPU. The relationship between java threads priorities and operating system thread priorities varies across operating systems.

HotSpot provides monitors by which threads running application code may participate in mutual exclusion (mutex) protocol. The monitor is either locked or unlocked. Only one thread may own the lock at any time. Only after acquiring ownership of the monitor thread may enter the critical section protected by this monitor. Critical sections are referred as synchronised blocks delineated by synchronised keyword.

Apart from application threads, JVM contains also internal threads which can be categorised into following groups:

  • VM Thread – responsible for executing VM operations
  • Periodic task thread – thread executing periodic operations within the VM (singleton instance of WatcherThread)
  • GC threads – threads of different types to support parallel an concurrent garbage collections
  • Compiler threads – performs a JIT compilation and optimisation of bytecode to native code at runtime (C1 and C2 JIT compilation threads)
  • Signal dispatcher threads – thread waiting for processing directed signals and dispatches them to a java signal handling method

JVM_compiler_threads

VM thread spends its time waiting for requested operations to appear in the operation queue (VMOperationQueue). The operation is typically passed to VM Thread because they require the VM to reach safepoint before they can be executed. When the VM is at safepoint all threads inside the VM have been blocked and any threads executing in native code are prevented from returning to the VM while the safepoint is in progress. This means that VM operation can be executed knowing that no thread can be in the middle of modifying heap. All threads are in a state such that their Java stacks are unchanging and can be examined.

Most familiar VM operation is related to garbage collection, particularly stop-the-world phase of garbage collection that is common to many garbage collocational algorithms. Other VM operation is: thread stacks dumps, thread suspension or stopping, inspection or modification via JVMTI etc. VM operation can be synchronous or asynchronous.

Safepoints are initiated using cooperative pooling based mechanism. Thread asks: “Should I block for a safepoint?” The moment when this is happening often is during thread state transition. Threads executing interpreted code don’t usually ask the question, instead of when safepoint is requested interpreter switches to different dispatch table which includes that question. When safepoint is over the dispatched table is switched back. Once safepoint has been requested VM Thread must wait until all threads are known to be in safepoint safe state before proceeding with the operation. During safepoint thread lock is used to block any threads that were running and releasing the lock when operation completed.

Structure of Java Virtual Machine (JVM)

Java-based applications run in Java Runtime Environment (JRE) which consists of a set of Java APIs and Java Virtual Machine (JVM). The JVM loads an application via class loaders and runs it by the execution engine.

JVM runs on all kind of hardware where executes a java bytecode without changing java execution code. VM implements a Write Once Run Everywhere principle or so-called platform independence principle. Just to sum up JVM key design principles:
  • Platform independence
    • Clearly defined primitive data types – Languages like C or C++ have a size of primitive data types depending on the platform. Java is unified in that matter.
    • Fixed byte order to Big Endian (network byte order) – Intel x86 uses little endian while RISC processors use big endian. Java uses big-endian.
  • Automatic memory management – class instances are created by the user and automatically removed by Garbage Collection.
  • Stack based VM – typical computer architectures such as Intel x86 are based on registers however JVM is based on a stack.
  • Symbolic reference – all types except primitive one are referred via symbolic name instead direct memory addresses.

Java uses bytecode as an intermediate representation between source code and machine code which runs on hardware. Bytecode instruction is represented as 1 byte numbers e.g. getfield 0xb4, invokevirtual 0xb6 hence there is a maximum of 256 instructions. If the instruction doesn’t need operand so next instruction immediately follows otherwise operands follow instruction according to instruction set specification. Those instructions are contained in class files produced by java compilation. The exact structure of the class file is defined in “Java Virtual Machine specification” section 4 – class file format. After some version information there are sections like constant pools, access flags, fields info, this and super info, methods info etc. See the spec for the details.

A class loader loads compiled java bytecode to the Runtime Data Areas and the execution engine executes the Java bytecode. Class is loaded when is used for the first time in the JVM. Class loading works in dynamic fashion on parent-child (hierarchical) delegation principle. Class unloading is not allowed. Some time ago I wrote an article about class loading on application server. Detail mechanics of class loading is out of scope for this article.

Runtime data areas are used during execution of the program. Some of these areas are created when JVM starts and destroyed when the JVM exits. Other data areas are per-thread – created on thread creation and destroyed on thread exit. FThe following picture is based mainly on JVM 8 internals (doesn’t include segmented code cache and dynamic linking of language introduced by JVM 9).

25AC9487-1AB9-48DE-936C-B6A9AC781637

  • Program counter – exist for one thread and has the address of currently executed instruction unless it is native then the PC is undefined. PC is, in fact, pointing to at a memory address in the Method Area.
  • Stack – JVM stack exists for one thread and holds one Frame for each method executing on that thread. It is LIFO data structure. Each stack frame has reference to for local variable array, operand stack, runtime constant pool of a class where the code being executed.
  • Native Stack –  not supported by all JVMs. If JVM is implemented using C-linkage model for JNI than stack will be C stack(order of arguments and return will be identical in the native stack typical to C program). Native methods can call back into the JVM and invoke Java methods.
  • Stack Frame – stores references that point to the objects or arrays on the heap.
    • Local variable array – all the variables used during execution of the method, all method parameters and locally defined variables.
    • Operand stack – used during execution of the bytecode instruction. Most of the bytecode is manipulating operand stack moving from local variables array.
  • Heap – area shared by all threads used to allocate class instances and arrays in runtime. Heap is the subject of Garbage Collection as a way of automatic memory management used by java. This space is most often mentioned in JVM performance tuning.
  • Non-Heap memory areas
    • Method area – shared by all threads. It stores runtime constant pool information, field and method information, static variable, method bytecode for each class loaded by the JVM. Details of this area depend on JVM implementation.
      • Runtime constant pool – this area corresponds constant pool table in the class file format. It contains all references to methods and fields. When a method or field is referred  to JVM searches the actual address of the method or field in the memory by using the constant pool
      • Method and constructor code
    • Code cache – used for compilation and storage of methods compiled to native code by JIT compilation
Bytecode assigned to runtime data areas in the JVM via class loader is executed by the execution engine. Engine reads bytecode in the unit of instruction. Execution engine must change the bytecode to the language that can be executed by the machine. This can happen in one of two ways:
  1. Interpreter – read, interpret and execute bytecode instructions one by one.
  2. JIT (Just In Time) compiler – compensate disadvantage of interpretation. Start executing the code in interpreted mode and JIT compiler compile the entire bytecode to native code. Execution is then switched from interpretation to execution of native code which is much faster thanks to various optimisation techniques JIT performs. Native code is stored in the cache. Compilation to native code takes time so JVM uses various metrics to decide whether to JIT compile the bytecode.

How the JVM execution engine runs is not defined by JVM specification so vendors are free to improve their JVM engines by various techniques, hotspot JVM is described in more detail and JIT watch gives insight into runtime characteristics.

More details can be found in The Java® Virtual Machine Specification – Java SE 9 Edition

 

 

Online collaboration tools for distributed teams

During past several years, working habits and working style has been rapidly changing. And this trend will continue for sure, just visit Google trends and search for “Digital nomad” or “Remote work”. However some profession undergoes this change with a better ease than others. But it is clear that companies that understand that trend benefit from that.

Working in a different style requires a brand new set of tools and approaches which provides you similar working conditions as when people are co-located at the same office. Video conferencing and phone or skype is just the beginning and doesn’t cover all aspects.

In the following paragraphs, I am going to summarize tools I found useful while working as software developer remotely in a fully distributed team. Those tools are either free or offer some free functionality and I still consider them very useful in various situations. The spectrum of the tools starts with some project management or planning tools to communicate.

For the communication – chat and calls Slack becomes standard tool widely adopted now. It allows you to freely organize your teams, let them create channels they need. It supports a wide range of plugins e.g. chatbots and it is well integrated with others tools. Provides application on all desktop and mobile platforms.

slack

When solving some issue or just want to present something to the audience screen sharing becomes a very handy tool. Found Join.me pretty handy. Free plan with the limited size of the audience was just big enough. Working well on Mac OS and Windows, Linux platform I haven’t tried yet.

joinme

When it comes to the pure conferencing phone calls or chat Discord recently took my breath away by the awesome sound quality. Again it offers desktop and mobile clients. plus you can use just browser version if you do not wish to install anything on your PC.

discord

Now I slightly move to planning and designing tools during software development process and doesn’t matter if you use Scrum or Kanban. Those have their place there. Shared task board with post-it notes. The one I found useful and free is Scrumblr. The only disadvantage is that it is public. It allows you to design the number of sections, change the colours of the notes and add them markers etc.

scrumblr

When we touched an agile development methodology there is no planning and estimation without planning poker. I found useful BitPoints. Simple yet meeting all our needs and free online tool where you invite all participants to the game. It allows you to do a various setting like the type of deck etc.

bitpoints

When designing phase reached shared online diagramming tools we found really useful is Sketchboard. It offers a wide range of diagrams types and shapes. It offers traditional UML diagrams for sure. Free versions offer few private diagrams otherwise you go public with your design. Allows comments and team discussion.

sketchboard

Sometimes we just missed traditional whiteboard session and just brainstorm. So a web white board tool AWW meet our needs. Simple yet powerfull.

aww

This concludes the set of tools I found useful during a past year while working in a distributed team remotely. I hope that you found at least one useful or didn’t know it before. Do you have other tools you found useful or have better variants of those mentioned above? Please share it in the comment section!

Apache Kafka foundation of modern data stream processing

Working on the next project using again awesome Apache Kafka and again fighting against a fundamental misunderstanding of the philosophy of this technology which probably usually comes from previous experience using traditional messaging systems. This blog post aims to make the mindset switch as easy as possible and to understand where this technology fits in. What pitfalls to be aware off and how to avoid them. On the other hand, this article doesn’t try to cover all or goes into much detail.

Apache Kafka is system optimized for writes – essentially to keep up with whatever speed or amount producer sends. This technology can be configured to meet any required parameters. That is one of the motivations behind naming this technology after famous writer Franz Kafka. If you want to understand the philosophy of this technology you have to take a look with a fresh eye. Forget what you know from JMS, RabbitMQ, ZeroMQ, AMQP and others. Even though the usage patterns are similar internal workings are completely different – the opposite. Following table provides a quick comparison

JMS, RabbitMQ, …
Apache Kafka
Push model
Pull model
Persistent message with TTL
Retention Policy
Guaranteed delivery
Guaranteed “Consumability”
Hard to scale
Scalable
Fault tolerance – Active – passive
Fault tolerance – ISR (In Sync Replicas)

Core ideas in Apache Kafka come from RDBMS. I wouldn’t describe Kafka as a messaging system but rather as a distributed database commit log which in order to scale can be partitioned. Once the information is written to the commit log everybody interested can read it at its own pace and responsibility. It is consumers responsibility to read it not the responsibility of the system to deliver the information to the consumer. This is the fundamental twist. Information stays in the commit log for a limited time given by retention policy applied. During this period it can be consumed even multiple times by consumers. As the system has reduced set of responsibilities it is much easier to scale. It is also really fast – as sequence read from the disk is similar to random access memory read thanks to effective file system caching.

kafkaoffsets

Topic partition is a basic unit of scalability when scaling out Kafka. Message in Kafka is simple key-value pair represented as byte arrays. When message producer is sending a message to Kafka topic a client partitioner decides to which topic partition message is persisted based on message key. It is a best practice that messages that belong to the same logical group are sent to the same partition.  As that guarantee clear ordering. On the client side, exact position of the client is maintained on per topic partition bases for the assigned consumer group. So point to point communication is achieved by using exactly the same consumer group id when clients are reading from the topic partition. While publish-subscribe is achieved by using distinct consumer group id for each client to topic partition. The offset is maintained for consumer group id and topic partition and can be reset if needed.

kafkacommunication

Topic partitions can be replicated zero or n times and distributed across the Kafka cluster. Each topic partition has one leader and zero or n followers depends on replication factor. The leader maintains so-called In Sync Replicas (ISR) defined by delay behind the partition leader is lower than replica.lag.max.ms. Apache Zookeeper is used for keeping metadata and offsets.

kafkacluster

Kafka defines fault tolerance in following terms:
  • acknowledge – broker acknowledge to producer message write
  • commit – the message is written to all ISR and consumer can read
While producer sends messages to Kafka it can require different levels of consistency:
  • 0 – producer doesn’t wait for confirmation
  • 1 – wait for acknowledge from the leader
  • ALL – wait for acknowledge from all ISR ~ message commit

Apache Kafka is quite flexible in configuration and as such, it can meet many different requirements in terms of throughput, consistency and scalability. Replication of topic partition brings read scalability on the consumer side but also poses some risk as it is some additional level of complexity to achieve this. If you are unaware of those corner cases it might lead to nasty surprises, especially for newcomers. So let’s take a closer look at following scenario.

We have topic partition with a replication factor 2. Producer requires highest consistency level, set to ack = all. Replica 1 is currently the leader. Message 10 is committed hence available to clients. Message 11 is not acknowledged nor committed due to the failure of replica 3. Replica 3 will be eliminated from ISR or put offline. That causes that message 11 becomes acknowledged and committed.

kafka_uc1

Next time we lose Replica 2 it is eliminated from ISR and the same situation repeats for messages 12 and 13.
kafka_uc2.png
The situation can still be a lot worse if cluster loses current partition leader – Replica 1 is down now.
kafka_uc3
What happens if Replica 2 or Replica 3 goes back online before Replica 1? One of those becomes a new partition leader and we lost data messages 12 and 13 for sure!
kafka_uc4

Is that a problem? Well, the correct answer is: It depends. There are scenarios where this behaviour is perfectly fine. Imagine collecting logs from all machines via sending them through Kafka. On the other hand, if we implement event sourcing and we just lost some events that we cannot recreate the application state correctly. Yes, we have a problem! Unfortunately, if that doesn’t change in latest releases, that is default configuration if you just install new fresh Kafka cluster. It is a set up which favour availability and throughput over other factors. But Kafka allows you to set it up in a way that it meets your requirements for consistency as well but will sacrifice some availability in order to achieve that (CAP theorem). To avoid the described scenario you should use the following configuration. The producer should require acknowledging level ALL. Do not allow kafka perform a new leader election for dirty replicas – use settings unclean.leader.election.enable = false. Use replication factor (default.replication.factor = 3) and require minimal number of replicas to be in sync state to higher than 1 (min.insync.replicas = 2).

We already quickly touched the topic of message delivery to the consumer. Kafka doesn’t guarantee that message was delivered to all consumers. It is the responsibility of the consumers to read messages. So there is no semantics of persistent message as known from traditional messaging systems. All messages sent to Kafka are persistent meaning available for consumption by clients according to the retention policy. Retention policy essentially specifies how long the message will be available in Kafka. Currently, there are two basic concepts – limited by space used for keeping messages or time for which the message should be at least available. The one which gets violated first wins.

When I need to clean the data from the Kafka (triggered by retention policy) there are two options. The simplest one just deletes the message. Or I can compact messages. Compaction is a process where for each message key is just one message, usually the latest one. That is actually the second semantics of key used in the message.

What features you cannot find in Apache Kafka compared to traditional messaging technologies? Probably the most significant is an absence of any selector in combination with listening (wake me on receive). For sure can be implemented via correlation id, but efficiency is on the completely different level. You have to read all messages, deserialize those and filter. Compared to a traditional selector which uses the custom field in message header where you don’t need even to deserialize message payload that is on the completely different level. Monitoring Kafka on production environment essentially concerns elementary question: Are the consumers fast enough? Hence monitoring consumers offsets with respect to the retention policy.

Kafka was created on LinkedIn to solve a specific problem of modern data-driven application to fill the gap in traditional ETL processes usually working with flat files and DB dumps. It is essentially enterprise service bus for data where software components need exchange data heavily. It unifies and decouples data exchange among components. Typical uses are in “BigData” pipeline together with Hadoop and Spark in lambda or kappa architecture.  It lays down foundations of modern data stream processing.

This post just scratches basic concepts in Apache Kafka. If you are interested in details I really suggest to read following sources which I found quite useful on my way when learning Kafka:

Hadoop IO and file formats

In this post dedicated to Big Data I would like to summarize hadoop file formats and provide some brief introduction to this topic. As things are constantly evolving especially in the big data area I will be glad for comments in case I missed something important. Big Data framework changes but InputFormat and OutputFormat stay the same. Doesn’t matter what’s big data technology is in use, can be hadoop, spark or …

Let’s start with some basic terminology and general principles. The key term in mapreduce paradigm is split which defines a chunk of the data processed by single map. Split is further divided into the record where every record is represented as a key-value pair. That is what you actually know from mapper API as your input. The number of splits gives you essentially the number of map tasks necessary to process the data which is not in clash with the number of defined mappers for your mapreduce slots. This just means that some map tasks need to wait until the map slot is available for processing. This abstraction is hidden in IO layer particularly InputFormat or OutputFormat class which contains RecordReader, RecordWriter class responsible for further division into records. Hadoop comes with a bunch of pre-defined file formats classes e.g. TextInputFormat, DBInputFormat, CombinedInputFormat and many others. Needless to say that there is nothing which prevents you from coming with your custom file formats.

Described abstraction model is closely related to mapreduce paradigm but what is the relation to underlying storage like HDFS? First of all, mapreduce and distributed file system (DFS) are two core hadoop concepts  which are “independent” and the relation is defined just through the API between those components. The well-known DFS implementation is HDFS but there are several other possibilities(s3, azure blob, …). DFS is constructed for large datasets. The core concept in DFS is a block which represents a basic unit of the original dataset for a manipulation and processing e.g. replication etc. This fact puts also additional requirements on dataset file format: it has to be splittable – that means that you can process a given block independently from the rest of the dataset. If the file format is not splittable and you would run a mapreduce job you wouldn’t get any level of parallelism and the dataset would be processed by a single mapper. Splittability requirement also applies if the compression is desired as well.

What is the relation between a block from DFS and split from mapreduce? Both of them are essentially key abstractions for parallelization but just in different frameworks and in ideal case they are aligned. If they are perfectly aligned that hadoop can take full advantage of so-called data locality feature which runs the map or reduce tasks on a cluster node where the data resides and minimize the additional network traffic. In case of imprecise alignment, a remote reads will happen for records missing for a given split. For that reasons file formats includes sync markers or points.

To take an advantage and full power of hadoop you design your system for big files. Typically the DFS block size is 64MB but can be bigger. That means that biggest Hadoop enemy is a small file. The number of files which lives in DFS is somehow limited by the size of Name Node memory. All the datasets metadata are kept in memory. Hadoop offers several strategies how to avoid of this bad scenario. Let’s go through those file formats.

HAR file (stands for Hadoop archive) – is a specific file format which essentially packs a bunch of files into a single logical unit which is kept on name node. HAR files don’t support additional compression and as far as I know are transparent to mapreduce. Can help if name nodes are running out of memory.

Sequence file is a kind of file-based data structure. This file format is splittable as it contains a sync point after several records. The record consists of key – value and metadata. Where key and value is serialized via class whose name is kept in the metadata. Classes used for serialization needs to be on CLASSPATH.

Map file is again a kind of hadoop file based data structure and it differs from a sequence file in a matter of the order. Map file is sorted and you can perform a look up. Behavior pretty similar to java.util.Map class.

Avro data file is based on avro serialization framework which was primarily created for Hadoop. It is a splittable file format with a metadata section at the beginning and then a sequence of Avro serialized objects. Metadata section contains a schema for an Avro serialization. This format allows a comparison of data without deserialization.

Google Protocol buffers are not natively supported by Hadoop but you can plug the support via libraries as elephant-bird from twitter.

So what about file formats as XML and JSON? They are not natively splittable and so “hard” to deal.  A common practice is to store then into text file a single message per line.

For textual files needless to say that those files are the first class citizens in Hadoop. TextInputFormat and TextOutputFormat deal with those. Byte offset is used as key and the value is the line content.

This blog post just scratches the surface of Hadoop file formats but I hope that it provides a good introduction and explain the connection between two essential concepts – mapreduce and DFS. For the further reference book, Hadoop Definite guide goes into the great detail.