Who monitors Prometheus?

Of course, I am not talking about Greek mythology but about popular monitoring tool Prometheus  based on Google’s internal monitoring tool called Borgmon.

As some teams invest significant effort in making their solution transparent, easy to troubleshoot and stable by instrumenting their product and technology which is then connected to their monitoring system the only question remains: Who monitors the monitoring? If the monitoring is down you won’t get any alarms about the malfunction of the product and you are at the begging. So you add monitoring of a monitoring system. Availability tough game starts. If both components have 95% availability you effectively achieve 90%. I believe that you got the rules of the game we play.

How it is monitoring of monitoring setup in practice? Well, the implementation differs. But the core idea is the same and works well even in the army! It is called Dead man switch. It is based on the idea that if we are supposed to receive a signal for triggering an alarm in an unknown moment we need to guarantee that the signal can trigger an alarm anytime. Reversing the logic for triggering the alarm will give us that guarantee so having a signal which we receive constantly and the alarm is triggered when we don’t receive a signal. So simple! This principle (heartbeat) is used in multiple places e.g. clustering. In the army, they use it as well.

dead-man

Some monitoring tools they have this capability built-in but Prometheus doesn’t. So how to achieve this in order to sleep well that the watcher is watching. We need to set up a rule that is constantly firing in Prometheus. There can be a rule like this:

    - name: monitoring-dead-man
      rules:
      - alert: "Monitoring_dead_man"
        expr: vector(1)
        labels:
          service: deadman
        annotations:
          summary: "Monitoring dead man switch should always fire alert"
          description: "Monitoring dead man switch for probing alert path"

Now we need to create a heart beating. The rule on its own would fire once then it would be propagated to Prometheus Alert Manager (component responsible for managing alerts) and all would be over. We need a regular interval for our check-ins. Interval is given by availability you want to achieve as that all adds to reaction time you need. You can achieve this behaviour by special route for your alert in an Alert Manager:

      - receiver: 'DEAD-MAN-SNITCH'
        match:
          service: deadman
        repeat_interval: 5m

Now we need to achieve an alarm trigger reverse logic. In our particular case, we use Dead Man Snitch  which is great for monitoring batch jobs e.g. data import to your database. It works in a way that if you do not check in in the specified interval it triggers with the lead time given by interval. You can specify the rules when to trigger but those are the details of the service you use. All you need to add to Prometheus is a receiver definition checking in particular snitch as follows:

       - name: 'DEAD-MAN-SNITCH'
          webhook_configs:
            -  url: 'https://nosnch.in/your_snitch_id'

The last thing you need to do is integrate the trigger with the system you use for on-call rotta management e.g. PagerDuty . For this example integration between dead man snitch and pagerduty
As I am not an infra or DevOps guy by hart it took me some time to figure it out and connect the bits. I hope you found it useful. If you have other experiences or the different way how to achieve this behaviour let me know in the comment section below.

Kubernetes Helm features I would wish to know from day one

Kubernetes Helm is a package manager for Kubernetes deployments. It is one of the possible tools for deployment management on Kubernetes platform. You can imagine it as an RPM in Linux world with package management on top of it like an apt-get utility.

Helm release management, ability to install or rollback to a previous revision, is one of the strongest selling points of Helm and together with strong community support makes it an exciting option. Especially the number of prepared packages is amazing and make it extremely easy to bootstrap a tech stack on the kubernetes cluster. But this article is not supposed to be a comparison between the kubernetes tools but instead describing an experience I’ve made while working with it and finding the limitations which for some else might be quite ok but having an ability to use the tool in those scenarios might be an additional benefit.
Helm is written in GO lang with the usage of GO templates which brings some limitation to the tool. Helm works on the level of string literals, and you need to take care of quotation, indentation etc. to form a valid kubernetes deployment manifest. This is a strong design decision from the creators of the Helm, and it is good to be aware of it. Secondly, Helm merges two responsibilities: To render a template and to provide kubernetes manifest. Though you can nest templates or create a template hierarchy, the rendered result must be a Kubernetes manifest which somehow limits possible use cases for the tool. Having the ability to render any template would extend tool capabilities as quite often the kubernetes deployment descriptors contain some configuration where you would appreciate type validation or possibility to render it separately. At the time of writing this article, the Helm version 2.11 didn’t allow that. To achieve this, you can combine Helm with other tools like Jsonnet and tie those together.
While combining different tools, I found following Helm advanced features quite useful which greatly simplified and provide some structure to the resulting Helm template. Following list enumerates those which I found quite useful.

Template nesting (named templates)
Sub-templates can be structured in helper files starting with an underscore, e.g. `_configuration.tpl` as files starting with underscore doesn’t need to contain valid kubernetes manifest.

{{- define "template.to.include" -}}
The content of the template
{{- end -}}

To use the sub-template, you can use

{{ include "template.to.include" .  -}}

Where “.” passes the actual context. When there are problems with the context trick with $ will solve the issue.

To include a file all you need is specify a path

{{ $.Files.Get "application.conf" -}}

Embedding file as configuration

{{ (.Files.Glob "application.conf").AsConfig  }}

it generates key-value automatically which is great when declaring a configMap

To provide a reasonable error message and make some config values mandatory, use required function

{{ required "Error message if value not specified" .Values.component.port }}

Accessing a key from value file which contains a dot e.g. application.conf

{{- index .Values.configuration "application.conf" }}

IF statements combining multiple values

{{- if or (.Values.boolean1) (.Values.boolean2) (.Values.boolean3) }}

Wrapping reference to value into braces solves the issue.

I hope that you found tips useful and if you have any suggestions let me know in the comment section below.

HotSpot JVM JIT optimisation techniques

After a JVM JIT overview, I put together a list of important techniques used by HotSpot JVM JIT compiler as I discovered them during my study in this area. Neither it has the ambition to provide an exhaustive list of techniques nor providing a deep expertise on a particular topic. The level was set to just understand the technique. Each section has links I found pretty useful in my study of the subject.
Described techniques in this article are based on Oracle HotSpot JVM implementation, but it should also be valid for OpenJDK and VMs based on that at least in a limited scope.
Majority of the described methods are dependant on runtime information collected by VM during execution and stored in ephemeral method data objects.

  • method inlining – is a base optimisation technique which can enable more sophisticated and complex techniques later on as it brings related code closer together. It eliminates the cost associated with calling a method as it copies a body of the called method in a place from where it was called. HotSpot places some restrictions on which methods can be inlined as, e.g. bytecode size of the inlined method, the depth of the method in the current call chain, etc. For the exact list see HotSpot VM options. An earlier post contains detailed method inlining example including evaluation of the performance gain.
  • loop unrolling – reduces the number of jumps back to the loop body beginning. Each jump can have an adverse effect on the CPU as it dumps a pipeline of incoming instructions. It eliminates the cost associated with branch prediction and removes safe point polls which are typically added at the end of each loop iteration. The shorter the loop body the higher the penalty for the back jump. This HotSpot performs decision on various criteria like type of loop variable, number of loop exit points, etc. Available benchmark showed that loop with integer loop variable was two times faster than with long. This cannot be taken as a rule of thumb but rather an idea about the possible impact of this optimisation. The behaviours differ between HotSpot versions, is dependant on CPU architecture and available instruction set. For more details check loop optimisation in HotSpot VM compiler
  • escape analysis – the sole purpose of this technique is to decide whether work done in a method is visible outside of the method or has any side effects. This optimisation is performed after any inlining has completed. Such knowledge is utilised in eliminating unnecessary heap allocation via optimisation called scalar replacement. In principle, object fields become scalar values as if they had been allocated as local variables instead. This reduces the object allocation rate and reduces memory pressure and in the end results in fewer GC cycles. This effect is nicely demonstrated in “Automatic stack allocation in the java virtual machine” blog post. Details can be also found in OpenJDK Escape Analysis.
    Escape analysis is also utilised when optimizing a performance of intrinsic locks (those using synchronized). There are possible lock optimisations which and essentially eliminates lock overhead:

    • lock elision – removing locks on an object which doesn’t escape given scope. A great blog post from Aleksey Shipilёv on lock elision demonstrates the performance effects.
    • lock coarsening – merges sequential lock regions that share the same lock. More detailed information can be found in the post dedicated to lock coarsening
    • nested locks – detects blocks of code where the same lock is acquired without releasing
  • monomorphic dispatch – this optimisation relies on the empirical fact that in the authored code it is quite often that only one type is observed in receiver object. This optimisation essentially eliminates the cost of looking up method in vtables – removes invokevirtual bytecode instruction. This optimisation is also protected by a guard to ensure that no wrong code ever invoked. HotSpot also has bimorphic dispatch – pretty similar stuff just done with two classes. If there is a need to regain some performance it is possible to transform the code in a way that call contains just two types to take advantage of this optimisation.
  • intrinsic – is the highly performant native implementation of the method known to JVM in advance rather than generated by JIT. They are used for performance critical methods and are backed by either special instruction provided by CPU or operating system. That means that they are platform dependent and some of them may not be available on all platforms. A decision about which optimisation to use might be deferred until runtime. Intrinsics are contained in .ad files of HotSpot OpenJDK source codes for example intrinsics x86 64-bit architecture. List of HotSpot available intrinsics is listed in vmSymbols.
    Someone who would like to see more detail on intrinsic could find nice this blog post about POPCNT CPU instruction(x86 and amd64) utilised when Integer.bitcount including benchmark to get an idea about the performance impact. This presentation gives great insight into what does it mean to add a new intrinsic to HotSpot. The obvious fact is that you have to implement it in C++ with all the risks it brings it (no memory management and an issue can bring the whole VM down). Compared to Graal a new JIT compiler for JVM written in Java integrated via JVM Compiler Interface (JVMCI). As JIT is nothing more than just the transformation from bytecode to machine code. It brings a lot of benefits e.g. using standard Java tooling. Graal dramatically simplifies developing of custom intrinsics for specific hardware. For better understanding Graal I found Chris Seaton presentation pretty useful.
  • on stack replacement (OSR) – is another term you might find when you start reading about JIT optimisations. OSR is just a technique for switching between two different implementations while it is running. This comes in handy in situations when some functions are invoked just once and could benefit from optimised version. When OSR happens a VM needs to be paused and replace the current frame with the new one which may have variables in different locations. For further info, I refer to an interesting post on On Stack Replacement whether is good or bad.

To see what JIT compilers perform during application runtime use JITWatch. Thanks to a newly gained understanding of JIT optimisations is should make a lot more sense what’s going on.

References for further study:
Useful JIT optimisation techniques
Java HotSpot performance engine architecture 

Stress testing asynchronous REST service with Gatling

In this blog post, I am not going to deep dive into application performance testing theory nor methodology but rather focus on explaining more advanced features of Gatling performance testing tool on the relatively common scenario when testing asynchronous service. Asynchronous service we are going to test in this post, report generation service, has following API specification:

  1. A report is requested for a given userId via POST method on /report endpoint and in the response request tracking reportId is returned
  2. Report status is tracked via GET method on /report/{reportId} when response status code is 202 Accepted it means that report generation is still in progress. 5xx status code means error during generation. When the report is done service returns 200 OK and the generated report.

Gatling testing tool allows you to create any testing scenario. Testing synchronous service is pretty simple – call the service with the request payload, get the response and assert the content. Performance, in this case, is relatively simple: response time of single service, error rates etc. However, testing an asynchronous service is a bit more difficult as it cannot be done via a single call. Testing scenario needs to follow the structure:

  1. request a service with given parameters and store id for this particular request
  2. poll the request status for tracing the progress
  3. download result when the request is completed

Performance of the service cannot be measured directly as processing is no longer bounded by a single call. What we would like to measure is processing time since the report is requested until processing is complete.

Test scenario can be nicely transformed into test scenario in Gatling DSL:

 val generateReport = scenario("Generate report")
 .exec(requestReport)
 .exec(pollReportGenerationState)
 .exec(getGeneratedReport)

Each test step contains validations that assure correctness of test scenario for a given virtual user. Test scenario relies on Gatling sessions which allow us to carry over user specific data between test steps. In this particular case, we need user specific reportId in order to track request progress and finally collect the report. ReportId is parsed from JSON Report generation response :

 private val requestReport: HttpRequestBuilder = http("Request report generation")
 .post("/report")
 .queryParamMap(withApiKey)
 .body(StringBody(reportRequestBody(userId, reportParameter)))
 .asJSON
 .check(status is 202)
 .check(jsonPath("$..reportId").ofType[String].exists.saveAs("reportId"))

Probably most interesting part is polling the report generation state where we need to keep a virtual user in polling mode until the report is ready. Again we are going to use Session in combination with asLongAs looping operator:

 private def notGenerated(sess: Session): Validation[Boolean] = {
val reportInProgress = 202
val generationStatus = sess("generationStatus").asOption[Int].getOrElse(reportInProgress)
logger.debug(s"Report generation status: $generationStatus")
generationStatus == reportInProgress
}

private val pollReportGenerationState = asLongAs(notGenerated)(
pause(100.millis)
.exec(
http("Poll report generation state")
.get("/report/${reportId}")
.queryParamMap(withApiKey)
.asJSON
.check(status saveAs ("generationStatus"))
)
)

When working with asynchronous service another operator worth consideration is “tryMax”. The disadvantage of this operator is that request failed are counted towards failed requests what in our case would drastically falsify results. Silencing the request would erase this test step completely.

Collecting report when is generated is pretty straightforward. In order to collect statistics on generating report scenario – time to get the report, we need to wrap the part of the scenario in group combinator which will result in group statistics for that group.

val generateReport = scenario("Generate report").group("Generation report completion"){
exec(requestReport)
.exec(pollReportGenerationState)
.exec(getGeneratedReport)
}

In this blog post, we used more advanced features of Gatling performance testing tool when testing asynchronous service using virtual users sharing important data between test steps via Session. Complete code snippet can be found on my GitHub gist. If you use different way when testing asynchronous service or any suggestion how to support custom metrics in Gatling report let me know in comment section below this blog post.

Communication in distributed teams using Slack

Some time ago I wrote a post about tools I found useful when working in a distributed team. Working habits are changing and companies which grasp it will benefit from access to the broader talent pool and possible benefits when managed rightly. However, managing remote teams pose significant challenges. This post isn’t going to discuss how to structure teams or how to address araising challenges but rather focusing on things I found useful when working in a remote distributed team.

In the past years, Slack become defacto standard in this area even though I keep hearing from various teams that it didn’t meet quite their needs. If you have experience with the different tool I will be happy if you can share your experience in the comment section below the post.

Following section lists features I discovered really useful during working in a remote distributed development team.

  1. Snooze notifications –  to protect your focus time and boost your productivity. Break day to focus units where all distraction is disabled.
  2. Highlight words – can save your time to skim through unread messages available via Slack > Preferences > Notifications > My Keywordsslack highlight keywords feature
  3. Reminder tools – on each message you can set a reminderslack remind me about this feature
  4. To quickly start google hangout meeting – use command /hangout and the call will be initiated
  5. If you want to append ¯\_(ツ)_/¯ to channel – type /shrug command
  6. Use advanced slack search – start in search box with plus signslack advanced search
  7. Create Poll – use the command: /poll “Question?” “Option1” “Option2” “Option3” …slack simple poll
  8. Convert direct communication to a private channel. To direct communication must be included more than one party involved and the resulted channel is always private with no option to make it public. You can, later on, invite new participants with the option to see the channel history.slack direct message to private channel
  9. Channels for dedicated groups or around topics of interest are standard but what I found really useful is User groups feature which allows communicating with a group of people across the channels.
  10. Use stars to create an instant to-do list. You can mark any message or phrase with a star next to the time-stamp. It’s almost the same as bookmarking. Hint: if you mark messages with the task then you’ll get a personal to-do list which is really convenient to form during the conversation.

This concludes my top 10 slack features I found really useful while working in a remote distributed team. If you have other tips and tricks please let me know. Also if you found other useful tools please also share that experience in the comment section below.

How to optimise a code to be JIT friendly

In the previous blog post, we measured the effect of basic JIT optimisation technique – method inlining(out of other JIT optimisation techniques). The code example was a bit unnatural as it was super simple Scala code just for demonstration purposes of method inlining. In this post, I would like to share a general approach I am using when I want to check how JIT treats my code or if there is some possibility to improve the code performance in regards to JIT. Even the method inlining requires the code to meet certain criteria as bytecode length of inlined methods etc. For this purpose, I am regularly using great OpenJDK project called JITWatch which comes with a bunch of handy tools in regard to JIT. I am pretty sure that there is probably more tools and I will be more than happy if you can share your approaches when dealing with JIT in the comment section below the article.
Java HotSpot is able to produce a very detailed log of what the JIT compiler is exactly doing and why. Unfortunately, the resulting log is very complex and difficult to read. Reading this log would require an understanding of the techniques and theory that underline JIT compilation. A free tool like JITWatch process those logs and abstract this complexity away from the user.

In order to produce log suitable for JIT Watch investigation the tested application needs to be run with following JVM flags:
-XX:+UnlockDiagnosticVMOptions

-XX:+LogCompilation

-XX:+TraceClassLoading
those settings will produce log file hotspot_pidXXXXX.log. For purpose of this article, I re-used code from the previous blog located on my GitHub account with JVM flags enabled in build.sbt.
In order to look into generated machine code in JITWatch we need to install HotSpot Disassembler (HSDIS) to install it to $JAVA_HOME/jre/lib/server/. For Mac OS X that can be used from here and try renaming it to hsdis-amd64-dylib. In order to include machine code into generated JIT log we need to add JVM flag -XX:+PrintAssembly.
[info] 0x0000000103e5473d: test %r13,%r13
[info] 0x0000000103e54740: jne 0x0000000103e5472a
[info] 0x0000000103e54742: mov $0xfffffff6,%esi
[info] 0x0000000103e54747: mov %r14d,%ebp
[info] 0x0000000103e5474a: nop
[info] 0x0000000103e5474b: callq 0x0000000103d431a0 ; OopMap{off=112}
[info] ;*invokevirtual inc
[info] ; - com.jaksky.jvm.tests.jit.IncWhile::testJit@12 (line 19)
[info] ; {runtime_call}
[info] 0x0000000103e54750: callq 0x0000000102e85c18 ;*invokevirtual inc
[info] ; - com.jaksky.jvm.tests.jit.IncWhile::testJit@12 (line 19)
[info] ; {runtime_call}
[info] 0x0000000103e54755: xor %r13d,%r13d
We run the JITWatch via ./launchUI.sh
JITWATCH_config
to configure source files and target generated class files
JITWatch_configuration

And finally, open prepared JIT log and hit Start.

The most interesting from our perspective is TriView where we can see the source code, JVM bytecode and native code. For this particular example we disabled method inlining via JVM Flag “-XX:CompileCommand=dontinline, com/jaksky/jvm/tests/jit/IncWhile.inc

JITWatch_notinlined
To just compare with the case when the method body of IncWhile.inc is inlined – native code size is greater 216 compared to 168 with the same bytecode size.
JITWatch-inlined
Compile Chain provides also a great view of what is happening with the code
JITWatch_compileChain
Inlining report provides a great overview what is happening with the code
JITWatch-inlining
As it can be seen the effect of tiered compilation as described in JIT compilation starts with client C1 JIT compilation and then switches to server C2 compilation. The same or even better view can be found on Compiler Thread activity which provides a timeline view. To refresh memory check overview of JVM threads. Note: standard java code is subject to JIT optimizations too that’s why so many compilation activities here.
JITWatch_compilerThreads
JITWatch is a really awesome tool and provides many others views which don’t make sense to screenshot all e.g. cache code allocation, nmethodes etc. For detail information, I really suggest reading JITWatch wiki pages.  Now the question is how to write JIT friendly code? Here pure jewel of JITWatch comes in: Suggestion Tool. That is why I like JITWatch so much. For demonstration, I selected somewhat more complex problem – N Queens problem.
JITWatch_suggestion
Suggestion tool clearly describes why certain compilations failed and what was the exact reason. It is a coincidence that in this example we hit again just inlining as there is definitely more going on in JIT but this window provides a clear view of how we can possibly help JIT.
Another great tool which is also a part of JITWatch is JarScan Tool. This utility will scan a list of jars and count bytecode size of every method and constructor. The purpose of this utility is to highlight the methods that are bigger than HotSpot threshold for inlining hot methods (default 35 bytes) so it provides hints where to focus benchmarking to see whether decomposing code into smaller methods brings some performance gain. The hotness of the method is determined by the set of heuristics including call frequency etc. But what can eliminate the method from inlining is its size. For sure just the method size it too big breaching some limit for inlining doesn’t automatically mean that method is a performance bottleneck. JarScan tool is a static analysis tool which has no knowledge of runtime statistics hence real method hotness.
jakub@MBook ~/Development/GIT_REPO (master) $ ./jarScan.sh --mode=maxMethodSize --limit=35 ./chess-challenge/target/scala-2.12/classes/
"cz.jaksky.chesschallenge","ChessChallange$","delayedEndpoint$cz$jaksky$chesschallenge$ChessChallange$1","",1281
"cz.jaksky.chesschallenge.solver","ChessBoardSolver$","placeFigures$1","scala.collection.immutable.List,scala.collection.immutable.Set",110
"cz.jaksky.chesschallenge.solver","ChessBoardSolver$","visualizeSolution","scala.collection.immutable.Set,int,int",102
"cz.jaksky.chesschallenge.domain","Knight","check","cz.jaksky.chesschallenge.domain.Position,cz.jaksky.chesschallenge.domain.Position",81
"cz.jaksky.chesschallenge.domain","Queen","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Rook","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Bishop","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Knight","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","King","equals","java.lang.Object",73
"cz.jaksky.chesschallenge.domain","Position","Position","int,int",73
"cz.jaksky.chesschallenge.domain","Position","equals","java.lang.Object",72
To wrap up, JITWatch is a great tool which provides insight into HotSpot JIT optimisations happening during program execution and it can help you to understand how a decision made at the source code level can affect the performance of the program.

JVM JIT compilation as a way of performance optimisation

Previous article structure of JVM – java memory model briefly mentions bytecode executions modes and article JVM internal threads provides additional insight into the internal architecture of JVM execution. In this article, we focus on Just In Time compilation and on some of its basic optimisation techniques. We also discuss the performance impact of one JIT optimisation technique namely method inlining. In the remainder of this article we focus solely on HotSpot JVM, however, principles are valid in general.
HotSpot JVM is a mixed-mode VM which means that it starts off interpreting the bytecode, but it can compile code into very highly optimised native machine code for faster execution. This optimised code runs extremely fast and performance can be compared with C/C++ code.  JIT compilation happens on method basis during runtime after the method has been run a number of times and considered as a hot method. The compilation into machine code happens on a separate JVM thread and will not interrupt the execution of the program. While the compiler thread is compiling a hot method JVM keeps on using the interpreted version of the method until the compiled version is ready.  Thanks to code runtime characteristics HotSpot JVM can make a sophisticated decision about how to optimise the code.
Java HotSpot VM is capable of running in two separate modes (C1 and C2) and each mode has a different situation in which it is usually preferred:
  • C1 (-client) – used for application where quick startup and solid optimization are needed, typically GUI application are good candidates.
  • C2 (-server) – for long running server application
Those two compiler modes use different techniques for JIT compilation so it is possible to get for the same method very different machine code. Modern java application can take advantage of both compilation modes and starting from Java SE 7 feature called tiered compilation is available.  An application starts with C2 compilation which enables fast startup and once the application is warmed up compiler C2 takes over. Since Java SE 8 tiered compilation is a default. Server optimisation is more aggressive based on assumptions which may not always hold. These optimizations are always protected with guard condition to check whether the assumption is correct. If an assumption is not valid JVM reverts the optimisation and drops back to interpreted mode. In server mode HotSpot VM runs a method in interpreted mode 10 000 times before compiling it (can be adjusted via -XX:CompileThreshold=5000). Changing this threshold should be considered thoroughly as HotSpot VM works best when it can accumulate enough statistics in order to make an intelligent decision what to compile. If you wanna inspect what is compiled using-XX:PrintCompilation.
Among most common JIT compilation techniques used by HotSpot VM is method inlining, which is a practice of substituting the body of a method into the places where the method is called. This technique saves the cost of calling the method. In the HotSpot, there is a limit on method size which can be substituted. Next technique commonly used is monomorphic dispatch which relies on a fact that there are paths through method code which belongs to one reference type most of the time and other paths that belong to other types. So the exact method definitions are known without checking thanks to this observation and the overhead of virtual method lookup can be eliminated. JIT compiler can emit optimised machine code which is faster. There are many other optimisation techniques like loop optimisation, dead code elimination, intrinsics and others.
The performance gain by inlining optimisation can be demonstrated on a simple Scala code:
class IncWhile {

  def main(): Int = {
    var i: Int = 0
    var limit = 0

    while (limit < 1000000000) {
      i = inc(i)
      limit = limit + 1
    }
    i
  }

  def inc(i: Int): Int = i + 1
}

Where method inc is eligible for inlining as the method body is smaller than 35 bytes of JVM bytecode (actual size of inc method is 9 bytes). Inlining optimisation can be verified by looking into JIT optimised machine code.

IncWhile-inlined

Difference is obvious when compared to machine code when inlining is disabled use  –XX:CompileCommand=dontinline,com/jaksky/jvm/tests/jit/IncWhile.inc

IncWhile-dontinlineThe difference in runtime characteristics is also significant as the benchmark results show. With disabled inlining:

[info] Result "com.jaksky.jvm.tests.jit.IncWhile.main":
[info] 2112778741.540 ±(99.9%) 9778298.985 ns/op [Average]
[info] (min, avg, max) = (2040573480.000, 2112778741.540, 2192003946.000), stdev = 28831537.237
[info] CI (99.9%): [2103000442.555, 2122557040.525] (assumes normal distribution)
[info] # Run complete. Total time: 00:08:03
[info] Benchmark Mode Cnt Score Error Units
[info] IncWhile.main avgt 100 2112778741.540 ± 9778298.985 ns/op

When inlining enabled JVM JIT also capable to use next optimizations like loop optimizations which might case that our whole loop is eliminated as it is easily predictable. We would get time around 3 ns which are for 1GHz processor unreal to perform billions of operations. To disable most of loop optimizations use -XX:LoopOptsCount=0 JVM option.

[info] Result "com.jaksky.jvm.tests.jit.IncWhile.main":
[info] 332699064.778 ±(99.9%) 3485503.823 ns/op [Average]
[info] (min, avg, max) = (316312877.000, 332699064.778, 358738827.000), stdev = 10277087.396
[info] CI (99.9%): [329213560.955, 336184568.600] (assumes normal distribution)
[info] # Run complete. Total time: 00:04:55
[info] Benchmark Mode Cnt Score Error Units
[info] IncWhile.main avgt 100 332699064.778 ± 3485503.823 ns/op
so the performance gain by inlining a method body can be quite significant 2 seconds vs 300 milliseconds.
In this post, we discussed the mechanics of Java JIT compilation and some optimisation techniques used. We particularly focused on the one of the simplest optimisation technique called method inlining. We demonstrated performance gain brought by eliminating a method call represented by invokevirtual bytecode instruction. Scala also offers a special annotation @inline which should help us with performance aspects of the code under the development. All the code for running the experiments is available online on my GitHub account.

 

HotSpot JVM internal threads

In the Structure of Java Virtual Machine we scratched the surface of a class file structure, how it is connected to java memory model via class loading process. Also, we briefly discussed the bytecode structure and its execution including a short introduction to Just In Time runtime optimisation. In this post we will look more at the internals of execution engine, however, there is no ambition to substitute a detailed VM implementation documentation for HotSpot JVM but just provide enough details to gain bigger picture.

Basic Threading model in HotSpot JVM is a one to one mapping between Java threads (an instance of java.lang.Thread) and native operating system threads. The native thread is created when the Java thread is started and reclaimed once it terminates. The operating system is responsible for scheduling all threads and dispatching to any available CPU. The relationship between java threads priorities and operating system thread priorities varies across operating systems.

HotSpot provides monitors by which threads running application code may participate in mutual exclusion (mutex) protocol. The monitor is either locked or unlocked. Only one thread may own the lock at any time. Only after acquiring ownership of the monitor thread may enter the critical section protected by this monitor. Critical sections are referred as synchronised blocks delineated by synchronised keyword.

Apart from application threads, JVM contains also internal threads which can be categorised into following groups:

  • VM Thread – responsible for executing VM operations
  • Periodic task thread – thread executing periodic operations within the VM (singleton instance of WatcherThread)
  • GC threads – threads of different types to support parallel an concurrent garbage collections
  • Compiler threads – performs a JIT compilation and optimisation of bytecode to native code at runtime (C1 and C2 JIT compilation threads)
  • Signal dispatcher threads – thread waiting for processing directed signals and dispatches them to a java signal handling method

JVM_compiler_threads

VM thread spends its time waiting for requested operations to appear in the operation queue (VMOperationQueue). The operation is typically passed to VM Thread because they require the VM to reach safepoint before they can be executed. When the VM is at safepoint all threads inside the VM have been blocked and any threads executing in native code are prevented from returning to the VM while the safepoint is in progress. This means that VM operation can be executed knowing that no thread can be in the middle of modifying heap. All threads are in a state such that their Java stacks are unchanging and can be examined.

Most familiar VM operation is related to garbage collection, particularly stop-the-world phase of garbage collection that is common to many garbage collocational algorithms. Other VM operation is: thread stacks dumps, thread suspension or stopping, inspection or modification via JVMTI etc. VM operation can be synchronous or asynchronous.

Safepoints are initiated using cooperative pooling based mechanism. Thread asks: “Should I block for a safepoint?” The moment when this is happening often is during thread state transition. Threads executing interpreted code don’t usually ask the question, instead of when safepoint is requested interpreter switches to different dispatch table which includes that question. When safepoint is over the dispatched table is switched back. Once safepoint has been requested VM Thread must wait until all threads are known to be in safepoint safe state before proceeding with the operation. During safepoint thread lock is used to block any threads that were running and releasing the lock when operation completed.

Structure of Java Virtual Machine (JVM)

Java-based applications run in Java Runtime Environment (JRE) which consists of a set of Java APIs and Java Virtual Machine (JVM). The JVM loads an application via class loaders and runs it by the execution engine.

JVM runs on all kind of hardware where executes a java bytecode without changing java execution code. VM implements a Write Once Run Everywhere principle or so-called platform independence principle. Just to sum up JVM key design principles:
  • Platform independence
    • Clearly defined primitive data types – Languages like C or C++ have a size of primitive data types depending on the platform. Java is unified in that matter.
    • Fixed byte order to Big Endian (network byte order) – Intel x86 uses little endian while RISC processors use big endian. Java uses big-endian.
  • Automatic memory management – class instances are created by the user and automatically removed by Garbage Collection.
  • Stack based VM – typical computer architectures such as Intel x86 are based on registers however JVM is based on a stack.
  • Symbolic reference – all types except primitive one are referred via symbolic name instead direct memory addresses.

Java uses bytecode as an intermediate representation between source code and machine code which runs on hardware. Bytecode instruction is represented as 1 byte numbers e.g. getfield 0xb4, invokevirtual 0xb6 hence there is a maximum of 256 instructions. If the instruction doesn’t need operand so next instruction immediately follows otherwise operands follow instruction according to instruction set specification. Those instructions are contained in class files produced by java compilation. The exact structure of the class file is defined in “Java Virtual Machine specification” section 4 – class file format. After some version information there are sections like constant pools, access flags, fields info, this and super info, methods info etc. See the spec for the details.

A class loader loads compiled java bytecode to the Runtime Data Areas and the execution engine executes the Java bytecode. Class is loaded when is used for the first time in the JVM. Class loading works in dynamic fashion on parent-child (hierarchical) delegation principle. Class unloading is not allowed. Some time ago I wrote an article about class loading on application server. Detail mechanics of class loading is out of scope for this article.

Runtime data areas are used during execution of the program. Some of these areas are created when JVM starts and destroyed when the JVM exits. Other data areas are per-thread – created on thread creation and destroyed on thread exit. FThe following picture is based mainly on JVM 8 internals (doesn’t include segmented code cache and dynamic linking of language introduced by JVM 9).

25AC9487-1AB9-48DE-936C-B6A9AC781637

  • Program counter – exist for one thread and has the address of currently executed instruction unless it is native then the PC is undefined. PC is, in fact, pointing to at a memory address in the Method Area.
  • Stack – JVM stack exists for one thread and holds one Frame for each method executing on that thread. It is LIFO data structure. Each stack frame has reference to for local variable array, operand stack, runtime constant pool of a class where the code being executed.
  • Native Stack –  not supported by all JVMs. If JVM is implemented using C-linkage model for JNI than stack will be C stack(order of arguments and return will be identical in the native stack typical to C program). Native methods can call back into the JVM and invoke Java methods.
  • Stack Frame – stores references that point to the objects or arrays on the heap.
    • Local variable array – all the variables used during execution of the method, all method parameters and locally defined variables.
    • Operand stack – used during execution of the bytecode instruction. Most of the bytecode is manipulating operand stack moving from local variables array.
  • Heap – area shared by all threads used to allocate class instances and arrays in runtime. Heap is the subject of Garbage Collection as a way of automatic memory management used by java. This space is most often mentioned in JVM performance tuning.
  • Non-Heap memory areas
    • Method area – shared by all threads. It stores runtime constant pool information, field and method information, static variable, method bytecode for each class loaded by the JVM. Details of this area depend on JVM implementation.
      • Runtime constant pool – this area corresponds constant pool table in the class file format. It contains all references to methods and fields. When a method or field is referred  to JVM searches the actual address of the method or field in the memory by using the constant pool
      • Method and constructor code
    • Code cache – used for compilation and storage of methods compiled to native code by JIT compilation
Bytecode assigned to runtime data areas in the JVM via class loader is executed by the execution engine. Engine reads bytecode in the unit of instruction. Execution engine must change the bytecode to the language that can be executed by the machine. This can happen in one of two ways:
  1. Interpreter – read, interpret and execute bytecode instructions one by one.
  2. JIT (Just In Time) compiler – compensate disadvantage of interpretation. Start executing the code in interpreted mode and JIT compiler compile the entire bytecode to native code. Execution is then switched from interpretation to execution of native code which is much faster thanks to various optimisation techniques JIT performs. Native code is stored in the cache. Compilation to native code takes time so JVM uses various metrics to decide whether to JIT compile the bytecode.

How the JVM execution engine runs is not defined by JVM specification so vendors are free to improve their JVM engines by various techniques, hotspot JVM is described in more detail and JIT watch gives insight into runtime characteristics.

More details can be found in The Java® Virtual Machine Specification – Java SE 9 Edition

 

 

Online collaboration tools for distributed teams

During past several years, working habits and working style has been rapidly changing. And this trend will continue for sure, just visit Google trends and search for “Digital nomad” or “Remote work”. However some profession undergoes this change with a better ease than others. But it is clear that companies that understand that trend benefit from that.

Working in a different style requires a brand new set of tools and approaches which provides you similar working conditions as when people are co-located at the same office. Video conferencing and phone or skype is just the beginning and doesn’t cover all aspects.

In the following paragraphs, I am going to summarize tools I found useful while working as software developer remotely in a fully distributed team. Those tools are either free or offer some free functionality and I still consider them very useful in various situations. The spectrum of the tools starts with some project management or planning tools to communicate.

For the communication – chat and calls Slack becomes standard tool widely adopted now. It allows you to freely organize your teams, let them create channels they need. It supports a wide range of plugins e.g. chatbots and it is well integrated with others tools. Provides application on all desktop and mobile platforms.

slack

When solving some issue or just want to present something to the audience screen sharing becomes a very handy tool. Found Join.me pretty handy. Free plan with the limited size of the audience was just big enough. Working well on Mac OS and Windows, Linux platform I haven’t tried yet.

joinme

When it comes to the pure conferencing phone calls or chat Discord recently took my breath away by the awesome sound quality. Again it offers desktop and mobile clients. plus you can use just browser version if you do not wish to install anything on your PC.

discord

Now I slightly move to planning and designing tools during software development process and doesn’t matter if you use Scrum or Kanban. Those have their place there. Shared task board with post-it notes. The one I found useful and free is Scrumblr. The only disadvantage is that it is public. It allows you to design the number of sections, change the colours of the notes and add them markers etc.

scrumblr

When we touched an agile development methodology there is no planning and estimation without planning poker. I found useful BitPoints. Simple yet meeting all our needs and free online tool where you invite all participants to the game. It allows you to do a various setting like the type of deck etc.

bitpoints

When designing phase reached shared online diagramming tools we found really useful is Sketchboard. It offers a wide range of diagrams types and shapes. It offers traditional UML diagrams for sure. Free versions offer few private diagrams otherwise you go public with your design. Allows comments and team discussion.

sketchboard

Sometimes we just missed traditional whiteboard session and just brainstorm. So a web white board tool AWW meet our needs. Simple yet powerfull.

aww

This concludes the set of tools I found useful during a past year while working in a distributed team remotely. I hope that you found at least one useful or didn’t know it before. Do you have other tools you found useful or have better variants of those mentioned above? Please share it in the comment section!