CI/CD tools landscape

For any software development team is critical to delivering value as quickly as possible, safely and reliably. It is proven that speed of delivery is directly correlated with the organisation performance (see, e.g. State of DevOps report. So the delivery process influence company valuation and is critical for scaling the engineering effort withholding the desired quality of the product. How to achieve this is one of the cornerstones of DevOps and SRE. To get an idea of how the modern software delivery works in a successful company, see how delivery pipeline works in AWS.
This blog post is by no means a replacement for deep-dive specialised or best practice literature, e.g. Continous Delivery, Continous Integration but rather an evaluation of the current tooling landscape which can help in achieving project goals. No need to mention that tool alone won’t make the magic happen without correct delivery pipeline design. But in the post, we will focus solely on the tooling.

Firstly, let’s clarify the essential terms (for reference, see a good article from Atlassian)

Continuous integration (CI) is the practice of automating the integration of code changes from multiple contributors into a single software project.

Continuous Delivery (CD) is the ability to get changes of all types—including new features, configuration changes, bug fixes and experiments— into production, or into the hands of users, safely and quickly in a sustainable way.

Continuous deployment (CD) is a strategy for software releases wherein any code commit that passes the automated testing phase is automatically released into the production environment. It is paramount to software delivery processes. 

The delivery process is critical in any software company. From my perspective and experience current state is far from being “solved”, and the number of tools appearing every year confirms that. The amount of money spend by VCs is just confirming that. The majority of tools are imperative, while the next big trend seems to be a “declarative” CI/CD tooling. Curious about what the future will bring.

Wide variety of tools available (by no means list is extensive):

Our selection and evaluation criteria base on our current and future needs:

  • cost-effective (auto-scaling workers, etc.)
  • cost of maintenance
  • speed of development/ability to contribute
  • manual approval stage
  • ability to pass certification (audit-ability, permission and roles, etc.)
  • multi-cloud support
  • support VMs + kubernetes deployments + potentially serverless
  • ability to integrate Infrastructure as Code to delivery pipeline
  • do not scratch all our development infra (keep in mind cost/benefit ratio)
  • majority of our workloads are running in GCP
  • deals with mono-repo
  • support for long term support (LTS) branches

Following tools made it into shortlist for evaluation and deep dive. See dedicated post for each of those:

Summary:
Our ideal solution would be tooling provided by our primary cloud provider, which meets our current and near feature needs and is fully managed. We partially matched that with a combination of Cloud Build and Spinnaker for GCP based on tutorial provided by GCP.
Generally, my impression from the study and evaluation of tools listed is that claim of “full CI/CD” support are neither great in CI nor CD and lay somewhere in the middle. They provide a platform a let you code the rest. Another pain point is to tackle the monorepo and provide a means to be efficient. Platforms seem to be somewhat pricy, and the amount of infra work needed is not that low to justify it when providing all necessary features. Curious about what the Harness will provide in this space.
Not promoting the combination with end up with but was clear win moving away from Concourse CI. Where missing resource management for stages was a total killer, insufficient authorisation and role management and absence of manual steps was clear do not continue this journey. For a fresh new project, a GitLab would be a brainer to start with. It provides all needed for development, but when the project grows significantly, it can become pricy, and you are motivated even by GitLab to move partially to your infrastructure. Needless to say, that setup requires some amount of work, especially proxying and create network waypoints.
If you have some experiences with tools evaluated or disagree with the points, please use the comment section to share your view and don’t forget to like and follow me on Twitter!

Processing…
Success! You're on the list.

CD with Spinnaker – evaluation

Spinnaker one of the popular continuous delivery platform originally developed in Netflix. I am evaluating a version 1.23.5 . Spinnaker is a multi-cloud continuous delivery platform supporting VM and Kubernetes based deployments (server-less under development). Extensible platform with HA setup possible. This post is supposed to be part of the bigger series with a unified structure.

Overview:
Spinnaker Architecture
Spinnaker basic concepts (Spinnaker started for VM deployments, Kubernetes concepts mapped to it in provider)
Pipeline stages
– Support for manual Judgement stage though no detailed permission model for actions (non OSS plugins exists e.g. Armory)
– Nesting pipeline supported (either fire and forget or wait for completion)
Custom stages development (Rest call, Kubernetes job or Jenkins job, …)
– Development of new stage

Authentication & Authorisation (Spinnaker security concepts):
Spinnaker Authentication
Spinnaker Authorisation with Role Based Access
– Spinnaker can be accessed through GCP Identity Aware Proxy (or other service on different cloud providers)
– Authentication G-Suite identity provider or GitHub teams. Other options exist as well, see overview here.
– Authorisation with Google Groups (only support flat structure, role = name of the group), GitHub teams , raw mapping or others
Pipelines are versioned automatically
Pipeline triggers
– Concept of providers which integrates pipelines with target platform or cloud providers, e.g. Kubernetes provider v2
– Support for complex deployment strategies
– Management CLI – Halyard (spinnaker configuration) and Spinn for pipeline management
– Deployment to Kubernetes in the form of native manifests, Helm packages transformed in Helm Bake Stage to native manifests (using native Helm support for templating)
– Terraform stage as a custom stage e.g. oss implementation
– Wide variety of notification options
– Monitoring support via Prometheus
Backup configuration to storage

Pricing:
– There is no price for Spinnaker itself only for resources consumed when deployed
– Requires VMs, Redis or CloudSql(Postgress)
– Loadbalancer
Spinnaker for GCP if you are running on GCP, where you pay for resources needed only.

Resources:
https://spinnaker.io/
https://www.slideshare.net/Pivotal/modern-devops-with-spinnaker-olga-kundzich
https://spinnaker.io/concepts/ebook/

Summary:
Tool with focus on CD with manual approval stages, security model which makes it SOC2 compliant. Good audit-ability in place (possible to integrate to GCP audit log). Scripted stages and manual approval stage is possible to specify just a group. It is done on application/ pipeline level. Tool eliminate Helm from kubernetes cluster as it works based on Kubernetes native manifest. Propagates Immutable infrastructure as those artefacts are stored for possible rollbacks.  Authorisation/Authentication seems to be a complex but variable to integrate with wide variety of the systems. Pretty active user group, offering help. Pricing is based on resources used.

CI/CD with GitLab – evaluation

GitLab one of the popular DevOps platform out there, currently. I am evaluating a version GitLab 13.7-pre- release features. This post is supposed to be part of the bigger series with a unified structure. Evaluation in the context of existing infrastructure GitHub + Prometheus + Grafana.

High level overview: 

Authentication/Authorisation:

CI/CD capabilities:

Pricing:

  • Has the concept of minutes in the plan + buying extra ($10 per 1000min)
  • Pay for the storage $60/10GB  see details
  • Based on my understanding, we need at least Premium $19/user/month.
  • GitLab pricing

I haven’t studied GitLab offering super profoundly, but for building a new project, I would consider starting with it as it provides complete SDLC support (compared to Spinnaker it is CI + CD). Acts as SDLC management on top of the cloud provider – providing an easy way how to comply with the majority of measures from certification, e.g. SOC 2, but those are the gold plan features ($99/user/month). This might be pricy, but if you use ticket management, documentation (instead of, e.g. Jira), roadmap tooling, release notes management, Terrafrom stage seems like a no-brainer!

I see the following challenges:

  • Pipeline deployment ordering as parallel pipelines run
  • Shared runners are small machines step to registered add admin infra work
  • A security model is similar to Spinnaker, additionally doesn’t allow custom groups, but I guess that you can create custom apps (users)
  • Pricing seems scary at the end runners probably run on your infra and registered to the platform, OTOH if managed to keep on shared runners, need to buy a lot of build minutes. 
  • Storage cost seems high 
  • Docker registry has 30 days expiry (probably can be extended) => you will be uploading to your GCR

I haven’t studied in deep deployment capabilities:

  • Integration with Helm – probably rendering via helm template and then deploy
  • Support for deployment strategies – requires appropriate kubernetes object manifests as everywhere
  • Registered kubernetes seems to have an agent running in them
  • Has all concepts from Spinnaker more less
  • Has starting support for Terraform in alpha

Potential pain points:

  • Having a whole pipeline in git(including deployment strategies configurations, approvals) – might pose challenges when there is no pure trunk-based development – requires a need for backporting and harder for surveillance. 

GitLab is built on top of plenty of OS projects where I can imagine that integration between your infrastructure and GL might be extensive.

The only reasonable scenario that you fully migrate to GitLab and reduce extra tooling like Assana, GitHub, Confluence, … or for new projects that might be a no-brainer. That migration can be pretty heavy, but you might get some compliance checks for that in a single workspace. 

Resources:

CI with GCP Cloud Build – evaluation

Cloud Build on of the services available on Google Cloud Platform. Evaluation happened January 2021 and I believe that is still improving. This post is supposed to be part of the bigger series with a unified structure.

Overview:

  • Even though Cloud Build labels itself as CI/CD tool it lacks the CD features (e.g. deployment strategies, manual approval stages etc.) – nobody prevents you from developing those
  • Run in GCP or has some support for local execution as well
  • Build using wiring Docker containers together. Executed on single VM, you can upscale VM to high cpu machines up to 32cpu. 

Continous Integration features:

Pricing:

Summary:

Purely CI system with capability to build (~ Cloud Build). No triggers for time based related things. So either Event based (commit, tag, …) or manual trigger. Probably could be emulated via Cloud Function to trigger to simulate Time Based Trigger. Has ability to run locally which is nice. Scales up to 32cpu machines. Prices based on build time (clock time). Doesn’t offer Approval stages, security model based on IAM and seems that you cannot grant permission on particular configuration/build. Doesn’t have concept of pipeline – but rather set of tasks steps(stages). Definition lives in Git – so LTS branches should be buildable. To have full end-2-end deployment, you need a CD system. This system manages just “build artefact”. 

CI/CD with Jenkins – evaluation

Jenkins Evaluation happened January 2021 and I believe that Jenkins is still improving. This post is supposed to be part of the bigger series with a unified structure.

Overview:
Pipeline definition completely lives in GIT together with code ~> Jenkinsfile
– Support for jenkinsfile via graddle DSL
You can chain the pipelines
– Single pipeline triggered on various branches ~> Multi-branch pipelines (tutorial)
Parallel pipeline stages
– Access to build meta-data (e.g.  build-number, commit hash, …)
Jenkins as a code plugin
– Managing secrets via secrets plugin
Audit trail plugin
Try notifier
– Better UI with Blue Ocean
– Tooling – Jenkins Job Builder (Job builder tutorial)
Pull-request Jenkins pipeline
– Deployment topology – master x slave/agent
Jenkins Helm deployment – seems has autoscaling agents – based on Configuration as a code plugin
– Manual approvals – seems as not so straightforward via input option
Jenkins on Google Kubernetes Engine

Security model:
– Default has no roles – all has single view -> plugins
GitHub OAuth and here
Role base authorisation plugin –  (strategy plugin – role) – that probably doesn’t work together with gitHub OAuth, but can work with Matrix access

Resources:
Jenkins for beginners

Summary:
Jenkins – one of the most popular open source CI/CD systems. Necessary to be self-hosted. But even Kubernetes plugin seems to have agent autoscaling capabilities which should be cost effective. Seems that whole Jenkins configuration can be bootstrapped from code. 

Security model has various options not sure how all fits together e.g. gitHub OAuth + Roles and Securities but there is multiple ways e.g. control matrix. 

Has concept of pipelines and jobs. Pipelines are next generation where they live completely in code-base ~> LTS should be ok. Seems that have some basic manual approvals stages, question how that goes together with auth. Has concept of multi-branch jobs/pipelines = single definition for whole bunch of branches where definition is dynamically taken from source. 

CD capabilities are somewhat simplistic – no advanced release strategies. Like roll back, monitoring etc. That would need to be scripted probably.

Kubernetes Helm operational models

Helm, a package manager for kubernetes, went through some evolution in the past several years. It evolved from Helm 2 to Helm 3, where helm 2 went through end-of-life nearly a year ago so I would be pretty late to the party. Without going to deep into helm internals I would mention just main feature. The removal of Tiller, a component that acted as a middle man and caused many troubles (requiring cluster around for many helm commands, security as tiller run as Kubernetes RBAC cluster-admin, etc.) is now gone! And many more, if interested official pages provide a good summary of helm 3). In this short blog post, I would like to give a quick overview of how Helm 3 works from a high-level perspective and what are the potential Helm operational modes and risks associated.
Helm 3 architecture is lightweight (compared to helm 2) schematically described in the following picture.

  • Helm binary installed on the client machine interacting with kubernetes cluster api. 
  • Helm metadata objects stored either as ConfigMap or Secret kubernetes object (depends on the configuration options)

Helm binary provides a cli to helm. The main function is to render manifests based on the Helm manifests templates and apply them to kubernetes cluster with preserving the revision history for possible rollbacks via helm cli. In addition, helm metadata objects whose payload is a serialized protocol buffer contains all data to render requested kubernetes manifests based on helm package specification and provided value file which acts as variables to helm package. You can list a history of helm release via (e.g. prometheus deployment in namespace prometheus):

$ helm history prometheus -n prometheus
REVISION	UPDATED                 	STATUS    	CHART            	APP VERSION	DESCRIPTION
113     	Fri Apr 23 12:55:11 2021	superseded	prometheus-11.0.0	2.16.0     	Upgrade complete
114     	Wed Apr 28 13:18:29 2021	superseded	prometheus-11.0.0	2.16.0     	Upgrade complete
115     	Wed Apr 28 13:49:13 2021	superseded	prometheus-11.0.0	2.16.0     	Upgrade complete
116     	Wed Apr 28 15:23:38 2021	superseded	prometheus-11.0.0	2.16.0     	Upgrade complete
117     	Wed Apr 28 17:03:15 2021	superseded	prometheus-11.0.0	2.16.0     	Upgrade complete
118     	Fri Apr 30 16:50:13 2021	superseded	prometheus-11.0.0	2.16.0     	Upgrade complete
119     	Mon May  3 16:10:01 2021	superseded	prometheus-11.0.0	2.16.0     	Upgrade complete
120     	Fri May  7 11:49:50 2021	superseded	prometheus-11.0.0	2.16.0     	Upgrade complete
121     	Fri May 14 15:06:13 2021	superseded	prometheus-11.0.0	2.16.0     	Upgrade complete
122     	Thu May 20 10:45:56 2021	deployed  	prometheus-11.0.0	2.16.0     	Upgrade complete

and find corresponding secrets in the namespace where the package is being applied

$ kubectl get secrets -n prometheus
NAME                                                  TYPE                                  DATA   AGE
sh.helm.release.v1.prometheus.v113                    helm.sh/release.v1                    1      52d
sh.helm.release.v1.prometheus.v114                    helm.sh/release.v1                    1      47d
sh.helm.release.v1.prometheus.v115                    helm.sh/release.v1                    1      47d
sh.helm.release.v1.prometheus.v116                    helm.sh/release.v1                    1      47d
sh.helm.release.v1.prometheus.v117                    helm.sh/release.v1                    1      47d
sh.helm.release.v1.prometheus.v118                    helm.sh/release.v1                    1      45d
sh.helm.release.v1.prometheus.v119                    helm.sh/release.v1                    1      42d
sh.helm.release.v1.prometheus.v120                    helm.sh/release.v1                    1      38d
sh.helm.release.v1.prometheus.v121                    helm.sh/release.v1                    1      31d
sh.helm.release.v1.prometheus.v122                    helm.sh/release.v1                    1      25d

When you upgrade or install package, Helm render the manifests (performs manifest api validation) and apply them to kubernetes cluster api. Kubernetes api performs upgrades, fill in missing default options, may silently ignore unrecognised or wrong settings (depends on configuration). Lower level kubernetes objects are derived from higher-level objects, e.g. replica sets from deployments. All this results in a manifest that is actually running in the kubernetes cluster as depicted in the picture. When you ask Helm to provide a manifest via

helm get manifest release-name -n namespace

Will provide deployed kubernetes manifest. Requested kubernetes manifest that reside in kubernetes Helm secret metadata. That is not exactly what is running in the cluster. To get the manifests that are actually running in the cluster.

kubectl get all -n namespace

This command will provide all kubernetes objects, including derived that are running in the cluster. By comparing those two, you can see differences. If you consider that kubernetes cluster is over time upgraded as well that you realize that what is actually running in the cluster can be actually surprising. That surprise usually manifests during disaster recovery. That is the reason why is it is highly preferred to eliminate Helm abstractions for the deployment chain. Kubernetes natively supports versioning via the revision history feature, which is specific to kubernetes object, e.g. deployment.

kubectl rollout history deployment deployment_name

This command captures only actual differences in the deployment manifest, so the revision count might not be the same as the number of revisions from Helm. Also, Helm revision history bundles all the kubernetes objects into a single revision.
To move to deployments of native kubernetes manifests, Helm offers a feature to render the manifests using a package and value file via helm template command. That completes the picture of how Helm operates. A simplified view could be summarised into:
1. render manifests
2. kubectl apply rendered manifests
3. store the helm revision metadata

Helm 3 provides good flexibility for deployments and leaves important decisions to SREs while keeping access to community-maintained packages.

TIP: If you develop some helm packages in-house, adding a helm lint command to your PR checks allows discovering issues during the PR review process. Don’t forget to check helm advanced features

If you like the content or have some experiences or questions don’t forget to leave a comment bellow and follow me on Twitter.

Processing…
Success! You're on the list.

Developers path to seniority

What seniority means for a software engineer? Is it familiarity with frameworks, libraries or something different? In this post, I will try to summarise my perspective on how it is changing for me. Got a lot of view during the last three years when I have lead or participated in a decent amount of interviews for engineering roles, including the C-level executives. Let state the obvious: fair and good hiring is super hard. During the hiring process, you try to assess the seniority of the candidate(and other qualities for sure) according to the company career ladder. As a reference, industry-respected and well-known ladder description can be used Google career ladder. If you haven’t had an opportunity to familiarise yourself with it, take the time and do so, it is definitely worth understanding. Google career ladder construction anchored in Google perfect team study. This post doesn’t attempt to mimic Google research but rather provide a starting point and raise curiosity. Excellent management is where the resulted outcome of the team is far greater than the pure sum of individual contributions. 

Seniority levels are usually a combination of technical “hard” skills and other skills collectively referred to as “soft” skills. Unfortunately, relatively few companies go beyond this level of explanation. 

Junior level is characterised by a strong desire to learn new tools, frameworks and techniques. Often connected with black and white view on problems and when solving a task, junior developers can come up with a single possible solution or more with slight modifications. It is nothing wrong to be at that stage, and healthy teams often contain some junior developers as they bring fresh trends and passion to the teams. However, having more than 25% of junior developers in the team is challenging as they require more attention and slows the team down or affect the quality. Be careful with this mix.

Gaining seniority than then goes on a few axes – technical skills, context developer is using during the evaluation of possible solutions and ability to communicate effectively. Junior developer has limited technical skills, and his context is limited to code he is writing and communicates solely in terms of code, requirements or tickets. Context is a vital vehicle for guiding a developer in two remaining axes. As technical skills are growing number of the possible solution raises as well. Context can be divided but not limited to the following scopes:

Codebase – Limits the knowledge to the current codebase, used technologies and design principles and thinking in those low-level terms. This definition can be confusing as project codebase is involved every time as the ultimate source of truth but what differs is the level of abstraction. Developers are growing by operating bigger codebase and experience on different project codebase.

Development habits and practices – This area covers topics like where to apply which kind of tests. Where are the areas you can compromise if chasing time, and what are the real costs? Tech debt management. Essential features for long term supportability. Those items can be categorised under some risk management and damage control practices.

System design – Designing a system for performance, scalability, security and similar system-wide aspects. Defining thread zones and expected failure scenarios. Think in terms of consistency and ability to enforce those policies on the platform level. What can be handled on infrastructure level vs what is necessary to address in the application code?

Different mindset/roles – Understanding of other functions and associated mindsets of people participating in the development process (system engineers, quality assurance, machine learning engineers, project managers, etc.). What helps the best in this area is to try out the majority of those roles. Getting the role mindset will simplify communication during cross-team communication.

Software development life cycle – Ability to foresee and expect requirements which might not be implicitly obvious or stated in the desired functionality description. Understanding of product lifecycle stages starting from ideation till end-of-life. An item like supportability, troubleshooting etc. What helps in those aspects is to take the perspective of technical support personnel and think in terms of what-if. How difficult would it be to discover that component/feature X doesn’t work as expected.

Technology overview – Knowledge of technologies and alternatives gives you lego blocks for building a final solution architecture. This is especially important for the cloud-based product as you are trading a cost for a speed of delivery. Similarly, the same applies to programming languages. Those have specific features, design patterns etc. 

Cross disciplines – This context is pretty similar to `Different mindset roles` with an only difference; it is focused on actual skills necessary in those roles and effects the tools have on team dynamics. Disciplines like development, operation, infrastructure, marketing and similar. 

Cross-level and cross-department awareness – Understanding main objectives, areas of competence and levels of abstractions those people operate on. Unfortunately, even though role titles repeat competencies and responsibilities differs significantly between companies. That is what makes the execution different for each company and is one of the secret sauce of success. Effects are visible in accounting and balance sheets. 

Management and leadership skills – Guide the team without command and control approach. Leading the change throughout the team, getting the buy in to support the change and move things forward.

Business domain – Understanding of industry you operate in helps you to drive better value for your customers by combining your technical abilities with domain knowledge.

Business models – Understanding how the business makes money on the product you build helps you to understand key aspects of the technical solution in terms of desired points of flexibility, scalability and cost management.

Knowledge and experiences in those scopes provide a better context for making decisions in the bigger picture and finding a more strategic solution for given conditions. Helps you to drive communication more effectively as you can think more like your counter-party and understand his point of view. Lastly, you should be able to come up with a plan of delivery iterations with a value-added in each iteration and mitigated risk. 

The more senior you get on technical skills and proceed on the career ladder, the more the role start rely on your leadership skills. The vast amount of companies than forces people to manager roles but some (including Google, Facebook, etc.) allows to continue on technical track and label those positions as Principal or Staff Engineers. The role of those engineers is usually to improve the engineering culture and best practices of the teams apart from solving complex engineering tasks. 

I neglected at least one aspect in my writing, the effect of personal characteristics and perspectives for particular roles. I am not going into those but instead refer to great Neil blog with people topologies

What is your perspective on the “developer’s” seniority? Let me know in the comment section below, or you can reach me on Twitter.

Resources I found the best on those topics except those already mentioned in the text:

Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations

Team Topologies: Organizing Business and Technology Teams for Fast Flow

The Five Dysfunctions of a Team

Managing Humans

The Invincible Company

Value Proposition Design: How to Create Products and Services Customers Want

The Business Model Navigator

If you like the content or have some experiences or questions don’t forget to leave a comment bellow and follow me on Twitter.

Processing…
Success! You're on the list.

Scaling Terraform(IaC) across the team

HashiCorp Terraform is a popular tool for managing your cloud infrastructure as code (IaC) in a cloud-agnostic way (same tool for various cloud platforms). Instead of unifying all capabilities for different cloud platforms, the core concepts exposed to the end-user via Terraform provider concept. Terraform offers providers for all major cloud vendors and other cloud services and technologies as well, e.g. Kubernetes.  

This blog post doesn’t aim to be an introduction into Terraform concepts (official documentation is quite ok) but instead sharing an experience with using a Terraform in a distributed team, tools that come handy and all things that make life easier. Even though HashiCorp offers Terraform Enterprise this option is used quite rarely at least on “small/er” projects so we won’t be discussing this option here. I openly admit that I have zero experience with this service so I cannot objectively compare. I will solely focus on using the open-sourced part of the project and Terraform version 0.12.x and higher. 

Terraform maintains the sate of the infrastructure that manages in the state file. Format of Terraform state file is version dependant without strict rules on version compatibility (at least I wasn’t able to find one that was reliably followed and guaranteed). Managing state file poses two main challenges:
1) manage/share state across the team
2) control/align the version used across the team

The first aspect, different teams solve differently. Some commit state file alongside the configuration to version control system which is far from ideal as there might be multiple copies of such resource across the team and requires some team coordination. On top of that, state file contains sensitive information which is impossible to mask and such doesn’t belong to the source control system. A lot better approach is using Terraform remote backend, which allows true concurrent approach. Capabilities depend on the concrete implementation used. The backend can be changed from local to remote easily as is migrated automatically. The only limitation is that merging and splitting state file is allowed only for the locally managed state. 

Managing Terraform version management is centred around providing frictionless version upgrades for different Terraform configurations that align across the team with assuring that state file won’t get upgraded accidentally. To make sure that your state file won’t get upgraded accidentally put version restriction to every configuration managed e.g.

terraform {
  required_version = "0.12.20"
}

To align the team on a uniform Terraform version for every single configuration managed use tool for Terraform version management, e.g. tfenv. Put the desired version of to .terraform-version file located in the folder together with configuration. Tfenv automatically switches to appropriate version as needed, when new version encountered you need to run tfenv install to download a new version. If you want to check version available:

$ tfenv list
* 0.12.20 (set by /Users/jakub/test/terraform/.terraform-version)
  0.12.19
  0.12.18

As the number of resources or organisation grows so does the state file. Which leads to increased time for configuration synchronisation and competing for a lock on a remote state file. To increase the throughput and allow team DevOps mode(clear ownership of the solution from end to end), you might want to divide the infrastructure and associated state files into smaller chunks with clear boundaries. To keep your configuration DRY hierarchical configuration tools like Terragrunt comes to rescue and reduce repetition. 

A growing number of users poses challenges as well as benefits which are the same as on application code written in, e.g. java or any other programming language. What is the real motivation for Infrastructure as a Code (IaC). How to setup standards and best practices on the project? Terraform offers a bunch of tools embedded. To make sure that code is properly formatted according to standards use fmt utility, which is pluggable to your CI/CD pipeline or pre-commit hooks.

terraform fmt --recursive --check --diff

For your re-usable Terraform modules it is good to make sure they are valid though it doesn’t catch all the bugs as it doesn’t checks against cloud APIs, so it doesn’t replace integration tests.

terraform validate

Getting an idea what will change, diff of your current infrastructure against proposed changes can be easily achieved via the generated plan

terraform plan

Enforcing security and standards are a lot easier on IaC as you can use tools like tflint or checkov which allows writing custom policies. We conclude the tool section with awesome Terraform tools which provide a great source if you are looking for something specific.

In this blog post, we just scratched the surface of Terraform tooling and completely skipped design and testing, which are topics for separate posts. What are your favourite tools? What did you find really handy? Leave a comment, share your tips or you can ask me on twitter.

Processing…
Success! You're on the list.

Who monitors Prometheus?

Of course, I am not talking about Greek mythology but about popular monitoring tool Prometheus  based on Google’s internal monitoring tool called Borgmon.

As some teams invest significant effort in making their solution transparent, easy to troubleshoot and stable by instrumenting their product and technology which is then connected to their monitoring system the only question remains: Who monitors the monitoring? If the monitoring is down you won’t get any alarms about the malfunction of the product and you are at the begging. So you add monitoring of a monitoring system. Availability tough game starts. If both components have 95% availability you effectively achieve 90%. I believe that you got the rules of the game we play.

How it is monitoring of monitoring setup in practice? Well, the implementation differs. But the core idea is the same and works well even in the army! It is called Dead man switch. It is based on the idea that if we are supposed to receive a signal for triggering an alarm in an unknown moment we need to guarantee that the signal can trigger an alarm anytime. Reversing the logic for triggering the alarm will give us that guarantee so having a signal which we receive constantly and the alarm is triggered when we don’t receive a signal. So simple! This principle (heartbeat) is used in multiple places e.g. clustering. In the army, they use it as well.

dead-man

Some monitoring tools they have this capability built-in but Prometheus doesn’t. So how to achieve this in order to sleep well that the watcher is watching. We need to set up a rule that is constantly firing in Prometheus. There can be a rule like this:

    - name: monitoring-dead-man
      rules:
      - alert: "Monitoring_dead_man"
        expr: vector(1)
        labels:
          service: deadman
        annotations:
          summary: "Monitoring dead man switch should always fire alert"
          description: "Monitoring dead man switch for probing alert path"

Now we need to create a heart beating. The rule on its own would fire once then it would be propagated to Prometheus Alert Manager (component responsible for managing alerts) and all would be over. We need a regular interval for our check-ins. Interval is given by availability you want to achieve as that all adds to reaction time you need. You can achieve this behaviour by special route for your alert in an Alert Manager:

      - receiver: 'DEAD-MAN-SNITCH'
        match:
          service: deadman
        repeat_interval: 5m

Now we need to achieve an alarm trigger reverse logic. In our particular case, we use Dead Man Snitch  which is great for monitoring batch jobs e.g. data import to your database. It works in a way that if you do not check in in the specified interval it triggers with the lead time given by interval. You can specify the rules when to trigger but those are the details of the service you use. All you need to add to Prometheus is a receiver definition checking in particular snitch as follows:

       - name: 'DEAD-MAN-SNITCH'
          webhook_configs:
            -  url: 'https://nosnch.in/your_snitch_id'

The last thing you need to do is integrate the trigger with the system you use for on-call rotta management e.g. PagerDuty . For this example integration between dead man snitch and pagerduty
As I am not an infra or DevOps guy by hart it took me some time to figure it out and connect the bits. I hope you found it useful. If you have other experiences or the different way how to achieve this behaviour leave the comment below or you can reach me on twitter.

Processing…
Success! You're on the list.

Kubernetes Helm features I would wish to know from day one

Kubernetes Helm is a package manager for Kubernetes deployments. It is one of the possible tools for deployment management on Kubernetes platform. You can imagine it as an RPM in Linux world with package management on top of it like an apt-get utility.

Helm release management, ability to install or rollback to a previous revision, is one of the strongest selling points of Helm and together with strong community support makes it an exciting option. Especially the number of prepared packages is amazing and make it extremely easy to bootstrap a tech stack on the kubernetes cluster. But this article is not supposed to be a comparison between the kubernetes tools but instead describing an experience I’ve made while working with it and finding the limitations which for some else might be quite ok but having an ability to use the tool in those scenarios might be an additional benefit.
Helm is written in GO lang with the usage of GO templates which brings some limitation to the tool. Helm works on the level of string literals, and you need to take care of quotation, indentation etc. to form a valid kubernetes deployment manifest. This is a strong design decision from the creators of the Helm, and it is good to be aware of it. Secondly, Helm merges two responsibilities: To render a template and to provide kubernetes manifest. Though you can nest templates or create a template hierarchy, the rendered result must be a Kubernetes manifest which somehow limits possible use cases for the tool. Having the ability to render any template would extend tool capabilities as quite often the kubernetes deployment descriptors contain some configuration where you would appreciate type validation or possibility to render it separately. At the time of writing this article, the Helm version 2.11 didn’t allow that. To achieve this, you can combine Helm with other tools like Jsonnet and tie those together.
While combining different tools, I found following Helm advanced features quite useful which greatly simplified and provide some structure to the resulting Helm template. Following list enumerates those which I found quite useful.

Template nesting (named templates)
Sub-templates can be structured in helper files starting with an underscore, e.g. `_configuration.tpl` as files starting with underscore doesn’t need to contain valid kubernetes manifest.

{{- define "template.to.include" -}}
The content of the template
{{- end -}}

To use the sub-template, you can use

{{ include "template.to.include" .  -}}

Where “.” passes the actual context. When there are problems with the context trick with $ will solve the issue.

To include a file all you need is specify a path

{{ $.Files.Get "application.conf" -}}

Embedding file as configuration

{{ (.Files.Glob "application.conf").AsConfig  }}

it generates key-value automatically which is great when declaring a configMap

To provide a reasonable error message and make some config values mandatory, use required function

{{ required "Error message if value not specified" .Values.component.port }}

Accessing a key from value file which contains a dot e.g. application.conf

{{- index .Values.configuration "application.conf" }}

IF statements combining multiple values

{{- if or (.Values.boolean1) (.Values.boolean2) (.Values.boolean3) }}

Wrapping reference to value into braces solves the issue.

I hope that you found tips useful and if you have any suggestions leave the comment below or you can reach me for further questions on twitter.

Processing…
Success! You're on the list.