Desire for high availability
Scalability, Availability, Resilience – those are just common examples of computer system requirements which forms an overall application architecture very strongly and have a direct impact to “indicators” such as Customer Satisfaction Ratio, Revenue, Cost, etc. The weakest part of the system has the major impact on those parameters. The topic of this post availability is defined as the percentage of time that a system is capable of serving its intended function.
In BigData era Apache Hadoop is a common component of nearly every solution. As the system requirements are shifting from purely batch-oriented systems to near-to-real-time systems this just adds pressure on systems availability. Clearly, if the system in batch mode runs every midnight than 2 hours downtime is not such a big deal as opposed to near-to-real-time systems where result delayed by 10 min is pointless.
I this post I will try to summarize Hadoop high availability strategies as a complete and ready to use solutions I encountered during my research on this topic.
In Hadoop 1.x the well-known fact is that the Name Node is a single point of failure and as such all high availability strategies try to cope with that – strengthen the weakest part of the system. Just to clarify widely spread myth – Secondary Name Node isn’t a backup or recovery node by nature. It has different tasks than Name Node BUT with some changes, Secondary Name Node can be started in the role of Name Node. But neither this doesn’t work automatically nor that wasn’t the original role for SNN.
High Availablity strategies
High availability strategies can be categorized by the state of standby: Hot/Warm Standby or Cold Standby. This has a direct correlation to failover(start up) time. To give a raw idea(according to doc): Cluster with 1500 nodes with PB capacity – the startup time is close to one hour. Startup consists of two major phases: restoring the metadata and then every node in HDFS cluster need to report block location.
Hadoop 1.x HA strategies
The typical solution for Hadoop 1.x which makes use of NFS and logical group of name nodes. Some resources claim that in case of NFS unavailability the name node process aborts what would effectively stop the cluster. I couldn’t verify that fact in different sources of information but I feel important to mention that. Writing name node metadata to NFS need to be exclusive to a single machine in order to keep metadata consistent. To prevent collisions and possible data corruption a fencing method needs to be defined. Fencing method assures that if the name node isn’t responsive that he is really down. In order to have a real confidence, a sequence of fencing strategies can be defined and they are executed in order. Strategies range from simple ssh call to power supply controlled over the network. This concept is sometimes called shot me in the head. The failover is usually manual but can be automated as well. This strategy works as a cold standby and Hadoop providers typically provide this solution in their High Availability Kits.
Because of the relatively long, start up time of back up name node some companies (e.g. Facebook) developed their own solutions which provide hot or warm standby. Facebook’s solution to this problem is called avatar node. The idea behind is relatively simple: Every node is wrapped to so-called avatar(no change to original code base needed!). Primary avatar name node writes to shared NFS filer. Standby avatar node consists of secondary name node and back up name node. This node continuously reads HDFS transaction logs and keeps feeding those transactions to encapsulated name node which is kept in safe mode which prevents him from performing any active duties. This way all name node metadata are kept hot. Avatar in standby mode performs duties of secondary name node. Data nodes are wrapped to avatar data nodes which send block reports to both primary and standby avatar node. Failover time is about a minute. More information can be found here.
Another attempt to create a Hadoop 1.x hot standby coming to form China Mobile Research Institute is based on running synchronization agents and sync master. This solution brings another question and it seems to me that it isn’t so mature and clear as the previous one. Details can be found here.
Hadoop 2.x HA strategies
An ultimate solution to high availability brings Hadoop 2.x which removes a single point of failure from a different architecture. YARN (Yet Another Resource Negotiator) also called MapReduce 2. And for HDFS there is another concept called Quorum Journal Manager (QJM) which can use NFS or Zookeeper as synchronization and coordination framework. Those architectural changes provide the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.
This post just scratches the surface of Hadoop High Availability and doesn’t go deep in detail daemon but I hope that it is a good starting point. If someone from the readers is aware of some other possibility I am looking forward to seeing that in the comment section.