HDFS Architecture Explained

Inspired from Google File System which was developed using C++ during 2003 by Google to enhance its search engine, Hadoop Distributed File System (HDFS), a Java based file system, becomes the core components of Hadoop. With its fault tolerant and self healing features, HDFS enables Hadoop to harness the true capability of distributed processing techniques by turning a cluster of industry standard servers or commodity servers into massively scalable pool of storage. Just to add another feather in its cap, HDFS can store structured, semi-structured or unstructured data in any format regardless of schema and is specially designed to work in an environment where scalability and throughput is critical.

The entire Hadoop HDFS cluster is divided into two parts.

  1. Master:  This part of cluster contains a node which is termed as Name Node.  There can be one or two optional Node which can be used for fault tolerance purpose which is termed as Secondary / Standby Name Node.
  2. Slave:  All servers which are a slave are called Data Node.  The no of data node can be anything which suits the business requirements.  The size of data which can be hold by HDFS in a cluster is decided by the data nodes disk size which is allocated in HDFS.  Any additional storage requirement can be fulfilled by just adding one or more node of the required disk size.

Because Name Node is the central piece of the entire HDFS cluster, it is essential to take a good care of this node.  This machine is recommended to be a good quality server with having lot of RAM.  This is the node which keeps mapping of “files to block” and “blocks to the data nodes”.  In short, Name node stores the file system metadata.

This task is being done on Name nodes using two tables maintained in memory.  One of which is maintained for “Files to block mapping”.  If there are any block corruption happens at data node this table in memory gets updated on Name node and an action is taken accordingly.  Second table is maintained to track the mapping of “blocks to data nodes”.  If any data node is not responding within specified time, name node update the second table and an action is taken based on this information.  Any communication by client application to this cluster is always done using name node.

On the one hand, this machine is the heart of this entire cluster, on the other hand this becomes the Achilles’ heel of the entire cluster too as if this machine got any failure, entire cluster does not have any clue where to go and what action needs to be taken.  In simple word, everything goes down with this one machine. This is termed as single point of failure (SPOF) . Further, even if the cluster can accommodate more machines, addition of machine is somewhat limited by memory of the Name Node and may be a bottleneck beyond 4000  odd machines because name node stores all metadata in its memory.

Very soon, starting with version 2.0 and beyond, Hadoop came up with solutions to this.  To provide high availability of HDFS cluster, Hadoop devise the concept of active/passive configuration with a hot standby and thus allows adding additional Name nodes in the same cluster to quickest possible fail over.


To understand the concept of Name Node (Active), Secondary Name Node and Standby Name Node (Passive), let’s understand the purpose of the secondary name node first.

There is one redundant node available which was known as secondary name node in both version of Hadoop (1.0 and 2.0).  Because this is the only redundant node available in the Hadoop 1.0, this node was used in fail over. But fail over could be possible with this structure only with interfere of Hadoop administrator manually.

The file system metadata is maintained by HDFS on Name node with the help of two files.

  1. editlog:  This file is getting updated by Hadoop about each and every action taken by HDFS for example adding a new file, deleting a file, moving it between folders.  Because it records all activity of HDFS, this file quickly grows a lot.
  2. fsimage:  HDFS read this file at the time of boot to understand the checkpoint status of the cluster.

For successful startup of HDFS cluster these two files has to be in sync, hence, Hadoop writes all the new records of editlogs to fsimage first, just before starting up cluster. Once this process completed, HDFS reads the status of cluster from fsimage and start the cluster. HDFS does synchronization of only at startup time by Name node.

Now consider a scenario where the name node is running for month before we brought it down for maintenance.  Once maintenance completed and we try to start name node, the editlog and fsimage synchronization process starts.  Since name node was running for a month, editlog size is very high and thus, it will require a lot of downtime to start the HDFS even if the maintenance time is very less.

Concept of secondary name node comes into picture to avoid this increased downtime.  The function of secondary name node is to wake up and read the editlog from name nodes for new records created in the last one hour.  If changes found, it asks for existing fsimage from name node, reload it in its own namespace, update it and send it back to name node to reload it in the name node namespace.

In the new high availability architecture, HDFS allowed us to add another name node which it calls as passive/standby name node.  Where task of secondary name node is to update the fsimage of the Name node, task of standby name node is to reload fsimage into its own name space as and when the fsimage of Name node got updated.  Data nodes are communicating to both nodes using heartbeat instead of only name node in case of SPOF architecture.

The restriction, here, is an only one name node can be active at a time.  This restriction is imposed simply to split-brain scenario as the namespace state would quickly diverge between the two and if both will be active it will be risking data loss or other incorrect results. HDFS does this with the help of “Journal Nodes” which allow only one name node to write capability at a time.  File editlog is stored on shared storage to get accessed by standby name node at every point of time.  As soon as failure occurs at Name node, standby name node took over the functionality of “Journal Nodes” and become active.  Rests are already in sync by standby name node, hence, the cluster proceeds with the current processing without any loss of data or incorrect result.


Leave a Reply

Your email address will not be published. Required fields are marked *