Because of the limitation of currently available Enterprise data warehousing tools, Organizations were not able to consolidate their data at one place to maintain faster data processing. Traditional ETL tools may take hours, days and sometimes even weeks. Performances of these tools are limited by two Hardware limitations.
- The vertical hardware scalability: Hardware can be scaled vertically to one extent beyond which no further addition is possible. For example, if the required RAM can be scaled to 64 GB or so, but even this is having some limitation and beyond that no RAM can be added.
- Hardware Failure: In this structure of hardware, failure of even a small component can lead the down of entire processing.
With the use of distributed parallel processing design Hadoop overcome these scenarios. This can be configured using low cost commodity machines. This technique relies on horizontal scaling of hardware. Horizontal scaling can be achieved by adding new commodity machines without any limitations. The best part of this type of hardware is fault tolerant and can accommodate one or more machine failure without much performance hit.
By harnessing the true capability of distributed parallel processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very low granular level but also speeds up its processing on decentralized data to the maximum. By enabling Hadoop an organization, with huge data, can get its analysis within minute for which it was waiting for hours in traditional ETL with very less data.
On the one had Hadoop makes the life of an organization simple by letting them collect all the data of all the times without even concerning of its size and processing speeds, On the other hand it gives the power to the organization to analyze these data, in no time, the way they require to improve the performance of the organization.
Hadoop is doing all this magical job with the help of its four core components which makes its architecture an elegant design. These components can be listed as below.
- HDFS (Hadoop Distributed File System) : This file system lays the foundation of Hadoop. It provides reliable data storage across all the nodes of a cluster of Hadoop. This file system links together all nodes which is available in a cluster and make them a single file system. Adding a new Machine in this cluster is just like increase the storage space.
- YARN (Yet Another Resource Negotiator): At the one side, HDFS is connecting all the nodes of a cluster in such a way that the storage spread across machine is treated as one single storage, on the other side, YARN does the same thing with CPU and Memory. It works like an operating system in Hadoop like Monitor and manages workloads, implementing security controls, and managing high availability features of Hadoop.
- MapReduce: This programming paradigm is the heart of Hadoop which facilitates tremendous scalability across hundreds or thousands of servers in a Hadoop cluster. It is actually a combination of two separate and distinct tasks (MAP and REDUCE) which makes Hadoop to take the large data set, spread it across multiple nodes of Hadoop, get the processed results and combine the results to return it to the user in quickest manner.
- Hadoop common: This module contains the utilities that support the other Hadoop components to work or interact with each other.