Journey of Hadoop

At the outset of twenty-first century, somewhere 1999-2000, due to increasing popularity of XML and JAVA, internet was evolving faster than ever.  As the world wide web grew at dizzying pace, though current search engine technologies were working fine, a better open source search engine was the need of the hour to cater the future growth of the internet.  Project Nutch at Apache Open source organization was a realization of this growing need by Doug Cutting and Mike Cafarella.  Doug was Internet archive search director and Mike was a graduate student at University of Washington.

It took an year of work for both of them before it started working and hundreds of millions of pages had been indexed with the limitations that it could only run across a handful of machines. Further, it requires monitoring round the clock to avoid failure.

While this duo was serious at their work, they get impressed by Google during 2003-2004 when Google come up with two Paper “Google File System (GFS) ” and “MapReduce” which led them add DFS and MapReduce in Nutch.  The best part was they could just set it running on between 20 and 40 machines. But still miles to go.

Meanwhile, Yahoo!, equally impressed with GFS and MapReduce, want to build an open source technology using these two features.  Yahoo’ Eric Baldeschwieler (E14) hired Doug with many others and a new project got split off Nutch with Doug responsible for Apache Open Source liaison and E14 was looking engineers, clusters, users etc.  This new project was named as Hadoop by Doug on his son’s yellow elephant toy which he used to call Hadoop

Finally with all great leadership and the resources mobilized by Yahoo!, Hadoop reached web scale in 2008.

Since its inception, Hadoop has become one the most talked about technologies for its ability to store and process huge amount of structured and unstructured data quickly.  It enables distributed processing for big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks.

  1. Massive data storage
  2. Faster processing

Because of the limits of vertical scalability, Hadoop is designed to horizontally scale up from a single server to thousands of commodity machines and off-course with very high fault tolerance.  It introduced new dimensions in large scale computing with all of its new features.  These can be listed down:

  1. Scalable: Any cluster of Hadoop can be expanded by just adding new server or commodity machine without even thinking of any complexity like reformat, move and with little administration.
  2. Fault Tolerant:  Data and application processing are protected against hardware failure.  When the cluster looses a node, the system automatically redirects the work of that particular node to another node so that computing does not fail.
  3. Cost Effective:  As it is open source, it is free.  Further, it brings massive parallel computing by just using commodity servers.  This not only reduces the cost of processing but also the cost to per terabyte storage of data.
  4. Storage Flexibility:  Unlike traditional relational database, Hadoop is schema-less and can absorb any type of data whether structured or unstructured from any number of sources.  It also provides features to join and aggregate data from even different sources which make it a powerful analysis tool.

Leave a Reply

Your email address will not be published. Required fields are marked *