Innovations in technologies made the resources cheaper than earlier. This enables organizations to store more data at lower cost and thus increasing the size of data. Gradually it becomes bigger and now it moves from Megabytes (MB) to Petabytes (1e+9 MB). This huge increase in data requires some different kind of processing and ways of storage than the traditional one because the traditional one was not built for such scenario. In order to get a solution, the definition of Big Data evolved.
Big Data is not a single technology or techniques. This is a broad term for data sets so large or complex that traditional data processing applications are inadequate to process it. In fact, Big Data refers to data creation, storage, retrieval and analysis that is remarkable in terms of its four elements.
These four elements are known as 4Vs :
- VOLUME: Remember the days of 20th century when 10 GB of storage for a PC was enough. In today’s world a typical flight can generate 5-6 GB of data every second. With this ratio a trip of 10 to 12 hours can generate 500 Terabytes of data. Even social websites like Facebook, twitter, YouTube are adding thousands of terabytes of data every day. 40 Zettabyte of data is expected to created by 2020 which is a 300 times from 2005.
- VELOCITY: The high speed generated data also needs to be processed with the same speed with ZERO error for analysis. For example, data generated by flight needs to be analyzed with the same speed and without any fault. A delay of even a millisecond can cost life of passenger. In the same way high frequency stock trading reflects market changes in milliseconds and a single error can lead economy to damage severely. A modern car is having close to 100 sensors that monitors items like fuel, tire pressure etc.
- VARIETY: Traditional database systems were designed to address smaller volumes of structured data, a predictable and consistent data structure whereas Big Data isn’t about just structured data. Big Data is also includes Geo spatial data, 3-D data, audio and video, and unstructured data which including log files from various systems like flights, exchanges and social media. 400 Million tweets per day, 30 Billion of contents every months, 4 Billion of hours of Video are watched each month on YouTube are an example of different forms of data
- VERACITY: It is the 4th V in Big Data. It is biases, noise and abnormality in data. In every analytic processing, 40-60% of time is spent on “data preparation” phase like removing duplicates, partial entries, removing null or blank, aggregating results etc. Uncertainty of data (poor data quality) causes $1.3 Trillion loss to US economy. Every 3rd business leaders do not trust the data they use to make decision. Hence, with respect to save processing time and accuracy of processed data this V is very important.
As multiple technologies and techniques involved in Big data, numerous tools are available which usages different technologies. For Example, Giraph or Hema or GraphX are using Graph Model technology whereas Hadoop or HaLoop or Twister usages MapReduce, Pig or Tez or Hive usages MapReduce as well as DAG Model. Other tools and technologies are also in the market for the same objective. Hence, an organization should take utmost care while selecting a Big Data technology based on their data processing requirements.
The data processing requirements of organization can be :
- Operational(Real time, Interactive workloads): low latency response on highly selective access criteria, Supported by NoSQL like MongoDB.
- Analytical (High throughput): Complex queries, touches most of the data, supported by MPP database systems and MapReduce.
Decision makers must consider 6 point agenda before deciding a technology for their organization.
- Online vs. Offline Big Data
- Software Licensing Models
- Developer Appeal
- General Purpose vs. Niche Solutions
Choosing a way which provides both types of abilities to an organization becomes easy with the new technologies. NoSQL, MPP databases, and Hadoop have emerged to address Big Data challenges. One of the most common ways companies are leveraging the capabilities of both systems is by integrating a NoSQL database such as MongoDB with Hadoop.
Gradually Hadoop becomes an important ingredient of Big Data Technologies by its usability, scalabilities and capabilities. Moreover, it is open source and easy to configure with other Big Data tools.