Let’s break down the concept of big data

In one sentence Big Data is “extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions”. There are many misconceptions about Big Data, people think huge data sets needed to be processed by a system is Big Data instead, all those data are said to be big data which data can neither be stored nor processed.

Why this concept of Big Data?

This is 2019, and if we assume we are living is a data world of 100% data, then the fact is 90% of its data has produced by the past 2 years, and rest 10% only is contributed by the previous year. So, what we can conclude is, with every year we are producing more and more data. This data can be anything like text messages, images, videos, etc. Every multinational company like Facebook, GooglePlus, LinkedIn, Tweeter, etc that deals with huge user interaction or companies like Google, Yahoo, Bing, Duckduckgo, etc collects huge data about other things hade to move towards the concept of Big Data to satisfy its customers with fast processing speed.


Before Hadoop, PROCESS WAS COMPUTATION BOND, That means if we want to process a data then it has to be on that machine if you cannot store your data, then you cannot process that data.

Surveys say, about half a PB (Peta Bytes) of data is handled by Facebook each day, twitter handles about 500 million tweets million per day, Google receives 5.6 billion searches per day. So are those websites serves you slow? No. So how do they do that? Here the concept of Hadoop and Big Data come in the picture.

Break Down In Simple Words:

Well, let’s understand the concept of Big Data in simple words through two scenarios.

1. The Farmer Example:

Once there was a farmer. He had his corp field and let’s assume that all the crop yielded in a year, is consumed by his family and the farmer has a storeroom of capacity 100. First-year he yields 10 units of the crop, stored in his storeroom and consumed by his family. Second-year he yields 50 units of the crop, stored in his storeroom and consumed by his family. Third-year he yields 100 units of the crop, stored in his storeroom and consumed by his family. But in Fourth-year he yields 200 units of the crop. Would he be able to store the crop in his storeroom? No. So this means he has more data (crops here) to process (store in the storeroom) which is beyond its processing power. So the farmer or the company needs to apply an alternative plan to store and process the data. Here the amount of crop or data that the farmer or the company could not store or process is considered as Big Data to it.

2. The Employer Example:

Let’s take me as an employee of some company, and I use to sign a lot of documents. I can sign files at a maximum of 50 docs in a certain amount of time. So my colleague gave me 100 files at first, at the time I have finished 50 documents signing he gave me another 100 documents to sign. So I have a total of 150 documents to sign. Again by the time you have finished 50 documents signing, your colleague gave you 100 files more, so now you have 200 files pending. So basically every time you finish 50 documents you pending document count increases by 50 documents. But you are singing at your max speed but the problem is your processing speed is fixed but your load is continuously increasing. So basically if we consider the documents as data to be processed and your signing speed as your processing speed. Then we must say with the increasing speed we need to increase our processing speed too. So the data that we can not finish processing will become Big Data to us.

Now if we can not store our data in our local systems then we will go for solutions like data centers or external hard disk. Simple Solution right? But here also we have a problem. Say you have stored 10GB of data in your data center and you are trying to access 3GB of data for processing. For that, you have written 120KB of code. Now you can go for two possible ways.

  1. You can download all your 3GB of data from your data center to your local machine.
  2. You can send your 120KB code to the data center.

If you go for the 1st option, then you will consume a lot of time in just transferring those files. But you can go for the 2nd option but still sending your machine code is not a good practice.


What we can do then? We can implement or process the task on various systems with there own resources. They will get fractions of the main data and they will process that data back to the main server. After all the transactions or tasks have been performed then the main server will return the response to the client system. By implementing this you can process all that huge data in less amount of time. This is what we know as the concept of Hadoop.

How Hadoop Came?

After Google had started, it started to receive a huge amount of data each day. They had to came up with some solutions to store the data efficiently. They took 13 years and in 2003 they introduced GFS or Google File System. In the following year, they also added MapReduce for increased data efficiency.

After a while the 2nd popular search engine after Google, Yahoo also felt the need for efficient data storing for fast processing. Doug Cutting and Mike Cafarella took the concept of GFS and modified it and introduced HDFS or Hadoop Distributed File System, a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; in the year 2008 and they also included MapReduce for improved efficiency. In 2008 February, Yahoo moves its web index onto Hadoop.

On 19 February 2008, Yahoo! Inc. launched what they claimed was the world’s largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo! web search query. There are multiple Hadoop clusters at Yahoo! and no HDFS file systems or MapReduce jobs are split across multiple data centers. Every Hadoop cluster node bootstraps the Linux image, including the Hadoop distribution. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine. In June 2009, Yahoo! made the source code of its Hadoop version available to the open-source community.

In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21PB (Peta Bytes) of storage. In June 2012, they announced the data had grown to 100 PB and later that year they announced that the data was growing by roughly half a PB per day.

As of 2013, Hadoop adoption had become widespread: more than half of the Fortune 50 companies used Hadoop.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.