What is Big Data?
Big Data is a term you are going to be reading or hearing about in the months and years to come as it is slowly but steadily becoming a part of our daily lives. Simply put, Big Data is the collection of all the data (structured or unstructured) being collected from around the world from digital media at an extraordinary rate. Its size ranges from hundreds of GBs to anything beyond that. To be more specific, when data grows in size up to a certain limit after which it becomes time and machine intensive to be operated from a single computer, that data is termed ‘Big Data’. Big Data used to be called Analytics/Business Intelligence before the industry felt the need for a change in the term. To break that down in simple words, let’s say that Facebook wants to know which ads work best for people with college degrees. Let’s say there are 200,000,000 Facebook users with college degrees, and they have been each served 100 ads. That’s 20,000,000,000 events of interest, and each “event” (an ad being served) contains several data points (features) about the ad: what was the ad for? Did it have a picture in it? Was there a man or woman in the ad? How big was the ad? What was the most prominent color? Let’s say for each ad there are 50 “features”. This means you have 1,000,000,000,000 (one trillion) pieces of data to sort through. If each “piece” of data was only 100 bytes, you’d have about 93 GB of data to parse. This is just a small version of the Big Data.
The five characteristics that define Big Data are: Volume, Velocity, Variety, Veracity and Value.
VOLUME: Volume refers to the ‘amount of data’, which is growing day by day at a terrific rate. The size of data generated by humans, machines and their interactions on just social media is gigantic. Researchers have predicted that 40,000 Exabytes will be generated by 2020, which is an increase of around 300 times from 2005.
VELOCITY: Velocity is defined as the speed at which different sources generate the data every day. This flow of data is massive and continuous. There are 1.03 billion Daily Active Users on Facebook Mobile as of now, which is an increase of over 20% annually. This shows how fast the number of users are growing on social media and how fast the data is getting generated daily. Based on real-time data and how we handle the velocity of Big Data, we can generate insights and take correct, effective decisions for the product.
VARIETY: As there are many sources which are contributing to Big Data, the type of data they are generating is different. It can be structured, semi-structured or unstructured. Hence, there is a variety of data which is getting generated every day. Earlier, we used to get the data from Microsoft Excel and databases; now, the data appears in the form of images, audios, videos, sensory data etc. Hence, this variety of unstructured data creates new problems in capturing, storage, mining and analyzing the data.
VERACITY: Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness.
Data available can sometimes get messy and maybe difficult to trust. With many forms of big data, quality and accuracy are difficult to control like Twitter posts with hashtags, abbreviations, typos and colloquial speech. The volume is often the reason behind for the lack of quality and accuracy in the data.
VALUE: It is all well and good to have access to big data but unless we can turn it into value it is useless. By turning it into value I mean, is it adding to the benefits of the organizations who are analyzing big data? Is the organization working on Big Data achieving a high Return on Investment? Unless, it adds to their profits by working on Big Data, it is useless.
What is Hadoop?
Now that we have some insight and basic knowledge about Big Data, we shall go on to look at what Hadoop is. Hadoop was developed because the existing data storage and processing tools appeared to be inadequate to handle all the large amounts of data that started to appear after the internet boom in the last decade. Over the years, Hadoop has improved into an operating system at a very large scale focused on ‘distributed and parallel processing’ of the vast amounts of data created nowadays. As is with any ‘normal’ operating system, Hadoop consists of a file system, is able to write programs, can manage distributing those programs and return the results afterwards.
A Hadoop network is reliable and extremely scalable and it can be used to query massive data sets. Hadoop is written in the Java programming language, meaning it can run on any platform, and is used by a global community of distributors and big data technology vendors who have built layers on top of Hadoop.
The feature that makes Hadoop so useful is that the Hadoop Distributed File System (HDFS). This is the storage system of Hadoop that is able to break down the data that it processes into smaller pieces, which are called blocks. These blocks are subsequently distributed throughout a cluster. This distributing of the data allows the map and reduce functions to be executed on smaller subsets instead of on one large data set. This increase efficiency, processing time and it enable the scalability necessary for processing vast amounts of data.
MapReduce, created by Google to solve the issue of storing, handling and processing vast amounts of user data is a software framework and model that can process and retrieve the vast amounts of data stored in parallel on the Hadoop system. The MapReduce libraries have been written in many programming languages and it therefore can work with all of them. MapReduce can work with structured and unstructured data.
MapReduce works in two steps: The first step is the “Map-phase”, which divides the data into smaller subsets and distributes those subsets over the different nodes in a cluster. Nodes within the system can do this again, resulting in a multi-level tree structure that divides the data in ever-smaller subsets. At those nodes, the data is processed and the answer is passed back to the “master node”. The second step is the “Reduce-phase”. The master node collects all the returned data and combines them into some sort of output that can be used again. The MapReduce framework manages all the various tasks in parallel and across the system and forms the heart of Hadoop.
With the combination of these technologies, massive amounts of data can be easily stored, processed and analyzed in a fraction of a second. In the past years, Hadoop has proven very successful for the Big Data ecosystem and it looks like it this will remain in the future. Hadoop is a powerful tool and since 2005, over 25% organizations currently use Hadoop to manage their data, up from 10% in 2012. There are several reasons why organizations use Hadoop, being:
It is being used in almost any industry ranging from retail to government to finance. So it is definitely worth our time and effort in preparing for the Hadoop certification.