Managing and storing Big Data

The average consumer PC might contain 500 GB or 1TB of storage; a mobile phone? 32–128 GB or more; and if that’s not enough we go ahead and use additional storage devices. But how much storage do you think Google or Microsoft or even Coca Cola needs?

With more than 1.9 billion drinks consumed per day, Coca Cola generates vast amounts of data every day from production to distribution via consumer feedback. Google processes over 2.5 trillion searches per year (which is about 80,731 queries per second) and as a whole, processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its computing clusters.
1 petabyte (PB) = 1024 terabytes (TB)
What’s a MapReduce job? In simple terms, it’s a technique to scale an input set of data to be processed over multiple computing nodes to produce a new set of output.
But what is Big Data?
Requirements of dealing and extracting information from large and complex data sets (such as Coca Cola and Google mentioned above) is what gives us the problem or field of Big data. Since such data sets are difficult or impossible to process using traditional methods, we need to come up with a different approach to dealing with the same.

In the early 2000s, industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s:
Volume: Within the Social Media space for example, Volume refers to the amount of data generated through websites, portals and online applications. Consider this- Facebook and YouTube have over 2 billion users, and Instagram has 500 million+ daily active users. Every day, these users contribute to billions of images, posts, videos, tweets etc. You can now imagine the insanely large amount -or Volume- of data that is generated every minute and every hour.
Velocity: With Velocity we refer to the speed with which data are being generated. For example, each day hundreds of millions of posts, tweets, likes and around 500,000 hours of video are uploaded. This boom in dataflow needs to be controlled and also processed on time without creating bottlenecks.
Variety: This refers to all the structured and unstructured data that has the possibility of getting generated either by humans or by machines. The most commonly added data are structured -texts, tweets, pictures & videos. However, unstructured data like emails, voicemails, hand-written text, ECG reading, audio recordings etc. are also important elements under Variety. Variety is all about the ability to classify the incoming data into various categories.
How does a company store/manage Big Data?
To discuss management, we need to note four key points needed to build a Big data management solution:
Big Data Integration- Loading big data (large amount of log files, various data from operational systems, sensors, social media, or other sources) into Hadoop via HDFS, HBase, Sqoop or Hive is considered an operational data integration problem.
Big Data Manipulation- There is a range of tools that enable a user or consumer to take advantage of big data parallelization to perform transformations on large amounts of data. These languages such as Apache Pig provide a scripting language to remove errors, compare, compute and group data into an HDFS cluster.
Big Data Quality- Proper management offers data quality that take the advantage of parallel computing of Hadoop. These big data quality features provide explicit functions and tasks to profile and identify duplicity of records across the huge data stores in moments not days.
Big Data Management and Governance- Big data projects need to implement an explicit project management structure, as they become part of the bigger system. Thus, companies will need to consider standards and procedures around these projects just as they have with data management projects in the past.
Hadoop? HDFS? … huh?
Let’s say that we have an influx of various items/documents that we need to store and process. Now sending each one of these items to a machine takes some finite time t. So for n items we need n*t units of time (ideally) to send the data. Consider having a set of n machines with multiple storage options, and each machine can take in items parallelly. Now for the same n items we just need t units of time, as each item is sent to a different machine at the same time. Also, as our items increase, we can just add more machines or nodes to our initial cluster to manage the need of storage. This is known as a distributed computing and storage.

Apache Hadoop is basically a solution for Big data that implements the above mentioned scenario. It is an open-source software for reliable, scalable, distributed computing. HDFS or Hadoop Distributed File System is the process or agent that decides where each item or data is stored, accessed or managed in the cluster of nodes.
HDFS
Hadoop Distributed File System (HDFS) distributes the data over the data nodes. There are four types of nodes involved within HDFS. They are:
Name Node: a facilitator that provides information on the location of data. It knows which nodes are available, where in the cluster certain data resides, and which nodes have failed.
Secondary Node: a backup to the Name Node
Job Tracker: coordinates the processing of the data using Map Reduce.
Slave Nodes: store data and take direction from the Job Tracker.
Applications that collect data in various formats can place data into the Hadoop cluster by using an API operation to connect to the NameNode. The NameNode tracks the file directory structure and placement of “chunks” for each file, replicated across DataNodes. To run a job to query the data, provide a MapReduce job made up of many map and reduce tasks that run against the data in HDFS spread across the DataNodes. Map tasks run on each node against the input files supplied, and reducers run to aggregate and organize the final output.
The Hadoop ecosystem has grown significantly over the years due to its extensibility. Today, the Hadoop ecosystem includes many tools and applications to help collect, store, process, analyze, and manage big data.