Quoting the words of Pat Gelsinger, the CEO of VMware “Data is the new science, Big Data holds the answers”. Going by this statement data holds the key to today’s world. In the past we had to rely on experienced professionals regarding critical decisions pertaining to business, marketing, shopping etc. This experience was particularly based on exposure to lots of problems which they had faced and whether they had been able to successfully tackle the same. Thus they had been unconsciously training their mind to decide the feasibility of certain decisions. Times have changed now and we are now looking towards the Data based decisions approach to provide more accurate decisions to minimize human error and maximize the efficiencies of these industries.
To work with this concept we need to know to how much data we need to handle. It is estimated that nearly 3 billion Terabytes of data is generated by a single Cross Country flight. Surprised!!!. This was just about the volume of data generated by the airlines industry. We have many industries working on similar lines and generating atrocious amounts of data.
Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information. Thus to tackle such voluminous and baffling sizes of generated data we use specific techniques to mine useful information. Such techniques need to be highly robust, accessible, scalable and simple. One such framework is named HADOOP. This framework is based on a file system named HDFS (Hadoop Distributed File System) which uses the essence of distributed file system architecture and parallel programming to handle enormous amounts of data stored on commodity servers. These techniques help in mining critical information flawlessly.
HDFS stores files in pieces named blocks. These blocks are located in random locations on the servers in order to minimize the seek time for these files. Secondly duplicate copies of these blocks are also stored which serve as a backup so as to prevent loss of information thus making it robust. In order to locate these blocks the metadata of these blocks are stored in Primary NameNode while the actual data in the form of blocks are stored in various Data Nodes spread across the server. This Primary NameNode serves as a master to the Data Nodesand hence it as also called a master-slave architecture.
“Torture the data and it will confess anything”. This quote perfectly imbibes all the points mentioned in the preceding paragraphs. No wonder it is called the Hotcake for IT professionals and will remain the same for the next few decades.