Big Data deals with large data sets or deals with the complex that dealt with by traditional data processing application software. It has three key concepts like volume, variety, and velocity. In volume, determining the size of data and in variety, data will be categorized means will determine the type of data like images, PDF, audio, video, etc. and in velocity, speed of data transfer or speed of processing and analyzing data will be considered. Big data works on large data sets, and it can be unstructured, semi-structured, and structured. It includes the following key parameters while considering big data like capturing data, search, data storage, sharing of data, transfer, data analysis, visualization, and querying, etc. In the case of analyzing, it will be used in A/B testing, machine learning, and natural language processing, etc. In the case of visualization, it will be used in charts, graphs, etc. In big data, the following technology will be used in Business intelligence, cloud computing, and databases, etc.
Some Popular Big Data Technologies:
Here, we will discuss the overview of these big data technologies in detail and will mainly focus on the overview part of each technology as mentioned above in the diagram.
1. Apache Cassandra: It is one of the No-SQL databases which is highly scalable and has high availability. In this, we can replicate data across multiple data centers. Replication across multiple data centers is supported. In Cassandra, fault tolerance is one of the big factors in which failed nodes can be easily replaced without any downtime.
2. Apache Hadoop: Hadoop is one of the most widely used big data technology that is used to handle large-scale data, large file systems by using Hadoop file system which is called HDFS, and parallel processing like feature using MapReduce framework of Hadoop. Hadoop is a scalable system that helps to have a scalable solution that handles large capacities and large capabilities. For example: If you see real use cases like NextBio is using Hadoop MapReduce and HBase to process multi-terabyte data sets off the human genome.
3. Apache Hive: It is used for data summarization and ad hoc querying which means for querying and analyzing easy Big Data. It is built on top of Hadoop for providing data summarization, ad-hoc queries, and the analysis of large datasets using SQL-like language called HiveQL. It is not a relational database and not a language for real-time queries. It has many features like -designed for OLAP, SQL type language called HiveQL, fast, scalable, and extensible.
4. Apache Flume: It is a distributed and reliable system that is used to collect, aggregate, and move large amounts of log data from many data sources toward a centralized data store.
5. Apache Spark: The main objective of spark for speeding up the Hadoop computational computing software process, and It is introduced by Apache Software Foundation. Apache Spark can work independently because it has its own cluster management, and It is not an updated or modified version of Hadoop and if you will deep dive then you can say it is just one way to implement Spark with Hadoop. The Main idea to implement Spark with Hadoop in two ways is for storage and processing. So, in two ways Spark uses Hadoop for storage purposes just because Spark has its own cluster management computation. In Spark, it includes interactive queries and stream processing, and in-memory cluster computing is one of the key features.
6. Apache Kafka: It is a distributed publish-subscribe messaging system and more specifically you can say it has a robust queue that allows you to handle a high volume of data, and you can pass the messages from one point to another as you can say from one sender to receiver. You can do message computation offline and online mode, it is suitable for both. To prevent data loss Kafka messages are replicated within the cluster. For real-time streaming data analysis, it integrates Apache Storm and Spark and is built on top of the ZooKeeper synchronization service.
7. MongoDB: It is based on cross-platform and works on a concept like collection and document. It has document-oriented storage that means data will be stored in the form of JSON form. It can be an index on any attribute. It has features like high availability, replication, rich queries, support by MongoDB, Auto-Sharding, and Fast in-place updates.
8. ElasticSearch: It is a real-time distributed system, and open-source full-text search and analytics engine. It has features like scalability factor is high and scalable structured and unstructured data up to petabytes, It can be used as a replacement of MongoDB, RavenDB which is based on document-based storage. To improve the search performance, it uses denormalization. If you see the real use case then it is an enterprise search engine and big organizations using it, for example- Wikipedia, GitHub.