A Collection of large and complex datasets which are difficult to store and process using the traditional database and data processing tools is considered as big data. Big data is collected from traditional and digital sources which, when refined properly can be used for research and analysis. With time, organizations are growing and with this data generated from these organizations are also increasing exponentially. The challenge is to have a platform which can provide a single, consistent view of the complete data. Another challenge is to organize this data so that it makes sense and can be utilized as useful information. Everything around us generates big data continuously. Social media websites and digital sources are responsible for producing such huge amount of data. How this huge amount of data is transmitted – sensors, mobile and systems are the answer.
Where is this Big Data coming from?
- Social media: Big data companies like Facebook and google get the data from whatever activities we perform. Other examples are YouTube, Twitter, LinkedIn, blogs, slideshare, Instagram, chatter, WordPress, Jive, etc.
- Public Web: This includes data coming from Wikipedia, health care services, the World Bank, government, weather, traffic, etc.
- Archives: This includes archives of any data like medical records, customer correspondence, insurance forms, scanned documents, etc.
- Docs: Documentation of any format including HTML, CSV, PDF, XLS, Word, XML, etc. are the sources of big data.
- Media: Images, video, audio, live stream, podcast etc.
- Data storage: The various database and file systems which are used to store the data serve as the source for big data.
- Machine Log Data: Data coming from server, application logs, audit logs, CDR- call detail records, various mobile apps, mobile location etc.
- Sensor Data: Data from sensors connected to medical devices, road cameras, satellites, traffic surveillance devices, video games, household appliances, air conditioning units, office buildings etc.
Three Vs of Big data
There are 3Vs that define Big data velocity, variety and volume
- Variety: There are multiple formats to store data, e.g., database, MS-Access, MS-Excel, text and many more. It can also be in the form of pdf, video or SMS. So the challenge is to arrange this data to make it meaningful and it is easier when the data is in the same format.
- Volume: The volume of data coming from multiple sources is huge. With this increase in the volume of data it is important for the organizations to reevaluate their architecture and applications.
- Velocity: Velocity refers to the processing speed of the data. In earlier days, yesterday’s data were considered as recent data, but now this thing is valid only in the newspapers. Rest, everything gets updated even less than a fraction of seconds. News channel, radios, tweets, Facebook posts and comments everything updates so fast that data updates few minutes back is considered useless and old.
Big data is a mixture of unstructured, structured and multi-structured data.
- Structured Data: Data which has a defined format and is organized in a predefined schema is called structured data. Data coming from traditional databases and repositories like Mainframes, SQL server, Oracle, DB2, Sybase, Access, Excel, txt and Teradata are considered as examples of structured data. The Relational Database Management system deals with only this kind of data.
- Unstructured data: Data which is unorganized and it is not easy to interpret such data using traditional databases or data models are called unstructured data. Data coming from social media like Chatter, text analytics, blogs, Tweets, comments, clicks, tags etc..
- Multi-structured data: Multi-structure data are un-modelled, it needs to be organized, although there might be a schema but it is ignored. It can be derived from interactions between human and machines. This includes emerging market data, e-commerce, and other third party data like weather, currency conversion, demographic, panel etc.
About the Author:
Vaishnavi Agrawal loves pursuing excellence through writing and have a passion for technology. She has successfully managed and run personal technology magazines and websites. She currently writes for Intellipaat. She is based out of Bangalore and has an experience of 5 years in the field of content writing and blogging. Her work has been published on various sites related to Hadoop, Big Data, Business Intelligence, Cloud Computing, IT, SAP, Project Management and more.