Saturday, December 1, 2012

Big Data - 1: A big picture


Everyday we get bombarded with terminologies in connection with big data like NoSQL, RavenDB, Hadoop, MongoDB and so on. Since this is an upcoming area in IT, we struggle to understand what those terms are and how these technologies can help us. I am not an expert in Big data, but I would share what I have known so far.



The diagram above gives a high level view of various terminologies in the big data arena and their relationship with each other.
  • Big Data : 
    • Big data is the data that is big in terms of
      • Volume (terabytes, petabytes and above)
      • Velocity (rate of transmission and reception) 
      • Variety (complexity, text, pictures, videos etc).
    • Big data by itself is messy and would not make any sense as it is unstructured or semi-structured unlike traditional data. The only way it can be useful is pull out only the relevant information. Pulling out relevant information out of such big data is the real problem that traditional sytems cannot help in. This is where big data technologies come into picture
There are three technology classes that support Big Data. Hadoop, NoSQL and Massively Parallel Processing (MPP) databases. I would start with MPP databases as they have been there for decades.
  • MPP Databases: These databases spread the data into independent storage and CPU thus achieving parallellism and great processing capability. This is special hardware and software combination that help you achieve such parallellism. The big products in market are IBM's Netezza, HP's Vertica and EMC's Greenplum. Microsoft's SQL Server Parallel Datawarehouse (PDW) is one upcoming MPP database.
  • Hadoop : Hadoop is one of the technology classes that support big data. It is an opensource version of Mapreduce created by Apache software foundation. In general, Mapreduce is parallel programming framework.
  • NoSQL : NoSQL stands for "not only SQL". It represents a class of databases that do not follow RDBMS principles (like transactional integrity) and mainly deal with big data. There are four categories of NoSQL databases. Please refer to the diagram above for the database products that fall under the below categories. Few of the NoSQL databases work in conjunction with Hadoop components.
    • Document store
    • Graph Database
    • Key-Value store
    • Wide Column store
I hope this post was useful in understanding the basic terminologies of big data and related technologies.

 In the next big data post, I would provide more details on NoSQL database categories.

No comments:

Post a Comment