In my previous post I mentioned how more data can brought a success to any business , more data any organization have about it's customers / targeted customers , their preferences , search pattern , demands etc and Organization use this data to filter out rich information, this information can help organization to better serve it's customer which is key for any business success. There are 2 bigger challenges here we have to face
[1] How to store this big data set ?
[2] How to do analysis on such a big data set ?
There are various technology stacks used in industry to solve above problems and Hadoop is one of widely used technology to work with big data.
Hadoop provides a reliable shared storage and analysis system. There are mainly 2 major component of Hadoop
a) HDFS : Hadoop Distributed File System which provide a storage capability to store big data.
A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes
b) MapReduce : This provide a capability to analysis the data stored using HDFS process. It understands the data and assigns work to the nodes in a cluster.
MapReduce tries to collocate the data with the compute node, so data access is fast since it is local ( data locality feature ) and one of reason for good performance. Coordinating the processes in a large-scale distributed computation is a challenge. There are harder aspects ex: any job failure, track of progress of jobs on individual nodes, rerun of any failed job on any node, monitoring the progress of individual job and overall analysis task. MapReduce spares the programmer from having to think about failure, since the implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this since it is shared nothing architecture, meaning the tasks have no dependency on one another. In my next blog I will try to give more details about individual Hadoop component.
[1] How to store this big data set ?
[2] How to do analysis on such a big data set ?
There are various technology stacks used in industry to solve above problems and Hadoop is one of widely used technology to work with big data.
Hadoop provides a reliable shared storage and analysis system. There are mainly 2 major component of Hadoop
a) HDFS : Hadoop Distributed File System which provide a storage capability to store big data.
A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes
b) MapReduce : This provide a capability to analysis the data stored using HDFS process. It understands the data and assigns work to the nodes in a cluster.
An obvious question which comes to mind at this point is why can't we use RDBMS for this ? There are some clear reason why RDBMS is not best solution while handling such a large data set
- Traditional RDBMS system are designed to handle Gigabytes of data where Hadoop was designed to solve the problem of handling Petabytes of data.
- In RDBMS data set is more structured ( static schema ) , Hadoop provide a functionality which even can work with Raw / Non /less structured data ( dynamic schema ).
- Traditional RDBMS system scaling is non linear while Hadoop was designed with clear motive of Linear scaling.
MapReduce tries to collocate the data with the compute node, so data access is fast since it is local ( data locality feature ) and one of reason for good performance. Coordinating the processes in a large-scale distributed computation is a challenge. There are harder aspects ex: any job failure, track of progress of jobs on individual nodes, rerun of any failed job on any node, monitoring the progress of individual job and overall analysis task. MapReduce spares the programmer from having to think about failure, since the implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this since it is shared nothing architecture, meaning the tasks have no dependency on one another. In my next blog I will try to give more details about individual Hadoop component.
No comments:
Post a Comment