rohitv-cs

Saturday, January 5, 2013

monotonic increasing sequence generator in distributed environment !!!

From last some time in my office we are working on moving out of Oracle dependency, already we are successful in migrating many data storage related dependency from Oracle DB to other Non SQL data storage. At this point we are seeing 1 major challenge and after rubbing my head multiple times and doing a lot googling I couldn't find any solution which we can be implemented quickly. At this point I feel it will be good if I can point out how it's a really challenge to generate monotonically increasing sequences in a distributed environment ? In our current system we are using around 40 different Oracle DB instances ( not any RAC ) which give us monotonically increasing sequences but now we need to replicate this to a central solution which can give us monotonically increasing values for thousand different sequences and can handle 100000 next value generation request for these sequences within 5 minutes.

While going through some Oracle related articles I found even in case of Oracle RAC cluster if we need high performance and we use caching mechanism each node in RAC cluster will cache some values of sequences and depend on which node fulfill a particular request we will get the value for SEQUENCE.Nextval. In this case of Oracle RAC we are not getting monotonically increasing values , it will only make sure values we are getting will always be within a range and never a value returned more than ones.

If we want to generate monotonically increasing sequences in a distributed environment we can think of problem as a group of 3 people as 3 cluster nodes "Ballu" , "Chunky" , "Dee" . The task is our client node "Tyagi" will keep asking these guys for next value of a Sequence named as "LAMBU". At the starting all 3 know that current value of LAMBU is 500 and scenario is as mentioned in above picture.

Now suppose first request comes from Tyagi and Load-balancer redirect it to Ballu who increase the current value of sequence LAMBU to 501 and return to Tyagi but now other concerns here

Ballu need to update Chunky and Dee to increase their current value also to 501 so they never return the same value 501 for another request from Tyagi

[1] There is still an issue, In a system where hundreds request are coming within a second it might possible before Ballu update Chunky and Dee about this new sequence value change they have already returned 501 to another Tyagi's request.

[2] Some one can suggest we should keep a storage system behind Ballu, Dee and Chunky. I told them it will make my application again single point of failure, It will increase latencies for the requests coming from Tyagi, If for each request I have to sync a back-end data storage how really benefit I am getting by putting Ballu, Chunky , Dee in a distributed way ?

[3] Another solution I can think is before returning value 501 to Tyagi's request Ballu will update Dee and Chunky that they should not use 501 any more. One question again arise here is in this case Ballu will have to wait till he get reply back the acknowledgement from both Chunky and Dee. There can be scenario where Ballu will send the update to Chunky and Dee "Ballu will use 501 please don't use it" , before this update reach to Dee, Dee also get another request from Tyagi and Dee also want to use 501 and send same update to Ballu and Chunky "Dee will use 501 please don't use it" . How can we avoid such collisions ? What will happen to nextValue request Tyagi had sent ? With the numbers of nodes growing in the system will not it be a bottle neck in the system to send updates and get acknowledgements ?

[4] One thing seems sure to me , there is no way where without taking to other nodes in the cluster any node can handle a request and still system returning monotonically increasing sequences. There only optimizations we can do is how to reduce this communication and even keep the promise to generate monotonically increasing sequences !!! I still need to think a lot for this and there seems no easy solution for now but I will love if some one can prove me wrong !!!

Sunday, December 16, 2012

Hadoop .. Work with Big data .

In my previous post I mentioned how more data can brought a success to any business , more data any organization have about it's customers / targeted customers , their preferences , search pattern , demands etc and Organization use this data to filter out rich information, this information can help organization to better serve it's customer which is key for any business success. There are 2 bigger challenges here we have to face
[1] How to store this big data set ?
[2] How to do analysis on such a big data set ?

There are various technology stacks used in industry to solve above problems and Hadoop is one of widely used technology to work with big data.
Hadoop provides a reliable shared storage and analysis system. There are mainly 2 major component of Hadoop

a) HDFS : Hadoop Distributed File System which provide a storage capability to store big data.
A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes

b) MapReduce : This provide a capability to analysis the data stored using HDFS process. It understands the data and assigns work to the nodes in a cluster.

An obvious question which comes to mind at this point is why can't we use RDBMS for this ? There are some clear reason why RDBMS is not best solution while handling such a large data set

Traditional RDBMS system are designed to handle Gigabytes of data where Hadoop was designed to solve the problem of handling Petabytes of data.
In RDBMS data set is more structured ( static schema ) , Hadoop provide a functionality which even can work with Raw / Non /less structured data ( dynamic schema ).
Traditional RDBMS system scaling is non linear while Hadoop was designed with clear motive of Linear scaling.

In laymen terms Hadoop first create replica of each data set on individual nodes ( machines / store and processing system ) using HDFS and to analysis the data set each node have it's own capability and run any analysis / operation task on the data stored with this node. MapReduce give the capability to finally aggregate data from these nodes run the reduce task on these nodes and get the final result set.

MapReduce tries to collocate the data with the compute node, so data access is fast since it is local ( data locality feature ) and one of reason for good performance. Coordinating the processes in a large-scale distributed computation is a challenge. There are harder aspects ex: any job failure, track of progress of jobs on individual nodes, rerun of any failed job on any node, monitoring the progress of individual job and overall analysis task. MapReduce spares the programmer from having to think about failure, since the implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this since it is shared nothing architecture, meaning the tasks have no dependency on one another. In my next blog I will try to give more details about individual Hadoop component.

Tuesday, October 9, 2012

Big Data ! Big Challenge !! Big Win !!!

While reading article on a blog I found a very interested statement "Big data can beat a better Algorithm" ... It was totally apposite to some of my college lectures which say A O(n) algorithm is always better than an O(n * n) ?

What is this big data ?
How is this getting generated ?
What challenges it have with it ?
How this data can be helpful for any organization ?

Based on one IDC survey the size of "digital universe" was .18 zettaBytes in 2006 which was ten times in 2011 to 1.8 zettaBytes ( which is roughly equivalent to every person on the earth have one disk drive of data ) . The rate at which this digital data is getting generated is still increasing. We can say we already crossed the time when A Software programmer had to think about data in GB or even in TB, now a Programmer will have to think about every second increasing Tsunami of digital universe known as "Big Data" ?

How is this data getting generated ? Every human being having a connectivity to WWW, Telephonic network , a pager is making it's contribution in this set of Big data. We are increasing it by sending a email, click like on Face Book, Call to a friend , Publish a blog. We go to our favorite song site and listen few songs, your favorite website keep track of all these activities and adding up to this Big data.

In my previous post I mentioned some points that their is certain limit after which a single machine can't store or process the data ! Than how this large data will be stored and will be processed ? how rich information will be getting extracted out of this ? Answer is with when one guy can't do it .. distribute it in between 2 guys . So to store these large data sets we need distributed storage ... Distributed computation point which will be making operation on this distributed data . There are again challenges what if I divide my data on 10 machines and 1 of them die ? How I will know on which machine my particular part of data is available ? Will my computation program will have to go on each and every machine to search for this data ? How these multiple computation units will coordinate ? There are many questions to be answered and I will try to uncover one by one in my upcoming posts but before that we jump to the Really Big question is this Big data really helpful for us ?

Yes It Is !! Your favorite song website will only be able to give nice suggestion to you only if it have all your previous visit records ...suggestion from your friends ..your geography location ...your recent likes ...your age ......balah balah .... so with out this big data this song website can't give you any valuable song suggestion even if they are using best Data analysis algorithm .......so I guess the article was right with big Data we can get a Big Win ...

Sunday, September 23, 2012

Why Parallel and Distributed Programming ??

Before really putting any comment on "Parallel or Distributed Computing" I was curious to know why we really need it ? I remember famous quote of Moore
"CPUs are double faster in every 18 months ...."
I believe this law still some how hold the truth but than big question is "why we need any thing called parallel or Distributed Computing when I know I have 2 times faster CPUs in next 18 months ? "

[1] The data in the web is more than doubling in few months only ....In last few years we are adding more and more devices which put more and more data on the web ...I Phone, I Pad , Kindle devices all of these are new source of data generation and access on the web.
[2] Moore's law might not be keep holding truth with time go : As we are moving to world where every one prefer devices with small and small in size and weight ... We already reaching to a point where chip size will reach to it's minimum size and getting more instructions execution per second out of it will not be that easy
[3] Even with a super Computer it might not be easy and fast to create 50 thousands animated soldiers in the war of "The Mummy " movie. Where each animated solider composed of millions of frames and each frame have it's own effect and attributes.

There are hundred more reasons which were the reason why Computer Scientist started looking on "Parallel and Distributed Systems around 20-30 years back" and today we reached to a stage where no Software Company across the world can breath with out Distributed Computing.

I have to first understand what exactly Distributed Computing is ?

I start with a very small example. Any software programmer start his programming life with a For loop code so let's see how it work in different scenarios

for(int i=0; i< 100 ; i++ )
{
c[i] = a[i] + b[i];
}

A machine which have a single CPU core to execute this program :
For each iteration of i , value of a[i] and b[i] will be loaded in memory and than addition operation will happen . Suppose time time taken in loading this data in memory is T than some how around it will take 100 * T time to complete this execution ...

Parallel Programming:
Suppose you have a N Core CPU and your compiler is brilliant enough to work in Parallel programming paradigm so In Parallel it will keep uploading data for N addition operation and executing the plus operation and some how your the same for loop execution it will take time around ( 100 * T / N )

In laymn's term we can say in Parallel programming CPU instructions run in parallel which share same In Memory data. There is no role of network or data on wire in it.

Now we move to Distributed Programming where we have M number of machines , each of them have them have N CPU Core . These machines are connected by a Network . We again have same task to run in
In this case each machine will be busy in loading data in memory for N addition operation and than doing the plus operation. It will reduce the total time taken for our task to ( 100 * T / N * M ) + X ...

What is this X ? I will try to cover in next post of this blog.

Saturday, May 28, 2011

Facebook Design Technology Scaling

Today just an idea came to my mind how facebook is handling so many of users ...what technologies they are using and how ? so I did a little bit of googling and what i found I am just pasting over here.

I found Developers in facebook are using a stack of open source technologies and it is LAMP (Linux , Apache , MySql , PHP ) with MemCache. I will give a brief introduction of all of these technologies. For now MemCache is a distributed Cache which make facebook users request fast.

MYSQL

For the database, Facebook utilizes MySQL because of its speed and reliability.MySQL is used primarily as a key-value store as data is randomly distributed amongst a large set of logical instances. These logical instances are spread out across physical nodes and load balancing is done at the physical node level.

As far as customizations are concerned, Facebook has developed a custom partitioning scheme in which a global ID is assigned to all data. They also have a custom archiving scheme that is based on how frequent and recent data is on a per-user basis.

Most data is distributed randomly.

PHP

Facebook uses PHP because it is a good web programming language with extensive support and an active developer community and it is good for rapid iteration. PHP is a dynamically typed/interpreted scripting language.Developers in facebook have done a lot of customization over here they have written there own compilers to compile the code.

MemCache

This is the part where I was very much interested to know. You should read following journal if you really want to know in detail http://www.linuxjournal.com/article/7451?page=0,0. I am giving a brief introduction here.

Memcached is a high-performance, distributed caching system. Although application-neutral, it's most commonly used to speed up dynamic Web applications by alleviating database load. Memcached is used on LiveJournal, Facebook,Youtube, Wikipedia and other high-traffic sites.

It is basically a collection of machines where we cache the data from the data source so maximum queries are replied back from the MemCache only and because a query from this cached data is faster than querying to data source ( database ) it makes the overall user's experience fast.

Mamcache is stored in the form of huge hash tables so all the query to Mamcache happens in O(1) or can say in constant time. As already mentioned all the objects have a unique id, Suppose you are on a page and seeing your friend's pic and when you click next, a query go to Mmecache which request for it's unique ID, if the requested ID object is in Mamcache, query will be replied otherwise Mamcache do a query to the database , cache this object and replied back to the user. So we can see the benefits of Mamcache will be much if maximum number of hits are from the Mamcache. I don't want to go much depth but I will like to discuss if you have any query and you post that over here.