Cluster computing

Saturday, March 18, 2017

We started reading the paper "Big Data and Cloud Computing: A survey of the State-of-the-Art and Research Challenges" by Skourletopoulos et al. This paper talks about the comparisons of data warehouse and big data as a cloud offering. IBM data scientists argue that the key dimensions of big data are : volume, velocity, variety and veracity. The size and type of existing deployments show ranges along these dimensions. Many of these deployments get data from external providers. A Big data as a service stack may get data from other big data sources, operational data stores, staging databases, data warehouses and data marts. Zheng et al showed that the service generated big data included service logs, service QoS and service relationships in the form of services identification and migration. A cloud based big data analytics service provisioning platform named CLAaaS is presented in the literature to help describe the significant features of the workflow systems, such as multi-tenancy for a wide range of analytic tools and back-end data sources, user group customizations and web collaboration. This system consists of several layers with the backend comprising of data sources such as DBMS, data warehouse, data and streams. The data from the data sources are mixed, preprocessed, filtered, aggregated, transformed and routinely operated so that it can be staged for the next layer. The staged data is then modeled. The modeling may be imperfect. Therefore, it is estimated, validated and scored. With the help of the model and with the help of the data, results can be analyzed and visualized. With the help of the results, the data may be further transformed or the model may be improved with feedback. The improvements come from result interpretations, predictions, prescriptions, action impact evaluation, visualization etc. This workflow may be familiar to many who work with graph databases because these are generally very large databases and require a similar drill. The difference is in the formalization of the data management and modeling steps. The algorithms and analysis can be similar to most NoSQL processing whether batch or streaming. While graphs require their own query language, it is likely that the industry may evolve to a standard on these dialects - courtesy Raghu Ramakrishnan

#codingexercise

Get sum of all averages of subsequences

Void  GetSumOfAllAverages(List<int> A, ref List<int> b, int start, int level, ref List<List<int>> subsequences)

{

for (int I = start; I < A.length; I++)

{

     b[level] = A[i];

subsequences.add(b.Clone());

if (I < A.length)

    GetSumOfAllAverages(A, ref b, start+1, level+1, ref subsequences);

  b[level] = '/0';

}

subsequences.Sum(x => x.Avg());

Int GetSumOfAllAvgs(List<int>A, int start, int end)

{

If (start>end) return 0;

If (start==end)return A[start];

Int sum = A.Sum();

double sumofallavgs= 0

For (int n = 1; n <=A.Count; n++)

Sumofallavgs += sum * NChoosek(A.Count-1, n-1) / n;

Return sumofavgs;

}

Int NChooseK(n, k) // this can also be a dynamic programming method although not used below

{

If (k >=0 && k <n)

{

Return Factorial(n)/(Factorial(n-k)xFactorial(k));

}else{

Return 0;

}

Friday, March 17, 2017

Today we start reading the paper "Big Data and Cloud Computing: A survey of the State-of-the-Art and Research Challenges" by Skourletopoulos et al. This paper talks about the comparisons of data warehouse and big data as a cloud offering. As Gartner mentioned, there will be more than 20 billion connected devices expected by the year 2020 and the amount of data exchanged by the sensors is going to be way more than the amount of data exchanged by human beings. The size of data is only growing. Many find it easier to directly work on such large scale data Big Data refers to very large and complex data sets that traditional data sets are incapable of processing For a more detailed comparision, I refer an earlier blog post. The main takeaway is that BigData is not only about storage but also about a different type of algorithms. These load, store and query a massive scale of data in batches by a technique called MapReduce and can run in parallel across a distributed cluster. Social network is one example of Big Data. Many cloud providers have established new datacenters for hosting social networking, business media content or scientific applications and services. In fact storage from cloud providers is measured in gigabyte-month and compute cycle is priced by the CPU-hour.

IBM data scientists argue that the key dimensions of big data are : volume, velocity, variety and veracity. The size and type of existing deployments show ranges along these dimensions. Many of these deployments get data from external providers. A Big data as a service stack may get data from other big data sources, operational data stores, staging databases, data warehouses and data marts. Typically the operational datastores, staging databases and warehouses are relational data. Data marts allow analysis over dimensions along a cube. Big Data sources can include source systems in Compliance, Trading, CRM, Research, Finance, MDM, Pricing and other IoT data sources.

Zheng et al described a big data as a service offering for service generated data. He showed that the stack for this service includes all three layers of analytics, platform and infrastructure in that hierarchy. The data feeding into this service comes from service generated big-data that includes service logs, service quality of service QoS and service relationship. The log analysis comes useful for visualization and diagnosis. The QoS provides fault tolerance and prediction. The service relationship provides service identification and migration.

#codingexercise

Count all Palindromic subsequences in a given string

Int GetCountPalin(string A, int start, int end)

{

If (String.IsNullOrEmpty(A) || A.Length == 0 ) return 0;

// Assert(start >= 0 && start < A.Length && end >= 0 && end < A.Length && start <=end);

If (start == end) return 1;

Int count = 0;

If (A[start] == A[end]){

count += GetCountPalin(A, start+1, end);

count += GetCountPalin(A, start, end-1);

count += 1;

}else{

count += GetCountPalin(A, start+1, end);

count += GetCountPalin(A, start, end-1);

count -= GetCountPalin(A, start+1, end-1);

}

return count;

}

Void  Combine(string A, ref stringbuilder b, int start, int level, ref List<int> palindromecombinations)

{

for (int I =start; I < A.length; I++)

{

     b[level] = A[i];

If(IsPalindrome(b.toString()))

palindromecombinations.add(b.toString());

if (I < A.length)

    Combine(A, ref b, start+1, level+1, ref palindromecombinations);

  b[level] = '/0';

}