Wednesday, March 22, 2017

We continue reading the paper "Big Data and Cloud Computing: A survey of the State-of-the-Art and Research Challenges" by Skourletopoulos et al. This paper talks about the comparisons of data warehouse and big data as a cloud offering. IBM data scientists argue that the key dimensions of big data are : volume, velocity, variety and veracity.  A Big data as a service stack may get data from other big data sources, operational data stores, staging databases, data warehouses and data marts.  Zheng et al showed that the service generated big data included service logs, service QoS and service relationships in the form of services identification and migration. A cloud based big data analytics service provisioning platform named CLAaaS is presented in the literature to help describe the significant features of the workflow systems, such as multi-tenancy for a wide range of analytic tools and back-end data sources, user group customizations and web collaboration. Big Data as a service was introduced  in order to provide common big data services, boost efficiency and reduce cost. The articulation of this cost is important to people considering switch over from data warehouse to Big Data. One metric proposed by the authors includes the yearly aggregation of the initial monthly cost for leasing cloud storage expressed in monetary units times the maximum storage capacity.
This costing is comparable to Enterprise Search service hosted in the cloud such as CloudWatch.
There are two or three remarkable features of hosting the indexing service in the cloud that often goes missing from on-premise deployments. First, the service creates an index pipeline instead of a store so that different sources can contribute to the pipeline for indexing.  Second a staging xml or json cache is used so that the configuration changes are not per data source but based on the staging cache. This makes rebuilding of index unnecessary when data sources are added or configuration changes for those data sources. Finally by hosting the service in the cloud or on a cluster from the cloud, the service can be elastic and expand to as many resources as necessary with automatic rollver, load balancing and differentiation in terms of components.
#codingexercise
Given  a string S and a library of strings B = {b1, ... bm}, construct an approximation of the string S by using copies of strings in B. For example B = { abab, bbbaaa, ccbb, ccaacc} and S = abaccbbbaabbccbbccaabab.  
The solution consists of taking strings from B and assigning to non-overlapping solutions of S. Strings from B may be used multiple times. There is a cost for unmatched character in S as alpha and there is a cost for mismatched character in S as beta. The mismatch between i and j positions is the number of mismatched characters in b[j] when aligned starting with position i in s. We compute the optimal costs of positions 1 to n in the given string. we determine the optimal cost of position k as a functions of the costs of the previous positions.
void GetCost(String S, List<String> B, int alpha, int beta, int[] Optimals)
{
for (int i = 1; i < n; i++)
{
    Optimals[k] = Optimals[k-1] + alpha;
    for (int j = 1; j < B.Length; j++)
    {
        int p = i - B[j].length;
        Optimals[k] = Math.min(Optimals[k], Optimals[p-1] + beta x Mismatch(p,j));
     }
}
}
The purpose of the above code is to show that the we find the optimal solution by evaluating the cost of each position as a candidate for the use a string from the given list. Either the current position is a continuation of the previous or a a beginning of the next. We assume the former by default even if it is a mismatch and note down the cost as alpha. Then for each of the strings, we take the substring that is permissible upto the current location and add it to the previously evaluated cost at the position where the substring fits in say p. This cost is therefore the cost computed prior to p and the number of mismatched characters in the substring times the unit cost beta. As we can see, the previous computations aid the future computations and therefore we use memoization by keeping tracks of the cost for each position.

No comments:

Post a Comment