Cluster computing

Sunday, March 26, 2017

#codingexercise
Count the number of distinct subsequences

Void  Combine(ref List<int> A, ref List<int> b, int start, int level, ref List<List<int>> sequences)

{

for (int I =start; I < A.length; I++)

{

if (b.contains(A[i])== false){

     b[level] = A[i];

sequences.Add(b);

if (I < A.length)

    Combine(ref A, ref b, m, start+1, level+1, ref sequences);

  b[level] = '/0';

}
}

return sequences.Distinct();

Alternatively,

int GetCountDistinct(List<int> A)

{
var digits = new int[10];
digits.ForEach(x => x = -1);
int n = A.Count;
int dp[n+1];
dp[0] = 1;
for (int i=1; i<=n; i++)
{
dp[i] = 2*dp[i-1];
if (digits[A[i-1]] != -1)
dp[i] = dp[i] - dp[digits[A[i-1]]];
digits[A[i-1]] = (i-1);
}
return dp[n];
}

Saturday, March 25, 2017

#codingexercise
Count all increasing subsequences in a given integer array

Void  Combine(ref List<int> digits, ref List<int> b, int start, int level, ref int count)

{

for (int I =start; I < digits.length; I++)

{

if (b.contains(digits[i])== false){

     b[level] = digits[i];

count += GetIS(b);

if (I < digits.length)

    Combine(ref digits, ref b, m, start+1, level+1, ref count);

  b[level] = '/0';

}

int GetIS(List<int> A)

{

count = 0;

for (int i = 0; i < A.count; i++)

for (int j = start; j < i; j++)

if (A[i] < A[j])

return count;

return 1;

}

Alternatively,

int GetCountIS(List<int> A, int n)
{
int result[n];
for (int i = 0; i < n; i++ )
result[i] = 1;
for (int i = 1; i < n; i++ )
for (int j = 0; j < i; j++ )
if ( A[i] > A[j])
result[i] += result[j];

int count = 0;

for (int i = 0; i < n; i++ )
count += result[i];
return count
}

Friday, March 24, 2017

Yesterday we were talking about Enterprise Search. Enterprise search such as Google does site indexing and is very helpful for pages served on the web for any organization. The difference is in the index generated and the topology of the search and indexing pipeline. Traditionally this was maintained on premise with dedicated hardware. This is now being sunset in favor of services from cloud. There are a few remarkable features of hosting the indexing service in the cloud that often goes missing from on-premise deployments. First, there is a formal separation between indexing and analytics. Second, the service creates an index pipeline instead of a store so that different sources can contribute to the pipeline for indexing. Third a staging xml or json cache is used so that the configuration changes are not per data source but based on the staging cache. This makes rebuilding of index unnecessary when data sources are added or configuration changes for those data sources. Fourth, by hosting the service in the cloud or on a cluster from the cloud, the service can be elastic and expand to as many resources as necessary with automatic rollover, load balancing and differentiation in terms of components. Finally, cloud based indexing uses Big Data rather than a simple index. Not just Google Enterprise Search Services but also HP TRIM and KeyView are hosted in the cloud Most of them work with Big Data for their index. Connectors can be implemented with a range of popular search engines. Companies pay for the SLAs from the connector. Connectors are available for a growing range of repositories which include File Systems, CIFS and NFS shares, Box.com, S3, etc., relational databases including tables and their snapshots, Content management systems such as Documentum and Sharepoint, Collaboration such as Confluence, Jira, ServiceNow and TeamForge, CRM software such as SalesForce and RightNow, Websites hosted with Aspider web crawler, Heritrix web crawler and Staging repository as well as a variety of feeds such as FTP, RSS etc. The Service Level Agreements from the connector cover security with sign-on such as LDAP, NTLM and CAS, performance goals such as not to degrade the target repository and content enrichment such as cleansing, parsing, entity extraction and categorization.

Analytics happens with proprietary query defined language and usually varies depending on the platform. Search Technologies has mentioned a variety of techniques related to search.

I believe that desktop search can be included with System Center automations. This lets data to be indexed to follow the same route as logs and metrics with the convenience of a central management.

Personal document indexing as a productivity tool then merely requires using queries on a cloud service.

#codingexercise
Break lines to make lines as balanced as possible with the typographical considerations to avoid excessive white space, avoid widows and orphans etc.
void BreakLines(int[] optimals)
{
for (int k =1; k <=n; k++)
{
optimals[k] = int_max;
for (int j =0; j<= k-1; j++)
optimals[k] = Math.min(optimals[k], optimals[j] + Penalty(j, k));
}
}
where Penalty(i,j) is the penalty of starting a line at a position i and ending the line at a position j
The above method could be made recursive without memoization.
BreakLines gives the minimum penalty
To Layout the text, we do it whenever we find lower cost:
void LayoutText(int[] optimals, int[] Best)
{
for (int k =1; k <=n; k++)
{
optimals[k] = int_max;
for (int j =0; j<= k-1; j++){
var temp = optimals[j] + Penalty(j, k);
if (temp < optimals[k]){
optimals[k] = temp;
Best[k] = j;
}
}
}
}

dynamic programming is about making the unknown subproblem as recursive. For example, to count the number of permutations with k inversions for a number N, we can say:

int GetInv(int N, int k)
{
if N == 0 return 0;
if k == 0 return 1; // sorted
int count = 0;
for(int i = 0; i <= k; i++)
{
count += GetInv(N-1, i);
}
return count;

}
This can also use memoization.

Thursday, March 23, 2017

Yesterday we were discussing Enterprise Search. Companies are getting rid of their yellow Google search appliances and embracing cloud based services instead. Not just Google Enterprise Search Services but also HP TRIM and KeyVieware hosted in the cloud Most of them work with Big Data for their index. This is in sharp contrast to the desktop search index stores. These services work independently from the connectors that send the data which is a way to future proof the investment in enterprise search. The search service maintains an index pipeline that does reindex only on configuration change not data additions. Connectors can be implemented with a range of popular search engines. Companies pay for the SLAs from the connector. Connectors are available for a growing range of repositories which include File Systems, CIFS and NFS shares, Box.com, S3, etc., relational databases including tables and their snapshots, Content management systems such as Documentum and Sharepoint, Collaboration such as Confluence, Jira, ServiceNow and TeamForge, CRM software such as SalesForce and RightNow, Websites hosted with Aspider web crawler, Heritrix web crawler and Staging repository as well as a variety of feeds such as FTP, RSS etc. The Service Level Agreements from the connector cover security with sign-on such as LDAP, NTLM and CAS, performance goals such as not to degrade the target repository and content enrichment such as cleansing, parsing, entity extraction and categorization.
#codingexercise
Break lines to make lines as balanced as possible with the typographical considerations to avoid excessive white space, avoid widows and orphans etc.
void BreakLines(int[] optimals)
{
for (int k =1; k <=n; k++)
{
optimals[k] = int_max;
for (int j =0; j<= k-1; j++)
optimals[k] = Math.min(optimals[k], optimals[j] + Penalty(j, k));
}
}
where Penalty(i,j) is the penalty of starting a line at a position i and ending the line at a position j
The above method could be made recursive without memoization.

Wednesday, March 22, 2017

We continue reading the paper "Big Data and Cloud Computing: A survey of the State-of-the-Art and Research Challenges" by Skourletopoulos et al. This paper talks about the comparisons of data warehouse and big data as a cloud offering. IBM data scientists argue that the key dimensions of big data are : volume, velocity, variety and veracity. A Big data as a service stack may get data from other big data sources, operational data stores, staging databases, data warehouses and data marts. Zheng et al showed that the service generated big data included service logs, service QoS and service relationships in the form of services identification and migration. A cloud based big data analytics service provisioning platform named CLAaaS is presented in the literature to help describe the significant features of the workflow systems, such as multi-tenancy for a wide range of analytic tools and back-end data sources, user group customizations and web collaboration. Big Data as a service was introduced in order to provide common big data services, boost efficiency and reduce cost. The articulation of this cost is important to people considering switch over from data warehouse to Big Data. One metric proposed by the authors includes the yearly aggregation of the initial monthly cost for leasing cloud storage expressed in monetary units times the maximum storage capacity.
This costing is comparable to Enterprise Search service hosted in the cloud such as CloudWatch.
There are two or three remarkable features of hosting the indexing service in the cloud that often goes missing from on-premise deployments. First, the service creates an index pipeline instead of a store so that different sources can contribute to the pipeline for indexing. Second a staging xml or json cache is used so that the configuration changes are not per data source but based on the staging cache. This makes rebuilding of index unnecessary when data sources are added or configuration changes for those data sources. Finally by hosting the service in the cloud or on a cluster from the cloud, the service can be elastic and expand to as many resources as necessary with automatic rollver, load balancing and differentiation in terms of components.
#codingexercise
Given a string S and a library of strings B = {b1, ... bm}, construct an approximation of the string S by using copies of strings in B. For example B = { abab, bbbaaa, ccbb, ccaacc} and S = abaccbbbaabbccbbccaabab.
The solution consists of taking strings from B and assigning to non-overlapping solutions of S. Strings from B may be used multiple times. There is a cost for unmatched character in S as alpha and there is a cost for mismatched character in S as beta. The mismatch between i and j positions is the number of mismatched characters in b[j] when aligned starting with position i in s. We compute the optimal costs of positions 1 to n in the given string. we determine the optimal cost of position k as a functions of the costs of the previous positions.
void GetCost(String S, List<String> B, int alpha, int beta, int[] Optimals)
{
for (int i = 1; i < n; i++)
{
Optimals[k] = Optimals[k-1] + alpha;
for (int j = 1; j < B.Length; j++)
{
int p = i - B[j].length;
Optimals[k] = Math.min(Optimals[k], Optimals[p-1] + beta x Mismatch(p,j));
}
}
}
The purpose of the above code is to show that the we find the optimal solution by evaluating the cost of each position as a candidate for the use a string from the given list. Either the current position is a continuation of the previous or a a beginning of the next. We assume the former by default even if it is a mismatch and note down the cost as alpha. Then for each of the strings, we take the substring that is permissible upto the current location and add it to the previously evaluated cost at the position where the substring fits in say p. This cost is therefore the cost computed prior to p and the number of mismatched characters in the substring times the unit cost beta. As we can see, the previous computations aid the future computations and therefore we use memoization by keeping tracks of the cost for each position.

Tuesday, March 21, 2017

We started reading the paper "Big Data and Cloud Computing: A survey of the State-of-the-Art and Research Challenges" by Skourletopoulos et al. This paper talks about the comparisons of data warehouse and big data as a cloud offering. IBM data scientists argue that the key dimensions of big data are : volume, velocity, variety and veracity. A Big data as a service stack may get data from other big data sources, operational data stores, staging databases, data warehouses and data marts. Zheng et al showed that the service generated big data included service logs, service QoS and service relationships in the form of services identification and migration. A cloud based big data analytics service provisioning platform named CLAaaS is presented in the literature to help describe the significant features of the workflow systems, such as multi-tenancy for a wide range of analytic tools and back-end data sources, user group customizations and web collaboration.
On the other hand, Analytics was described in a service oriented Decision Support System in Demirkan and Delen's study. This study shows a Decision support system is best organized by data flowing from information sources through the data service tier which in turn transforms it for information service tier and finally analytics service tier.
#codingexercise
Highway Billboard problem
Consider a highway of M miles. We have to place billboards on a highway such that revenue is maximized. The possible sites for billboards are given by x1 < x2 < ... xn. with revenues r1, r2 ... rn and these sites lie between 0 and M. No two billboards can be placed within t miles
int GetMaxRevenue(List<int> xi, List<int> ri, int m, int t, int i, int next) // i ranges from 1 to m
{
if (i <= 1) return 0;
if (next <=1) return 0;
if (next >= n) return GetMaxRevenue(xi, ri, m, t, i-1, next);
if (xi[next] != i)
return GetMaxRevenue(xi, ri, m, t, i-1,next);
else{
if (i <= t){
return Math.max(GetMaxRevenue(xi, ri, m, t, i-1,next+1), ri[next]);
}else{
return Math.max(GetMaxRevenue(xi, ri, m, t, i-t-1,next+1)+ri[next], GetMaxRevenue(xi, ri, m, t, i-1));
}
}
}
As with all dynamic programming methods with overlapping subproblems, memoization will help.

Monday, March 20, 2017

We started reading the paper "Big Data and Cloud Computing: A survey of the State-of-the-Art and Research Challenges" by Skourletopoulos et al. This paper talks about the comparisons of data warehouse and big data as a cloud offering. IBM data scientists argue that the key dimensions of big data are : volume, velocity, variety and veracity. A Big data as a service stack may get data from other big data sources, operational data stores, staging databases, data warehouses and data marts. Zheng et al showed that the service generated big data included service logs, service QoS and service relationships in the form of services identification and migration. A cloud based big data analytics service provisioning platform named CLAaaS is presented in the literature to help describe the significant features of the workflow systems, such as multi-tenancy for a wide range of analytic tools and back-end data sources, user group customizations and web collaboration.
On the other hand, Analytics was described in a service oriented Decision Support System in Demirkan and Delen's study. This study shows a Decision support system is best organized by data flowing from information sources through the data service tier which in turn transforms it for information service tier and finally analytics service tier.
There has been some effort in cost benefit analysis in cloud and mobile computing environments but there has been little or no effort in the area of evaluation and classification of big data as a service model. There have been efforts to examine the cost benefit of extending cloud computing with clusters or to calculate the TCO or for the economic efficiency of the cloud.
#codingexercise
int GetNChooseK(int n, int k)
{
if (k <0 || k > n) return 0;
if (n < 0) return 0;
return Factorial(n) / (Factorial(n-k) * Factorial(k));
}
uint Factorial(uint n)
{
if (n <= 1) return 1;
return n * factorial(n-1);
}
We can also do this using dynamic programming approach
int GetNChooseKDP(uint n, uint k)
{
if (k == 0 || k == n)
return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);

}
Find the number of strings that can be formed from letters a,b and c such that b occurs only once and c occurs only twice.
int GetCount(string A, ref stringbuilder b, int start, int level, int n)

{

for (int I = start; I < n; I++)

{
for (int k = 0; k < 3; k++) {
if (b.length == n) {console.writeline(b); return;}
if (b.count(x => x == 'b') == 1) {return;}
if (b.count(x => x == 'c') == 2) {return;}

     b[level] = A[k];

if (I < n)

    GetCount(A, ref b, start+1, level+1, n);

  b[level] = '/0';
}

}

the above method could also be modified by concatenating or skipping the current letter in a sequence of abc repeated n times.