Cluster computing

Saturday, June 24, 2017

Today we discuss Cloudera architecture – Cloudera is known for its data management software that can be deployed and run on any cloud. It offers an enterprise data hub, an analytics DB, and operational DB, data science and engineering and essentials. It is elastic and flexible, it has high performance analytics, it can easily provision over multicloud and it can be used for automated metering and billing. Essentially they allow different data models, real-time data pipelines and streaming applications with their big data platform. They enable data models to break free from vendor lockins and with the flexibility to let it be community defined. Moreover they let the database to scale by hosting it in the cloud. The data science workbench offered from Cloudera involves a console on a web browser that users can authenticate themselves with using Kerberos against the cluster KDC. Engines are spun-up and we can seamlessly connect with Spark, Hive, and Impala. The engines are spun up based on engine kernels and profiles. A command prompt and an editor are both available for interactive command execution. Code is executed in a session and we can quit from the session. This workbench also supports docker and kubernetes to manage containers. Cloudera Data Science workbench uses Docker and Kubernetes. Cloudera is supported on dedicated Hadoop hosts. Cloudera also adds a data engineering service called Altus. It’s a platform that works against a cloud by allowing clusters to be setup and torn down and jobs to be submitted to those clusters. Clusters may be Apache Spark, MR2 or Hive. Behind the data engineering service they provision these clusters using EC2 and S3 layers. The EC2 provides the compute layer and the S3 provides the storage layer. It may be interesting to note that neither of them Note that the clusters are not provided with Apache Mesos and Marathon stack and the storage is not provided on other file and database based technologies. But this can be expanded in future. Containerization technologies and Backend as a service aka lambda functions can also be supported. The main notion here is that Cloudera works with existing public clouds while it offers enterprise manager for on-premise solution. They provide great management capabilities, they are open-sourced and provide great platform support with the right mix of open source tools for analytics and operations. They even work well and draw parallels with Splunk which happens to be a machine data platform. Cloudera may likely benefit from the same practice around Sustaining Engineering and Support that endeared Splunk to its customers. It may be interesting to find out if Cloudera can slip as a data platform under the Splunk so that Splunk becomes less proprietary and embraces Cloudera’s community model. The time series database for Splunk can be left alone as only the cluster operations are migrated.
#codingexercise
Given a rod with price chart for different lengths, determine the cut lengths that result in the most price.
int GetMaxPrices(List<int> p, int n)
{
var dp = new int[n+1];
dp[0] = 0;
for (int i =1; i <= n; i++)
{
var price = INT_MIN;
for ( k = 1; k <= i; k++)
{
price = max(price, p[k] + dp[i-k]);
}
dp[i] = price;
}
return dp[n];
}
In order to determine the cut lengths, we can include a list for each position that includes the i-k which causes an update to the price. Then for the position n, we recursively enumerate the positions that contributed to n.

Cluster computing

Saturday, June 24, 2017

No comments:

Post a Comment