Cluster computing

Wednesday, November 20, 2019

Partitioning is probably the reason why tools like Spark, Flink becomes popular for BigData:
Partition function of FLink can be demonstrated with the following:
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<Integer, Double>> sorted = env.readCsvFile("ratings.csv")
.ignoreFirstLine()
.includeFields(false, true, true, false)
.types(Long.class, Double.class)
.groupBy(0)
.reduceGroup(new GroupReduceFunction<Tuple2<Long, Double>, Tuple2<Long, Double>>() {
@Override
public void reduce(Iterable<Tuple2<Long, Double>> values, Collector<Tuple2<Long, Double>>out) throws Exception {
Long movieId = null;
double total = 0;
int count = 0;
for (Tuple2<Long, Double> value: iterable) {
movieId = value.f0;
total += value.f1;
count++;
}
if (count > 50){
Collector.collect(new Tuple2<>(movieId, total/count);
}
}
})
.partitionCustom(new Partitioner<Double>() {
@Override
public int partition(Double key, int numPartition) {
return key.intValue() - 1;
}
}, 1);
}

Tuesday, November 19, 2019

This is a continuation of the earlier posts to enumerate funny aspects of software engineering practice:

390) Build a product where people can learn about and hear others empowered in their work and the fan following grows.

391) Build a product where they have to wait for orphaned resourcesn cleanup before they can proceed with re-install.

392) Build a product where the users have to frequently run the installer and it doesn’t complete some of the times

393) Build a product where the software is blamed because the administrator was shy to read the manual

394) Build a product where the resources for the software provided by the customer does not meet the requirements.

395) Build a product where different parts of the software need to be installed differently and find that deployments are usually haphazard

396) Build a product where the installation on production environment is so elaborate that it requires planning, dry runs and coordination across teams

397) Build a product where every update is used to justify setting aside a day by the staff

398) Build a product where the reality and perception are deliberately kept different at a customer site

399) Build a product where the vision for the product is different from what the customer wants to use it for.

400) Build product where the quirkiness of the product offers fodder for all kind of talks from conferences to board room meetings.

Build a product where the last mile before the release usually requires everyone to pitch in

Build a product where the pre-release efforts can never truly be separated from development efforts

Build a product where the last mile gathers unwanted attention even though they don’t have any direct dependence on the product.

Build a product where the announcements become something to look forward to both for people building the product and those waiting for it outside.

Build a product where the process of building never truly becomes a joyful process.

Build a product where the release of a product is never really a starting point for the next version

Build a product where the last few weeks for the release is usually reserved for making the customer experience better rather than during the development

Build a product where the Murphy’s law seems to be more applicable towards the endgame

Build a product where the week after the release still manages to throw surprises.

Build a product where the customer celebrations of their success with the product adds joy to the work involved during development but the hands that made the product are no longer there.

Monday, November 18, 2019

A comparision of Flink SQL execution and Facebook’s Presto continued:

The Flink Application provides the ability to write SQL query expressions. This abstraction works closely with the Table API and SQL queries can be executed over tables. The Table API is a language centered around tables and follows a relational model. Tables have a schema attached and the API provides the relational operators of selection, projection and join. Programs written with Table API go through an optimizer that applies optimization rules before execution.

Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.

The querying of key value collection is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.

Without the analogy of the join, the key-value collections will require standard query operators like where clause which may test for a match against a set of keys. This is rather expensive compared to the join because we do this with a large list of key-values and possibly repeated iterations over the entire list for matches against one or more keys in the provided set.

Most key-value collections are scoped. They are not necessarily in a large global list. Such key-values become scoped to the document or the object. The document may be in one of two forms – Json and Xml. The Json format has its own query language referred to as jmesPath and the Xml also support path-based queries. When the key-values are scoped, they can be efficiently searched by an application using standard query operators without requiring the use of paths inherent to a document format as Json or Xml.

Presto scalability to processing petabytes of data is unparalled. And the use of a distributed SQL query engine also helps

int getKthAntiClockWise(int[] [] A, int m, int n, int k)
{
if (n <1 || m < 1) return -1;
if (k <= m)
return A[0, k-1];
if (k <= n+m-1)
return A[m-1, k-m];
if (k <= n+m-1+m-1)
return A[n-1, (m-1-(k-(n+m-1)))] ;
if (k <= n+m-1+m-1+n-2)
return A[0, n-1-(k-(n+m-1+m-1))];
return getKthAntiClockWise(Copy(A, (1,1,m-2,n-2)), m-2, n-2, k-(2*n+2*m-4)));
// Copy uses System.arraycopy
}

Sunday, November 17, 2019

A comparision of Flink SQL execution and Facebook’s Presto:
The Flink Application provides the ability to write SQL query expressions. This abstraction works closely with the Table API and SQL queries can be executed over tables. The Table API is a language centered around tables and follows a relational model. Tables have a schema attached and the API provides the relational operators of selection, projection and join. Programs written with Table API go through an optimizer that applies optimization rules before execution.
Flink Applications generally do not need to use the above abstraction of Table APIs and SQL layers. Instead they work directly on the Core APIs of DataStream (unbounded) and DataSet (bounded data set) APIs. These APIs provide the ability to perform stateful stream processing.
For example,
DataStream<String> lines = env.addSource( new FlinkKafkaConsumer<>(…)); // source
DataStream<Event> events = lines.map((line)->parse(line)); // transformation
DataStream<Statistics> stats = events.keyBy(“id”).timeWindow(Time.seconds(10)).apply(new MyWindowAggregationFunction());
stats.addsink(new BucketingSinkPath));
Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources. Facebook’s Presto can run on large data sets of social media such as in the order of Petabytes. It can also run over HDFS for interactive graphics.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Just like the standard query operators of .Net the FLink SQL layer is merely a convenience over the table APIs. On the other hand, Presto offers to run over any kind of data source not just Table APIs.

Saturday, November 16, 2019

This is a continuation of the earlier posts to enumerate funny aspects of software engineering practice:

380) Build a product with thought through dashboard and find the users gravitating to the page
381) Build a product with little or no styling and find the applications not gaining appeal
382) Build a product with specific investment towards stylesheets and see dramatic improvement in perception
383) Build a product with customizable styles and the partners become happy
384) Build a product with styles that can be changed and the s satisfaction grows among end-users.
385) Build a product with styles that suit groups and membership to the group grows
386) Build a product with logo that can be made into stickers to be offered as give away and theit becomes popular among the young professionals
387) Build a product with marketing events that become popular and the awareness increases
388) Build a product with partnerships where partners talk about the product and the fan following grows
389) Build a product with advocacy groups and training and the skilled users grow
390) Build a product where people can learn about and hear others empowered in their work and the fan following grows.
391) Build a product where they have to wait for orphaned resourcesn cleanup before they can proceed with re-install.
392) Build a product where the users have to frequently run the installer and it doesn’t complete some of the times
393) Build a product where the software is blamed because the administrator was shy to read the manual
394) Build a product where the resources for the software provided by the customer does not meet the requirements.
395) Build a product where different parts of the software need to be installed differently and find that deployments are usually haphazard
396) Build a product where the installation on production environment is so elaborate that it requires planning, dry runs and coordination across teams
397) Build a product where every update is used to justify setting aside a day by the staff
398) Build a product where the reality and perception are deliberately kept different at a customer site
399) Build a product where the vision for the product is different from what the customer wants to use it for.
400) Build product where the quirkiness of the product offers fodder for all kind of talks from conferences to board room meetings.

This is a continuation of the earlier posts to enumerate funny aspects of software engineering practice:

Friday, November 15, 2019

This is a continuation of the earlier posts to enumerate funny aspects of software engineering practice:
370) Build a product that gets a thumbs up from production support.
371) Build a product that makes it easy to add storage and find prolific use of storage by users.
372) Build a product that makes it easy for users to complete workflows with fewer checks and find that users tend to experiment rather than let the workflow guide them
373) Build a product that makes workflows too convoluted and users tend to use it only for partial completion
374) Build a product where users can create their own workflows and it becomes popular with audience that like user friendly designer software like tools
375) Build a product with little or no composition to workflows and users tend to write several clones of customized workflows
376) Build a product with the ability to support live debugging and it become popular for development environments
377) Build a product with lots of levers and the dashboard looks intimidating
378) Build a product with fewer levers and find the customers unhappy
379) Build a product where you let the users create their own panels for levers and they hardly do it
380) Build a product with thought through dashboard and find the users gravitating to the page
381) Build a product with little or no styling and find the applications not gaining appeal
382) Build a product with specific investment towards stylesheets and see dramatic improvement in perception
383) Build a product with customizable styles and the partners become happy
384) Build a product with styles that can be changed and the s satisfaction grows among end-users.
385) Build a product with styles that suit groups and membership to the group grows
386) Build a product with logo that can be made into stickers to be offered as give away and theit becomes popular among the young professionals
387) Build a product with marketing events that become popular and the awareness increases
388) Build a product with partnerships where partners talk about the product and the fan following grows
389) Build a product with advocacy groups and training and the skilled users grow
390) Build a product where people can learn about and hear others empowered in their work and the fan following grows.