Cluster computing

Saturday, November 23, 2019

Just like the standard query operators of .Net the FLink SQL layer is merely a convenience over the table APIs. On the other hand, Presto offers to run over any kind of data source not just Table APIs.

Although Apache Spark query code and Apache Flink query code look very much similar, the former uses stream processing as a special case of batch processing while the latter does just the reverse.
Also Apache Flink provides a SQL abstraction over its Table API

Friday, November 22, 2019

A comparision of Flink SQL execution and Facebook’s Presto continued:
The Flink Application provides the ability to write SQL query expressions. This abstraction works closely with the Table API and SQL queries can be executed over tables. The Table API is a language centered around tables and follows a relational model. Tables have a schema attached and the API provides the relational operators of selection, projection and join. Programs written with Table API go through an optimizer that applies optimization rules before execution.
Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.
Consider the time and space dimension queries that are generally required in dashboards, charts and graphs. These queries need to search over data that has been accumulated which might be quite large often exceeding terabytes. As the data grows, the metadata becomes all the more important and their organization can now be tailored to the queries instead of relying on the organization of the data. If there is need to separate online adhoc queries on current metadata from more analytical and intensive background queries, then we can choose to have different organizations of the information in each category so that they serve their queries better.
SQL queries have dominated almost all databases and warehouses queries no matter how high the software stack has been built over these products. On the other hand, simple summation form logic on big data need to be written with Map-Reduce methods. Although query languages such as U-SQL are trying to bridge the gap and adapters are written to translate SQL queries over other forms of data stores, they are not native to the unstructured stores and they often come up with their own query language.
The language for the query has traditionally been SQL. Tools like LogParser allow SQL queries to be executed over enumerable. SQL has been supporting user defined operators for a while now. These user defined operators help with additional computations that are not present as built-ins. In the case of relational data, these generally have been user defined functions or user defined aggregates. With the enumerable data set, the SQL is somewhat limited for LogParser. Any implementation of a query execution layer over the key value collection could choose to allow or disallow user defined operators. These enable computation on say user defined data types that are not restricted by the system defined types. Such types have been useful with say spatial co-ordinates or geographical data for easier abstraction and simpler expression of computational logic. For example, vector addition can be done with user defined data types and user defined operators.
The solutions that are built on top of SQL queries are not necessarily looking to write SQL. They want programmability and are just as content to write code using Table API as they are with SQL queries. Consequently, business intelligence and CRM solutions are writing their code with standard query operators over their data. These well-defined operators are applicable to most collections of data and the notion is shared with other programming languages such as. Net.

Thursday, November 21, 2019

This is a continuation of the earlier posts to enumerate funny aspects of software engineering practice:

401) Build a product where the last mile before the release usually requires everyone to pitch in

402) Build a product where the pre-release efforts can never truly be separated from development efforts

403) Build a product where the last mile gathers unwanted attention even though they don’t have any direct dependence on the product.

404) Build a product where the announcements become something to look forward to both for people building the product and those waiting for it outside.

405) Build a product where the process of building never truly becomes a joyful process.

406) Build a product where the release of a product is never really a starting point for the next version

407) Build a product where the last few weeks for the release is usually reserved for making the customer experience better rather than during the development

408) Build a product where the Murphy’s law seems to be more applicable towards the endgame

409) Build a product where the week after the release still manages to throw surprises.

410) Build a product where the customer celebrations of their success with the product adds joy to the work involved during development but the hands that made the product are no longer there.

411) Build a product where the innovations are from the intellectual intellectual horse power as opposed to the forces shaping the product and find more maintenance down the road

412) Build a product where the people building their product leave a mark and recognition for themselves only to find it revisioned by the next batch

413) Build a product as a v1 in the market and the find the adage nothing remains unchanged holding more truth than ever before

414) Build a product where the product matures version after version and find its customers snowball with loyalty

415) Build a product where the establishment of a product allows ventures in sister products that also do well

416) Build a product where vendors vie to ship together.

417) Build a product where the knowledge about how the competitor products is never too far away.

418) Build a product where the patented innovations spawn open source substitutes and workarounds

419) Build a product where the dynamics of the people building the product shapes the product more than anything else

Wednesday, November 20, 2019

Partitioning is probably the reason why tools like Spark, Flink becomes popular for BigData:
Partition function of FLink can be demonstrated with the following:
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<Integer, Double>> sorted = env.readCsvFile("ratings.csv")
.ignoreFirstLine()
.includeFields(false, true, true, false)
.types(Long.class, Double.class)
.groupBy(0)
.reduceGroup(new GroupReduceFunction<Tuple2<Long, Double>, Tuple2<Long, Double>>() {
@Override
public void reduce(Iterable<Tuple2<Long, Double>> values, Collector<Tuple2<Long, Double>>out) throws Exception {
Long movieId = null;
double total = 0;
int count = 0;
for (Tuple2<Long, Double> value: iterable) {
movieId = value.f0;
total += value.f1;
count++;
}
if (count > 50){
Collector.collect(new Tuple2<>(movieId, total/count);
}
}
})
.partitionCustom(new Partitioner<Double>() {
@Override
public int partition(Double key, int numPartition) {
return key.intValue() - 1;
}
}, 1);
}

Tuesday, November 19, 2019

This is a continuation of the earlier posts to enumerate funny aspects of software engineering practice:

390) Build a product where people can learn about and hear others empowered in their work and the fan following grows.

391) Build a product where they have to wait for orphaned resourcesn cleanup before they can proceed with re-install.

392) Build a product where the users have to frequently run the installer and it doesn’t complete some of the times

393) Build a product where the software is blamed because the administrator was shy to read the manual

394) Build a product where the resources for the software provided by the customer does not meet the requirements.

395) Build a product where different parts of the software need to be installed differently and find that deployments are usually haphazard

396) Build a product where the installation on production environment is so elaborate that it requires planning, dry runs and coordination across teams

397) Build a product where every update is used to justify setting aside a day by the staff

398) Build a product where the reality and perception are deliberately kept different at a customer site

399) Build a product where the vision for the product is different from what the customer wants to use it for.

400) Build product where the quirkiness of the product offers fodder for all kind of talks from conferences to board room meetings.

Build a product where the last mile before the release usually requires everyone to pitch in

Build a product where the pre-release efforts can never truly be separated from development efforts

Build a product where the last mile gathers unwanted attention even though they don’t have any direct dependence on the product.

Build a product where the announcements become something to look forward to both for people building the product and those waiting for it outside.

Build a product where the process of building never truly becomes a joyful process.

Build a product where the release of a product is never really a starting point for the next version

Build a product where the last few weeks for the release is usually reserved for making the customer experience better rather than during the development

Build a product where the Murphy’s law seems to be more applicable towards the endgame

Build a product where the week after the release still manages to throw surprises.

Build a product where the customer celebrations of their success with the product adds joy to the work involved during development but the hands that made the product are no longer there.

Monday, November 18, 2019

Presto scalability to processing petabytes of data is unparalled. And the use of a distributed SQL query engine also helps

int getKthAntiClockWise(int[] [] A, int m, int n, int k)
{
if (n <1 || m < 1) return -1;
if (k <= m)
return A[0, k-1];
if (k <= n+m-1)
return A[m-1, k-m];
if (k <= n+m-1+m-1)
return A[n-1, (m-1-(k-(n+m-1)))] ;
if (k <= n+m-1+m-1+n-2)
return A[0, n-1-(k-(n+m-1+m-1))];
return getKthAntiClockWise(Copy(A, (1,1,m-2,n-2)), m-2, n-2, k-(2*n+2*m-4)));
// Copy uses System.arraycopy
}

Sunday, November 17, 2019

A comparision of Flink SQL execution and Facebook’s Presto:
The Flink Application provides the ability to write SQL query expressions. This abstraction works closely with the Table API and SQL queries can be executed over tables. The Table API is a language centered around tables and follows a relational model. Tables have a schema attached and the API provides the relational operators of selection, projection and join. Programs written with Table API go through an optimizer that applies optimization rules before execution.
Flink Applications generally do not need to use the above abstraction of Table APIs and SQL layers. Instead they work directly on the Core APIs of DataStream (unbounded) and DataSet (bounded data set) APIs. These APIs provide the ability to perform stateful stream processing.
For example,
DataStream<String> lines = env.addSource( new FlinkKafkaConsumer<>(…)); // source
DataStream<Event> events = lines.map((line)->parse(line)); // transformation
DataStream<Statistics> stats = events.keyBy(“id”).timeWindow(Time.seconds(10)).apply(new MyWindowAggregationFunction());
stats.addsink(new BucketingSinkPath));
Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources. Facebook’s Presto can run on large data sets of social media such as in the order of Petabytes. It can also run over HDFS for interactive graphics.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Just like the standard query operators of .Net the FLink SQL layer is merely a convenience over the table APIs. On the other hand, Presto offers to run over any kind of data source not just Table APIs.