Monday, November 25, 2019

A comparision of Flink SQL execution and Facebook’s Presto continued:



The Flink Application provides the ability to write SQL query expressions. This abstraction works closely with the Table API and SQL queries can be executed over tables. The Table API is a language centered around tables and follows a relational model. Tables have a schema attached and the API provides the relational operators of selection, projection and join. Programs written with Table API go through an optimizer that applies optimization rules before execution.



Presto from Facebook is  a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.



Just like the standard query operators of .Net the FLink SQL layer is merely a convenience over the table APIs. On the other hand, Presto offers to run over any kind of data source not just Table APIs.



Although Apache Spark query code and Apache Flink query code look very much similar, the former uses stream processing as a special case of batch processing while the latter does just the reverse.

Also Apache Flink provides a SQL abstraction over its Table API


While Apache Spark provided ".map(...)" and  ".reduce(...)" programmability syntax to support batch oriented processing, Apache Flink provides Table APIs with  ".groupby(...)" and  ".order(...)" syntax. It provides SQL abstraction and supports steam processing as the norm. 

Sunday, November 24, 2019

Flink Savepoint versus checkpoint
Frequently Apache Flink software users find the tasks they submit to the Flink runtime to run long. This happens when the data that these tasks process is very large. There are usually more than one tasks that perform the same operation by dividing the data among themselves. These tasks do not perform in batch operations. They are most likely performing stream-based operations handling a window at a time. The state of the Flink application can be persisted with savepoints. This is something that the application can configure the tasks before they run.  Tasks are usually run as part of Flink ‘jobs’ depending on the parallelism set by the application. The savepoint is usually stored as a set of files under the <base>/.flink/savepoint folder
Savepoints can be manually generated from running link jobs with the following command:
# ./bin/flink savepoint <jobID>
The checkpoints are system generated. The do not depend on the user. The checkpoints are also written to the file system much the same way as savepoints. Although they use different names and are meant to be triggered by user versus system, they share the system method to persist state for any jobs as long as those jobs are enumerable.
The list of running jobs can be seen from the command line with
# ./bin/flink list –r
where the –r option is an option to list the running jobs
Since the code to persist savepoint and checkpoints are the same, the persistence of the state needs to follow the policies of the system which include a limit on the number of states an application can persist.
This implies that the manual savepointing will fail if the limits are violated.  This results in all savepoint triggers to fail. The configuration of savepointing and checkpointing is described by the Apache Flink documentation. Checkpointing may also be possible with the reader group for the stream store. It's limits may need to be honored as well.
User has syntax from Flink such as savepoints to interpret progress.  Users can create, own or delete savepoints which represents the execution state of a streaming job. These savepoints point to actual files on the storage. Flink provides checkpointing which it creates and deletes without user intervention.
While checkpoints are focused on recovery, much more lightweight than savepoints, and bound to the job lifetime, they can become equally efficient diagnostic mechanisms
The detection of failed savepoints is clear from both the jobManager logs as well as the taskManager logs. Shell execution of the flink command for savepoints is run from the jobManager where the flink binary is available. The logs are conveniently available in the log directory.

Saturday, November 23, 2019

A comparision of Flink SQL execution and Facebook’s Presto continued:

The Flink Application provides the ability to write SQL query expressions. This abstraction works closely with the Table API and SQL queries can be executed over tables. The Table API is a language centered around tables and follows a relational model. Tables have a schema attached and the API provides the relational operators of selection, projection and join. Programs written with Table API go through an optimizer that applies optimization rules before execution.

Presto from Facebook is  a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.
Microsoft facilitated streaming operators with StreamInsight queries. The Microsoft StreamInsight Queries follow a five-step procedure:
1)     define events in terms of payload as the data values of the event and the shape as the lifetime of the event along the time axis
2)     define the input streams of the event as a function of the event payload and shape. For example, this could be a simple enumerable over some time interval
3)     Based on the events definitions and the input stream, determine the output stream and express it as a query. In a way this describes a flow chart for the query
4)     Bind the query to a consumer. This could be to a console. For example
        Var query = from win in inputStream.TumblingWindow( TimeSpan.FromMinutes(3)) select win.Count();
5)     Run the query and evaluate it based on time.
The StreamInsight queries can be written with .Net and used with standard query operators. The use of ‘WindowedDataStream’ in Flink Applications and TumblingWindow in .Net are constructs that allow operation on stream based on a window at a time.


Just like the standard query operators of .Net the FLink SQL layer is merely a convenience over the table APIs. On the other hand, Presto offers to run over any kind of data source not just Table APIs.

Although Apache Spark query code and Apache Flink query code look very much similar, the former uses stream processing as a special case of batch processing while the latter does just the reverse.
Also Apache Flink provides a SQL abstraction over its Table API



Friday, November 22, 2019

A comparision of Flink SQL execution and Facebook’s Presto continued:
The Flink Application provides the ability to write SQL query expressions. This abstraction works closely with the Table API and SQL queries can be executed over tables. The Table API is a language centered around tables and follows a relational model. Tables have a schema attached and the API provides the relational operators of selection, projection and join. Programs written with Table API go through an optimizer that applies optimization rules before execution.
Presto from Facebook is  a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.
Consider the time and space dimension queries that are generally required in dashboards, charts and graphs. These queries need to search over data that has been accumulated which might be quite large often exceeding terabytes. As the data grows, the metadata becomes all the more important and their organization can now be tailored to the queries instead of relying on the organization of the data. If there is need to separate online adhoc queries on current metadata from more analytical and intensive background queries, then we can choose to have different organizations of the information in each category so that they serve their queries better.
SQL queries have dominated almost all databases and warehouses queries no matter how high the software stack has been built over these products. On the other hand, simple summation form logic on big data need to be written with Map-Reduce methods. Although query languages such as U-SQL are trying to bridge the gap and adapters are written to translate SQL queries over other forms of data stores, they are not native to the unstructured stores and they often come up with their own query language.
The language for the query has traditionally been SQL. Tools like LogParser allow SQL queries to be executed over enumerable. SQL has been supporting user defined operators for a while now. These user defined operators help with additional computations that are not present as built-ins. In the case of relational data, these generally have been user defined functions or user defined aggregates. With the enumerable data set, the SQL is somewhat limited for LogParser. Any implementation of a query execution layer over the key value collection could choose to allow or disallow user defined operators. These enable computation on say user defined data types that are not restricted by the system defined types. Such types have been useful with say spatial co-ordinates or geographical data for easier abstraction and simpler expression of computational logic. For example, vector addition can be done with user defined data types and user defined operators.
The solutions that are built on top of SQL queries are not necessarily looking to write SQL. They want programmability and are just as content to write code using Table API as they are with SQL queries. Consequently, business intelligence and CRM solutions are writing their code with standard query operators over their data. These well-defined operators are applicable to most collections of data and the notion is shared with other programming languages such as. Net.

Thursday, November 21, 2019

This is a continuation of the earlier posts to enumerate funny aspects of software engineering practice:

401) Build a product where the last mile before the release usually requires everyone to pitch in

402) Build a product where the pre-release efforts can never truly be separated from development efforts

403) Build a product where the last mile gathers unwanted attention even though they don’t have any direct dependence on the product.

 404) Build a product where the announcements become something to look forward to both for people building the product and those waiting for it outside.

405) Build a product where the process of building never truly becomes a joyful process.

406) Build a product where the release of a product is never really a starting point for the next version

407) Build a product where the last few weeks for the release is usually reserved for making the customer experience better rather than during the development

408) Build a product where the Murphy’s law seems to be more applicable towards the endgame

409) Build a product where the week after the release still manages to throw surprises.

410) Build a product where the customer celebrations of their success with the product adds joy to the work involved during development but the hands that made the product are no longer there.

411) Build a product where the innovations are from the intellectual  intellectual horse power as opposed to the forces shaping the product and find more maintenance down the road

412) Build a product where the people building their product leave a mark and recognition for themselves only to find it revisioned by the next batch

413) Build a product as a v1 in the market and the find the adage nothing remains unchanged holding more truth than ever before

414) Build a product where the product matures version after version and find its customers snowball with loyalty

415) Build a product where the establishment of a product allows ventures in sister products that also do well

416) Build a product where vendors vie to ship together.

417) Build a product where the knowledge about how the competitor products is never too far away.

418) Build a product where the patented innovations spawn open source substitutes and workarounds

419) Build a product where the dynamics of the people building the product shapes the product more than anything else

Wednesday, November 20, 2019

Partitioning is probably the reason why tools like Spark, Flink becomes popular for BigData:
Partition function of FLink can be demonstrated with the following:
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<Integer, Double>> sorted = env.readCsvFile("ratings.csv")
.ignoreFirstLine()
.includeFields(false, true, true, false)
.types(Long.class, Double.class)
.groupBy(0)
.reduceGroup(new GroupReduceFunction<Tuple2<Long, Double>, Tuple2<Long, Double>>() {
    @Override
    public void reduce(Iterable<Tuple2<Long, Double>> values, Collector<Tuple2<Long, Double>>out) throws Exception {
        Long movieId = null;
        double total = 0;
        int count = 0;
        for (Tuple2<Long, Double> value: iterable) {
            movieId = value.f0;
            total += value.f1;
            count++;
        }
        if (count > 50){
            Collector.collect(new Tuple2<>(movieId, total/count);
        }
    }
})
.partitionCustom(new Partitioner<Double>() {
    @Override
    public int partition(Double key, int numPartition) {
        return key.intValue() - 1;
    }
    }, 1);
}

Tuesday, November 19, 2019

This is a continuation of the earlier posts to enumerate funny aspects of software engineering practice:

390) Build a product where people can learn about and hear others empowered in their work and the fan following grows.

391) Build a product where they have to wait for orphaned resourcesn cleanup before they can proceed with re-install.

392) Build a product where the users have to frequently run the installer and it doesn’t complete some of the times

393) Build a product where the software is blamed because the administrator was shy to read the manual

394) Build a product where the resources for the software provided by the customer does not meet the requirements.

395) Build a product where different parts of the software need to be installed differently and find that deployments are usually haphazard

396) Build a product where the installation on production environment is so elaborate that it requires planning, dry runs and coordination across teams

397) Build a product where every update is used to justify setting aside a day by the staff

398) Build a product where the reality and perception are deliberately kept different at a customer site

399) Build a product where the vision for the product is different from what the customer wants to use it for.

400) Build  product where the quirkiness of the product offers fodder for all kind of talks from conferences to board room meetings.

  1. Build a product where the last mile before the release usually requires everyone to pitch in 
  1. Build a product where the pre-release efforts can never truly be separated from development efforts 
  1. Build a product where the last mile gathers unwanted attention even though they don’t have any direct dependence on the product. 
  1.  Build a product where the announcements become something to look forward to both for people building the product and those waiting for it outside. 
  1. Build a product where the process of building never truly becomes a joyful process. 
  1. Build a product where the release of a product is never really a starting point for the next version 
  1. Build a product where the last few weeks for the release is usually reserved for making the customer experience better rather than during the development 
  1. Build a product where the Murphy’s law seems to be more applicable towards the endgame 
  1. Build a product where the week after the release still manages to throw surprises. 
  1. Build a product where the customer celebrations of their success with the product adds joy to the work involved during development but the hands that made the product are no longer there.