Cluster computing

Tuesday, February 16, 2021

User interface and API design for Streaming Queries (continued from previous post ...)

The Proposal

Query Interface is not about keywords.

When it comes to a user interface for querying data, people often associate Google user interface with search terms. Avoid Google. This is not necessarily the right interface for Streaming queries. If anything, it is simplistic, and keywords based. It is not operators-based. Splunk interface is a much cleaner interface since it has seasoned from search over machine data. Streaming queries are not table queries. They also don’t operate on big data on the Hadoop file system only. They do have standard operators that can directly translate to operator keywords and can be chained the same manner in a java file as pipe operators on the search bar. These translations of query operator expressions on the search bar to a java language FLink streaming query is a patentable design for streaming user interface and one that will prove immensely flexible for direct execution of user queries as compared to package queries.

Some truths about user interface development:

1) User Interface development estimates slip too often. It is not necessarily due to user interface test automation frameworks such as Selenium. Instead, it is caused by the user interface becoming monolithic and clunky. It takes on a large maintenance tail as different browsers and timing issues must be handled or special-cased.

2) With the number of resources working on user interface increasing, there is a machine efficiency curve generated to user interface library development.

3) User interface should be more like a library rather than an all-inclusive customized application. It should easily switch as a layer over one data source to another or both

4) It may be tempting to copy the existing solution from competitors such as

Splunk

SQL

Tableau

Machine Learning User Interface

But it is far more important to streamline the translation of search expressions in plain English on the Search bar to Java Flink queries that can run in a fast serverless computing manner.

5) If a wire diagram becomes more appealing and the go-to resource for all problem solving, it will likely not materialize. Instead, the comfort and pain of using the existing technologies for improving the user experience and usability must be honed.

6) Plug-n-play plugin JavaScript framework for Sharepoint is a great start for standard query operators. The essence of stream querying and the continuous availability of results and their paging is something that will not be found elsewhere and should be salient for a user interface over streams.

7) Security is not a functionality of the user interface. It is a non-functional aspect and should not require scattered checks all over the place but rather at the admission of control.

8) User interface is great to build on REST interface and with thin clients, but the functionality should be composable and inheritable

9) Separation of concerns especially between dashboard elements and the logic pertaining to one or more search queries will tremendously improve the offerings to the customer.

Monday, February 15, 2021

User Interface and API design for Streaming queries

Problem statement:

Analytical applications that perform queries on data are best visualized via catchy charts and graphs. Most applications using storage and analytical products have a user interface that lets them create containers for their data but have very little leverage for the customer to author streaming queries. Even the option to generate a query is nothing more than the uploading of a program using a compiled language artifact. The application reuses a template to make requests and responses, but this is very different from long running and streaming responses that are typical for these queries. A solution is presented in this document.

Description

An application can choose to offer a set of packaged queries available for the user to choose from a dropdown menu while internalizing all upload of corresponding programs, their execution and the return of the results. One of the restrictions that comes with packaged queries exported via REST APIs is their ability to scale since they consume significant resources on the backend and continue to run for a long time. These restrictions cannot be relaxed without some reduction on their resource usage. The API must provide a way for consumers to launch several queries with trackers and they should be completed reliably even if they are done one by one. This is facilitated with the help of a reference to the query and a progress indicator. The reference is merely an opaque identifier that only the system issues and uses to look up the status. The indicator could be another api that takes the reference and returns the status. It is relatively easy for the system to separate read-only status information from read-write operations so the number of times the status indicator is called has no degradation on the rest of the system. There is a clean separation of the status information part of the system which is usually periodically collected or pushed from the rest of the system. The separation of read-write from read-only also helps with their treatment differently. For example, it is possible to replace the technology for the read-only separately from the technology for read-write. Even the technology for read-only can be swapped from one to another for improvements on this side.

The design of all REST APIs generally follows a convention. This practice gives well recognized uri qualifier patterns, query parameters and methods. Exceptions, errors and logging are typically done with the best usage of http protocol. The 1.1 version of that protocol is revised so it helps with the

The SQL query can also be translated from REST APIs with the help of engines like Presto or LogParser.

Tools like LogParser allow sql queries to be executed over enumerable. SQL has been supporting user defined operators for a while now. Aggregation operations have had the benefit that they can support both batch and streaming mode operations. These operations therefore can operate on large datasets because they view only a portion at a time. The batch operation can be parallelized to several processors. This has generally been the case with Big Data and cluster mode operations. Both batch and stream processing can operate on unstructured data.

Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.
Just like the standard query operators of .NET the FLink SQL layer is merely a convenience over the table APIs. On the other hand, Presto offers to run over any kind of data source not just Table APIs.
Although Apache Spark query code and Apache Flink query code look very much similar, the former uses stream processing as a special case of batch processing while the latter does just the reverse.
Also, Apache Flink provides a SQL abstraction over its Table API
While Apache Spark provided ". map(...)" and ". reduce(...)" programmability syntax to support batch-oriented processing, Apache Flink provides Table APIs with ".groupby(...)" and ". order(...)" syntax. It provides SQL abstraction and supports steam processing as the norm.

The APIs vary when the engines change even though they form an interface such that any virtualized query can be executed, and the APIs form their own query processing syntax. They vary because the requests and responses vary, and their size and content vary. Some APIs are better suited for stream processing and others for batch processing.

Sunday, February 14, 2021

Importer-Exporter

Importer-Exporter:

Problem Statement: Data in a database poses a challenge to the utilities that have matured as usable from command-line tools because it is not in a declarative form. The moment data is exported, it becomes an instant legacy. On the other hand, when the data is exported as files or objects, or events, the businesses get empowered to use non-relational technologies such as Big Data analysis stacks, Snowflake warehouse, and Stream analytics. What is needed is an import-export tool and this article strives to discuss the options available for same.

Solution: Data import/export is not a new feature. Extract-Transform-Load operations can export data in a variety of formats such as CSV or JSON. What was missing was the layout of the data in the form of the organization in the database so that the data becomes easy to differentiate and work with on disk as opposed to remaining hidden within the database. When the cells, the rows, and the tables are organized in a way that they are easy to locate using a folder/file hierarchy, they become easy for command-line tools that are savvy about folder/file hierarchy and can recursively differentiate instances of data.

The ETL wizards are only part of the solution because they export data but not the organization. If they were to export schema, it would assist with the organization of the data, but it would not be sufficient for the end-to-end capability required from a tool to differentiate instances of data between database snapshots.

OData was a technique that popularized exposing the data to the web using the conventional web Application Programming Interface (API). This data now could be imported via a command-line tool such as curl and the JavaScript Object Notational Format (JSON) made it universally acceptable for programmability. Yet the organization information was largely left to the caller and the onus to make hundreds of calls does not yet address the need described. Again, this was also an insufficient solution but certainly a helpful step.

S3 APIs with its tri-level hierarchy of Namespace, Bucket, and Object was an interesting evolution because each object was web-accessible. If the data could be exported from a database into these containers, the use of command-line tools became unnecessary as downstream systems such as event processors could directly read the data without the need to put them on disk in between the stages of processing. The ability to differentiate instances of data from database snapshots could have been modified to iterate over the remote web locations which incidentally maintained a folder prefix organization. Whether the database, table, and columns could be translated to Namespace, Buckets and Objects remained at the discretion of the caller and the nature or size of data. This mapping even if predetermined could not be found in the existing pieces of solutions proposed by ETL wizards, OData, or S3 exporters and importers.

Filesystem remains universally appealing to all shell-based command-line tools. Writing C program-based tools that can download data from a database and write them to disk in the form of a folder or files is, therefore, the only complete solution to differentiate instances of data between database snapshots. This option was always left out to the individual developers to implement for their use case because the size of the production database could be upwards of Terabytes and filesystems stretched even on clusters were not considered practical for production. But these very developers truly proved that data is not sticky to relational databases because they found and promoted batch and stream-based processing that avoids the centralized store.

Conclusion: With the ability to connect to a relational store and a predetermined strategy to flatten data from database snapshots to folder/file layout, a shell-based command-line tool can enable a source control tracking system on data which allows data revisions to be visualized and compared better.