Cluster computing

Sunday, April 22, 2018

I came across an interesting discussion on how to efficiently search for multiple predicates and support paging.
The resolution is that we almost always have to push predicates down into the database and more so even to the query so that the optimizer has a chance to determine the best query plan for it.
In the absence of a database, we will be emulating the work of the query execution inside the database and are still not likely to be efficient and consistent in all cases simply because we have an enumeration based data structure only in the web-service layer.
On the other hand, the database is closest to the storage, indexes and organizes the records so that they are looked up more efficiently. The query plans can be compared and the most efficient can be chosen. Having an in-memory iteration only data structure will only limit us and will not scale to size of data.
That said, the predicates are expected to evaluate the same way regardless of which layer they are implemented in. If we have a set of predicates and they are separated by or clause as opposed to and clause, then we will likely have a result set from each predicate and they may involve the same records in the results of each predicate. If we filter based on one predicate and we also allow matches based on another predicate, the two result sets may then be merged into one so that the result can then be returned to the caller. The result sets may have duplicates so the merge may have to return only the distinct elements. This can easily be done by comparing the unique identifiers of each record in the result set.
The selection of the result is required prior to determining the section that needs to be returned to the user. This section is determined by the start,offset pair in the enumeration of the results. If the queries remain the same over time, and the request only varies in the paging parameters, then we can even cache the result and return only the paged section. The API will likely persist the predicate, resultsets in cache so that subsequent calls for paging only results the same responses. This can even be done as part of predicate evaluation by simply passing the well known limit and offset parameter directly in the SQL query. In the enumerator we do this with Skip and Take.

#codingexercise : https://1drv.ms/u/s!Ashlm-Nw-wnWti-TEUQKcdNRtPgJ
#sqlexercise : https://goo.gl/Zz1yoV

Saturday, April 21, 2018

Reconstructing a sequence of numbers from separate smaller sub-sequences.

Let us consider a sequence 1,3,7,2,4,5,9,8

And we are given a set of sub-sequences as

1,3,2,4,9

1,7,4,5,9,8

There are duplicates but they appear only once in each list. The orders of the subsequences appear the same in the sub-sequences as they appear in the main sequence.

The question is can we reconstruct the original sequence from the given sub-sequences deterministically.

On the surface, we know that we can inter-leave the m elements in the n-1 gaps between the other subsequence excluding duplicates. That gives us a large number of combinations. These combinations do not indicate the unique master sequence of distinct numbers.

Moreover, the number of entries from one sequence into a gap of the second sequence is variable from one to the number of elements following the choice inclusive. The choice of elements for these gaps follows the well-known stars and bars problem which yields a binomial coefficient number of combinations.

Consequently, it does appear that there may be no solution to this problem.

If the elements were sorted, then this problem would have been a merged list where we pick the element in the destination based on the smaller element from the first and the second array.

To make a choice of picking one element from one of the arrays, we need to sequentially visit the elements in both arrays. We advance one element in either array each time. In order to do so, our choice has to be in the same order as the master sequence.

We may give some weight to the choice of the element in both arrays. This assignment of weight has to factor in the index of the element if available from both arrays. For example the number 4 appears in both arrays above but since the index is lower in the second array rather than the first array, it will be picked and the other 4 will be ignored. Moreover, the first few elements of the first subsequence before the number 4 in the first array will be picked before the element 4 in the second array is encountered. We could use this availability of duplicates as a way to progress as long as there are sufficient duplicates in both indicating the choice of the array.

Therefore if the element 7 in the above subsequence was preceded by a 3, the elements in the master sequence could have been reconstructed to the point of 1,3,7,2,4.

Friday, April 20, 2018

Challenges cited from health information collaboration in Information Technology projects
The design of IT systems for collaboration in data sharing is itself quite a challenge given that publishers and subscribers do not adhere to any standard. In the health information industry, such designs are considered even more difficult because they also have to satisfy the demands of sponsors and users.
In the law enforcement industry, legal requests and responses may even have a uniform standard across participating companies and law enforcement agencies. The goal here was in timeliness and the availability of data. Health information sharing standards have a slightly different area of emphasis. They have plenty of data when it comes to collections but privacy and boundary rules have many manifestations. Although they may not be subject to the same rules as personally identifiable information (PII), the health records have their own conformance requirements which have different interpretations and implementations.
The techniques for information sharing include:
1) Gateway between two heterogenous systems such as across companies
2) Message Bus between enterprise wide organizations with different areas of emphasis
3) Web-services within an organization for the sharing of data that is not exposed directly from data stores
4) And exposed data sources such as databases and their connectors

The techniques are important only for what can be applied in a given context. However, the time-held tradition had been to store the data in a database or archive in a data warehouse and connect those data stores via web services.
The pull and push between publishers and subscribers do not necessarily come with a protocol. Participating systems can choose whatever works for them
In the absence of any protocol or enforcement, information sharing falls largely on the level of collaboration between participants. Such collaboration is particularly relevant in coalescing health information across boundaries.

#codingexercise
https://ideone.com/XHJ2II

Thursday, April 19, 2018

Standard Query Operators in all programming languages
We were discussing standard query operations
There may be some new queries in practice today that were not as prevalent as earlier. These queries may originate from machine learning where they augment current data mining practice of grouping, ranking, searching and sorting. Such queries call for some new algorithms which are expressed in the form of code. Such logic together with the query operators above are augmenting the query language to form a newer language.
When we are missing the standard query operators in a programming language while a certain language has taken the lead with expanded libraries to cover the newer querying techniques, we have two ways to overcome it. Some plugins may provide these operators but such plugins often change the denominator over which the original language was executable.
First, write code in the existing broader reachability language with the same kind of primitives in a shared module so that they can be replaced with whatever comes next in the language revisions.
Second, adopt the plugins and upgrade your language and toolset to take advantage of better expressions, testability and brevity in the logic artifacts that we keep in version control.
Both these techniques require extra work over the development that we currently do for the sake of meeting business goals. Consequently the use of standard query operators might incur cost but the benefits far outweigh the costs since we are aligning ourselves to the general direction in which queries are being expressed.
It may be interesting to note that queries get a lot more variety from systems other online transactional processing. Analytical processing, Business Intelligence stacks and reporting systems come with far more complex queries. Data warehouses provide immense datasets to play with the queries. Data mining and other techniques are consistently expanding the definition of queries. In addition, the notion of batch processing and stream processing have been introduced Finally, clustering algorithms, decision trees, SVMs and neural-net based machine learning techniques are adding query logic that was not anticipated in the SQL query language. Universal query language is now trying to broaden the breadth of existing query language and standardize it. With these emerging trends, the standard query operators do not appear to be anywhere close to being discontinued and instead are likely to be expanded where possible.

Wednesday, April 18, 2018

Standard Query Operators in all programming languages
Let us say we have a list of records. We want to perform some queries on the list to find information such as which records match a given criteria. Other well known queries include grouping, eliminating duplicates, aggregating, finding one record, counting records etc. If we have two lists, we may want to join the lists, find the intersection or find the records that appear in one but not the other.
These are called standard query operations. They are standard because databases have traditionally described these operations in great detail and the operators in their query language are not only thorough but are also emulated up the layers in the software stack where usually the database powers all the layers above.
Many complex queries are often translated to these standard query operators to make them simpler to understand. Consequently these standard query operators become primitives and a convenience to standardize across several business areas where queries are used.
There may be some new queries in practice today that were not as prevalent as earlier. These queries may originate from machine learning where they augment current data mining practice of grouping, ranking, searching and sorting. Such queries call for some new algorithms which are expressed in the form of code. Such logic together with the query operators above are augmenting the query language to form a newer language.
Yet not all programming languages come with libraries to facilitate even the standard query operators while a certain language has taken the lead with expanded libraries to cover the newer querying techniques. Some plugins may provide these operators but such plugins often change the denominator over which the original language was executable.
There are two ways to mitigate such discordance:
First, write code in the existing broader reachability language with the same kind of primitives in a shared module so that they can be replaced with whatever comes next in the language revisions.
Second, adopt the plugins and upgrade your language and toolset to take advantage of better expressions, testability and brevity in the logic artifacts that we keep in version control.
Both these techniques require extra work over the development that we currently do for the sake of meeting business goals. Consequently the use of standard query operators might incur cost but the benefits far outweigh the costs since we are aligning ourselves to the general direction in which queries are being expressed.
#codingexercise http://js.do/code/209859

Tuesday, April 17, 2018

We were discussing why Interface description language don't become popular as compared to their alternatives. We said that interfaces or contracts provide the comfort for participants to work independently, offload validation, determine functional and non-functional requirements but the alternative to work with granular stateless requests that are well documented are a lot more appealing.
We also discussed the trade-offs between state driven sessions and stateless APIs when discussing client side application development. We noted that there is a wonderful ecosystem of browser based client side software development using standard jargons and tools.
Today we see that there are differences to client side and application side software development. For example, client-side javascript based applications require to mitigate security threats that are fundamentally different from server side code. Those client scripts could run anywhere and on any device. Threats such as cross site scripting, man-in-the-middle attacks, sql injection attacks, cross-origin resource sharing etc are all vulnerabilites exploited from the client side.
Consistency in the way we describe our markup, stylesheets and scripts help smooth out the organization of code on the client side. However, they help only so much. There is nothing preventing the developer from moving logic forward from server to client side. When we did have fat clients, they were touted as sophisticated tools. With the help of Model-View-Controller architecture, we seemed to move to different view-models for tabs on the same web page and remove the restriction between rendering and content for the page. UI developers suggested that view-models make a page rather limited and highly restrictive in workflows requiring more navigations and reducing usability in general. The logic on the client side can also be further simplified without compromising rendering or usability. For example, we could make one call versus several calls to the API and data source. The entire page may also be loaded directly as a response from the backend. While API may define how the user interface is implemented, there is little need for user interface to depend on any state based interface. Therefore the argument holds even in front-end development.

#codingexercise : water retained in an elevation map : https://ideone.com/HBEn3F

Monday, April 16, 2018

Contracts are also static and binding for the producer and consumer. Changes to the contract are going to involve escalations and involvement.

Contracts whether for describing services or for component interactions. They are generally replaced by technologies where we use pre-determined and well-accepted verbs and stateless design.

The most important advantage of stateless design and syntax of calls is that each request is granularly authenticated, authorized and audited. By making a one to one relationship between a request and a response and not having the client span multiple requests with a shared state, we avoid the use of dangling resources on the server side and state management on clients. The clients are required to be handling more preparation for each request but this happens to be a convenience when all the requests share the same chores.

The other advantage to stateless design is that much of the requests now follow well-established protocols. These protocols come with important benefits. First they are widely accepted in the industry. Second they move away from company specific proprietary contracts. Third the wide acceptance fosters a community of developers, tools and ecosystems.

The popularity of a technology is determined not only by the simplicity of the product but by the audience using the technology. When tools, development environment and products are promoting stateless design over stateful contracts, no matter how much we invest in interface specification language, its adoption will be very much dependent on the audience who will generally go with their convenience.

Protocols can be stateful and stateless. There is no doubt that either may come useful under different circumstances. Contracts still come useful in being a descriptor for a service in its three-part description - Address, Binding and Contract. But when each service behaves the same as every other cookie-cutter service and a well-known binding and contract, there may not be anything more needed than address.

Leaving the communications aside, let us look at the proposed benefit of a client-side state management. Certainly, it improves the performance of numerous intermediary requests if the setup and teardown are the ones taking the most performance hit. Also, the client has an opportunity to keep the reference on the server-side resource for as long as they want and the underlying communication channel holds. Proponents argue a stateful design may result in higher performance for trunk or bulk callers instead of retail individual callers since this reduces the number of clients to a manageable few. While this may be true, it underscores the stateful design as more of an exception rather than the norm.
#codingexercise: https://ideone.com/3ZY49O