Cluster computing

Tuesday, April 24, 2018

Events as a measurement of Cloud Database performance

In the previous post, we talked about cloud databases. They come with the benefits of the cloud and can still offer at par performance with on-premise database and perhaps better Service-Level Agreements. Today we talk about a test case to measure performance of a cloud database, say Azure Cosmos Database using it as a queue. I could not find a comparable study to match this test case.

Test case: Let us say, I stream millions of events per minute to CosmosDB and each bears a tuple <EventID, Byte[] payload> The payload is irrelevant other than forcing a data copy operation for each event thereby trying to introduce processing time so I use the same payload with each event.

The EventIDs range from 1 to Integer.MaxValue. I want to store this in a big table where I utilize the 255 byte row key to support horizontal scaling of my table.

Then I want to run a query every few seconds to tell me the top 50 events by count in the table.

The Query will be :

SELECT events.ID id,

COUNT(*) OVER ( PARTITION BY events.ID ) count,

ORDER BY count DESC;

Which semantically has the same meaning as GROUP BY COUNT(*)

This is a read-only query so the writes to the table should not suffer any performance impact. In other words, the analysis query is separate from the transactional queries to insert the events in a table.

In order to tune the performance of the Cloud Database, I set the connection policy to

· Direct mode

· Protocol to TCP

· Avoid startup latency on first request

· Collocate clients in same Azure region for performance

· Increase number of threads/tasks

This helps to squeeze the processing time of the events as they make their way to the table. The number of Request Units (RU) will be attempted to exceed 100,000

This high throughput will facilitate a load that the analysis query performance may become slower over time with the massive increase in data size.

Conclusion – This test case and its pricing will help determine if it is indeed practical to store high volume traffic in the database such as from mobile fleet or IoT devices.

Monday, April 23, 2018

Databases can scale up when deployed on-premise. How about the migration to cloud databases ? What are some of the differences ?
This post tries to find those answers. On-premise database can scale only under constraints requiring hardware resources, manpower and costs That said on-premise database can work without internet access, provide high tps, keep the data and applications in house, and can store sensitive data.
Database servers such as SQL server have also become cloud ready and even provide mission critical cloud performance while lately they have been focusing on advanced analytics and rich visualizations. Higher end database servers that require say 64 cpus are often an on-premise only choice..
We don't talk about data warehouses in this section because they are different from databases. That said Azure SQL DB and DW both work in the cloud as relational stores.
Azure Data lake Analytics and Azure data lake store are non-relational and while they are hosted in the cloud, the corresponding on-premise analytics system is usually fulfilled by Hadoop.
Analytics include such things as Federated Query, Power BI, Azure machine learning, and Azure Data Factory
Sql Server on an Azure VM is considered as IaaS while virtualized databases in the cloud is considered PaaS. In the latter case, the Azure Cloud Services manage scale and resilience while allowing the application developers to focus on application and Data. Regardless of the choice, they use consistent tools for all the database options above.
Migrating database from single servers to Azure VMs is straightforward but MDMs and other databases that are company assets involve more chores.

#sqlexercise : https://goo.gl/Zz1yoV

Sunday, April 22, 2018

I came across an interesting discussion on how to efficiently search for multiple predicates and support paging.
The resolution is that we almost always have to push predicates down into the database and more so even to the query so that the optimizer has a chance to determine the best query plan for it.
In the absence of a database, we will be emulating the work of the query execution inside the database and are still not likely to be efficient and consistent in all cases simply because we have an enumeration based data structure only in the web-service layer.
On the other hand, the database is closest to the storage, indexes and organizes the records so that they are looked up more efficiently. The query plans can be compared and the most efficient can be chosen. Having an in-memory iteration only data structure will only limit us and will not scale to size of data.
That said, the predicates are expected to evaluate the same way regardless of which layer they are implemented in. If we have a set of predicates and they are separated by or clause as opposed to and clause, then we will likely have a result set from each predicate and they may involve the same records in the results of each predicate. If we filter based on one predicate and we also allow matches based on another predicate, the two result sets may then be merged into one so that the result can then be returned to the caller. The result sets may have duplicates so the merge may have to return only the distinct elements. This can easily be done by comparing the unique identifiers of each record in the result set.
The selection of the result is required prior to determining the section that needs to be returned to the user. This section is determined by the start,offset pair in the enumeration of the results. If the queries remain the same over time, and the request only varies in the paging parameters, then we can even cache the result and return only the paged section. The API will likely persist the predicate, resultsets in cache so that subsequent calls for paging only results the same responses. This can even be done as part of predicate evaluation by simply passing the well known limit and offset parameter directly in the SQL query. In the enumerator we do this with Skip and Take.

#codingexercise : https://1drv.ms/u/s!Ashlm-Nw-wnWti-TEUQKcdNRtPgJ
#sqlexercise : https://goo.gl/Zz1yoV

Saturday, April 21, 2018

Reconstructing a sequence of numbers from separate smaller sub-sequences.

Let us consider a sequence 1,3,7,2,4,5,9,8

And we are given a set of sub-sequences as

1,3,2,4,9

1,7,4,5,9,8

There are duplicates but they appear only once in each list. The orders of the subsequences appear the same in the sub-sequences as they appear in the main sequence.

The question is can we reconstruct the original sequence from the given sub-sequences deterministically.

On the surface, we know that we can inter-leave the m elements in the n-1 gaps between the other subsequence excluding duplicates. That gives us a large number of combinations. These combinations do not indicate the unique master sequence of distinct numbers.

Moreover, the number of entries from one sequence into a gap of the second sequence is variable from one to the number of elements following the choice inclusive. The choice of elements for these gaps follows the well-known stars and bars problem which yields a binomial coefficient number of combinations.

Consequently, it does appear that there may be no solution to this problem.

If the elements were sorted, then this problem would have been a merged list where we pick the element in the destination based on the smaller element from the first and the second array.

To make a choice of picking one element from one of the arrays, we need to sequentially visit the elements in both arrays. We advance one element in either array each time. In order to do so, our choice has to be in the same order as the master sequence.

We may give some weight to the choice of the element in both arrays. This assignment of weight has to factor in the index of the element if available from both arrays. For example the number 4 appears in both arrays above but since the index is lower in the second array rather than the first array, it will be picked and the other 4 will be ignored. Moreover, the first few elements of the first subsequence before the number 4 in the first array will be picked before the element 4 in the second array is encountered. We could use this availability of duplicates as a way to progress as long as there are sufficient duplicates in both indicating the choice of the array.

Therefore if the element 7 in the above subsequence was preceded by a 3, the elements in the master sequence could have been reconstructed to the point of 1,3,7,2,4.

Friday, April 20, 2018

Challenges cited from health information collaboration in Information Technology projects
The design of IT systems for collaboration in data sharing is itself quite a challenge given that publishers and subscribers do not adhere to any standard. In the health information industry, such designs are considered even more difficult because they also have to satisfy the demands of sponsors and users.
In the law enforcement industry, legal requests and responses may even have a uniform standard across participating companies and law enforcement agencies. The goal here was in timeliness and the availability of data. Health information sharing standards have a slightly different area of emphasis. They have plenty of data when it comes to collections but privacy and boundary rules have many manifestations. Although they may not be subject to the same rules as personally identifiable information (PII), the health records have their own conformance requirements which have different interpretations and implementations.
The techniques for information sharing include:
1) Gateway between two heterogenous systems such as across companies
2) Message Bus between enterprise wide organizations with different areas of emphasis
3) Web-services within an organization for the sharing of data that is not exposed directly from data stores
4) And exposed data sources such as databases and their connectors

The techniques are important only for what can be applied in a given context. However, the time-held tradition had been to store the data in a database or archive in a data warehouse and connect those data stores via web services.
The pull and push between publishers and subscribers do not necessarily come with a protocol. Participating systems can choose whatever works for them
In the absence of any protocol or enforcement, information sharing falls largely on the level of collaboration between participants. Such collaboration is particularly relevant in coalescing health information across boundaries.

#codingexercise
https://ideone.com/XHJ2II

Thursday, April 19, 2018

Standard Query Operators in all programming languages
We were discussing standard query operations
There may be some new queries in practice today that were not as prevalent as earlier. These queries may originate from machine learning where they augment current data mining practice of grouping, ranking, searching and sorting. Such queries call for some new algorithms which are expressed in the form of code. Such logic together with the query operators above are augmenting the query language to form a newer language.
When we are missing the standard query operators in a programming language while a certain language has taken the lead with expanded libraries to cover the newer querying techniques, we have two ways to overcome it. Some plugins may provide these operators but such plugins often change the denominator over which the original language was executable.
First, write code in the existing broader reachability language with the same kind of primitives in a shared module so that they can be replaced with whatever comes next in the language revisions.
Second, adopt the plugins and upgrade your language and toolset to take advantage of better expressions, testability and brevity in the logic artifacts that we keep in version control.
Both these techniques require extra work over the development that we currently do for the sake of meeting business goals. Consequently the use of standard query operators might incur cost but the benefits far outweigh the costs since we are aligning ourselves to the general direction in which queries are being expressed.
It may be interesting to note that queries get a lot more variety from systems other online transactional processing. Analytical processing, Business Intelligence stacks and reporting systems come with far more complex queries. Data warehouses provide immense datasets to play with the queries. Data mining and other techniques are consistently expanding the definition of queries. In addition, the notion of batch processing and stream processing have been introduced Finally, clustering algorithms, decision trees, SVMs and neural-net based machine learning techniques are adding query logic that was not anticipated in the SQL query language. Universal query language is now trying to broaden the breadth of existing query language and standardize it. With these emerging trends, the standard query operators do not appear to be anywhere close to being discontinued and instead are likely to be expanded where possible.

Wednesday, April 18, 2018

Standard Query Operators in all programming languages
Let us say we have a list of records. We want to perform some queries on the list to find information such as which records match a given criteria. Other well known queries include grouping, eliminating duplicates, aggregating, finding one record, counting records etc. If we have two lists, we may want to join the lists, find the intersection or find the records that appear in one but not the other.
These are called standard query operations. They are standard because databases have traditionally described these operations in great detail and the operators in their query language are not only thorough but are also emulated up the layers in the software stack where usually the database powers all the layers above.
Many complex queries are often translated to these standard query operators to make them simpler to understand. Consequently these standard query operators become primitives and a convenience to standardize across several business areas where queries are used.
There may be some new queries in practice today that were not as prevalent as earlier. These queries may originate from machine learning where they augment current data mining practice of grouping, ranking, searching and sorting. Such queries call for some new algorithms which are expressed in the form of code. Such logic together with the query operators above are augmenting the query language to form a newer language.
Yet not all programming languages come with libraries to facilitate even the standard query operators while a certain language has taken the lead with expanded libraries to cover the newer querying techniques. Some plugins may provide these operators but such plugins often change the denominator over which the original language was executable.
There are two ways to mitigate such discordance:
First, write code in the existing broader reachability language with the same kind of primitives in a shared module so that they can be replaced with whatever comes next in the language revisions.
Second, adopt the plugins and upgrade your language and toolset to take advantage of better expressions, testability and brevity in the logic artifacts that we keep in version control.
Both these techniques require extra work over the development that we currently do for the sake of meeting business goals. Consequently the use of standard query operators might incur cost but the benefits far outweigh the costs since we are aligning ourselves to the general direction in which queries are being expressed.
#codingexercise http://js.do/code/209859