Cluster computing

Friday, April 27, 2018

The previous post was on Events as a measurement of Cloud Database performance with a focus on the pricing of the cloud database. We factored in the advantages of a public cloud managed RDS over a self-managed AWS instance namely: upgrades, backup and failover are provided as a services, there is more infrastructure and db Security, the database appears as a managed appliance and the failover is a packaged service.

The managed database instance types offer a range of CPU and memory selections. Moreover their storage is scaleable on demand. Automated backups have a retention period of 35 days and manual snapshots are stored in S3 for durability. An availability zone is a physically distinct independent infrastructure. Multiple availability zones each of which is a physically distinct independent infrastructure comes with database synchronization so they are better prepared for failures. Read replicas help offload read traffic. Entire database may be snapshot and copied across region for greater durability. Compute, Storage and IOPS are provisioned constitute the bill.

Performance is improved with offloading read traffic to replicas, putting a cache in front of the RDS and scaling up the storage or resizing the instances. CloudWatch alerts and DB Event notifications enabled databases to be monitored.

In short, RDS allows developers to focus on app optimization with schema design, query construction, query optimization while allowing all infrastructure and maintenance to be wrapped under managed services.

RDS alone may not scale in a distributed manner. Therefore software such as ScaleBase allows creation of a distributed relational database where database instances are scaled out. Single instance database can now be transformed into multiple-instance distributed relational database. The benefits from such distributed database include massive scale, instant deployment, keeping all RDS benefits from single-instance, automatic load balancing especially with lags from replicas and splitting of reads and writes, and finally increased ROI with no app code requirements.

#codingexercise https://ideone.com/pyiZ7C

Thursday, April 26, 2018

The previous post was on Events as a measurement of Cloud Database performance with a focus on the pricing of the cloud database.
With the help of managed services in the cloud, deployments are simple and fast and so is scaling. Moreover, patching, backups and replication are also handled appropriately. It is compatible to almost all applications. And it has fast, predictable performance. There is no cost to get started and we pay for what we use.
Initially the usage costs are a fraction of the costs incurred from on-premise solutions. Moreover for the foreseeable future, the usage expenses may not exceed on-premise solutions. However, neither the cloud provider nor the cloud consumer controls the data which only explodes and increases over time.
Traditionally disks kept up with this storage requirements and there was cost for the infrastructure but since the usage is proportional to the incoming data, there is no telling when the tradeoff between managed database service costs exceeds the alternatives.

The metering of cloud database usage and the corresponding bills will eventually become significant enough for corporations to take notice when more and more data finds its way to the cloud. Storage has been touted to be zero cost in the cloud but compute is not. Any amount of incoming data is paired with compute requirements. While the costs may be granular per request processing, it aggregates when billions of requests are processed in smaller and smaller duration. Consequently, the cloud may no longer be the least expensive option. Higher metering and more monitoring along with dramatic increase in traffic add up to substantial costs while there is no fallback option to keep the costs under the bar.

The fallback option to cut costs and make them more manageable is itself a very important challenge. What will money starved public cloud tenants do when the usage only accrues with explosion of data regardless of economic cycles? This in fact is the motivating factor to consider a sort of disaster management plan when public cloud becomes unaffordable while it may even be considered unthinkable today.

I'm not touting the on-premise solution or the hybrid cloud technology as the fall back option. I don't even consider them as a viable option regardless of any economic times or technological advancement. The improvements to cloud services has leaped far ahead of such competition and the cloud consumers are savvy about what they need from the cloud. Instead I focus on some kind of easing of traffic into less expensive options primarily with the adoption of quality of service for cloud service users and their migration to more flat rate options. Corporate spending is not affected but there will be real money saved when the traffic to the cloud is kept sustainable.
Let us take a look at this Quality of Service model. When all the data goes into the database and the data is measure only in size and number of requests, there is no distinction between Personally Identifiable Information and regular chatty traffic. In the networking world, the ability to perform QoS depends on the ability to differentiate traffic using <src-ip, src-port, dest-ip, dest-port, protocol> With the enumeration and classification of traffic, we can have traffic serviced with different QoS agreements using IntServ or DiffServ architecture. Similar QoS managers could differentiate data so that the companies not only have granularity but also take better care of their more important categories of data.

New:

#codingexercise: https://1drv.ms/u/s!Ashlm-Nw-wnWtjSc1V5IlYwWFY88

Wednesday, April 25, 2018

The previous post was on Events as a measurement of Cloud Database performance. Today we focus on the pricing of the cloud database.

It may be true that the cost of running a cloud database for high volume load may be significantly cheaper rather than the cost of running the same database and connections on-premise. This comes from the following managed services:

Scaling

High Availability

Database backups

DB s/w patches

DB s/w installs

OS patches

OS installation

Server Maintenance

Rack & Stack

Power, HVAC, net

However, I argue that the metering of cloud database usage and the corresponding bills will eventually become significant enough for corporations to take notice. This is due to the fact that pay as you go model albeit granular is in fact only expected to increase dramatically as usage grows. Consequently, the cloud may no longer be the least expensive option. Higher metering and more monitoring along with dramatic increase in traffic add up to substantial costs while there is no fallback option to keep the costs under the bar.

#codingexercise: https://ideone.com/F6QWcu

#codingexercise: https://ideone.com/pKE68t

Tuesday, April 24, 2018

Events as a measurement of Cloud Database performance

In the previous post, we talked about cloud databases. They come with the benefits of the cloud and can still offer at par performance with on-premise database and perhaps better Service-Level Agreements. Today we talk about a test case to measure performance of a cloud database, say Azure Cosmos Database using it as a queue. I could not find a comparable study to match this test case.

Test case: Let us say, I stream millions of events per minute to CosmosDB and each bears a tuple <EventID, Byte[] payload> The payload is irrelevant other than forcing a data copy operation for each event thereby trying to introduce processing time so I use the same payload with each event.

The EventIDs range from 1 to Integer.MaxValue. I want to store this in a big table where I utilize the 255 byte row key to support horizontal scaling of my table.

Then I want to run a query every few seconds to tell me the top 50 events by count in the table.

The Query will be :

SELECT events.ID id,

COUNT(*) OVER ( PARTITION BY events.ID ) count,

ORDER BY count DESC;

Which semantically has the same meaning as GROUP BY COUNT(*)

This is a read-only query so the writes to the table should not suffer any performance impact. In other words, the analysis query is separate from the transactional queries to insert the events in a table.

In order to tune the performance of the Cloud Database, I set the connection policy to

· Direct mode

· Protocol to TCP

· Avoid startup latency on first request

· Collocate clients in same Azure region for performance

· Increase number of threads/tasks

This helps to squeeze the processing time of the events as they make their way to the table. The number of Request Units (RU) will be attempted to exceed 100,000

This high throughput will facilitate a load that the analysis query performance may become slower over time with the massive increase in data size.

Conclusion – This test case and its pricing will help determine if it is indeed practical to store high volume traffic in the database such as from mobile fleet or IoT devices.

Monday, April 23, 2018

Databases can scale up when deployed on-premise. How about the migration to cloud databases ? What are some of the differences ?
This post tries to find those answers. On-premise database can scale only under constraints requiring hardware resources, manpower and costs That said on-premise database can work without internet access, provide high tps, keep the data and applications in house, and can store sensitive data.
Database servers such as SQL server have also become cloud ready and even provide mission critical cloud performance while lately they have been focusing on advanced analytics and rich visualizations. Higher end database servers that require say 64 cpus are often an on-premise only choice..
We don't talk about data warehouses in this section because they are different from databases. That said Azure SQL DB and DW both work in the cloud as relational stores.
Azure Data lake Analytics and Azure data lake store are non-relational and while they are hosted in the cloud, the corresponding on-premise analytics system is usually fulfilled by Hadoop.
Analytics include such things as Federated Query, Power BI, Azure machine learning, and Azure Data Factory
Sql Server on an Azure VM is considered as IaaS while virtualized databases in the cloud is considered PaaS. In the latter case, the Azure Cloud Services manage scale and resilience while allowing the application developers to focus on application and Data. Regardless of the choice, they use consistent tools for all the database options above.
Migrating database from single servers to Azure VMs is straightforward but MDMs and other databases that are company assets involve more chores.

#sqlexercise : https://goo.gl/Zz1yoV

Sunday, April 22, 2018

I came across an interesting discussion on how to efficiently search for multiple predicates and support paging.
The resolution is that we almost always have to push predicates down into the database and more so even to the query so that the optimizer has a chance to determine the best query plan for it.
In the absence of a database, we will be emulating the work of the query execution inside the database and are still not likely to be efficient and consistent in all cases simply because we have an enumeration based data structure only in the web-service layer.
On the other hand, the database is closest to the storage, indexes and organizes the records so that they are looked up more efficiently. The query plans can be compared and the most efficient can be chosen. Having an in-memory iteration only data structure will only limit us and will not scale to size of data.
That said, the predicates are expected to evaluate the same way regardless of which layer they are implemented in. If we have a set of predicates and they are separated by or clause as opposed to and clause, then we will likely have a result set from each predicate and they may involve the same records in the results of each predicate. If we filter based on one predicate and we also allow matches based on another predicate, the two result sets may then be merged into one so that the result can then be returned to the caller. The result sets may have duplicates so the merge may have to return only the distinct elements. This can easily be done by comparing the unique identifiers of each record in the result set.
The selection of the result is required prior to determining the section that needs to be returned to the user. This section is determined by the start,offset pair in the enumeration of the results. If the queries remain the same over time, and the request only varies in the paging parameters, then we can even cache the result and return only the paged section. The API will likely persist the predicate, resultsets in cache so that subsequent calls for paging only results the same responses. This can even be done as part of predicate evaluation by simply passing the well known limit and offset parameter directly in the SQL query. In the enumerator we do this with Skip and Take.

#codingexercise : https://1drv.ms/u/s!Ashlm-Nw-wnWti-TEUQKcdNRtPgJ
#sqlexercise : https://goo.gl/Zz1yoV

Saturday, April 21, 2018

Reconstructing a sequence of numbers from separate smaller sub-sequences.

Let us consider a sequence 1,3,7,2,4,5,9,8

And we are given a set of sub-sequences as

1,3,2,4,9

1,7,4,5,9,8

There are duplicates but they appear only once in each list. The orders of the subsequences appear the same in the sub-sequences as they appear in the main sequence.

The question is can we reconstruct the original sequence from the given sub-sequences deterministically.

On the surface, we know that we can inter-leave the m elements in the n-1 gaps between the other subsequence excluding duplicates. That gives us a large number of combinations. These combinations do not indicate the unique master sequence of distinct numbers.

Moreover, the number of entries from one sequence into a gap of the second sequence is variable from one to the number of elements following the choice inclusive. The choice of elements for these gaps follows the well-known stars and bars problem which yields a binomial coefficient number of combinations.

Consequently, it does appear that there may be no solution to this problem.

If the elements were sorted, then this problem would have been a merged list where we pick the element in the destination based on the smaller element from the first and the second array.

To make a choice of picking one element from one of the arrays, we need to sequentially visit the elements in both arrays. We advance one element in either array each time. In order to do so, our choice has to be in the same order as the master sequence.

We may give some weight to the choice of the element in both arrays. This assignment of weight has to factor in the index of the element if available from both arrays. For example the number 4 appears in both arrays above but since the index is lower in the second array rather than the first array, it will be picked and the other 4 will be ignored. Moreover, the first few elements of the first subsequence before the number 4 in the first array will be picked before the element 4 in the second array is encountered. We could use this availability of duplicates as a way to progress as long as there are sufficient duplicates in both indicating the choice of the array.

Therefore if the element 7 in the above subsequence was preceded by a 3, the elements in the master sequence could have been reconstructed to the point of 1,3,7,2,4.