Cluster computing

Sunday, April 29, 2018

We were discussing the benefits of managed RDS instance. Let us now talk about the cost and performance optimization in RDS.
RDS supports multiple engines including Aurora, MySQL, MariaDB, PostgreSql, Oracle, SQL Server. Being a manged service it supports provisioning, patching , scaling, replicas, backup/restore and scaling up for all these servers. It supports multiple Availability zones. Lower TCO as a managed service - it provides more focus on differentiation which makes it attractive to managed instances big or small.
The storage type may be selected between GP2 and IO1. The former is general purpose and the latter is for high consistency and performance.

Depending on the volume size the burst rate and the IOPS rate on the GP2 needs to be monitored. There is a limit imposed by GP2 and as long as we have credit on this limit, the general purpose GP2 serves well.
Compute or memory can be scaled up or down. Storage can be upto 16 GB. There is no downtime for storage scaling.
Failovers are automatic. Replication is synchronous. Availability zone is inexpensive and enabled with one click. Read Replicas relieves pressure on the source database with additional read capacity.
Backups are managed with automated and manual snapshots. Transaction logs are stored every 5 minutes. There is no penalty for backups. Snapshots can be copied across regions. A backup can restore an entirely new database instance.
New volumes can be populated from Amazon S3. A VPC allows network isolation. Resource level permission control is based on IAM access control. There is encryption at rest and SSL based protection for transit. There is no penalty for encryption. Moreover, it is centralized with access and audit of key activity.
Access grants and revokes are maintained with an IAM user for everyone including the admin. Multi-factor authentication may also be setup.
CloudWatch may provide help with monitoring. The metrics usually involve those for CPU, storage and memory, swap usage, read and write, latency and throughput and replica lags. Cloudwatch alarms are similar to on-premises monitoring tools. Performance insights can be gained additionally by measuring active sessions, identifying sources of bottlenecks with an available tool, discovering problems with log analysis and windowing of timelines.
Billing is usually in the form of GB months.
#application : https://1drv.ms/w/s!Ashlm-Nw-wnWtkN701ndJWdxfcO4

Saturday, April 28, 2018

We were discussing the benefits of managed RDS instance:

The managed database instance types offer a range of CPU and memory selections. Moreover their storage is scaleable on demand. Automated backups have a retention period of 35 days and manual snapshots are stored in S3 for durability. An availability zone is a physically distinct independent infrastructure. Multiple availability zones each of which is a physically distinct independent infrastructure comes with database synchronization so they are better prepared for failures. Read replicas help offload read traffic. Entire database may be snapshot and copied across region for greater durability. Compute, Storage and IOPS are provisioned constitute the bill.

Performance is improved with offloading read traffic to replicas, putting a cache in front of the RDS and scaling up the storage or resizing the instances. CloudWatch alerts and DB Event notifications enabled databases to be monitored.

In short, RDS allows developers to focus on app optimization with schema design, query construction, query optimization while allowing all infrastructure and maintenance to be wrapped under managed services.

RDS alone may not scale in a distributed manner. Therefore software such as ScaleBase allows creation of a distributed relational database where database instances are scaled out. Single instance database can now be transformed into multiple-instance distributed relational database. The benefits from such distributed database include massive scale, instant deployment, keeping all RDS benefits from single-instance, automatic load balancing especially with lags from replicas and splitting of reads and writes, and finally increased ROI with no app code requirements.

Does multi-model cloud database instance lose fidelity and performance over dedicated relational database ?
The answer is probably no because a cloud scales horizontally and what the database server did to manage partition is what the cloud does too. A matrix of database servers as a distributed database model comes with the co-ordination activities. A cloud database seamlessly provides a big table. Can the service-level agreement of a big table match the service-level agreement of a distributed query on a sql server ? The answer is probably yes because the partitions of data and corresponding processing are now flattened.

Are developers encouraged to use cloud databases as their conventional development database which they move to production ? This answer is also probably yes and the technology that does not require a change of habit is more likely to get adopted and all the tenets of cloud scale processing only improves traditional processing. Moreover queries are standardized in language as opposed to writing custom map-reduce logic and maintaining a library of those as a distributable package for No-Sql users.

#codingexercise https://ideone.com/Ar5cOO

Friday, April 27, 2018

The previous post was on Events as a measurement of Cloud Database performance with a focus on the pricing of the cloud database. We factored in the advantages of a public cloud managed RDS over a self-managed AWS instance namely: upgrades, backup and failover are provided as a services, there is more infrastructure and db Security, the database appears as a managed appliance and the failover is a packaged service.

The managed database instance types offer a range of CPU and memory selections. Moreover their storage is scaleable on demand. Automated backups have a retention period of 35 days and manual snapshots are stored in S3 for durability. An availability zone is a physically distinct independent infrastructure. Multiple availability zones each of which is a physically distinct independent infrastructure comes with database synchronization so they are better prepared for failures. Read replicas help offload read traffic. Entire database may be snapshot and copied across region for greater durability. Compute, Storage and IOPS are provisioned constitute the bill.

Performance is improved with offloading read traffic to replicas, putting a cache in front of the RDS and scaling up the storage or resizing the instances. CloudWatch alerts and DB Event notifications enabled databases to be monitored.

In short, RDS allows developers to focus on app optimization with schema design, query construction, query optimization while allowing all infrastructure and maintenance to be wrapped under managed services.

RDS alone may not scale in a distributed manner. Therefore software such as ScaleBase allows creation of a distributed relational database where database instances are scaled out. Single instance database can now be transformed into multiple-instance distributed relational database. The benefits from such distributed database include massive scale, instant deployment, keeping all RDS benefits from single-instance, automatic load balancing especially with lags from replicas and splitting of reads and writes, and finally increased ROI with no app code requirements.

#codingexercise https://ideone.com/pyiZ7C

Thursday, April 26, 2018

The previous post was on Events as a measurement of Cloud Database performance with a focus on the pricing of the cloud database.
With the help of managed services in the cloud, deployments are simple and fast and so is scaling. Moreover, patching, backups and replication are also handled appropriately. It is compatible to almost all applications. And it has fast, predictable performance. There is no cost to get started and we pay for what we use.
Initially the usage costs are a fraction of the costs incurred from on-premise solutions. Moreover for the foreseeable future, the usage expenses may not exceed on-premise solutions. However, neither the cloud provider nor the cloud consumer controls the data which only explodes and increases over time.
Traditionally disks kept up with this storage requirements and there was cost for the infrastructure but since the usage is proportional to the incoming data, there is no telling when the tradeoff between managed database service costs exceeds the alternatives.

The metering of cloud database usage and the corresponding bills will eventually become significant enough for corporations to take notice when more and more data finds its way to the cloud. Storage has been touted to be zero cost in the cloud but compute is not. Any amount of incoming data is paired with compute requirements. While the costs may be granular per request processing, it aggregates when billions of requests are processed in smaller and smaller duration. Consequently, the cloud may no longer be the least expensive option. Higher metering and more monitoring along with dramatic increase in traffic add up to substantial costs while there is no fallback option to keep the costs under the bar.

The fallback option to cut costs and make them more manageable is itself a very important challenge. What will money starved public cloud tenants do when the usage only accrues with explosion of data regardless of economic cycles? This in fact is the motivating factor to consider a sort of disaster management plan when public cloud becomes unaffordable while it may even be considered unthinkable today.

I'm not touting the on-premise solution or the hybrid cloud technology as the fall back option. I don't even consider them as a viable option regardless of any economic times or technological advancement. The improvements to cloud services has leaped far ahead of such competition and the cloud consumers are savvy about what they need from the cloud. Instead I focus on some kind of easing of traffic into less expensive options primarily with the adoption of quality of service for cloud service users and their migration to more flat rate options. Corporate spending is not affected but there will be real money saved when the traffic to the cloud is kept sustainable.
Let us take a look at this Quality of Service model. When all the data goes into the database and the data is measure only in size and number of requests, there is no distinction between Personally Identifiable Information and regular chatty traffic. In the networking world, the ability to perform QoS depends on the ability to differentiate traffic using <src-ip, src-port, dest-ip, dest-port, protocol> With the enumeration and classification of traffic, we can have traffic serviced with different QoS agreements using IntServ or DiffServ architecture. Similar QoS managers could differentiate data so that the companies not only have granularity but also take better care of their more important categories of data.

New:

#codingexercise: https://1drv.ms/u/s!Ashlm-Nw-wnWtjSc1V5IlYwWFY88

Wednesday, April 25, 2018

The previous post was on Events as a measurement of Cloud Database performance. Today we focus on the pricing of the cloud database.

It may be true that the cost of running a cloud database for high volume load may be significantly cheaper rather than the cost of running the same database and connections on-premise. This comes from the following managed services:

Scaling

High Availability

Database backups

DB s/w patches

DB s/w installs

OS patches

OS installation

Server Maintenance

Rack & Stack

Power, HVAC, net

However, I argue that the metering of cloud database usage and the corresponding bills will eventually become significant enough for corporations to take notice. This is due to the fact that pay as you go model albeit granular is in fact only expected to increase dramatically as usage grows. Consequently, the cloud may no longer be the least expensive option. Higher metering and more monitoring along with dramatic increase in traffic add up to substantial costs while there is no fallback option to keep the costs under the bar.

#codingexercise: https://ideone.com/F6QWcu

#codingexercise: https://ideone.com/pKE68t

Tuesday, April 24, 2018

Events as a measurement of Cloud Database performance

In the previous post, we talked about cloud databases. They come with the benefits of the cloud and can still offer at par performance with on-premise database and perhaps better Service-Level Agreements. Today we talk about a test case to measure performance of a cloud database, say Azure Cosmos Database using it as a queue. I could not find a comparable study to match this test case.

Test case: Let us say, I stream millions of events per minute to CosmosDB and each bears a tuple <EventID, Byte[] payload> The payload is irrelevant other than forcing a data copy operation for each event thereby trying to introduce processing time so I use the same payload with each event.

The EventIDs range from 1 to Integer.MaxValue. I want to store this in a big table where I utilize the 255 byte row key to support horizontal scaling of my table.

Then I want to run a query every few seconds to tell me the top 50 events by count in the table.

The Query will be :

SELECT events.ID id,

COUNT(*) OVER ( PARTITION BY events.ID ) count,

ORDER BY count DESC;

Which semantically has the same meaning as GROUP BY COUNT(*)

This is a read-only query so the writes to the table should not suffer any performance impact. In other words, the analysis query is separate from the transactional queries to insert the events in a table.

In order to tune the performance of the Cloud Database, I set the connection policy to

· Direct mode

· Protocol to TCP

· Avoid startup latency on first request

· Collocate clients in same Azure region for performance

· Increase number of threads/tasks

This helps to squeeze the processing time of the events as they make their way to the table. The number of Request Units (RU) will be attempted to exceed 100,000

This high throughput will facilitate a load that the analysis query performance may become slower over time with the massive increase in data size.

Conclusion – This test case and its pricing will help determine if it is indeed practical to store high volume traffic in the database such as from mobile fleet or IoT devices.

Monday, April 23, 2018

Databases can scale up when deployed on-premise. How about the migration to cloud databases ? What are some of the differences ?
This post tries to find those answers. On-premise database can scale only under constraints requiring hardware resources, manpower and costs That said on-premise database can work without internet access, provide high tps, keep the data and applications in house, and can store sensitive data.
Database servers such as SQL server have also become cloud ready and even provide mission critical cloud performance while lately they have been focusing on advanced analytics and rich visualizations. Higher end database servers that require say 64 cpus are often an on-premise only choice..
We don't talk about data warehouses in this section because they are different from databases. That said Azure SQL DB and DW both work in the cloud as relational stores.
Azure Data lake Analytics and Azure data lake store are non-relational and while they are hosted in the cloud, the corresponding on-premise analytics system is usually fulfilled by Hadoop.
Analytics include such things as Federated Query, Power BI, Azure machine learning, and Azure Data Factory
Sql Server on an Azure VM is considered as IaaS while virtualized databases in the cloud is considered PaaS. In the latter case, the Azure Cloud Services manage scale and resilience while allowing the application developers to focus on application and Data. Regardless of the choice, they use consistent tools for all the database options above.
Migrating database from single servers to Azure VMs is straightforward but MDMs and other databases that are company assets involve more chores.

#sqlexercise : https://goo.gl/Zz1yoV