Cluster computing

Tuesday, May 1, 2018

Today we discuss the AWS database migration service- DMS. This service allows consolidation , distribution, and replication of databases. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. It support almost all the major brands of databases. It can also perform heterogeneous migration such as from Oracle to Microsoft SQL Server.
When the databases are different, the AWS schema conversion tool is used. The steps for conversion include : assessment, database schema conversion, application conversion, scripts conversion, integration with third party applications, data migration, functional testing of the entire system, performance tuning, integration and deployment, training and knowledge, documentation and version control, and post production support. The schema conversion tool assists with the first few steps until the data migration step. Database objects such as tables, views, indexes, code, user defined types, aggregates, stored procedures, functions, triggers, and packages can be moved with the SQL from the schema conversion tool. This tool also provides an assessment report and an executive summary. As long as the tool has the drivers for the source and destination databases, we can rely on the migration automation performed this way. Subsequently configuration and settings need to be specified on the target database. These settings include performance, memory, assessment report etc. The number of tables, schemas, user/role/permissions determine the duration of the migration.
The DMS differs from the schema conversion tool in that it is generally used for data migration instead of schema migration.
#codingexercises
Sierpinski triangle:
double GetCountRepeated(int n)

{

double result = 1;

For (int i = 0; i < n; i++)

{

result = 3 * result + 1 + 1;

}

Return result;

}

which can also be written recursively

another : https://ideone.com/F6QWcu

and finally: Text Summarization app: http://shrink-text.westus2.cloudapp.azure.com:8668/add

Monday, April 30, 2018

I have been working on a user interface for customers to upload content to a server for some processing. Users may even have to upload large files and both the upload and the processing may take time. I came across an interesting jquery plugin for the purpose of showing a progress bar.
The jQuery-File-Upload plugin gives an example as follows:
$(function () {
$('#fileupload').fileupload({
dataType: 'json',
done: function (e, data) {
$.each(data.result.files, function (index, file) {
$('<p/>').text(file.name).appendTo(document.body);
});
}
});
});

The upload progress bar is indicated this way:
$('#fileupload').fileupload({
:
progressall: function (e, data) {
var progress = parseInt(data.loaded / data.total * 100, 10);
$('#progress .bar').css(
'width',
progress + '%'
);
}
});

Notice that the function depends on data and this notion can be borrowed to the server side where given a request id, the server can indicate progress
function updateProgress() {
if (stopProgressCheck) return;
var webMethod = progressServiceURL + "/GetProgress";
var parameters = "{'requestId':'" + requestId + "'}";

$.ajax({
type: "POST",
url: webMethod,
data: parameters,
contentType: "application/json; charset=utf-8",
dataType: "json",
success: function (msg) {
if (msg.d != "NONE") { //add any necessary checks
//add code to update progress bar status using value in msg.d
statusTimerID = setTimeout(updateProgress, 100); //set time interval as required
}
},
error: function (x, t, m) {
alert(m);
}
});
}
Courtesy : https://stackoverflow.com/questions/24608335/jquery-progress-bar-server-side
The alternative to this seems to be to use the HTTP status code for 102 processing.
#application : https://1drv.ms/w/s!Ashlm-Nw-wnWtkN701ndJWdxfcO4

Sunday, April 29, 2018

We were discussing the benefits of managed RDS instance. Let us now talk about the cost and performance optimization in RDS.
RDS supports multiple engines including Aurora, MySQL, MariaDB, PostgreSql, Oracle, SQL Server. Being a manged service it supports provisioning, patching , scaling, replicas, backup/restore and scaling up for all these servers. It supports multiple Availability zones. Lower TCO as a managed service - it provides more focus on differentiation which makes it attractive to managed instances big or small.
The storage type may be selected between GP2 and IO1. The former is general purpose and the latter is for high consistency and performance.

Depending on the volume size the burst rate and the IOPS rate on the GP2 needs to be monitored. There is a limit imposed by GP2 and as long as we have credit on this limit, the general purpose GP2 serves well.
Compute or memory can be scaled up or down. Storage can be upto 16 GB. There is no downtime for storage scaling.
Failovers are automatic. Replication is synchronous. Availability zone is inexpensive and enabled with one click. Read Replicas relieves pressure on the source database with additional read capacity.
Backups are managed with automated and manual snapshots. Transaction logs are stored every 5 minutes. There is no penalty for backups. Snapshots can be copied across regions. A backup can restore an entirely new database instance.
New volumes can be populated from Amazon S3. A VPC allows network isolation. Resource level permission control is based on IAM access control. There is encryption at rest and SSL based protection for transit. There is no penalty for encryption. Moreover, it is centralized with access and audit of key activity.
Access grants and revokes are maintained with an IAM user for everyone including the admin. Multi-factor authentication may also be setup.
CloudWatch may provide help with monitoring. The metrics usually involve those for CPU, storage and memory, swap usage, read and write, latency and throughput and replica lags. Cloudwatch alarms are similar to on-premises monitoring tools. Performance insights can be gained additionally by measuring active sessions, identifying sources of bottlenecks with an available tool, discovering problems with log analysis and windowing of timelines.
Billing is usually in the form of GB months.
#application : https://1drv.ms/w/s!Ashlm-Nw-wnWtkN701ndJWdxfcO4

Saturday, April 28, 2018

We were discussing the benefits of managed RDS instance:

The managed database instance types offer a range of CPU and memory selections. Moreover their storage is scaleable on demand. Automated backups have a retention period of 35 days and manual snapshots are stored in S3 for durability. An availability zone is a physically distinct independent infrastructure. Multiple availability zones each of which is a physically distinct independent infrastructure comes with database synchronization so they are better prepared for failures. Read replicas help offload read traffic. Entire database may be snapshot and copied across region for greater durability. Compute, Storage and IOPS are provisioned constitute the bill.

Performance is improved with offloading read traffic to replicas, putting a cache in front of the RDS and scaling up the storage or resizing the instances. CloudWatch alerts and DB Event notifications enabled databases to be monitored.

In short, RDS allows developers to focus on app optimization with schema design, query construction, query optimization while allowing all infrastructure and maintenance to be wrapped under managed services.

RDS alone may not scale in a distributed manner. Therefore software such as ScaleBase allows creation of a distributed relational database where database instances are scaled out. Single instance database can now be transformed into multiple-instance distributed relational database. The benefits from such distributed database include massive scale, instant deployment, keeping all RDS benefits from single-instance, automatic load balancing especially with lags from replicas and splitting of reads and writes, and finally increased ROI with no app code requirements.

Does multi-model cloud database instance lose fidelity and performance over dedicated relational database ?
The answer is probably no because a cloud scales horizontally and what the database server did to manage partition is what the cloud does too. A matrix of database servers as a distributed database model comes with the co-ordination activities. A cloud database seamlessly provides a big table. Can the service-level agreement of a big table match the service-level agreement of a distributed query on a sql server ? The answer is probably yes because the partitions of data and corresponding processing are now flattened.

Are developers encouraged to use cloud databases as their conventional development database which they move to production ? This answer is also probably yes and the technology that does not require a change of habit is more likely to get adopted and all the tenets of cloud scale processing only improves traditional processing. Moreover queries are standardized in language as opposed to writing custom map-reduce logic and maintaining a library of those as a distributable package for No-Sql users.

#codingexercise https://ideone.com/Ar5cOO

Friday, April 27, 2018

The previous post was on Events as a measurement of Cloud Database performance with a focus on the pricing of the cloud database. We factored in the advantages of a public cloud managed RDS over a self-managed AWS instance namely: upgrades, backup and failover are provided as a services, there is more infrastructure and db Security, the database appears as a managed appliance and the failover is a packaged service.

The managed database instance types offer a range of CPU and memory selections. Moreover their storage is scaleable on demand. Automated backups have a retention period of 35 days and manual snapshots are stored in S3 for durability. An availability zone is a physically distinct independent infrastructure. Multiple availability zones each of which is a physically distinct independent infrastructure comes with database synchronization so they are better prepared for failures. Read replicas help offload read traffic. Entire database may be snapshot and copied across region for greater durability. Compute, Storage and IOPS are provisioned constitute the bill.

Performance is improved with offloading read traffic to replicas, putting a cache in front of the RDS and scaling up the storage or resizing the instances. CloudWatch alerts and DB Event notifications enabled databases to be monitored.

In short, RDS allows developers to focus on app optimization with schema design, query construction, query optimization while allowing all infrastructure and maintenance to be wrapped under managed services.

RDS alone may not scale in a distributed manner. Therefore software such as ScaleBase allows creation of a distributed relational database where database instances are scaled out. Single instance database can now be transformed into multiple-instance distributed relational database. The benefits from such distributed database include massive scale, instant deployment, keeping all RDS benefits from single-instance, automatic load balancing especially with lags from replicas and splitting of reads and writes, and finally increased ROI with no app code requirements.

#codingexercise https://ideone.com/pyiZ7C

Thursday, April 26, 2018

The previous post was on Events as a measurement of Cloud Database performance with a focus on the pricing of the cloud database.
With the help of managed services in the cloud, deployments are simple and fast and so is scaling. Moreover, patching, backups and replication are also handled appropriately. It is compatible to almost all applications. And it has fast, predictable performance. There is no cost to get started and we pay for what we use.
Initially the usage costs are a fraction of the costs incurred from on-premise solutions. Moreover for the foreseeable future, the usage expenses may not exceed on-premise solutions. However, neither the cloud provider nor the cloud consumer controls the data which only explodes and increases over time.
Traditionally disks kept up with this storage requirements and there was cost for the infrastructure but since the usage is proportional to the incoming data, there is no telling when the tradeoff between managed database service costs exceeds the alternatives.

The metering of cloud database usage and the corresponding bills will eventually become significant enough for corporations to take notice when more and more data finds its way to the cloud. Storage has been touted to be zero cost in the cloud but compute is not. Any amount of incoming data is paired with compute requirements. While the costs may be granular per request processing, it aggregates when billions of requests are processed in smaller and smaller duration. Consequently, the cloud may no longer be the least expensive option. Higher metering and more monitoring along with dramatic increase in traffic add up to substantial costs while there is no fallback option to keep the costs under the bar.

The fallback option to cut costs and make them more manageable is itself a very important challenge. What will money starved public cloud tenants do when the usage only accrues with explosion of data regardless of economic cycles? This in fact is the motivating factor to consider a sort of disaster management plan when public cloud becomes unaffordable while it may even be considered unthinkable today.

I'm not touting the on-premise solution or the hybrid cloud technology as the fall back option. I don't even consider them as a viable option regardless of any economic times or technological advancement. The improvements to cloud services has leaped far ahead of such competition and the cloud consumers are savvy about what they need from the cloud. Instead I focus on some kind of easing of traffic into less expensive options primarily with the adoption of quality of service for cloud service users and their migration to more flat rate options. Corporate spending is not affected but there will be real money saved when the traffic to the cloud is kept sustainable.
Let us take a look at this Quality of Service model. When all the data goes into the database and the data is measure only in size and number of requests, there is no distinction between Personally Identifiable Information and regular chatty traffic. In the networking world, the ability to perform QoS depends on the ability to differentiate traffic using <src-ip, src-port, dest-ip, dest-port, protocol> With the enumeration and classification of traffic, we can have traffic serviced with different QoS agreements using IntServ or DiffServ architecture. Similar QoS managers could differentiate data so that the companies not only have granularity but also take better care of their more important categories of data.

New:

#codingexercise: https://1drv.ms/u/s!Ashlm-Nw-wnWtjSc1V5IlYwWFY88

Wednesday, April 25, 2018

The previous post was on Events as a measurement of Cloud Database performance. Today we focus on the pricing of the cloud database.

It may be true that the cost of running a cloud database for high volume load may be significantly cheaper rather than the cost of running the same database and connections on-premise. This comes from the following managed services:

Scaling

High Availability

Database backups

DB s/w patches

DB s/w installs

OS patches

OS installation

Server Maintenance

Rack & Stack

Power, HVAC, net

However, I argue that the metering of cloud database usage and the corresponding bills will eventually become significant enough for corporations to take notice. This is due to the fact that pay as you go model albeit granular is in fact only expected to increase dramatically as usage grows. Consequently, the cloud may no longer be the least expensive option. Higher metering and more monitoring along with dramatic increase in traffic add up to substantial costs while there is no fallback option to keep the costs under the bar.

#codingexercise: https://ideone.com/F6QWcu

#codingexercise: https://ideone.com/pKE68t