Cluster computing

Friday, December 18, 2015

Today's post talks about AWS statistics gathering:
Historical data of current value of metrics can be very useful. But AWS also provides the aggregators, sum, max, min and avg.
If the metrics have continuous values and their sum is important, we can maintain a running total by taking the sum every interval. Typically this interval can be set in the call to grab the statistics.
Since the calls are continuous, their return values are also non-overlapping and therefore summation is straightforward.

function addCurrentToCumulative($current, $cumulative){
foreach ( $current as $metric ){

if (array_key_exists($metric['key'], $cumulative) == false){
$cumulative[$metric['key']] = array('key'=>$metric[key], 'value' => 0, 'units' => $metric['count']);
}

if (array_key_exists('cumulate', $metric) && $metric['cumulate']){

if (array_key_exists($metric['key'], $cumulative)){
$cumulative[$metric['key']]['value'] += $metric['sum'];
}else{
$cumulative[$metric['key']]['value'] = $metric['sum'];
}

}else{
$cumulative[$metric['key']]['value'] = $metric['avg'];
}
}
return $cumulative;
}
function saveOrPrintCumulative($cumulative){
print "\n==========================================================\n";
print " Aggregated Metrics\n";
print "==========================================================\n";

foreach($cumulative as $item){
$key = $item['key'];
$value = $item['value'];
$units = $item['units'];
print "Metric -- $key $value $units \n";
}

}

// after every $interval minutes
// cumulate by adding current to total
// update current with grab_stats

$tablename = "Metrics";
$current = array();
$cumulative = addCurrentToCumulative($current, $cumulative);
$current = grab_stats($client, $tablename);
sleep($interval*60); // seconds
$cumulative = addCurrentToCumulative($current, $cumulative);
$current = grab_stats($client, $tablename);
?>

Note the forward moving cumulation ensures that there are no overlaps or errors introduced.

Thursday, December 17, 2015

<?php

require 'vendor/autoload.php';

9 / 30

use Aws\CloudWatch\CloudWatchClient;

$key = "Your_key";

$secret = "Your_secret";

$region = "us-west-2";
$version ="latest";

// Use the us-west-2 region and latest version of each client.

$sharedConfig = [

'region' => $region,

'version' => $version,

'credentials' => array(

'key' => $key,

'secret' => $secret,

'key' => $key,

'secret' => $secret,

];

// Create an SDK class used to share configuration across clients.

$sdk = new Aws\Sdk($sharedConfig);

$client = $sdk->createCloudWatch();

function grabber($client, $tablename, $metric) {

$output = array();

$results = $client->getMetricStatistics(array(

'Namespace' => 'AWS/ECS',

'MetricName' => $metric,

'Dimensions' => array(

array(

10 / 30

'Name' => 'TableName',

'Value' => $tablename,

'StartTime' => strtotime('-1 days'), //'-'.$interval.' minutes'),

'EndTime' => strtotime('now'),

'Period' => 300,

'Statistics' => array('Minimum', 'Maximum', 'Average', 'Sum'),

));

echo 'RESULTS='.serialize($results);

print "-------------------------------------------\n";

print " $metric\n";

print "-------------------------------------------\n";

foreach ($results as $result){

echo 'RESULT='.serialize($result);

if (is_array($result) && array_key_exists('Datapoints', $result)){

foreach ( $result['Datapoints'] as $item ) {

$min = $item['Minimum'];

$max = $item['Maximum'];

$avg = $item['Average'];

$sum = $item['Sum'];

$time = $item['Timestamp'];

print "$time -- min $min, max $max, avg $avg, sum $sum\n";

array_push($output, array('key'=>$time, 'min'=>$min, 'max'=>$max, 'avg'=>$avg,'sum'=>$sum, 'cumulate'=>true, 'units'=>'count'));

}

return $output;

}

To detect whether a rectangle of contiguous 1 or zero exists and if so, it's size given the starting position of top left corner as the element in a 2d array of 1 and 0, we use the following:

starting at this point as top left, we walk the rectangle based on increasing diagonal or side bottom or right and making sure all newly encountered elements within the new bounds are of similar value.

Wednesday, December 16, 2015

int GetSizeOfGreatestRectangleOf1or0( int [,] binaries, int rows, int cols)
{
var rectangles = new SortedList<int, int>();
for (int k = 0; k < rows*cols; k++)
{
int row = k / cols;
int col = k % cols;
bool inRectangle = false;

for (int l = 0; l < rectangles.Count; l++)
{
if (rectangles.GetKey(l) <= k && rectangles.GetByIndex(l) > k){
{
int startrow = rectangles.GetKey(l) / cols;
int startcol = rectangles.GetKey(l) % cols;
Debug.Assert(binaries[startrow, starcol] == binaries[row, col]);
// already part of rectangle
inRectangle = true;
break;
}
}

/*if (inRectangle)
{
continue;
}else */ {
// add rectangle if exists
// starting at this point as top left
// we walk the rectangle based on increasing diagonal or side bottom or right and making sure all newly encountered elements within the new bounds are of similar value.

}
}

// with all rectangles return max size
int size = 0;
for (int l = 0; l < rectangles.Count; l++)
{
int startrow = rectangles.GetKey(l) / cols;
int startcol = rectangles.GetKey(l) % cols;
int endrow = rectangles.GetByIndex(l) / cols;
int endcol = rectangles.GetByIndex(l) % cols;
Debug.Assert(endcol > startcol && endrow > startrow);
int cursize = (endcol - startcol + 1) * (endrow - startrow + 1);
if (cursize > size)
{
size = cursize;
}
}
return size;
}

Tuesday, December 15, 2015

Today we cover the distributed mini batch algorithm. Previously we were discussing the serial algorithm. We now evaluate the phi rule in a distributed environment. The technique resembles the serial mini-batch and retains some of the steps from that algorithm except that it runs in parallel on each node in the network, and illustrates the overall algorithm workflow. If we specify a batch size as b and assume that k divides b and mu where mu is the additional inputs that arrive at the nodes.
This means that each batch contains b plus mu consecutive inputs. During each batch j, all of the nodes use a common predictor wj. During the first b inputs, the nodes calculate and accumulate the stochastic gradients of the loss function f at wj. Once the nodes have accumulated b gradients altogether, they start a distributed vector sum operation to calculate the sum of these b gradients. While the vector sum completes in the background, mu additional inputs arrive and the system keeps processing them using the same predictor wj. Although these are processed, their gradients are discarded and this waste can be made negligible by choosing appropriate values for b. As a rule of thumb, typically square root of all the m batches is a number around which we can set b because that is the minimum we need to process to get a feel for the b batches. Note mini batch can be half the full batch because it does not offer any significant improvement in iterations. It has to be smaller but it cannot be stochastic either since it is a hybrid between stochastic and full-batch.
When the vector -sum operation completes, each node holds the sum of the b-gradients collected during the batch j. Each node divides this sum by b and obtains the average gradient. Each node uses this average gradient in the update rule phi, which results in a synchronized prediction for the next iteration. Therefore, during batch j each node processes b + mu taken together and divided by k as the number of inputs that are processed with the current predictor but only the first b/k gradients are used to compute the next predictor. All of the b+mu inputs are used to calculate the regret measure.
This is not a no-communication parallelization. In fact we do use the communication to minimize the degradation we suffer with no-communication.
#problemsolving
Given a 2D array of 1 and 0, Find the largest rectangle (may not be square) which is made up of all 1 or 0.
The straightforward solution would be to find the size of the rectangle that the current element is part of and return the maximum if such size encountered. To detect the rectangle we could walk along the boundary if one exists. An optimization is to keep track of all top left and bottom right pairs of rectangles encountered and to skip it when contained. These rectangles can be kept sorted in the order of their top left corners. We check only the rectangles whose topleft is earlier than the current element.

Monday, December 14, 2015

AWS Metrics
AWS metrics relevant to compute are provided by
1. Amazon ECS Metrics for clusters which monitor CPU and memory utilization across the cluster as a whole, and across the services in the clusters.
2. Amazon EC2 Metrics for compute instances which monitor CPU utilization, disk read/writes, network traffic and status checked
Amazon ECS Metrics:
CPUUtilization - The percentage of CPU units that are used in the cluster or service.
Cluster CPU utilization (metrics that are filtered by ClusterName without ServiceName) is measured as the total CPU units in use by Amazon ECS tasks on the cluster, divided by the total CPU units that were registered for all of the container instances in the cluster.
Service CPU utilization (metrics that are filtered by ClusterName and ServiceName) is measured as the total CPU units in use by the tasks that belong to the service, divided by the total number of CPU units that are reserved for the tasks that belong to the service.
Units: Percent
MemoryUtilization - The percentage of memory that is used in the cluster or service.
Cluster memory utilization (metrics that are filtered by ClusterName without ServiceName) is measured as the total memory in use by Amazon ECS tasks on the cluster, divided by the total amount of memory that was registered for all of the container instances in the cluster.
Service memory utilization (metrics that are filtered by ClusterName and ServiceName) is measured as the total memory in use by the tasks that belong to the service, divided by the total memory that is reserved for the tasks that belong to the service.
Units: Percent
Dimensions for Amazon ECS Metrics are also provided to refine the metrics returned for the Amazon ECS resources. These include:
ClusterName
This dimension filters the data requested for all resources in a specified cluster. All Amazon ECS metrics are filtered by ClusterName.
ServiceName
This dimension filters the data requested for all resources in a specified service within a specified cluster.
Amazon EC2 Metrics:
The following metrics are available from each EC2 instance.
CPUCreditUsage
(Only valid for T2 instances) The number of CPU credits consumed during the specified period.
CPUCreditBalance
(Only valid for T2 instances) The number of CPU credits that an instance has accumulated.
CPUUtilization
The percentage of allocated EC2 compute units that are currently in use on the instance. This metric identifies the processing power required to run an application upon a selected instance.
DiskReadOps
Completed read operations from all ephemeral disks available to the instance in a specified period of time.
DiskWriteOps
Completed write operations to all ephemeral disks available to the instance in a specified period of time.
DiskReadBytes
Bytes read from all ephemeral disks available to the instance
DiskWriteBytes
Bytes written to all ephemeral disks available to the instance
NetworkIn
The number of bytes received on all network interfaces by the instance.
NetworkOut
The number of bytes sent out on all network interfaces by the instance.
Dimensions for Amazon EC2 Metrics allow you to filter the EC2 instance data
These include:
ImageId : This dimension filters the data requested for all instances running this EC2 Amazon Machine Image (AMI).
InstanceId: This dimension filters the data you request for the identified instance only.
InstanceType: This dimension filters the data you request for all instances running with this specified instance type.
AWS metrics relevant to storage are provided by Amazon Simple Storage Service Dimensions:
BucketSizeBytes - The amount of data in bytes stored in a bucket in the Standard storage class or in the Reduced Redundancy Storage (RRS) class.
NumberOfObjects - The total number of objects stored in a bucket.
And the corresponding dimensions used to filter the data include:
BucketName - This dimension filters the data requested for the identified bucket only.
StorageType - This dimension filters the data you have stored in a bucket by the type of storage. The types are StandardStorage for the Standard storage class, ReducedRedundancyStorage for the Reduced Redundancy Storage (RRS) class, and AllStorageTypes.
Sample API call: http://monitoring.amazonaws.com/?SignatureVersion=2
&Action=ListMetrics
&Version=2010-08-01
&AWSAccessKeyId=< AWS Access Key Id>
&SignatureVersion=2
&SignatureMethod=HmacSHA256
&Timestamp=2010-11-17T05%3A13%3A00.000Z

Sample error response
This is specified as:
<ErrorResponse>
<Error>
<Type>Sender</Type>
<Code>InvalidParameterValue</Code>
<Message>Invalid nextToken</Message>
</Error>
<RequestId>cb1c4191-0e5d-11e2-9c15-6f306828daaa<RequestId>
</ErrorResponse>

SDK Libraries are available in Java, PHP, Python, Ruby and .NET.

Sample to get metrics
To get statistics we use the GetMetricStatistics API to get a variety of statistics though these are only available for 14 days.
Example: Call GetMetricStatistics with the following parameters to get the CPU utilization per hour for an EC2 instance for a 3-day range

MetricName = CPUUtilization
Period = 3600
Statistics list includes Maximum
Dimensions (Name=InstanceId, Value="<your-instance-id>")
Namespace = AWS/EC2
StartTime = 2011-01-09T23:18:00
EndTime = 2011-01-12T23:18:00

Sample to create alarms
To create or update an alarm, you use the PutMetricAlarm API function
To list any or all of the currently configured alarms, and list any alarms in a particular state use the DescribeAlarms API
To disable and enable alarms use the DisableAlarmActions and EnableAlarmActions APIs (mon-disable-alarm-actions and mon-enable-alarm-actions commands).
To test an alarm by setting it to any state use the SetAlarmState API (mon-set-alarm-state command)
Finally, to view an alarm's history use the DescribeAlarmHistory API (mon-describe-alarm-history command).

#codingexercise
String ReverseWordsInASentence(String sentence)
{
var words = sentence.split(' ');

var reversedSentence = words.Aggregate((sent, next) => next + " " + sent);

return reversedSentence;
}

Sunday, December 13, 2015

Billing System for IT Cloud Provider

Introduction

Billing Management System

Providers

Architecture

Provider Data

Openstack Billing provider:

Amazon Billing Provider

Implementation

Conclusion

Introduction:

Billing is useful to the consumers to know the usages that they make of the system resources. Although a private cloud provider may not require payments from the consumer, a report of various system resource usages together with some analysis charts and graphs make for a very presentable and actionable memo. The consumer can then directly participate in capacity planning and conservation.

Billing Management System:

This is a where a software billing management system comes useful. The billing system for a cloud provider is dedicated to the collection and processing of the usage data as it pertains to the end consumer billing. Detailed metrics for system resources are drawn from pay per use monitors that are either available internally or via external providers. By external provider we mean cloud providing vendor software that may already have its own telemetry. These gather runtime usage data that is stored in a repository. When the usage data is coupled with different pricing policies, we can get different bills generated. Pricing policies can include different schemes such as resource based or consumer based and models can vary from traditional pay-per use models to flat rate or pay-per-allocation models or their combinations.

Usage may also be combined with quotas. When quotas are exceeded, they can appear on the bills with notations for blocking further usage requests by these cloud consumers.

Providers:

Different providers provide their own mechanisms for usage monitoring and billing. Two examples cited here for providers are Openstack and AWS.

In the case of Openstack, we have Ceilometer. Ceilometer is OpenStack’s telemetry. It’s source code is available at https://github.com/openstack/ceilometer. It collects all kinds of monitoring and metering information such that no two agents need to be written to collect the same data. While monitoring and metering are different, Ceilometer’s metering consumer is say the billing system for which it acts as a unique and central point of contact to acquire all of the information collected across all OpenStack core components. The project strived to be efficient in terms of CPU and network costs. It supports both the push notifications from the existing services and the pull model by polling the infrastructure. Deployers can configure the type of data collected. Publishers are available as a Python implementation. A REST API is also available to expose the data collected by the metering system.

Bills even if written off can be delivered by mail. Their layout and crispness speak to the metering and usages giving a good indication to the customer of what is costing how much.

With regard to AWS Billing, we have a console that is built into the AWS Billing and Cost Management that already has a well known and documented cost analysis. Bills can be generated and exported as CSVs in S3 containers that can feed into different applications.

As a consolidator over cloud providers, we can maintain a consistent common user interface that accesses the same information from the underlying providers when available or adding new.

The design is one of a pull model where each bill is generated by a pull of the metrics, usage and rates from the underlying providers and our own database.

Architecture:

Central to the discussion on billing is the notion of a database for a customer. This is proposed in spite of the direct API access available to different cloud providers because a database promotes an ecosystem that leads to different workflows. The APIs are still used to populate the data except that they are now organized in a time-series manner and analyzed per customer or resource

UI / Console Provider 1

pull API/Database push

Bills (emails) Provider 2

Provider Data:

Openstack Billing provider:

Ceilometer measures the usage of resources in terms of disk usage, network usage and CPU usage. For each instance the usage details for following meters will be shown:

1. Disk Read Bytes

2. Disk Write Bytes

3. Disk Write Requests

4. Disk Read Requests

5. Network outgoing bytes

6. Network incoming bytes

7. Network outgoing packets

8. Network incoming packets

9. CPU

Ceilometer arranges all the data in terms of resources and their meters.

We can get a list of all the meters using

GET /v2/resources/ and

GET /v2/resources/(resource_id)

Where resource contains links such as a url to identify itself and a url for its meters such as http://localhost:8777/v2/meters/volume?q.field=resource_id&q.value=bd9431c1-8d69-4ad3-803a-8d4a6b89fd36

GET /v2/meters/ that returns all know meters and

GET /v2/meters/(meter_name) which even come with statistics

GET /v2/meters/(meter_name)/statistics

Samples and statistics can also be drawn from

GET /v2/samples

And GET /v2/ samples/(sample_id)

Tables can be created in the following manner:

1) RatePlan

a. Id

b. Name

c. CPU cost per second

d. Disk cost per MB

e. Network cost per MB

f. Cost for flavors

2) Table for instance-bill will have the following attributes

a. Resource_id

b. Rateplan_id

c. Date and time at which the bill was calculated

d. Resource_bill

Amazon Billing Provider

AWS Cloud operates on a pay as you go model so the bill reflects the actual usage. This also means that we need to check the account activity every so often or get alerts through alarms setup through Amazon CloudWatch metrics and alarms. The Amazon CloudWatch provides the monitoring of the estimated charges using billing alerts. Furthermore it delivers via the Amazon SNS – a well known publisher subscriber service for Mobile and Enterprise Messaging. The billing alerts sent by Amazon CloudWatch include estimates stored as CloudWatch metrics and this metric data is stored for 14 days. The data includes variations of the following:

Estimated charges : Total

Estimated charges : by service

Estimated charges : by Linked Account

Estimated charges : by Linked Account and Service

Note that these are estimates and not predictions. They approximate the cost and not take the trends or potential changes in the AWS usage pattern into account

The use of these alerts is unique for the consolidated billing for the private cloud provider. Essentially the alerts populate a dynamic view (in-memory) or table (stored) for each customer, account and service.

The table can have the following attributes

- Current period cost by service

- Cumulative cost by service

With this table, we can generate reports with customer account and service granularity. If we keep time series collection of these costs, we can pull reports as well. One thing that should stand out is that CloudWatch already manages most of the cost estimation and can provide different kinds of analysis builtin into the alerts. Therefore the table we keep for the data that we need to pull for our customers is merely a staging store. The bulk of the calculation and processing is already done for us.

The relevant APIs to populate the staging store in this case would be as follows:

Services/cloudwatch.class.php - __construct ( $options ) :

Constructs a new instance of AmazonCloudWatch:

put_metric_alarm ( $alarm_name, $metric_name, $namespace, $statistic, $period, $evaluation_periods, $threshold, $comparison_operator, $opt )

Creates or updates an alarm and associates it with the specified Amazon CloudWatch metric. Optionally, this operation can associate one or more Amazon Simple Notification Service resources with the alarm.

delete_alarms ( $alarm_names, $opt )

Deletes all specified alarms

Retrieves history for the specified alarm. Filter alarms by date range or item type. If an alarm name is not specified, Amazon CloudWatch returns histories for all of the owner’s alarms.

enable_alarm_actions ( $alarm_names, $opt )

Enables actions for the specified alarms.

disable_alarm_actions ( $alarm_names, $opt )

Disables actions for the specified alarms. When an alarm’s actions are disabled the alarm’s state may change, but none of the alarm’s actions will execute.

list_metrics ( $opt )

Returns a list of valid metrics stored for the AWS account owner. Returned metrics can be used with GetMetricStatistics to obtain statistical data for a given metric.

There is no need to merge the bills from different providers since we strive to be more granular in itemizing the costs. They can appear as different sections one from each provider. This saves the trouble of artificially merging into a common set of metrics when it may not be available as such from each provider. And even if they are available, their syntax, semantics and nuances may vary. Our customer recognize services and instances of compute and storage. As long as we maintain those as granular as possible in the bills, the customer has the luxury of aggregating these himself or seeing the summary wherever possible in one of the analysis charts.

Note that the providers may also maintain their own billing database that provides a rich query API. This might even argue for removing our database and performing the query on demand when the bills are being requested by the user in an interactive mode or during batch processing. However such databases may rarely store data more than a few weeks worth leaving our system to poll and store every so often to build up a time series data for longer analysis. In addition, the user account or plan may change frequently and may even to get notifications of when she is exceeding the free tier of usages. Additionally, one provider may promote a notification model preferred over a polling model while another may offer the other model to pull the data. A staging store for the billing data as relevant to the consumer bills enables to talk to different providers. We can keep the staging store as minimal as necessary, relying more on getting the information by APIs but it doesn’t seem to be avoidable.

Implementation:

Most of the cloud providers mentioned as examples are already providing rich set of APIs for their usage as shown. We just have to focus on the service that extracts, transforms and loads the database for use by the administrators and the consumers. Reports and publishing as emails can be done by pulling the data from this database.

Conclusion:

This software is easy to build with the existing frameworks that we already have both in terms of the data providers as well as the infrastructure for hosting it.

Friday, December 11, 2015

Today we continue to review the paper by Dekel, Gilad-BachRach, Shamir and Xiao which described optimal distributed online prediction using mini-batches. In this section we review serial online prediction using mini-batches.
We saw with the analysis of the parallel DMB algorithm in the earlier post that the expected regret was based on the variance of the stochastic gradients. The dependency on variance implies that we can average the stochastic gradients over mini-batches to reduce the variance.
In the serial mini-batch algorithm we do so. Let us take a look.
The update rule is applied after each input is received in the previous section. However, in the serial mini-batch algorithm, this is done only periodically hence the notion of mini-batches where the batch-size is defined by the user. Here b is a user-defined batch size and every b consecutive inputs is considered as a batch.
The algorithm now becomes:
for j = 1,2, ... do
initialize avg-gradient-j = 0
for s = 1 to b do
define i = (j - 1)b + s;
predict wj;
receive input zi sampled i.i.d from unknown distribution
suffer loss f(wj, zi);
gradient-i = delta-w f(wj, zi);
avg-gradient-j = avg-gradient-j + (1/b) gi;
end
set (wj+1, aj+1) = phi (auxiliary state vector aj, gradient gj, and step-size)

The prediction remains constant for the duration of each batch and is updated only when a batch ends. While processing the b inputs in batch j, the algorithm calculates and accumulates gradients and defines the average gradient. Each such mini batch computes one average gradient. The
appeal of the serial mini-batch setting is that the update rule is used less frequently, which may have computational benefits.

#codingexercise
int max(List<int> numbers)
{
max = INT_MIN;
for (int i = 0; i < numbers.Count; i++)
if (numbers[i] > max)
max = numbers[i];
return max;

}