Cluster computing

Tuesday, December 22, 2015

Workflow for setting up amazon metrics to be collected:

1) Edit Billing preferences as shown:

Notice that the AWS billing preferences provide three options:

The first option is the one that we will be customizing for our customers hence this option is merely for validity check.

The second option is the one that we are interested in because we can subscribe to CloudWatch then.

The third option can also be used with automation and hence we set it up with a S3 bucket.

Notice that the policy to be applied for a bucket named raja0034 S3 billing is as follows:

{

"Version": "2008-10-17",

"Id": "Policy1335892530063",

"Statement": [

{

"Sid": "Stmt1335892150622",

"Effect": "Allow",

"Principal": {

"AWS": "arn:aws:iam::386209384616:root"

"Action": [

"s3:GetBucketAcl",

"s3:GetBucketPolicy"

"Resource": "arn:aws:s3:::raja0034billing"

{

"Sid": "Stmt1335892526596",

"Effect": "Allow",

"Principal": {

"AWS": "arn:aws:iam::386209384616:root"

"Action": [

"s3:PutObject"

"Resource": "arn:aws:s3:::raja0034billing/*"

}

]

}

For option 2) of selecting alarms and metrics, we have to set it up on CloudWatch which displays both ECS and EBS Metrics:

Selected Metrics could then include:

Note that the metrics are categorized by service. If you wanted EC2 and EBS specifically, you would change the region to the one where your instances are :

Notice we have changed from East region to West.

Then you would see the metrics that we want to grab statistics for which we will do periodically and save it in a database as time series data for historical information and query.
#codingexercise
Void sizeOf1or0onlyRectangle (int [,] matrix, int row, int col, int startx, int starty, int x, int y, ref int size, dir)
{
If (x+1 < row && y +1 < col &&
Matrix [x+1,y+1] = matrix [x, y] &&
Matrix [x, y+1] = matrix [x,y] &&

Matrix[x+1, y] = matrix [x, y] && dir == diagonal && all_elements_in_new_row_are_same(matrix, startx, starty, matrix[x,y]) &&

all_elements_in_new_col_are_same(matrix, startx, starty, matrix[x,y]))

sizeOf1or0onlyRectangle(matrix,row, col, x+1,y+1, ref size +(x+1)-startx + (y+1)-starty + 1, dir );

If ( y +1 < col &&

Matrix [x,y+1] = matrix [x, y] && dir == horizontal && all_elements_in_new_row_are_same(matrix, startx, starty, matrix[x,y]))

sizeOf1or0onlyRectangle(matrix,row, col, x,y+1, ref size +(y+1)-starty, dir );

If (x+1 < row &&

Matrix[x+1, y] = matrix [x, y] && dir==vertical && all_elements_in_new_col_are_same(matrix, startx, starty, matrix[x,y]))

sizeOf1or0onlyRectangle(matrix,row, col, x+1,y, ref size +(x+1)-startx, dir);

}

Friday, December 18, 2015

Today's post talks about AWS statistics gathering:
Historical data of current value of metrics can be very useful. But AWS also provides the aggregators, sum, max, min and avg.
If the metrics have continuous values and their sum is important, we can maintain a running total by taking the sum every interval. Typically this interval can be set in the call to grab the statistics.
Since the calls are continuous, their return values are also non-overlapping and therefore summation is straightforward.

function addCurrentToCumulative($current, $cumulative){
foreach ( $current as $metric ){

if (array_key_exists($metric['key'], $cumulative) == false){
$cumulative[$metric['key']] = array('key'=>$metric[key], 'value' => 0, 'units' => $metric['count']);
}

if (array_key_exists('cumulate', $metric) && $metric['cumulate']){

if (array_key_exists($metric['key'], $cumulative)){
$cumulative[$metric['key']]['value'] += $metric['sum'];
}else{
$cumulative[$metric['key']]['value'] = $metric['sum'];
}

}else{
$cumulative[$metric['key']]['value'] = $metric['avg'];
}
}
return $cumulative;
}
function saveOrPrintCumulative($cumulative){
print "\n==========================================================\n";
print " Aggregated Metrics\n";
print "==========================================================\n";

foreach($cumulative as $item){
$key = $item['key'];
$value = $item['value'];
$units = $item['units'];
print "Metric -- $key $value $units \n";
}

}

// after every $interval minutes
// cumulate by adding current to total
// update current with grab_stats

$tablename = "Metrics";
$current = array();
$cumulative = addCurrentToCumulative($current, $cumulative);
$current = grab_stats($client, $tablename);
sleep($interval*60); // seconds
$cumulative = addCurrentToCumulative($current, $cumulative);
$current = grab_stats($client, $tablename);
?>

Note the forward moving cumulation ensures that there are no overlaps or errors introduced.

Thursday, December 17, 2015

<?php

require 'vendor/autoload.php';

9 / 30

use Aws\CloudWatch\CloudWatchClient;

$key = "Your_key";

$secret = "Your_secret";

$region = "us-west-2";
$version ="latest";

// Use the us-west-2 region and latest version of each client.

$sharedConfig = [

'region' => $region,

'version' => $version,

'credentials' => array(

'key' => $key,

'secret' => $secret,

'key' => $key,

'secret' => $secret,

];

// Create an SDK class used to share configuration across clients.

$sdk = new Aws\Sdk($sharedConfig);

$client = $sdk->createCloudWatch();

function grabber($client, $tablename, $metric) {

$output = array();

$results = $client->getMetricStatistics(array(

'Namespace' => 'AWS/ECS',

'MetricName' => $metric,

'Dimensions' => array(

array(

10 / 30

'Name' => 'TableName',

'Value' => $tablename,

'StartTime' => strtotime('-1 days'), //'-'.$interval.' minutes'),

'EndTime' => strtotime('now'),

'Period' => 300,

'Statistics' => array('Minimum', 'Maximum', 'Average', 'Sum'),

));

echo 'RESULTS='.serialize($results);

print "-------------------------------------------\n";

print " $metric\n";

print "-------------------------------------------\n";

foreach ($results as $result){

echo 'RESULT='.serialize($result);

if (is_array($result) && array_key_exists('Datapoints', $result)){

foreach ( $result['Datapoints'] as $item ) {

$min = $item['Minimum'];

$max = $item['Maximum'];

$avg = $item['Average'];

$sum = $item['Sum'];

$time = $item['Timestamp'];

print "$time -- min $min, max $max, avg $avg, sum $sum\n";

array_push($output, array('key'=>$time, 'min'=>$min, 'max'=>$max, 'avg'=>$avg,'sum'=>$sum, 'cumulate'=>true, 'units'=>'count'));

}

return $output;

}

To detect whether a rectangle of contiguous 1 or zero exists and if so, it's size given the starting position of top left corner as the element in a 2d array of 1 and 0, we use the following:

starting at this point as top left, we walk the rectangle based on increasing diagonal or side bottom or right and making sure all newly encountered elements within the new bounds are of similar value.

Wednesday, December 16, 2015

int GetSizeOfGreatestRectangleOf1or0( int [,] binaries, int rows, int cols)
{
var rectangles = new SortedList<int, int>();
for (int k = 0; k < rows*cols; k++)
{
int row = k / cols;
int col = k % cols;
bool inRectangle = false;

for (int l = 0; l < rectangles.Count; l++)
{
if (rectangles.GetKey(l) <= k && rectangles.GetByIndex(l) > k){
{
int startrow = rectangles.GetKey(l) / cols;
int startcol = rectangles.GetKey(l) % cols;
Debug.Assert(binaries[startrow, starcol] == binaries[row, col]);
// already part of rectangle
inRectangle = true;
break;
}
}

/*if (inRectangle)
{
continue;
}else */ {
// add rectangle if exists
// starting at this point as top left
// we walk the rectangle based on increasing diagonal or side bottom or right and making sure all newly encountered elements within the new bounds are of similar value.

}
}

// with all rectangles return max size
int size = 0;
for (int l = 0; l < rectangles.Count; l++)
{
int startrow = rectangles.GetKey(l) / cols;
int startcol = rectangles.GetKey(l) % cols;
int endrow = rectangles.GetByIndex(l) / cols;
int endcol = rectangles.GetByIndex(l) % cols;
Debug.Assert(endcol > startcol && endrow > startrow);
int cursize = (endcol - startcol + 1) * (endrow - startrow + 1);
if (cursize > size)
{
size = cursize;
}
}
return size;
}

Tuesday, December 15, 2015

Today we cover the distributed mini batch algorithm. Previously we were discussing the serial algorithm. We now evaluate the phi rule in a distributed environment. The technique resembles the serial mini-batch and retains some of the steps from that algorithm except that it runs in parallel on each node in the network, and illustrates the overall algorithm workflow. If we specify a batch size as b and assume that k divides b and mu where mu is the additional inputs that arrive at the nodes.
This means that each batch contains b plus mu consecutive inputs. During each batch j, all of the nodes use a common predictor wj. During the first b inputs, the nodes calculate and accumulate the stochastic gradients of the loss function f at wj. Once the nodes have accumulated b gradients altogether, they start a distributed vector sum operation to calculate the sum of these b gradients. While the vector sum completes in the background, mu additional inputs arrive and the system keeps processing them using the same predictor wj. Although these are processed, their gradients are discarded and this waste can be made negligible by choosing appropriate values for b. As a rule of thumb, typically square root of all the m batches is a number around which we can set b because that is the minimum we need to process to get a feel for the b batches. Note mini batch can be half the full batch because it does not offer any significant improvement in iterations. It has to be smaller but it cannot be stochastic either since it is a hybrid between stochastic and full-batch.
When the vector -sum operation completes, each node holds the sum of the b-gradients collected during the batch j. Each node divides this sum by b and obtains the average gradient. Each node uses this average gradient in the update rule phi, which results in a synchronized prediction for the next iteration. Therefore, during batch j each node processes b + mu taken together and divided by k as the number of inputs that are processed with the current predictor but only the first b/k gradients are used to compute the next predictor. All of the b+mu inputs are used to calculate the regret measure.
This is not a no-communication parallelization. In fact we do use the communication to minimize the degradation we suffer with no-communication.
#problemsolving
Given a 2D array of 1 and 0, Find the largest rectangle (may not be square) which is made up of all 1 or 0.
The straightforward solution would be to find the size of the rectangle that the current element is part of and return the maximum if such size encountered. To detect the rectangle we could walk along the boundary if one exists. An optimization is to keep track of all top left and bottom right pairs of rectangles encountered and to skip it when contained. These rectangles can be kept sorted in the order of their top left corners. We check only the rectangles whose topleft is earlier than the current element.