Cluster computing

Monday, December 8, 2014

We resume our discussion on the WRL system's cache analysis program. It was found that simulating split instruction and data first level caches and a direct mapped second level cache, analysis plus trace generation takes about 100 times as long as untraced execution. For example, tracing and analyzing a run of TV, a WRL timing verification program, extends the runtime from 25 minutes to 45 hours.

Saving traces for later analysis :
On the fly analysis does not require storage. When storage was required, the data was put on tapes for the sake of capacity and portability. Data was written onto the tape using the Maching scheme. A differencing technique was used to write the data. The original trace data was first converted into a cache difference trace consisting of reference points and differences. Addresses are represented as a difference from the previous address, rather than as an absolute address. This helps with creating patterns in the trace reference string. The cache difference trace is then compressed common sequences of data are translated to single code words. Reference locality and regularity of trace entry formats allow trace data to be very effectively compressed.

We next review the Base Case Cache organizations. A number of assumptions were made about the machine used for analyzing.
The only assumption related to the generation of traces is that of a RISC architecture in which the instruction fetches and explicit data loads and stores are the only forms of memory reference. This makes the simple form of the traces useful even though they contain no information about the instruction types other than the loads and the stores.
Among the assumptions made about the machine, one is that it's a pipelined architecture. If there were no cache misses, a new instruction is started on every cycle. This is possible when the page offset is simultaneously used to retrieve the real page from the TLB and to index the correct line in the cache. If there are no other delays other than by cache misses, the based cost is one cycle. Therefore the design of memory hierarchy is to keep the CPI close to 1.
Another assumption is that the machine is fast enough to require two levels of cache. The cost of going to the second level cache was considered about 12 cycles.

These assumptions are in line with the discussions on this topic with the hardware engineers who built the WRL system.

The first level data cache is considered a write through cache. It's a four entry FIFO that queues up writes to the second level.

#codingexercise
Decimal getStdDev(decimal [] a)
{
If (a == NULL) return 0;
Return a.StdDev();
}

Sunday, December 7, 2014

In today's post we continue discussing the WRL Trace generation system. We discussed a verification technique for the correctness of the trace data. First, the simulation results were compared with that of an existing instruction level simulator. Second, to check for the correctness of the information created by the trace code, we had to verify synchronization. There were two techniques used for this purpose - one where a program that used a simple tight infinite loop for generating a recognizable pattern, and then run a variant of the analysis program that checked whether anything was missing or out of place and two - the trace entry generated by the operating system on entry to the kernel was extended to include a sequence number that was incremented on each kernel entry.
We next look at a cache analysis program. The requirements were that it be flexible. It should be changed from run to run. It shouldn't take up too much memory to cause paging overhead. It should be as fast as possible given that the analysis takes much more time than trace generation. When constants were used in this program whose values were specified at compile time, they defined the number of cache levels, split or integrated data and instruction caches, degree of associativity, number of cache lines, line size, write policy and cost of hit and miss (in units of processor cycles) A version of this program is compiled for each cache configuration and there can even be a fully associative version of the second level cache.
The program is executed at intervals that can be based on the number of instructions, trace entries, or memory references encountered. The data written includes both intervals and cumulative summaries of the number of references, miss ratio, contributions to the cycles per instruction, the percent of misses that resulted from user/kernel conflicts and all of these for each cache, contribution to the CPI by second level compulsory misses (first time references), second level cache contribution to the CPI, if fully associative, and for user mode and kernel mode, number of cycles executed and CPI, CPI broken down by instruction, load and store contributions.
#coding exercise
Decimal getMedian(decimal [] a)
{
If (a == NULL) return 0;
Return a.Median();
}
#coding exercise
Decimal getMode(decimal [] a)
{
If (a == NULL) return 0;
Return a.Mode();
}

Friday, December 5, 2014

#coding exercise
Decimal getAvg(decimal [] a)
{
If (a == NULL) return 0;
Return a.Avg();
}

In this post , we continue our study of the WRL trace system. we review Interrupt timing. Traces may not accurately represent normal execution because code segments are executed at different times and therefore possibly in different order. Slowdown in execution means that various interrupts happen sooner or more frequently than they would without tracing.If user code does asynchronous I/O or relies on signals, more trace code is executed before and after the completion of an I/O operation.
Another such concern is that the additional timer interrupts are not only traced but they affect the rate at which process switching is mandated. For example, if there are more context switches, the time slice for actual user code execution is reduced. To overcome this, we could increase the quantum associated with traced processes. Similarly, if the additional timer interrupts are a problem, we could account for them in the analysis program rather than modifying the trace data.

Since less user code is executed between process switches, the traced processes are executed slower. This affects the accuracy of multiple process traces. To account for this, the process switch interval was increased. This does not affect the interrupts since they still need to be recognized immediately.

The process switch interval was increased by a factor of 8 to reflect actual switching. This corresponded to approximately 200,000 user instructions between switches. To simulate faster machines, the process switch interval was doubled. The average number of instructions executed between process switches for the multi-process benchmark was about 175000 because the quantum is measured in cycles rather than instructions.

In the comparision with simulation results, we were fortunate to have an original simulator with the WRL system that could print out traces for short real user programs. The WRL system was modified to write out traces in a similar form produced by the simulator. The resulting trace file was compared and the number of cache hits and misses were compared.

In order to check for the correctness of the information created by the trace code, it was necessary to assure that the kernel and user access to the trace buffer was synchronized and that there are no gaps in traces that were produced when the analysis program runs. To verify synchronization, there were two tests. First, a program was run infintely in a simple tight loop and the pattern of trace was determined to be consistent. Second, a sequence number was added to the trace entry on each entry into the kernel. Again the pattern was found to be consistent.

Today we continue discussing WRL system for trace generation. We look into the accuracy of trace data and to eliminate gaps for seamless traces. An accurate address trace represents the actual run-time sequence of memory references made by a set of trace-linked programs But trace data may get distorted from a variety of causes. For example, code may be missed from tracing during normal execution, or code that is executed at a different time in a traced system. or code that is executed only once in the tracing system.
Factors affecting the distortion are the design of the tracing mechanism, the slowdown caused by the tracing, and the interruption of the traced programs in order to execute trace extraction and analysis code.
For a trace to accurately represent real execution, there must be no gaps in the trace. These gaps may be introduced from interruptions during trace extraction and analysis.
If the user program is not time-dependent we don't have any issues with seamlessness. Code can be stopped and resumed from the same point once the trace buffer has been emptied. But this is not the case when the kernel is being traced. The kernel cannot be prevented from executing. Moreover, trace should include code that services interrupts belonging to the traced process. In such cases, we have to place tracing code by hand carefully that turns off tracing prior to the execution of the analysis program.
Another inaccuracy results from tracing code that is solely an artifact of tracing. This happens only during kernel tracing and not during user mode process because in the user program execution, the only extra code is the trace code itself. As an example, a TLB fault might occur whenever a user process first references a new page of the trace buffer. This would be captured in a kernel trace of the TLB fault handler. If the trace of the faulty handler was sufficiently long, the next user reference to the trace buffer could be to the next page in the buffer and that would cause another TLB fault. This would make a nasty loop of filling the trace buffer with faulty execution traces.
To solve this problem, we cannot merely turn off the tracing when the user's trace code is entered because there there may be other types of interrupts that may need to be traced. It is also possible for multiple interrupts on a single trip to kernel.To selectively decide what to trace, the tracing assembler code is instrumented to turn off tracing whenever a TLB fault occurs on an address that is within the range of the trace buffer.
Another cause of the trace induced TLB faults is the modification of the TLB during trace extraction and analysis. Execution of analysis code could substantially modify the TLB. After the analysis program has run, the traced user would normally have to refault in its relevant translation. To do this, we save the TLB state before running the analysis program and after the analysis, the contents of the TLB are restored prior to turning on

#coding exercise

Decimal getSum(decimal [] a)

{

If (a == NULL) return 0;

Return a.Sum();

}

Thursday, December 4, 2014

#coding exercise

Decimal getMin (decimal  [] a)

{

If (a == NULL) return 0;

Return a.Min ();

}

#coding exercise

Decimal getMax (decimal [] a)

{

If (a == NULL) return 0;

Return a.Max();

}

Simple OAuth implementation

An OAuth server implements one or more of the following workflows:

1) Client Credential grant

2) Implicit grant

3) Authorization code grant

4) Refresh token grant

1) is for guest access without any user context

2) is less secure but supports Password patterns

3) is the typical OAuth workflow supporting anti-password pattern

4) is used to refresh the code/token after they expire

Simplest case is issue bearer token with indefinite expiration time by method 2) for user resource access and least secure.

For more details or for accuracy, please refer the RFC

# implicit workflow

# GET /oauth/authorize/v1?client_id=CarouselTest1&response_type=token&username=foo&password=bar HTTP/1.1

# HTTP/1.1 200 OK

# <meta http-equiv="refresh"content="0;url=http://127.0.0.1#expires_in=86399945&token_type=bearer&access_token=ACCESS_TOKEN">

# authorization workflow

# auth_code request

# GET /oauth/authorize/?client_id=BCTest1&redirect_uri=http%3A%2F%2F127.0.0.1:8888/callback&scope=openid HTTP/1.1

# Content-Type : application/x-www-form-urlencoded

# HTTP/1.1 302 Moved Temporarily

# Location: http://127.0.0.1/oauth/?code=AUTH_CODE

# access_token request

# POST /oauth/token/v1 HTTP/1.1

# Content-Type: application/x-www-form-urlencoded

# grant_type=authorization_code&client_id=THE_CLIENT_ID&client_secret=THE_CLIENT_SECRET&code=AUTH_CODE

# HTTP/1.1 200 OK

# {"token_type":"bearer","expires_in":86399945,"refresh_token":"REFRESH_TOKEN","access_token":"ACCESS_TOKEN"}

# refresh token request

# POST /oauth/token/v1 HTTP/1.1

# Content-Type: application/x-www-form-urlencoded

# grant_type=refresh_token&client_id=THE_CLIENT_ID&client_secret=THE_CLIENT_SECRET&refresh_token=REFRESH_TOKEN

# HTTP/1.1 200 OK

# {"token_type":"bearer","expires_in":86399968,"refresh_token":"REFRESH_TOKEN","access_token":"ACCESS_TOKEN"}

# client_credential request

# POST /oauth/token/v1 HTTP/1.1

# Content-Type: application/x-www-form-urlencoded

# grant_type=access_token&client_id=THE_CLIENT_ID&client_secret=THE_CLIENT_SECRET

# HTTP/1.1 200 OK

#{"token_type":"bearer","expires_in":86399945,"refresh_token":"REFRESH_TOKEN","access_token":"ACCESS_TOKEN"}

This is implemented as:

# coding: utf-8

from datetime import datetime, timedelta

from flask import Flask

from flask import session, request

from flask import make_response

from flask import render_template, redirect, jsonify

from flask_sqlalchemy import SQLAlchemy

from werkzeug.security import gen_salt

from flask_oauthlib.provider import OAuth2Provider

import random

import string

app = Flask(__name__, template_folder='templates')

app.debug = True

app.secret_key = 'notasecret'

app.config.update({

    'SQLALCHEMY_DATABASE_URI': 'mysql://root:P@ssword@127.0.0.1/oauth',

})

db = SQLAlchemy(app)

#oauth = OAuth2Provider(app)

@app.route('/oauth/token/', methods=['POST'])

#@oauth.token_handler

def access_token():

    client_id = request.args.get('client_id', '')

    client_secret = request.args.get('client_secret', '')

    if client_id.isspace() or client_secret.isspace():  # ideally we check registered clients

       return make_response(jsonify({'error':'invalid client'}), 400)

    grant_type = request.args.get('grant_type','')

    if grant_type == 'authorization_code':

       code = request.args.get('code', '') # ideally we check code issued

       if code.isspace():

          return jsonify({'error':'invalid code'}), 400

       refresh_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       access_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       return jsonify({"token_type":"bearer","expires_in":3600,"refresh_token":refresh_token,"access_token":access_token}), 200

    if grant_type == 'access_token':

       # guest access token

       refresh_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       access_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       return jsonify({"token_type":"bearer","expires_in":3600,"refresh_token":refresh_token,"access_token":access_token}), 200

    if grant_type == 'refresh_token':

       refresh_token = request.args.get('refresh_token', '')

       if refresh_token.isspace():

          return make_response(jsonify({'error':'invalid refresh_token'}, 400))

       refresh_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       access_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       return jsonify({"token_type":"bearer","expires_in":3600,"refresh_token":refresh_token,"access_token":access_token}), 200

    return make_response(jsonify({'error':'invalid grant_type'}), 400)

@app.route('/oauth/authorize/', methods=['GET'])

#@oauth.authorize_handler

def authorize(*args, **kwargs):

         response_type = request.args.get('response_type', '')

         if response_type == 'token':

            # implicit

            username = request.args.get('username', '')

            password = request.args.get('password', '')

            access_token = ''

            if not username.isspace() and not password.isspace(): # ideally we check a membership provider

               access_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

               return make_response(jsonify({'url':'http://127.0.0.1/', 'token_type': 'bearer', 'access_token': access_token, 'expires_in': 3600}), 200)

            return make_response(jsonify({'error':'invalid credentials'}), 400)

         else:

            # auth code request

            client_id = request.args.get('client_id', '') # ideally we check registered clients

            if client_id.isspace():

                return make_response(jsonify({'error':'invalid client'}), 400)

            redirect_uri = request.args.get('redirect_uri', '') # ideally we check registered callback uri

            if redirect_uri.isspace():

                return make_response(jsonify({'error': 'invalid redirect_uri'}), 400)

            auth_code = ''.join([random.choice(string.digits + string.ascii_uppercase) for x in range(0, 12)])

            redirect_uri += "?code=" + auth_code # ideally we check for query strings

            return redirect(redirect_uri)

if __name__ == '__main__':

    app.run(port=8888)

Usually we don't send the client secret over the wire

Instead we could use something like this

# Here we simply propose the use of an OTP token to sign and make a request.

# implemented in totp.py discussed earlier as the OneTimePasswordAlgorithm ()

# for more detail, please visit : http://1drv.ms/13YBENz and code at https://github.com/ravibeta/apisecurity-1.0

otp = OneTimePasswordAlgorithm()

import hashlib

otp.generateTOTP(client_secret, epoch_time, '6', hashlib.sha256)

'857652'

StringToSign = ?timestamp=23453231&token=857652

Put the HMAC_SHA1 of the above in the authorization header

Today we continue our discussion on WRL system. We saw how the trace was controlled. We saw the format of the trace buffer entries. We saw the store and analyze operations with the trace. We saw how the simulation was used to study different cache organizations. We now review the accuracy of the trace data and eliminating gaps for seamless traces.
One of the improvements to this system could have been to add time stats (elapsed time, worker time) to each of the blocks of traced program execution between context switches.
#codingexercise
decimal GetDistinctRangeVariance(int [] A)
{
if (A == null) return 0;
Return A.DistinctRangeVariance()
}