Cluster computing

Friday, December 5, 2014

Today we continue discussing WRL system for trace generation. We look into the accuracy of trace data and to eliminate gaps for seamless traces. An accurate address trace represents the actual run-time sequence of memory references made by a set of trace-linked programs But trace data may get distorted from a variety of causes. For example, code may be missed from tracing during normal execution, or code that is executed at a different time in a traced system. or code that is executed only once in the tracing system.
Factors affecting the distortion are the design of the tracing mechanism, the slowdown caused by the tracing, and the interruption of the traced programs in order to execute trace extraction and analysis code.
For a trace to accurately represent real execution, there must be no gaps in the trace. These gaps may be introduced from interruptions during trace extraction and analysis.
If the user program is not time-dependent we don't have any issues with seamlessness. Code can be stopped and resumed from the same point once the trace buffer has been emptied. But this is not the case when the kernel is being traced. The kernel cannot be prevented from executing. Moreover, trace should include code that services interrupts belonging to the traced process. In such cases, we have to place tracing code by hand carefully that turns off tracing prior to the execution of the analysis program.
Another inaccuracy results from tracing code that is solely an artifact of tracing. This happens only during kernel tracing and not during user mode process because in the user program execution, the only extra code is the trace code itself. As an example, a TLB fault might occur whenever a user process first references a new page of the trace buffer. This would be captured in a kernel trace of the TLB fault handler. If the trace of the faulty handler was sufficiently long, the next user reference to the trace buffer could be to the next page in the buffer and that would cause another TLB fault. This would make a nasty loop of filling the trace buffer with faulty execution traces.
To solve this problem, we cannot merely turn off the tracing when the user's trace code is entered because there there may be other types of interrupts that may need to be traced. It is also possible for multiple interrupts on a single trip to kernel.To selectively decide what to trace, the tracing assembler code is instrumented to turn off tracing whenever a TLB fault occurs on an address that is within the range of the trace buffer.
Another cause of the trace induced TLB faults is the modification of the TLB during trace extraction and analysis. Execution of analysis code could substantially modify the TLB. After the analysis program has run, the traced user would normally have to refault in its relevant translation. To do this, we save the TLB state before running the analysis program and after the analysis, the contents of the TLB are restored prior to turning on

#coding exercise

Decimal getSum(decimal [] a)

{

If (a == NULL) return 0;

Return a.Sum();

}

Thursday, December 4, 2014

#coding exercise

Decimal getMin (decimal  [] a)

{

If (a == NULL) return 0;

Return a.Min ();

}

#coding exercise

Decimal getMax (decimal [] a)

{

If (a == NULL) return 0;

Return a.Max();

}

Simple OAuth implementation

An OAuth server implements one or more of the following workflows:

1) Client Credential grant

2) Implicit grant

3) Authorization code grant

4) Refresh token grant

1) is for guest access without any user context

2) is less secure but supports Password patterns

3) is the typical OAuth workflow supporting anti-password pattern

4) is used to refresh the code/token after they expire

Simplest case is issue bearer token with indefinite expiration time by method 2) for user resource access and least secure.

For more details or for accuracy, please refer the RFC

# implicit workflow

# GET /oauth/authorize/v1?client_id=CarouselTest1&response_type=token&username=foo&password=bar HTTP/1.1

# HTTP/1.1 200 OK

# <meta http-equiv="refresh"content="0;url=http://127.0.0.1#expires_in=86399945&token_type=bearer&access_token=ACCESS_TOKEN">

# authorization workflow

# auth_code request

# GET /oauth/authorize/?client_id=BCTest1&redirect_uri=http%3A%2F%2F127.0.0.1:8888/callback&scope=openid HTTP/1.1

# Content-Type : application/x-www-form-urlencoded

# HTTP/1.1 302 Moved Temporarily

# Location: http://127.0.0.1/oauth/?code=AUTH_CODE

# access_token request

# POST /oauth/token/v1 HTTP/1.1

# Content-Type: application/x-www-form-urlencoded

# grant_type=authorization_code&client_id=THE_CLIENT_ID&client_secret=THE_CLIENT_SECRET&code=AUTH_CODE

# HTTP/1.1 200 OK

# {"token_type":"bearer","expires_in":86399945,"refresh_token":"REFRESH_TOKEN","access_token":"ACCESS_TOKEN"}

# refresh token request

# POST /oauth/token/v1 HTTP/1.1

# Content-Type: application/x-www-form-urlencoded

# grant_type=refresh_token&client_id=THE_CLIENT_ID&client_secret=THE_CLIENT_SECRET&refresh_token=REFRESH_TOKEN

# HTTP/1.1 200 OK

# {"token_type":"bearer","expires_in":86399968,"refresh_token":"REFRESH_TOKEN","access_token":"ACCESS_TOKEN"}

# client_credential request

# POST /oauth/token/v1 HTTP/1.1

# Content-Type: application/x-www-form-urlencoded

# grant_type=access_token&client_id=THE_CLIENT_ID&client_secret=THE_CLIENT_SECRET

# HTTP/1.1 200 OK

#{"token_type":"bearer","expires_in":86399945,"refresh_token":"REFRESH_TOKEN","access_token":"ACCESS_TOKEN"}

This is implemented as:

# coding: utf-8

from datetime import datetime, timedelta

from flask import Flask

from flask import session, request

from flask import make_response

from flask import render_template, redirect, jsonify

from flask_sqlalchemy import SQLAlchemy

from werkzeug.security import gen_salt

from flask_oauthlib.provider import OAuth2Provider

import random

import string

app = Flask(__name__, template_folder='templates')

app.debug = True

app.secret_key = 'notasecret'

app.config.update({

    'SQLALCHEMY_DATABASE_URI': 'mysql://root:P@ssword@127.0.0.1/oauth',

})

db = SQLAlchemy(app)

#oauth = OAuth2Provider(app)

@app.route('/oauth/token/', methods=['POST'])

#@oauth.token_handler

def access_token():

    client_id = request.args.get('client_id', '')

    client_secret = request.args.get('client_secret', '')

    if client_id.isspace() or client_secret.isspace():  # ideally we check registered clients

       return make_response(jsonify({'error':'invalid client'}), 400)

    grant_type = request.args.get('grant_type','')

    if grant_type == 'authorization_code':

       code = request.args.get('code', '') # ideally we check code issued

       if code.isspace():

          return jsonify({'error':'invalid code'}), 400

       refresh_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       access_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       return jsonify({"token_type":"bearer","expires_in":3600,"refresh_token":refresh_token,"access_token":access_token}), 200

    if grant_type == 'access_token':

       # guest access token

       refresh_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       access_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       return jsonify({"token_type":"bearer","expires_in":3600,"refresh_token":refresh_token,"access_token":access_token}), 200

    if grant_type == 'refresh_token':

       refresh_token = request.args.get('refresh_token', '')

       if refresh_token.isspace():

          return make_response(jsonify({'error':'invalid refresh_token'}, 400))

       refresh_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       access_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

       return jsonify({"token_type":"bearer","expires_in":3600,"refresh_token":refresh_token,"access_token":access_token}), 200

    return make_response(jsonify({'error':'invalid grant_type'}), 400)

@app.route('/oauth/authorize/', methods=['GET'])

#@oauth.authorize_handler

def authorize(*args, **kwargs):

         response_type = request.args.get('response_type', '')

         if response_type == 'token':

            # implicit

            username = request.args.get('username', '')

            password = request.args.get('password', '')

            access_token = ''

            if not username.isspace() and not password.isspace(): # ideally we check a membership provider

               access_token = ''.join([random.choice('0123456789ABCDEF') for x in range(0, 16)]) # ideally we save tokens issued

               return make_response(jsonify({'url':'http://127.0.0.1/', 'token_type': 'bearer', 'access_token': access_token, 'expires_in': 3600}), 200)

            return make_response(jsonify({'error':'invalid credentials'}), 400)

         else:

            # auth code request

            client_id = request.args.get('client_id', '') # ideally we check registered clients

            if client_id.isspace():

                return make_response(jsonify({'error':'invalid client'}), 400)

            redirect_uri = request.args.get('redirect_uri', '') # ideally we check registered callback uri

            if redirect_uri.isspace():

                return make_response(jsonify({'error': 'invalid redirect_uri'}), 400)

            auth_code = ''.join([random.choice(string.digits + string.ascii_uppercase) for x in range(0, 12)])

            redirect_uri += "?code=" + auth_code # ideally we check for query strings

            return redirect(redirect_uri)

if __name__ == '__main__':

    app.run(port=8888)

Usually we don't send the client secret over the wire

Instead we could use something like this

# Here we simply propose the use of an OTP token to sign and make a request.

# implemented in totp.py discussed earlier as the OneTimePasswordAlgorithm ()

# for more detail, please visit : http://1drv.ms/13YBENz and code at https://github.com/ravibeta/apisecurity-1.0

otp = OneTimePasswordAlgorithm()

import hashlib

otp.generateTOTP(client_secret, epoch_time, '6', hashlib.sha256)

'857652'

StringToSign = ?timestamp=23453231&token=857652

Put the HMAC_SHA1 of the above in the authorization header

Today we continue our discussion on WRL system. We saw how the trace was controlled. We saw the format of the trace buffer entries. We saw the store and analyze operations with the trace. We saw how the simulation was used to study different cache organizations. We now review the accuracy of the trace data and eliminating gaps for seamless traces.
One of the improvements to this system could have been to add time stats (elapsed time, worker time) to each of the blocks of traced program execution between context switches.
#codingexercise
decimal GetDistinctRangeVariance(int [] A)
{
if (A == null) return 0;
Return A.DistinctRangeVariance()
}

Wednesday, December 3, 2014

We continue our discussion on the WRL system to generate and analyze trace data. We looked at the formatting of the trace entries. We now look at some special considerations for the kernel. The kernel can also be traced because it is relinked. This is true for nearly all kernel code except for the part written in Assembly language because the linker does not modify it. On the WRL system, this part is responsible for determining the cause of traps and interrupts, saving and restoring co-processor state and register contents, manipulating the clock, translation look aside buffer, and managing return from the interrupts. It also includes idle loop. In these cases, the trace routines are inserted by hand.
The trace can grow pretty quickly. The trace buffers are nearly half the size of the memory on the system and yet it might represent only two seconds of untraced execution. This trace data has to be extracted so that tracing can continue without too much disruption or affecting accuracy.
The interruption may entail extracting the partial trace and writing it or analyzing it so that it does not have to be saved. The latter is preferred because the former doesn't scale with long traces. When the trace buffer is nearly full, the operating system turns off tracing and runs a very high priority analysis process.. Execution of this analysis is controlled by a read system call on a special file. The read returns the virtual address space of the beginning of the trace buffer and the number of available entries only when the buffer is full or nearly full.The data can be analyzed in any manner and the tracing is stopped for this duration.When the data has been processed, the read is once again executed, tracing is turned back on and traced user programs can once again execute.
Storing the trace data helps make the simulation results to be reproducible and thereby help with study of different cache organizations in each simulation. On the fly analysis does not help with that. This is possible for a trace data corresponding to a single trace. For multiple processes or for kernel traces, it will not be identical from run to run. There are a couple of potential solutions to this problem. It might be possible to simulate more than one cache organization at a time but this does not allow the comparison of new and old results. Another possibility is to do a sufficient number of controlled runs to determine the variance in the results for each cache organization so that any statistically significant difference between any two cache can be determined.
#codingexercise
decimal GetDistinctRangeVariance(int [] A)
{
if (A == null) return 0;
Return A.DistinctRangeVariance ();
}
https://github.com/ravibeta/PythonExamples/blob/master/totp.py

#codingexercise

decimal GetDistinctRangeClusterCenter(int [] A)

{

if (A == null) return 0;

Return A.DistinctRangeClusterCenter()
}

Tuesday, December 2, 2014

Today we will continue our discussion on the WRL system to generate and analyze trace data. The trace code itself was written in proprietary assembly language using registers to avoid memory references. 8 of these registers are used for trace generation. The kernel uses a separate register set from user process Only the trace index and the trace pointer values needed to be copied from user to kernel mode. In the switch from kernel to user they are always calculated. All processes have the shared trace buffer mapped to the same virtual address space. The linker inserts trace branches as part of the automatic generation of address references. In addition, it is possible to add custom trace routines by user or kernel. Only the kernel does it for this study. The addresses inserted in the trace buffer by the kernel are physical addresses, whereas those inserted by user addresses are virtual addresses. It is essential for the analysis code to be able to tell one from the other.Similarly, it is important to tell sequences of virtual addresses as belonging to a particular process when more than one is being traced. This is done by making a change mode entry in the trace buffer on every transition from kernel to user or back.The entry indicates the direction of the switch and the process id.
Tracing is done only when a trace flag is set and stored in a trace register.All writes to the trace buffer check for this value. When a write occurs, at the first interrupt following a write to a location in /dev/kmem, tracing is turned on. The set of user programs that are traced are determined by the ones that are specially linked. The kernel is traced only when it is trace-linked. When tracing is off, 4 to 6 additional instructions are executed at every trace point. The user code does not incur any cost but the kernel does. Even when the kernel is not linked for tracing, it is slower because it has to execute instructions to keep the trace buffer consistent between every switch
The trace buffer entries are formatted such that they can be divided into two types based on whether they fill one or two 32 bit words in the buffer. Data reference entries are one word long. Data load/store entries take up a single 32-bit word. The Load/Store is differentiated by the first two high order bits of the word. The remaining bits are used for address which is guaranteed to be less than 1GB. Any entry with a non zero high order bits is a data entry. On the other hand, an instruction entry represents the beginning of a basic block and has the first word denoting the type and count. Count is the number of instructions in the basic block starting at the address in the second word.
#codingexercise
Int GetDistinctRangeMode(int [] A)
{
if (A == null) return 0;
Return A.DistinctRangeMode ();
}
Int GetDistinctRangeMedian(int [] A)

{

if (A == null) return 0;

Return A.DistinctRangeMedian ();

}

decimal GetDistinctRangeStdDev(int [] A)

{

if (A == null) return 0;

Return A.DistinctRangeStdDev ();

}

Monday, December 1, 2014

Today we discuss another WRL report. This is titled Long Address Traces from RISC machines: Generation and analysis by Borg, Kessler, Lazana and Wall. This talks about cache design when processor speed was picking up. If the processor speed was shown to increase by a factor of ten with the memory system remaining the same, the overall execution speed improved merely by a factor of two. The difference was mainly attributed to servicing cache misses.
To analyze caches, they used simulation to trace memory addresses of running systems. The accuracy of the results depends both on the simulation model and the trace data. Some of the limitations included complexity, inaccuracy, lack of system references, short length, inflexibility or applicability only to CISC machines. They built a system to overcome these limitations where the trace generation is based on link time code modification. This makes generation of new trace easier and the system was flexible to allow control of what is traced and when it is traced. For large fast systems, very long trace is required. This system posed no limitations on the length of the trace.
The traces generated by this system was used to analyze the behavior of multi-level cache hierarchies. The link time code modification was possible because the programs were compiled to an intermediate language called Mahler. An option to the linker enables branches to trace code to be inserted into the program wherever a referenced address is to be recorded. Instruction addresses are computed and written into a trace buffer. This addresses are those where the instructions would have been located were it not for the trace branches. The trace buffer is shared among the kernel and all running processes but only those that have been linked actually use the buffer. The buffer is mapped to the high end of every user's virtual address space and the user trace code can write directly to the trace buffer referencing its virtual address. The trace code is written in an assembly language proprietary to the WRL system. Since the kernel execution is uninterruptible unlike user execution, additional code to ensure correct synchronization among the users of the trace buffer is required. A user trace entry could therefore be split with an arbitrary amount of trace data generated by kernel.
Trace extraction and analysis are time-consuming and cannot be performed together with tracing. Hence tracing is interrupted for those activities. The challenge is to ensure that the trace is seamless. A partial trace is implemented where the trace data is analyzed immediately by a high priority analysis process and the trace is not required to be saved.
We will also review another paper for combining branch predictors by McFarling from Rajeev's website after this one.
#codingexercise
Int GetDistinctRangeMin(int [] A)
{
if (A == null) return 0;
return A.DistinctRangeMin();
}

Int GetDistinctRangeMax(int [] A)

{

if (A == null) return 0;
Return A.DistinctRangeMax ();

}