Cluster computing

Wednesday, September 4, 2013

Today I want to make a post aside from the ones I have been lately. Much of this has been learned in an effort to automate a report and I want to write it down here. First, an introduction to the objective and then a choice of steps and a discussion on their tradeoffs followed by a possibly expedient solution.
The objective in this case was to generate daily automated mails of a pivot table. The table comes from a TFS query that lists work items assigned to owners with the priority of items say 1, 2, 3. The generated pivot table has the count of work items for each priority listed for every owner. This table should flow daily to a distribution list of subscribers so that they can get it in their inbox.
Although there could be several ways to do this task, I will list the things that need to be considered.
The data could change every day.
The pivot table format could change.
TFS queries may not support both the listing of the data and the pivoting.
THis is a custom report. By that I mean we cannot use the canned reports from a TFS connection because they don't pivot the data.
The items in a pivot table should be in HTML format for mail out.
The transformation of the data from tables to xslt to any other reporting or the invocation of tools on dedicated local servers are subject to maintenance and add to costs.
Ease of use such as with Excel to pivot the table or odor charts will help with changes in the format of the output over time.
Server that sends out the mails should preferably do this daily. It's alright to do it more than once and to a designated person so that one of the can be selected to send out to a wider audience. This lets some amount of manual control if desired. This is optional.
The solutions could include and in no particular order:
1) write SQL server reporting services RDL reports so that they can be viewed from the TFS server
2) write custom tool to generate the output and the mail using libraries and plugins. The executable can be scheduled to be invoked with a task scheduler
3) TFS queries can be saved and made accessible over the Internet at which point the source table is already in HTML and available via web requests
Out of the three, the tradeoffs are in terms of flexibility to changes, availability of data and report online and the cost of ownership involved. Users who are familiar with these will spot them right away and for those who are new, this could be explained as the more moving pieces there are, the more it costs to own them or transform the data flow between them.
Lastly, the solution that could be considered expedient is to have the excel application publish just the pivot table item to and not the sheets or the workbook. This item could appear directly on say a share point website which provides http access to the pivot tables. This can be requested with a web request and mailed out with a script that is invoked by the task scheduler or a reporting server.

SHA-1 hashing

SHA-1 hashes are 160 bits or 20 bytes long. It comprises of hexadecimal numbers 40 digits long. The message digest is similar to the Rivest design for MD4 MD5. Take 5 blocks of 32 bits each, unsigned and in Big-Indian. Then do a preprocessing to the message. Append the bit 1 to the message. Append a padding of upto 512 bits so that the message aligns with 448. Append the legneth of the message as an unsigned number. Then do the processing on successive 512 bit chunks. For each chink, break the chunk into sizteen 32bit big endian words. Initializes the hash value for this chunk as h0, h1, h2, h3 and h4. Extend the sixteen 32 bit words into eighty 32 bit words this way: for each of the 16 to 79 th word, XOR the word that appears 3, 8, 14, and 16 words earlier and then leftrotate by 1. Initialize the hash value for this chunk as set of 5. In the main loop for i from 0 to 7, for each of the four equal ranges, apply the and and the or to the chunks in a predefined manner specified differently for each range. Then recompute the hash values for the chunks by exchanging them and re-assinging the first and left rotating the third by 30. At the end of the look, recompute the chunk's hash to the result so far. The final hash value is the appending of each of these chunks.
.Net Library packages this up with ComputeHash method:
A hash can be computed like this:
var bytesToSign = Encoding.UTF8.GetBytes(content);
HMAC hmacSha256 = new HMACSHA256(secretKeyBytes);
byte[] hashBytes = hmacSha256.ComputeHash(bytesToSign);
return Convert.ToBase64String(hashBytes);

Tuesday, September 3, 2013

In this post, I'm going to talk about hash tables. Hash tables are popular because it takes a constant time to lookup a data record. A hash function is used to generate a fixed-length representation of the actual data record. Hash functions can make a table lookup faster and can detect duplicated or similar records. Two objects are similar if their hash codes are equal. Hashes don't retain the original data.
In a hash table, the hash function maps the search key to an index, which then gives the place where the data is inserted. Here the index only refers to a bucket since the range of the key values is typically larger than the range of hashes. Each bucket corresponds to a set of records.
Duplicate records are found by going through the set of records in the same bucket. This scanning is necessary because hashing ensures that same records end up in the same bucket. This is called collisions. When collisions are minimum, the hashing function has good performance.
Incidentally, geometric hashing can be used to find points that are close in a plane or three dimensional space. Here the hashing function is interpreted as a partition of that space into a grid of cells. The return value of the function is typically a tuple with two or more indices such as the dimensions of a plane.
Universal hashing is a scheme in which there are a family of hashing functions to choose from and a function is chosen such that the when two distinct keys are hashed, they would collide only once in n where n is the different hash values desired. However, it could have more collisions than a special purpose hash function.
Hash tables are also used for encryption because it gives a pretty good representation of the data and can guard against tampering of the data. If the data were to be modified, it would be hard to hide it in the same hash.

In cryptographic hash functions such as SHA-1, there is more even mapping of inputs across the entire range of hash values. Therefore, they serve as good general purpose hash functions. Note that this function is for a uniform spread and not random. A randomizing function is a better choice of a hashing function.

Hash functions have to be deterministic. If they are given the same keys, they should produce the same hash again. This does not mean hashing functions cannot be used with things that change such as the time of the day or a memory address. Whenever a key changes, it can generally be rehashed.

OAuth implicit and authentication code grants are for the WebUI to use. This is because the userId translation need not be visible to user or clients. WebUI testing covers the implicit and authentication code grants. Its the webUI that makes sure the redirects are forwarded to the right clients. This could be done with the uri parameter, state and possibly callbacks. If the clients are not secured, the bearer token could land to a phishing client. If the phishing client hijacks the token, it can easily use that to access protected resources. This is a security vulnerability.

Monday, September 2, 2013

One way to not persist any token related information in the OAuth provider is to do a cryptographic hash on the parameters provided during the token request and the timestamp. Since the hash is generated by a cryptography providers, it is opaque to the client. If it is opaque, I assumed I could encrypt and decrypt at the server side the userId and the clientId, so that the clients can use it as the OAuth access token while the server can easily decrypt the token to know the userID and the clientID.
string data = your_data_here;
DateTime now = DateTime.UtcNow;
string timestamp = now.ToString("yyyy-MM-ddTHH:mm:ssZ");
string signMe = data + timestamp;
byte[] bytesToSign = Encoding.UTF8.GetBytes(signMe);

  var encryptedBytes = provider.Encrypt(
bytesToSign, true);

  string decryptedTest = System.Text.Encoding.UTF8.GetString(
    provider.Decrypt(encryptedBytes, true));

However, after writing the post above I tried it out and found that the encodedstring is not what I was looking to pass around as tokens. Instead a hash will do. A hash can be computed like this:

  byte[] secretKeyBytes = Encoding.UTF8.GetBytes("Ada Lovelace" + DateTime.Now.ToString("yyyy-MM-ddTHH:mm:ssZ"));
var bytesToSign = Encoding.UTF8.GetBytes(content);
HMAC hmacSha256 = new HMACSHA256(secretKeyBytes);
byte[] hashBytes = hmacSha256.ComputeHash(bytesToSign);
return Convert.ToBase64String(hashBytes);

The only thing is that the RFC says the access token may self contain the authorization information such as with a data and signature so that it is self-verifiable. If we were to do that, the data must consist of public strings.

Sunday, September 1, 2013

We discussed substituting the OAuth provider in the proxy with a token provider from the API. The token provider is dependent on the client registrations and doesn't issue a token without a client involved. Therefore client registrations are also provisioned by the .com website for the company. This is different from the WebUI required for the OAuth login. The WebUI logins target the user profiles registered with the .com website. The same database could also hold client registration data although the clients are associated with developer logins. The developer logins are more closely associated with the API and proxy rather than the .com website user profile and can be independent. If the .com website were to provide or unify these logins, it is certainly possible. The goal in such efforts is to reduce the network RTT between the user and the API server. It also reduces the dependencies on the proxy. The WebUI is a central piece for the end to end working of the OAuth. The tests for the WebUI involve some of the following:
1) before login, check for client description and image
2) login with a username and password to see that the grant is successful. Successful grant could be seen with a sample API call such as get all stores.
3) deny the grant to see if the application responds properly
4) check the list of all authorized clients
5) revoke a single client to check if the client no longer has access.

We mentioned in the previous post a token mapping table. We primarily considered the mapping between user, client, and token and discussed token CRUD operations. In this post we consider token archival policy. Since the tokens are created for each user on each client, there could be a large number of tokens generated in a day, month or year. So archival policy of tokens or periodic purges of expired tokens is necessary. Since the tokens issued are progressive, it is unlikely that the tokens issued earlier will be used or referenced. In any case, a client presenting an expired token will be returned an invalid_token error. Therefore, the purge or archival of older tokens will not interfere with the progressive tokens. Therefore we could even consider keeping an in-memory sliding window of tokens issued in the last one hour. As new tokens are added, if there are any tokens exceeding the last one hour, they will be deleted. Since the OAuth provider server is expected to respond to authorization requests as continuously available, the sliding window can be in-memory. Additionally, we may want to keep track of other token related data as well such as token scope, URI, requirement for a refresh token, response type, clientId and client Secret. When the codes are issued, we could treat them as a special case of tokens so that a token issued in response for a code can look up a previously issued code.