Cluster computing

Tuesday, September 3, 2013

In this post, I'm going to talk about hash tables. Hash tables are popular because it takes a constant time to lookup a data record. A hash function is used to generate a fixed-length representation of the actual data record. Hash functions can make a table lookup faster and can detect duplicated or similar records. Two objects are similar if their hash codes are equal. Hashes don't retain the original data.
In a hash table, the hash function maps the search key to an index, which then gives the place where the data is inserted. Here the index only refers to a bucket since the range of the key values is typically larger than the range of hashes. Each bucket corresponds to a set of records.
Duplicate records are found by going through the set of records in the same bucket. This scanning is necessary because hashing ensures that same records end up in the same bucket. This is called collisions. When collisions are minimum, the hashing function has good performance.
Incidentally, geometric hashing can be used to find points that are close in a plane or three dimensional space. Here the hashing function is interpreted as a partition of that space into a grid of cells. The return value of the function is typically a tuple with two or more indices such as the dimensions of a plane.
Universal hashing is a scheme in which there are a family of hashing functions to choose from and a function is chosen such that the when two distinct keys are hashed, they would collide only once in n where n is the different hash values desired. However, it could have more collisions than a special purpose hash function.
Hash tables are also used for encryption because it gives a pretty good representation of the data and can guard against tampering of the data. If the data were to be modified, it would be hard to hide it in the same hash.

In cryptographic hash functions such as SHA-1, there is more even mapping of inputs across the entire range of hash values. Therefore, they serve as good general purpose hash functions. Note that this function is for a uniform spread and not random. A randomizing function is a better choice of a hashing function.

Hash functions have to be deterministic. If they are given the same keys, they should produce the same hash again. This does not mean hashing functions cannot be used with things that change such as the time of the day or a memory address. Whenever a key changes, it can generally be rehashed.

OAuth implicit and authentication code grants are for the WebUI to use. This is because the userId translation need not be visible to user or clients. WebUI testing covers the implicit and authentication code grants. Its the webUI that makes sure the redirects are forwarded to the right clients. This could be done with the uri parameter, state and possibly callbacks. If the clients are not secured, the bearer token could land to a phishing client. If the phishing client hijacks the token, it can easily use that to access protected resources. This is a security vulnerability.

Monday, September 2, 2013

One way to not persist any token related information in the OAuth provider is to do a cryptographic hash on the parameters provided during the token request and the timestamp. Since the hash is generated by a cryptography providers, it is opaque to the client. If it is opaque, I assumed I could encrypt and decrypt at the server side the userId and the clientId, so that the clients can use it as the OAuth access token while the server can easily decrypt the token to know the userID and the clientID.
string data = your_data_here;
DateTime now = DateTime.UtcNow;
string timestamp = now.ToString("yyyy-MM-ddTHH:mm:ssZ");
string signMe = data + timestamp;
byte[] bytesToSign = Encoding.UTF8.GetBytes(signMe);

  var encryptedBytes = provider.Encrypt(
bytesToSign, true);

  string decryptedTest = System.Text.Encoding.UTF8.GetString(
    provider.Decrypt(encryptedBytes, true));

However, after writing the post above I tried it out and found that the encodedstring is not what I was looking to pass around as tokens. Instead a hash will do. A hash can be computed like this:

  byte[] secretKeyBytes = Encoding.UTF8.GetBytes("Ada Lovelace" + DateTime.Now.ToString("yyyy-MM-ddTHH:mm:ssZ"));
var bytesToSign = Encoding.UTF8.GetBytes(content);
HMAC hmacSha256 = new HMACSHA256(secretKeyBytes);
byte[] hashBytes = hmacSha256.ComputeHash(bytesToSign);
return Convert.ToBase64String(hashBytes);

The only thing is that the RFC says the access token may self contain the authorization information such as with a data and signature so that it is self-verifiable. If we were to do that, the data must consist of public strings.

Sunday, September 1, 2013

We discussed substituting the OAuth provider in the proxy with a token provider from the API. The token provider is dependent on the client registrations and doesn't issue a token without a client involved. Therefore client registrations are also provisioned by the .com website for the company. This is different from the WebUI required for the OAuth login. The WebUI logins target the user profiles registered with the .com website. The same database could also hold client registration data although the clients are associated with developer logins. The developer logins are more closely associated with the API and proxy rather than the .com website user profile and can be independent. If the .com website were to provide or unify these logins, it is certainly possible. The goal in such efforts is to reduce the network RTT between the user and the API server. It also reduces the dependencies on the proxy. The WebUI is a central piece for the end to end working of the OAuth. The tests for the WebUI involve some of the following:
1) before login, check for client description and image
2) login with a username and password to see that the grant is successful. Successful grant could be seen with a sample API call such as get all stores.
3) deny the grant to see if the application responds properly
4) check the list of all authorized clients
5) revoke a single client to check if the client no longer has access.

We mentioned in the previous post a token mapping table. We primarily considered the mapping between user, client, and token and discussed token CRUD operations. In this post we consider token archival policy. Since the tokens are created for each user on each client, there could be a large number of tokens generated in a day, month or year. So archival policy of tokens or periodic purges of expired tokens is necessary. Since the tokens issued are progressive, it is unlikely that the tokens issued earlier will be used or referenced. In any case, a client presenting an expired token will be returned an invalid_token error. Therefore, the purge or archival of older tokens will not interfere with the progressive tokens. Therefore we could even consider keeping an in-memory sliding window of tokens issued in the last one hour. As new tokens are added, if there are any tokens exceeding the last one hour, they will be deleted. Since the OAuth provider server is expected to respond to authorization requests as continuously available, the sliding window can be in-memory. Additionally, we may want to keep track of other token related data as well such as token scope, URI, requirement for a refresh token, response type, clientId and client Secret. When the codes are issued, we could treat them as a special case of tokens so that a token issued in response for a code can look up a previously issued code.

Saturday, August 31, 2013

The APIs we mentioned in the previous post for OAuth have resource qualifiers based on clients, users and token since there can be several for each. We have not talked about claims. Claims provide access to different scope such as account, payment, rewards etc. We leave this as an internal granularity that we can expose subsequently without any revisions to the initial set. For the first cut of the APIs we want to keep it simple and exhaust the mapping between the clients, the users and their tokens. The tokens are themselves a hash they have an expiry time of around an hour. So we keep track of the tokens issued for the different clients and user authorizations. The table for the tokens could have a mapping to the user table and the client table based on the user and client ids respectively. The table is populated for every authorization grant. If the tokens are issued their issue time is recorded. So that for every access to the token, we check whether the token has expired. Since the tokens are hashes or strings, its useful to index them and this can help lookup time. An alternative would be to look them up with the date of issue so so that we can retrieve the tokens that were issued only in the last hour and check for the presence of the token presented. Tokens will be of different privileges depending on whether the userId and the clientId fields are populated. There could be fields for say "bearer" token. These tokens are important since OAuth treats these tokens differently. OAuth RFC 6750 describes these in details. The lookup of the token needs to be fast since the tokens will most likely be used more than the number of times they are issued. A cryptographic hash is sufficient for the token since we don't want to tie it to any information other than the mapping we have internally and we do want to make it hard for the hackers to generate or break the hash. This is easy because .Net libraries make it easy to generate such hash. Token once generated should not be removed from the table unless the user requests it. Revoking on user request keeps the table consistent internally and externally because the APIs return the list of tokens that the clients can keep track of. Revoking the token is important because the user or client can choose to revoke one or more tokens simultaneously. So this operation is different from the insert or the lookup. The tokens should not be generated more than once. So retries have to be error proof.

Friday, August 30, 2013

In the previous post, we began with a mention for how the proxy and the API interact as of today and moved on to how we want it to be. Here we re-visit the current interactions. The proxy may work across API providers across providers. However, we will keep our discussions to the API implementation. In the API implementation, we have the following methods provided by OAuth provider :
1) Get the description of a client
2) Create an access token when the API calls the provider
3) Create an authorization code when the API calls the provider
4) Get the applications for a user
5) Revoke an access token
When the token or codes are created, a user context is passed in that is used to track the client authorizations and the responses. This is how the tokens and codes are associated with a client and user. Depending on whether provider is implemented by a Proxy or by the retail company, the methods mentioned above are required to establish the mapping between the user and the client and their tokens/codes. The methods also take in additional parameter to make the token effective such as grantType, scope, response type, inclusion of a refresh token in the response, a URI and a state. These parameters are relevant to cut out the tokens as requested.
However, let's take a closer look at the APIs to see what we could add or improve on these APIs. The first thing is the testability. Do we have all the information exposed so that the data can be tested. No, we could do with a list of tokens assoicated by user. If we have a list of tokens issued for a client and a list of tokens issued for a user, we should be able to see which tokens are issued for clients only and that the tokens associated with the clients only are for low privilege uses as opposed to those that are issued with a user context. This test helps us know the set of tokens that have more privilege than others. By separating out the tokens this way and knowing the full set of tokens issued, we should be able to ensure that all access is accounted for.
Next we could do with a list of clients authorized by user and a list of users that a client has been authorized for. This could help us determine whether the authorization updates both lists. This is relevant because a token for user resources should be issued only when both are present.
Lastly, the list of authorized clients for a user and the list of users for a client should be updated with corresponding revokes. When all the tokens for a user with a given client are revoked, it could be considered equivalent to revoking the client authorization so that the client can request the user to re-authorize via OAuth for the next session. That is upto the discretion of the client and could be facilitated for the user with a checkbox for keeping the client authorized the first time the user signs in with OAuth WebUI.