Cluster computing

Monday, September 30, 2013

In this post, I want to talk about Application Object in Microsoft Excel, that lets you work with Excel sheets programmatically.
You could write code such as follows:
Application.Windows("book1.xls").Activate
or
Set x1 = CreateObject("Excel.sheet")
x1.Application.Workbooks.Open "newbook.xls"

You also have a PivotTable and PivotChart object in the Excel Object Model.
They facilitate pivot transformation of data. You could use the PivotTableWizard to generate the PivotTable.
ActiveSheet.PivotTableWizard(xlDatabase, Range("A1:C100")

Sunday, September 29, 2013

In the previous post, we discussed generating different tokens for the same user and client over time. We could do this based on a variant we add to the userId and clientId before encryption.
This was made possible with a hhmmss integer that we append to the UserId and ClientId.
This had the benefit that we used fixed length string to encyrpt and decrypt. However all of the original string may be known to the user. So if we want to keep some part of the original string unknown, we could add a magic number.
All of this could be avoided if we used an offset to rotate the userId+clientId string based on say hhmmss. The logic to translate the hhmmss to an offset thats within the bounds of 0 to the length of the original fixed length string is entirely up to the server.
We also mentioned that with the OAuth server, we preferred to keep a token table that gets populated for all OAuth token grants. At the same time, we said that the APIs that validate the tokens need not rely on the token table and can directly decrypt the tokens to identify the userId and the clientId. The table is only for auditing purposes at this point.
The token validation does not need to happen within the API implementation although APIs could choose to do so. That validation could be done with ActionFilterAttributes we discussed earlier. or even via HTTP handlers. The URI and query string could still be passed to the API implementation.
The id conversion may need to occur if the APIs would like to get the userId and clientId from the token itself so that the API resources do not require a user or client in the resource qualifiers. This is because the id is integer based. If the earlier implementation was based on GUIDs, the id and GUID for the same user or client may need to be looked up.
APIs are not expected to do away with user or client id since the token is a security artifact and not a functional artifact. To the user it could be rendundant or additional task to provide user and client information alongwith a token. To the API implementation, irrespective of the caller, the userId and clientId could be parameters so that callers can look up the information when the parameter values change.
That said, most of the resources are based on user profiles and has nothing to do with client profiles. If the users are already being aliased so that they don't have to enter their userId, then the resource qualifiers for the API implementations can certainly choose to not require userId and clientId. This will benefit the users who call the API.
I mentioned earlier that on mobile devices where text entry and input is generally difficult at this time, it is better to require less when the users have to specify the API directly.
Lastly, the database schema may need to be modified if ID parameter is not already what is currently being proposed.

Saturday, September 28, 2013

In today's post I want to continue the discussion on generating tokens by encryption. We mentioned encrypting the UserId and ClientId and circulate the base64 encoded string as tokens. We rely on the strength of the encryption to pass the information around. However we missed mentioning a magic number to add to the UserId and ClientId before we encrypt. This is important for several reasons. One because we want to be able to vary the tokens for the same UserId and ClientId. And we want to make it hard for a hacker to guess how we vary it. One way to do this for example would be to use the current time such as in hhmmss format along with an undisclosed constant increment.
Another way to generate the tokens without adding another number to the input is rotate the userId and ClientId. So the string constituting the UserId and the ClientId will be split and the two substrings exchanged in their positions. After decryption, we can swap it again to get the original string. Since the client Ids are not expected to grow as large as the integer max, we can use the leftmost padding of the client Ids as the delimiter.
Another way to do this would be to use the Luhn's algorithm that is used for generating credit card numbers. Here every alternate number in the original sequence is doubled and then their sum is multiplied by 9 and taken with modulus 10. This gives the check number to add to the end.
No matter how we vary the tokens from being generated, we can create a base64 encoded token.
The OAuth spec does not restrict the tokens to be a hash. The tokens could be anything. If they store information about the user and the client for validations during API calls, that is not restricted. The OAuth goes to mention such possibilities in the spec.
When considering the tradeoffs between a hash and an encrypted content for a token, the main caveat is whether the tokens are interpretable. If a malicious user can decrypt the tokens, then its a severe security vulnerability. Tokens can then be faked and the OAuth server will compromise the protected resources. Since the token is the primary means to gain access to the protected resources of the user, there's no telling when the token will be faked and misused and by whom. Tokens therefore need to be as tamper proof as the user's password.
If the tokens were not encrypted but merely a hash that has no significance in terms of contents, and the relevance is private to the server, then we are relying on a simpler security model. This means we don't have to keep upgrading the token generating technologies as more and more ways are discovered to break them. This however adds a persistence to the OAuth server such that these hashes can be tied back to the user and the client.
Even though I have been advocating a token database, and a hash is a convenient simpler security model, I firmly believe that we need both a token table and tokens that have user and client information in place with them in a way that only the server can generate them.
Simpler security model does not buy us the improvement in terms of scale and performance in the API implementations where token validation is desirable. All API calls should have token validation. Some could have user validation but all should have client validation.

Friday, September 27, 2013

In today's post I want to cover encryption and decryption of data we have talked about in the previous posts. For example, we wanted to generate an OAuth token based on information about the user and the client. So we wanted to do something like

Encrypt(UserID + ClientID) = Token

where UserID is a large integer and and Client ID is a regular integer. The original text can therefore be 16 and 8 characters in length which gives us 24 characters. We used fixed length for both UserID and ClientID and pad left. If we want to keep the size of the encrypted text to be the same as the original string, we could choose AES stream encryption. If we were to use stronger algorithms the size would likely bloat. And when you hex or base64 encode, the text could double in size.
In the database, there is an easy way to encrypt and decrypt the string using the ENCRYPTBYKEY (key_GUID)

CREATE CERTIFICATE Certificate01
ENCRYPTION BY PASSWORD = 'adasdefjasdsad7fafd98s0'
WITH SUBJECT = 'OAuth tokens for user and client',
EXPIRY_DATE = '20201212';
GO

OPEN SYMMETRIC KEY Token_Key_01
DECRYPTION BY CERTIFICATE Certificate01;

UPDATE Token
SET Token = EncryptByKey(Key_GUID('Token_Key_01'),
Convert(Varchar(16), UserID) + Convert(Varchar(8), ClientID));
GO

I want to mention the choice of the encryption algorithms. This could be AES, DES, 3DES, SHA1 etc. The encryption could be both block based and stream based.
For our purposes, we want to keep the size of the tokens to be a reasonable length. Since access tokens are likely passed around in the URI as query parameter, this should not be very long.
Moreover, the decryption should also work for quick and reasonable check on the tokens.
This way the storage of the tokens can be separated from the validation of user and client against a token. The storage of the token comes useful for auditing etc. The data is always pushed by the token granting endpoint/ There is no pull required from the database if the API implementations can merely decrypt the token.

In today's post I want to cover encryption and decryption of data we have talked about in the previous posts. For example, we wanted to generate a token based on information about the user and the client. So we wanted to do something like Encrypt(UserID + ClientID) = Hash where UserID is a large integer and and Client ID is a regular integer. The original text can therefore be 16 and 8 hexadecimal characters in length which gives us 24 characters. If we want to keep the size of the encrypted text to be the same as the original string, we could choose AES stream encryption. If we were to use stronger algorithms the size would likely bloat.

OAuth token database usage considerations

Here is a list of items to consider when provisioning a token table for all OAuth logins:

1) During the login, once the token is created, we have all the parameters that were used to create the token. They have been validated and hence the token was created. So the data entry should be reasonably fast. There should be no additional validations to occur.

2) Since this is an insert into a table that stores only the last hour of tokens,the table is not expected to grow arbitrarily large, so the performance for the insert should not suffer.

3) In a majority of the tokens issued, the user credentials will be requested. So expect that the user Id will be available. The table should populate the user Ids.

4) When each API call is made that validates the token against the user and the client, the lookup should be fast. Since these are based on Hashes or APIKeys, these should be indexed.

5) During the API call we are only looking at a single token in the table so other callers should not be affected, since the same client is expected to make the calls for that user. If another instance of the client is making the call to the same user, a different token is expected. So the performance will not suffer. And there should be no chance for performance degradation between the API calls.

Thursday, September 26, 2013

In the previous post I mentioned an archival policy for a token table. Today during implementation I used the following logic
While exists records to be moved from active to archive
BEGIN
SELECT the id of the candidates a few at a time
INSERT the candidates that have not already been inserted
DELETE from active table these records
END
There are a couple of points that should be called out. For one, there is no user transaction scope involved. This is intentional. I don't want bulk inserts that can take an enormous time as compared to these. These are also subject to failures and are not effective in moving records when the bulk insert fails.
Similarly the user transaction to cover all cases is almost unnecessary when we can structure the operations such that there are checks in each step and preceding steps. The latter helps with identifying failures and taking corrective actions Moreover the user transaction only helps to tie the operations together while each operation is itself transacted. The user transaction is typically used with larger data movement. With the approach to take only a few records even say a handful at a time and checking that they won't be actively used, that they have not already been copied to the destination from a previous aborted run and deleting the records from the source so that they won't come up again, helps with removing the onus of a user transaction and taking locks on a range of records. Besides, keeping the number of records to a handful during the move, we don't have to join the source and destination tables and instead join only with the handful of records we are interested in. This tremendously improves query performance.
But how do we really guarantee that we are indeed moving the records without failures and that this is done in a rolling manner until all records have been moved.
We do this by finding the records correctly.
How ? We identify the set of records with the keyword to denote only the top few that we are interested in. Our search criteria is based on a predicate that does not change for these records so that these records will match the predicate again and again. Then we keep track of their IDs in a table variable that consists only of one column comprising of these IDs. So long as these records are not being actively used and our logic is the only one doing the move, we are effectively owning these records. Then we take these records by their IDs and compare with source and destination. Since we check for all the undesirable cases such as original still left behind in the source, duplicates inserts into the destination, we know that we are being successful in the move. Lastly once we are done with our operation, we will not find these records again to move over thus guaranteeing that our process works. We just have to repeat the process over and over again until all the records matching the criteria have been moved.
This process is robust, clean, tolerant to failures. and resumable from any time of the run.