Cluster computing

Saturday, September 28, 2013

In today's post I want to continue the discussion on generating tokens by encryption. We mentioned encrypting the UserId and ClientId and circulate the base64 encoded string as tokens. We rely on the strength of the encryption to pass the information around. However we missed mentioning a magic number to add to the UserId and ClientId before we encrypt. This is important for several reasons. One because we want to be able to vary the tokens for the same UserId and ClientId. And we want to make it hard for a hacker to guess how we vary it. One way to do this for example would be to use the current time such as in hhmmss format along with an undisclosed constant increment.
Another way to generate the tokens without adding another number to the input is rotate the userId and ClientId. So the string constituting the UserId and the ClientId will be split and the two substrings exchanged in their positions. After decryption, we can swap it again to get the original string. Since the client Ids are not expected to grow as large as the integer max, we can use the leftmost padding of the client Ids as the delimiter.
Another way to do this would be to use the Luhn's algorithm that is used for generating credit card numbers. Here every alternate number in the original sequence is doubled and then their sum is multiplied by 9 and taken with modulus 10. This gives the check number to add to the end.
No matter how we vary the tokens from being generated, we can create a base64 encoded token.
The OAuth spec does not restrict the tokens to be a hash. The tokens could be anything. If they store information about the user and the client for validations during API calls, that is not restricted. The OAuth goes to mention such possibilities in the spec.
When considering the tradeoffs between a hash and an encrypted content for a token, the main caveat is whether the tokens are interpretable. If a malicious user can decrypt the tokens, then its a severe security vulnerability. Tokens can then be faked and the OAuth server will compromise the protected resources. Since the token is the primary means to gain access to the protected resources of the user, there's no telling when the token will be faked and misused and by whom. Tokens therefore need to be as tamper proof as the user's password.
If the tokens were not encrypted but merely a hash that has no significance in terms of contents, and the relevance is private to the server, then we are relying on a simpler security model. This means we don't have to keep upgrading the token generating technologies as more and more ways are discovered to break them. This however adds a persistence to the OAuth server such that these hashes can be tied back to the user and the client.
Even though I have been advocating a token database, and a hash is a convenient simpler security model, I firmly believe that we need both a token table and tokens that have user and client information in place with them in a way that only the server can generate them.
Simpler security model does not buy us the improvement in terms of scale and performance in the API implementations where token validation is desirable. All API calls should have token validation. Some could have user validation but all should have client validation.

Friday, September 27, 2013

In today's post I want to cover encryption and decryption of data we have talked about in the previous posts. For example, we wanted to generate an OAuth token based on information about the user and the client. So we wanted to do something like

Encrypt(UserID + ClientID) = Token

where UserID is a large integer and and Client ID is a regular integer. The original text can therefore be 16 and 8 characters in length which gives us 24 characters. We used fixed length for both UserID and ClientID and pad left. If we want to keep the size of the encrypted text to be the same as the original string, we could choose AES stream encryption. If we were to use stronger algorithms the size would likely bloat. And when you hex or base64 encode, the text could double in size.
In the database, there is an easy way to encrypt and decrypt the string using the ENCRYPTBYKEY (key_GUID)

CREATE CERTIFICATE Certificate01
ENCRYPTION BY PASSWORD = 'adasdefjasdsad7fafd98s0'
WITH SUBJECT = 'OAuth tokens for user and client',
EXPIRY_DATE = '20201212';
GO

OPEN SYMMETRIC KEY Token_Key_01
DECRYPTION BY CERTIFICATE Certificate01;

UPDATE Token
SET Token = EncryptByKey(Key_GUID('Token_Key_01'),
Convert(Varchar(16), UserID) + Convert(Varchar(8), ClientID));
GO

I want to mention the choice of the encryption algorithms. This could be AES, DES, 3DES, SHA1 etc. The encryption could be both block based and stream based.
For our purposes, we want to keep the size of the tokens to be a reasonable length. Since access tokens are likely passed around in the URI as query parameter, this should not be very long.
Moreover, the decryption should also work for quick and reasonable check on the tokens.
This way the storage of the tokens can be separated from the validation of user and client against a token. The storage of the token comes useful for auditing etc. The data is always pushed by the token granting endpoint/ There is no pull required from the database if the API implementations can merely decrypt the token.

In today's post I want to cover encryption and decryption of data we have talked about in the previous posts. For example, we wanted to generate a token based on information about the user and the client. So we wanted to do something like Encrypt(UserID + ClientID) = Hash where UserID is a large integer and and Client ID is a regular integer. The original text can therefore be 16 and 8 hexadecimal characters in length which gives us 24 characters. If we want to keep the size of the encrypted text to be the same as the original string, we could choose AES stream encryption. If we were to use stronger algorithms the size would likely bloat.

OAuth token database usage considerations

Here is a list of items to consider when provisioning a token table for all OAuth logins:

1) During the login, once the token is created, we have all the parameters that were used to create the token. They have been validated and hence the token was created. So the data entry should be reasonably fast. There should be no additional validations to occur.

2) Since this is an insert into a table that stores only the last hour of tokens,the table is not expected to grow arbitrarily large, so the performance for the insert should not suffer.

3) In a majority of the tokens issued, the user credentials will be requested. So expect that the user Id will be available. The table should populate the user Ids.

4) When each API call is made that validates the token against the user and the client, the lookup should be fast. Since these are based on Hashes or APIKeys, these should be indexed.

5) During the API call we are only looking at a single token in the table so other callers should not be affected, since the same client is expected to make the calls for that user. If another instance of the client is making the call to the same user, a different token is expected. So the performance will not suffer. And there should be no chance for performance degradation between the API calls.

Thursday, September 26, 2013

In the previous post I mentioned an archival policy for a token table. Today during implementation I used the following logic
While exists records to be moved from active to archive
BEGIN
SELECT the id of the candidates a few at a time
INSERT the candidates that have not already been inserted
DELETE from active table these records
END
There are a couple of points that should be called out. For one, there is no user transaction scope involved. This is intentional. I don't want bulk inserts that can take an enormous time as compared to these. These are also subject to failures and are not effective in moving records when the bulk insert fails.
Similarly the user transaction to cover all cases is almost unnecessary when we can structure the operations such that there are checks in each step and preceding steps. The latter helps with identifying failures and taking corrective actions Moreover the user transaction only helps to tie the operations together while each operation is itself transacted. The user transaction is typically used with larger data movement. With the approach to take only a few records even say a handful at a time and checking that they won't be actively used, that they have not already been copied to the destination from a previous aborted run and deleting the records from the source so that they won't come up again, helps with removing the onus of a user transaction and taking locks on a range of records. Besides, keeping the number of records to a handful during the move, we don't have to join the source and destination tables and instead join only with the handful of records we are interested in. This tremendously improves query performance.
But how do we really guarantee that we are indeed moving the records without failures and that this is done in a rolling manner until all records have been moved.
We do this by finding the records correctly.
How ? We identify the set of records with the keyword to denote only the top few that we are interested in. Our search criteria is based on a predicate that does not change for these records so that these records will match the predicate again and again. Then we keep track of their IDs in a table variable that consists only of one column comprising of these IDs. So long as these records are not being actively used and our logic is the only one doing the move, we are effectively owning these records. Then we take these records by their IDs and compare with source and destination. Since we check for all the undesirable cases such as original still left behind in the source, duplicates inserts into the destination, we know that we are being successful in the move. Lastly once we are done with our operation, we will not find these records again to move over thus guaranteeing that our process works. We just have to repeat the process over and over again until all the records matching the criteria have been moved.
This process is robust, clean, tolerant to failures. and resumable from any time of the run.

Wednesday, September 25, 2013

In a previous post I showed a sample of the source index tool. Today we can talk about a webUI to a source index server. I started with an ASP.NET MVC4 WebAPI project and added model and views for a search from and results display. The index built before-hand from a local enlistment source seems to work just fine.
There is a caveat though. The Tokenizer or in this case as it is called the Analyzer in Lucene.Net object model, is not able to tokenize some symbols. But this is customizable and so long as we strip the preceding or succeeding delimiters such as '[', '{', '(', '=', ':", we should be able to use the syntax.
The size of the index is around a few MB for around 32000 files of source with the basic tokenizer. So this is not expected to grow beyond a few GB of storage if all the symbols were tokenized.
Since the indexing happens offline and can be rebuilt periodically or when the source has sufficiently changed, the index rebuilding does not affect the users for the source index server.
As an aside, for those following the previous posts on OAuth token database, maintenance of the source index server is similar to maintenance of the token database. An archival policy helps the token database to be in certain size because all expired tokens are archived and the token expiry time is usually one hour. Similarly, the input index for the source index server can periodically rebuilt and swapped with the existing index.
With regards to the analyzer, a StandardAnalyzer could be used instead of a simple Analyzer. It takes a parameter for different versions and more recent version could be used to better tokenize source code.

Declarative versus code based security
Configuration files are very helpful to declare all the settings that affect the application running. These changes are made without affecting the code. Since security is configurable, security settings are often declared in the config file. In WCF, there are options to set the elements for different aspects of communication contracts in the configuration file as well as to set them via properties on objects in the code. For example, service binding has numerous parameters that are set in the configuration file.
In production, these settings can be overriden and provides a least intrusive mechanism to resolve issues. This is critical in production because code or data changes require tremendous validation to prevent regression and that is time-consuming and costly.
Configuration files are also scoped to appdomain, machine and enterprise wide. So the changes can be applied to the appropriate scope. Usually application configuration files alone have a lot of settings.
Changes need not be applied directly to the configuration file. It can be applied externally via usersettings file which will override existing configuration files. This again is very desirable in production.
However, the settings can explode to a large number. When the configuration settings are too many, they are occasionally stored in a database. This adds to an external repository for these settings.
Usually the settings are flat and are easy to be iterated while some involve their own sections.
Settings provided this way are easy to be validated as well.
Besides, CM tools are available that can make the configuration changes across a number of servers.