Cluster computing

Thursday, May 22, 2014

When we discussed radix sort, we saw that it works well for keys varying over a bigger range. If the keys are in a smaller range, then bucket sort could do. Bucket sort works something like this:
we make an array of size r of buckets
we make one pass through the records. we insert object with key k into the bucket k
we make pass through the array of buckets and read off as we go.
The idea behind bucket sort is that we split the interval of the range of values into several buckets so that they are randomly distributed across the buckets. With an auxiliary storage to hold the items in a linked list for each bucket, we rely on the fact that no one bucket will have a large number of elements so sorting each bucket is less expensive and we keep the cost linear.
The algorithm is as follows;:
void bucketSort(A, n)
{
For (int I =0 ; I < n; I++)
{
Insert (A [I], B [A [I]]);
}
For (int j =0; j < n; j++)
{
Insertion-sort (B [j]);
}
Concatenate (B);
}
The complexity for the bucket sort is O(n + r)
In bucket sort we are making an r-way decision at each step with the use of indirect addressing and is a significant optimization over the comparision model.
Radix sort can work with a bigger range of r i.e it can be of the order of 10 to 256.
In Radix sort the complexity is slightly different . If the strings are of length L, and we spend O(n) time at each level then the complexity is O(nL). It could end a lot earlier due to empty buckets.
In Most significant digit or character first radix sort, we distinguish the strings from just their first few characters. If we now try to find out how small L can be given that we distinguish only with the first few characters, then we just have to use the r-way split instead of the two way split. The lower bound is then log-r(n) and if all the strings are roughly this length, then the complexity is O (n*log-r(n))
In the Least significant first radix sort, the complexity is again different from Most significant first.
Here the running time is O(L*(r+n)) because we are concatenating L times and at each stage the sorting is that of a bucket sort which O(r+n) or simply O(n) giving the overall complexity as O(nL).
The advantage of implementing LSR is that it doesn't need any recursion.
At implementation time we might want to consider padding all strings to the right . This would make the strings comparable and the padding will not affect the comparision value of the string since the white space character has lower value than the first literal 'a'. Also, space optimized tries may require slightly more processing to navigate the pointers in the keys.

Wednesday, May 21, 2014

Tonight we will look at a data structure called trie.
A trie is an ordered tree data structure that can store a dynamic set or an associative array where the keys are usually strings. We search by prefixes and hence they are referred to as prefix tree or radix tree. In a binary search tree, a node keeps the keys associated with that node. In the trie, this is not the case. What we use is the position of the node in the tree. This position is used to construct the prefix. All the descendants of a node have a common prefix of the string associated with that node and the root has an empty string. Values are normally not associated with every node, only with leaves and some inner nodes. This lends the trie to be stored in a space optimized manner which are referred to as compact prefix tree.
The term trie comes from retrieval. The use of prefix implies that we can use the trie as a deterministic finite automaton without loops. The transitions from one node to the other is forward only
While characters can be used to represent the keys, this is not always necessary. Instead they can be stored as bits in which case we have a bitwise trie. Fixed size of bits such as an integer number or memory address is convenient for storage.
A trie is very often used as a replacement for Hash tables. There are several advantages. First the lookup is faster because it's prefix based which means we traverse down the root node in O(m) time. The insertion and deletions are cheap.
There is no chaining and consequently no collisions.
This comes in very useful for applications such as auto complete and Spellchecker.
Tries can provide automatic ordering of keys which comes in useful for auto complete or predictive text. They also come in useful for approximation algorithms.
We could take an example of radix sort.
Tries and radix sort are similar to binary search trees and quick sort.
In particular, each node has an array of size r of child pointers. When we store words in a tree, we keep the letters on the edges and walk down the tree reading the words.
When we insert, we end each tree with a $ and in some case strings are substrings of others. To do a lookup, we just walk down the tree letter by letter and then see if the node has the ending $ to the word. If we get to a null pointer we know that the key is not there.
There are a couple of ways of making it more practical. In particular, some things we can do are :
compress paths that don't branch into a single edge
and only split when needed.These techniques optimize storage.
Radix sort comes in different flavors. For example, there's the most significant digit/character first and the least significant digit/character first radix sort.
In the most significant first radix sort, we sort by the most significant to get the buckets. Then we recursively sort each bucket, starting with the next most significant digit/character and so on.
In the least-significant first radix sort, we sort by the least significant digit / character to get the buckets. Then we concatenate them. This method is trickier to follow but its easy to implement. The reason we can go from least significant onwards is that at each stage (significance) of the sorting, we preserve (aka stable) the results from the previous iteration. we can prove that the progress of the algorithm is stable by mathematical induction.
void RadixSort(Array a, int digits)
{
for (int i = 1; i< =digits ; i++)
InsertionSort(A, i);
}
void InsertionSort(A, d)
{
for (int i = 1; i < n; i++)
{
int x = A[i];
// if (x < A[k]) Insert(A,k,x)
int j = i - 1;
while (j >= 0 && A[j] > x)
{ A[j+1] = A[j];
j = j - 1;
}
A[j + 1] = x;
}
}

Let us discuss another Splunk application. We referred to a Splunk app that integrates log parser. We will discuss it a little bit more today
Log parser search queries as we know are SQL queries where as Splunk takes unix like search operators. Moreover Splunk exposes a query language for its indexed data whereas log parser can work with a variety of data sources. They say data is sticky meaning that users like to use the same applications that they have been using for their data unless there is a compelling advantage or the cost and risk to move the data is negligible. In order that Splunk may work with the data that belonged to log parser, let us look into query translation to log parser so that we can avoid the data import.

I will return to this shortly but I will digress to discuss an interesting data structure called trie

Tuesday, May 20, 2014

We will look at a few more examples of this regular expression search.We looked at how match called matchhere and matchhere in turn was recursive or called matchstar. We will now look at say grep functionality.
grep is implemented this way. We read strings from a file a buffersize at a time, null terminate the buffer.We run match over this buffer, increment the match count, print it and continue.
In other words, grep is a continuous search of the match over the entire file a buffer at a time.
Now if we want the matchstar to implement the leftmost longest search for
Suppose we were to implement the ^ and the $ sign implementation as well.
In this case, we will have to watch out for the following cases:
^ - match at the beginning
$ - match at the end
if both ^ and $ then match the string with the literal such that the beginning and the end is the same.
if both ^ and $ are specified and there is no literal, then match even empty strings.
literals before ^ and after $ are not permitted.
Let us now look at a way to use regex for validating a phone number. Note that phone numbers can come in different forms some with brackets, others with or without hyphens and some missing the country code etc.
Given a ten digit US phone number, where the 3 digit area code can be optional, the regex would look something like this:
^($\d{3}$)|^\d{3}[-]?)?\d{3}[-]?\d{4}$
Here the first character in the regex or the caret after the vertical bar denotes the start of a sentence with the phone number
The first ( denotes a group but the \( denotes the literal in an optional area code
\d matches a digit which will have the number of occurrences as specified in the { } braces
The | bar indicates that either the area code with paranthesis or without may be present.
Then there are characters to close the ones specified above
? makes the group optional while $ matches the end of a line.

Monday, May 19, 2014

Regular Expressions provide a compact and expressive notation for describing patterns of text. Their algorithms are interesting and are helpful in a variety of applications. They are accredited with making scripting an alternative to programming. They come in several flavors but almost all use wild cards which are special notations.
Some common wildcards include the following
* to denote everything
. to denote a single character
$ for the end of string
^ for the beginning of a string
^$ matches an empty string
[] to include specific sets of characters
+ for one or more occurrences
? for zero or more occurrences
\ as a prefix to quote a meta character
These building blocks are combined with parenthesis for grouping.

Even if we take the four elementary regular expression wild cards ^, $, . and * we can use those for a a modest regular expression search function. Kerninghan and Pike showed that with functions for match and matchhere this can be simple and easy. match determines whether a string matches a regular expression. matchhere checks if the text matches at each position in turn. As soon as we find a match, the search is over. If the expression begins with a ^, the text must begin with a match of the remainder of the pattern. Otherwise the characters are traversed one by one invoking match here at each position. Even empty text would need to be iterated because * can match empty strings.

Example 1:

Regular Expressions

by Brian W. Kernighan and Rob Pike

/* match: search for re anywhere in text */

int match(char *re, char *text)

{

if (re[0] == '^')

return matchhere(re+1, text);

do { /* must look at empty string */

if (matchhere(re, text))

return 1;

} while (*text++ != '\0');

return 0;

}

The matchhere method searches for regular expression at the beginning of the text. We first check for a null or empty string, then we check for the asterisk to see if all the text can be matched. When checking for all the text the delimiter is either a null terminator or a character that doesn't match or the expression doesn't have a '.'. We call this the match star method. Next we check for the $ or the null terminator in the text. Lastly we check again for more text and call matchhere.

The matchstar method matches zero or more instances by calling matchhere for the text one character at a time until the prefix doesn't match or the the prefix is a dot which means any character or until the end of the text is reached.

The matchhere method calls the match star method in one of the four cases that it checks. For all the other cases, it either returns a boolean corresponding to end of pattern reached or end of text reached or it calls itself recursively if the previous character in the regular expression is a period or a literal match.

We will look at a few more examples of this regular expression search.

In this post, we conclude how we can map raw function pointers to symbols offline. This means we no longer depend on debugging sessions to resolve a stack. When we encounter a stack trace for our executable that is not resolved. We simply pass the raw function pointers to our tool along with the load address of our executable and each of these function pointers are then looked up based on their RVA. The RVA is the relative address of the function pointer from the load address.
RVA = eip - load address
if the stack trace was already available in the format executable name + offset, the offset translates to RVA
the raw function pointers are easier to see in a crash log and hence and hence the tool takes the eip directly. Note that the eip points to the return address in a debugger, here we mention it as the function pointer
When we reviewed the win32 process startup stages, we saw how the process was initialized and what constitutes its address space. This helped us calculate the offsets.
The PDB is laid out in contiguous memory regions identified by section numbers. Each section has an address range and the addresses are found as offset from a given section. This is different from the Relative Virtual Address which indicates a position from the load address. The function pointers in an unresolved stack is a pointer that is specifically an offset equal to RVA from the load address. We use this to find the symbols. All symbols have a type information associated. We filter the result to only the type Function because that's what we are after.
Note that the symbols can be found by different methods such as the two lookups we have mentioned above. At the same time we can also iterate over all the symbols to dump the information based on tables. Since the PDB keeps track of all information regarding symbols in tables, we can exhaustively scan the tables for information on all symbols.
Dumping all the information on the symbols helps to get the information in text format which can then be searched for RVA, offset, section number or name of a function.
The latter approach is quite useful when we have heavy text parsing utilities.

Sunday, May 18, 2014

Today I want to try out the dia2dump sample.
        IDiaSymbol* pFunc; // initialized elsewhere
:
        DWORD seg = 0;
        DWORD offset = 0;
DWORD sect = 0;
g_pDiaSession->findSymbolByAddr( sect, offset, SymTagFunction, &pFunc );
or

BSTR bstrName;

if (pCompiland->get_name(&bstrName) != S_OK) {
wprintf(L"(???)\n\n");
}

else {
wprintf(L"%s\n\n", bstrName);

SysFreeString(bstrName);
}

And we specify the load address for the executable file that corresponds to the symbols in this symbol store.

HRESULT put_loadAddress (
ULONGLONG retVal
);

It is important to call this method when you get an IDiaSession object and before you start using the object.

A section contrib is defined as a contiguous block of memory contributed to the image by a compiland.
A segment is a portion of the address space. A section contrib can map to segments.

An offset is the difference between a given raw instruction pointer and the load address of the process.

If you don't know the section and the offset, you can put the entire address as an offset from the load address in the offset and specify section as zero.
or you can iterate over the section numbers
eg:
/Users/rrajamani/Downloads/dia2dump.txt:Function: [00447B60][0001:00446B60] ServerConfig::getSSLConfig(public: struct ssl_config __cdecl ServerConfig::getSSLConfig(void) __ptr64)

Here the RVA is 00447B60 [ eip - Process Load Address ]
   the Segment is 0001
   the offset is 00446B60

DWORD64 dwDisplacement = 0;

DWORD64 dwAddress = _wcstoui64(argv[i], NULL, 16);

DWORD64 dwRVA = dwAddress - dwLoadAddress;

long displacement = 0;

IDiaSymbol* pFunc = 0;

error = (DWORD)g_pDiaSession->findSymbolByRVAEx(dwRVA, SymTagFunction, &pFunc, &displacement );

if (!error && pFunc)

{

BSTR bstrName;

if (pFunc->get_name(&bstrName) != S_OK) {

wprintf(L"(???)\n\n");

}

else {

wprintf(L"%s \n\n", bstrName);

if (displacement)

wprintf(L"+ 0x%x \n\n", displacement);

else

wprintf(L" \n\n");

SysFreeString(bstrName);

}