Cluster computing

Tuesday, May 20, 2014

We will look at a few more examples of this regular expression search.We looked at how match called matchhere and matchhere in turn was recursive or called matchstar. We will now look at say grep functionality.
grep is implemented this way. We read strings from a file a buffersize at a time, null terminate the buffer.We run match over this buffer, increment the match count, print it and continue.
In other words, grep is a continuous search of the match over the entire file a buffer at a time.
Now if we want the matchstar to implement the leftmost longest search for
Suppose we were to implement the ^ and the $ sign implementation as well.
In this case, we will have to watch out for the following cases:
^ - match at the beginning
$ - match at the end
if both ^ and $ then match the string with the literal such that the beginning and the end is the same.
if both ^ and $ are specified and there is no literal, then match even empty strings.
literals before ^ and after $ are not permitted.
Let us now look at a way to use regex for validating a phone number. Note that phone numbers can come in different forms some with brackets, others with or without hyphens and some missing the country code etc.
Given a ten digit US phone number, where the 3 digit area code can be optional, the regex would look something like this:
^($\d{3}$)|^\d{3}[-]?)?\d{3}[-]?\d{4}$
Here the first character in the regex or the caret after the vertical bar denotes the start of a sentence with the phone number
The first ( denotes a group but the \( denotes the literal in an optional area code
\d matches a digit which will have the number of occurrences as specified in the { } braces
The | bar indicates that either the area code with paranthesis or without may be present.
Then there are characters to close the ones specified above
? makes the group optional while $ matches the end of a line.

Monday, May 19, 2014

Regular Expressions provide a compact and expressive notation for describing patterns of text. Their algorithms are interesting and are helpful in a variety of applications. They are accredited with making scripting an alternative to programming. They come in several flavors but almost all use wild cards which are special notations.
Some common wildcards include the following
* to denote everything
. to denote a single character
$ for the end of string
^ for the beginning of a string
^$ matches an empty string
[] to include specific sets of characters
+ for one or more occurrences
? for zero or more occurrences
\ as a prefix to quote a meta character
These building blocks are combined with parenthesis for grouping.

Even if we take the four elementary regular expression wild cards ^, $, . and * we can use those for a a modest regular expression search function. Kerninghan and Pike showed that with functions for match and matchhere this can be simple and easy. match determines whether a string matches a regular expression. matchhere checks if the text matches at each position in turn. As soon as we find a match, the search is over. If the expression begins with a ^, the text must begin with a match of the remainder of the pattern. Otherwise the characters are traversed one by one invoking match here at each position. Even empty text would need to be iterated because * can match empty strings.

Example 1:

Regular Expressions

by Brian W. Kernighan and Rob Pike

/* match: search for re anywhere in text */

int match(char *re, char *text)

{

if (re[0] == '^')

return matchhere(re+1, text);

do { /* must look at empty string */

if (matchhere(re, text))

return 1;

} while (*text++ != '\0');

return 0;

}

The matchhere method searches for regular expression at the beginning of the text. We first check for a null or empty string, then we check for the asterisk to see if all the text can be matched. When checking for all the text the delimiter is either a null terminator or a character that doesn't match or the expression doesn't have a '.'. We call this the match star method. Next we check for the $ or the null terminator in the text. Lastly we check again for more text and call matchhere.

The matchstar method matches zero or more instances by calling matchhere for the text one character at a time until the prefix doesn't match or the the prefix is a dot which means any character or until the end of the text is reached.

The matchhere method calls the match star method in one of the four cases that it checks. For all the other cases, it either returns a boolean corresponding to end of pattern reached or end of text reached or it calls itself recursively if the previous character in the regular expression is a period or a literal match.

We will look at a few more examples of this regular expression search.

In this post, we conclude how we can map raw function pointers to symbols offline. This means we no longer depend on debugging sessions to resolve a stack. When we encounter a stack trace for our executable that is not resolved. We simply pass the raw function pointers to our tool along with the load address of our executable and each of these function pointers are then looked up based on their RVA. The RVA is the relative address of the function pointer from the load address.
RVA = eip - load address
if the stack trace was already available in the format executable name + offset, the offset translates to RVA
the raw function pointers are easier to see in a crash log and hence and hence the tool takes the eip directly. Note that the eip points to the return address in a debugger, here we mention it as the function pointer
When we reviewed the win32 process startup stages, we saw how the process was initialized and what constitutes its address space. This helped us calculate the offsets.
The PDB is laid out in contiguous memory regions identified by section numbers. Each section has an address range and the addresses are found as offset from a given section. This is different from the Relative Virtual Address which indicates a position from the load address. The function pointers in an unresolved stack is a pointer that is specifically an offset equal to RVA from the load address. We use this to find the symbols. All symbols have a type information associated. We filter the result to only the type Function because that's what we are after.
Note that the symbols can be found by different methods such as the two lookups we have mentioned above. At the same time we can also iterate over all the symbols to dump the information based on tables. Since the PDB keeps track of all information regarding symbols in tables, we can exhaustively scan the tables for information on all symbols.
Dumping all the information on the symbols helps to get the information in text format which can then be searched for RVA, offset, section number or name of a function.
The latter approach is quite useful when we have heavy text parsing utilities.

Sunday, May 18, 2014

Today I want to try out the dia2dump sample.
        IDiaSymbol* pFunc; // initialized elsewhere
:
        DWORD seg = 0;
        DWORD offset = 0;
DWORD sect = 0;
g_pDiaSession->findSymbolByAddr( sect, offset, SymTagFunction, &pFunc );
or

BSTR bstrName;

if (pCompiland->get_name(&bstrName) != S_OK) {
wprintf(L"(???)\n\n");
}

else {
wprintf(L"%s\n\n", bstrName);

SysFreeString(bstrName);
}

And we specify the load address for the executable file that corresponds to the symbols in this symbol store.

HRESULT put_loadAddress (
ULONGLONG retVal
);

It is important to call this method when you get an IDiaSession object and before you start using the object.

A section contrib is defined as a contiguous block of memory contributed to the image by a compiland.
A segment is a portion of the address space. A section contrib can map to segments.

An offset is the difference between a given raw instruction pointer and the load address of the process.

If you don't know the section and the offset, you can put the entire address as an offset from the load address in the offset and specify section as zero.
or you can iterate over the section numbers
eg:
/Users/rrajamani/Downloads/dia2dump.txt:Function: [00447B60][0001:00446B60] ServerConfig::getSSLConfig(public: struct ssl_config __cdecl ServerConfig::getSSLConfig(void) __ptr64)

Here the RVA is 00447B60 [ eip - Process Load Address ]
   the Segment is 0001
   the offset is 00446B60

DWORD64 dwDisplacement = 0;

DWORD64 dwAddress = _wcstoui64(argv[i], NULL, 16);

DWORD64 dwRVA = dwAddress - dwLoadAddress;

long displacement = 0;

IDiaSymbol* pFunc = 0;

error = (DWORD)g_pDiaSession->findSymbolByRVAEx(dwRVA, SymTagFunction, &pFunc, &displacement );

if (!error && pFunc)

{

BSTR bstrName;

if (pFunc->get_name(&bstrName) != S_OK) {

wprintf(L"(???)\n\n");

}

else {

wprintf(L"%s \n\n", bstrName);

if (displacement)

wprintf(L"+ 0x%x \n\n", displacement);

else

wprintf(L" \n\n");

SysFreeString(bstrName);

}

Saturday, May 17, 2014

A look at win32 virtual address for a process and its creation.
Lets begin by reviewing the steps involved in creating a process using a call such as CreateProcess()
First the image file (.exe) to be executed is loaded inside the creating process (caller)
Second the executive process object is created.
Third the initial thread (stack, context and executive thread object) is created
Then the Wn32 subsystem is notified of the new process so that it can set up for the new process and thread.
Then the calling process can start execution of the initial thread (unless the Create Suspended flag was specified) and return
In the context of the new process and thread, the initialization of the address space is completed ( such as the load of the required DLLs) and then the execution at the entry point to the image is started. ( main subroutine )
Its probably relevant to mention here that in the first stage of loading the image, we find out what kind of application it is. If it's a Win32 application, it can be executed directly. If it's a OS/x we run the OSx.exe. If its POSIX we run the Posix.exe and if its MSDOS we run the MS-DOS directly.
In the second stage, we create the executive process object. This involves the following substages:
Setting up the EPROCESS block
Creating the initial process address space
Creating the kernel process block
Concluding the setup of the process address space
Setting up the PEB
and completing the setup of the executive process object.
In the substages above, the one involving the creation of the initial Process Address Space will consist of three pages:
the Page directory
the Hyperspace page and
the Working set list
In the sixth stage which perform process initialization in the context of the new process, the image begins execution in user mode. This is done by creating a trap frame that specifies the previous mode as a user and the address to return to as the main entry point of the image. Thus, when the the trap that causes the thread to start execution in kernel mode is dismissed, it begins execution in user mode at the entry point.
Inside the x86 process virtual address space starting from low to high memory, the kernel portion at the high memory consists of HAL usage sections, crash dump information, nonpaged pool expansion, system PTEs, paged pool, system cache, system working set list, etc. The user mode consists of hyperspace and process working set list, page tables and page directory, system PTEs, mapped views and system code. The kernel and the user mode addresses are separated by unused areas. In Linux, we have a memory mapped region in this space. The kernel space is reserved in the higher memory. The stack grows downwards adjacent to kernel space. The heap grows upwards from the user space.
Note that the instruction set is static and so we have

    if (pFunc->get_addressSection ( &seg ) == S_OK &&
        pFunc->get_addressOffset ( &offset ) == S_OK)
    {
        pFunc->get_length ( &length );
        pSession->findLinesByAddr( seg, offset, static_cast<DWORD>( length ), &pEnum );
    }

IDiaSymbol* pFunc;
pSession->findSymbolByAddr( isect, offset, SymTagFunction, &pFunc );

Today we will discuss how to use the DIA interfaces for reading the PDB when compared to our previous post here
MSDN says the following steps:

//initialize

CComPtr<IDiaDataSource> pSource;
hr = CoCreateInstance( CLSID_DiaSource,
                       NULL,
                       CLSCTX_INPROC_SERVER,
                       __uuidof( IDiaDataSource ),
                      (void **) &pSource);

if (FAILED(hr))
{
    Fatal("Could not CoCreate CLSID_DiaSource. Register msdia80.dll." );
}

//load

wchar_t wszFilename[ _MAX_PATH ];
mbstowcs( wszFilename, szFilename, sizeof( wszFilename )/sizeof( wszFilename[0] ) );
if ( FAILED( pSource->loadDataFromPdb( wszFilename ) ) )
{
    if ( FAILED( pSource->loadDataForExe( wszFilename, NULL, NULL ) ) )
    {
        Fatal( "loadDataFromPdb/Exe" );
    }
}

CComPtr<IDiaSession> psession;
if ( FAILED( pSource->openSession( &psession ) ) ) 
{
    Fatal( "openSession" );
}

CComPtr<IDiaSymbol> pglobal;
if ( FAILED( psession->get_globalScope( &pglobal) ) )
{
    Fatal( "get_globalScope" );
}

CComPtr<IDiaEnumTables> pTables;
if ( FAILED( psession->getEnumTables( &pTables ) ) )
{
    Fatal( "getEnumTables" );
}

CComPtr< IDiaTable > pTable;
while ( SUCCEEDED( hr = pTables->Next( 1, &pTable, &celt ) ) && celt == 1 )
{
     // Do something with each IDiaTable.
}

And the following lines indicate how to translate addr to function lines:

void dumpFunctionLines( IDiaSymbol* pSymbol, IDiaSession* pSession )
{
    ULONGLONG length = 0;
    DWORD isect = 0;
    DWORD offset = 0;
    pSymbol->get_addressSection( &isect );
    pSymbol->get_addressOffset( &offset );
    pSymbol->get_length( &length );
    if ( isect != 0 && length > 0 ) {
        CComPtr< IDiaEnumLineNumbers > pLines;
        if ( SUCCEEDED( pSession->findLinesByAddr( isect, offset, static_cast<DWORD>( length ), &pLines ) ) ) {
            CComPtr< IDiaLineNumber > pLine;

Thursday, May 15, 2014

Today I want to talk about auditing in the context of software. Auditing is not only for security but also for compliance and governance. Typically they are used for hashing or signing as well as to create an audit trail. When different accounts access a protected resource, an audit trail can reconstruct the timeline of actors and their operations. This is done with the help of two keys a private key and a public key. The private key is for encrypting a message so that the message is not transparent. Anyone with a public key though can decrypt it. Of course the public key has to be shared in a trustworthy way and hence a certification authority is involved as an intermediary. A receiver with a public key can then decrypt the message back to its original content.
Unlike other products functionality features auditing has a special requirement. It is said that incomplete security is worse than no security. The main concern here is how to audit events and sign them such that they have not been tampered with. Auditing explains 'who' did 'what' 'when'. As such it must be pristine. If there is tampering, it lays waste all the information given so far. Moreover, information for a specific event may be critical. If this information is tampered with, there could be a different consequence than the one intended. Even though audit events are generally not actionable except for regulations and compliance, it is another data point and is often demanded from software systems. Hence auditing is as important a feature as any, if not more. To add to the requirements of auditing, one important requirement is that security events may also be a consumer of auditing. Information about actions, actors and subjects are also events and as such are met with the same signing and hashing that auditing does.

One of the concerns with the sensitivity of the information is addressed by preventing any changes to audited event such as update or delete. This we see is helpful to guaranteeing that the data has not been tampered with. But can it be enforced. what mechanism can we put in place for this ?
Some techniques include hashing. A hashing function such as SHA256 for example will generate the same hash for the same input. Moreover the input may not be deciphered from its hash. Given a hash and the original from different sources, we can now check whether they are the same.
Another technique is to use the public key cryptography mentioned above. A public key and a private key can be used to enable only a selected set of receivers to understand the message. The public key is distributed while the private key is secured. When the public key is handed out, it is assumed to come from the sender that is why it is called a certificate.
If you would like to verify that the certificate did come from the sender, we can reach the certificate authority.The private key is left to the issuing party to secure.The keys can be generated with tools but its left for the party to secure. That said, if the users for the software do not interact with the keys generally or are admins themselves, then this is less severe.
In general its good to be a little paranoid about systems when it comes to security.