Thursday, June 6, 2013

Intermediate code:
The grammar of an intermediate language (IL) treats program as a set of instructions. The instructions are
1) a label or marker for jump statements
2) an assignment to a variable
3) a unary operator applied to an atomic expression with the result stored in a variable.
4) a binary operator applied to a variable and an atomic expression with the result stored in a variable
5) a transfer from a memory to a variable
6) a transfer from a variable to memory
7) a jump to a label
8) a conditional selection between two labels
9) a function call ( note the procedure calls are treated the same as function calls and argument and result are variables )
The compiler tries to do as much work during compile time as possible as compared to doing the same work at runtime. In general, this helps us prepare for the execution.
Now getting back to the conversion to the intermediary language, let us take a look at how to translate expressions, here the main challenge is that the expression is structured as a tree while the intermediary language is flat and the result of every operation is stored in a variable and every non constant argument is stored in another variable. A translation function is used to describe the overall technique. This function returns the code as a synthesized attribute. It translates variables and functions to the names that correspond to the intermediary language. The symbol tables are passed as inherited attributes to the translation function. In addition, attributes are used to decide where to store the values evaluated from sub-expressions. There are two ways to do this:
1) the location of the values of a sub-expression can be passed up as a synthesized attribute to a parent expression which can decide where to place its own value
2) the parent expression can decide where it wants to find the values of its sub-expressions and pass this information down as inherited attributes.
Method 1 doesn't have to generate code as it can return the name of the variable that holds the value. However, this only works under the assumption that the variable isn't updated before the value is used by the parent expression. If the expression has function calls, that may also have side-effects. Method 2 doesn't have this problem since the value is created immediately before the assignment is executed. In addition, this method can also generate code to calculate the expression result directly into the desired variable and not use temporary variables.
An inherited attribute say 'place' is the IL variable used to store the result of the expression.
The translation function uses a switch case construct for the type of expression and evaluates this way:
1) if the expression is a constant, the value of that is stored in place
2) if the expression is a variable, the equivalent IL variable is found in vtable and then copied to the place
3) if it's a unary operation, a new IL variable is generated to hold the value of the argument of the operation, then it is translated using the newly generated variable for place. Then an IL operation assigns the result to the inherited place. The operator++ concatenates the two lists of instructions.
4) if its a binary operation, two new IL variables are generated to hold the values of the arguments, then the values are translated and finally a binary operation in the intermediate language assigns the final result to the inherited place.
5) if its a function call, the arguments are translated first, then a function call is generated with the result assigned to the inherited place. The name of the function is looked up in the ftable for the corresponding name in the IL. (text from Mogensen)

In the gcc compiler, you can specify a -march parameter that indicates the instruction set the compiler can use and on the other hand you can specify -mtune=cpu-type, to indicate the processor for which the code is optimized.

intermediate code

Compilers generate intermediate code before translating to machine code. This breaks down the compiler into smaller jobs and has the following advantages:
1) Since machine code varies from one machine-architecture to other, only one translation is needed.
2) programs written in several high level languages can share the same translation from intermediate code to machine code
3) intermediate language need not be converted to machine code but interpreted by an external program. If the interpreter for the intermediate language is written in a language for which there are existing compilers for target machines, then there are even more advantages:
1)  no actual backends i.e. translation to machine code needs to be written for new machines
2)  a compiled program can redistributed for all machines in a single intermediate form
3) the intermediate form may be more compact than machine code.
The intermediate language should make it easy to translate from the high level language to the intermediate language and from the intermediate language to machine code. Furthermore, the intermediate language should enable optimizations.
Intermediate languages are also assessed based on their "graininess" i.e. whether an operation corresponds to a large amount of work or a small amount of work.  The intermediate language operations can be coarse grained for easy interpretation and translated into a sequence of machine code instructions. On the other hand, the operations can be fine grained such that several intermediate code instructions can be combined into one machine code instruction.
(text from Mogensen)

Wednesday, June 5, 2013

type checking a program

The previous post talked about treating the program as a list of functions. The program is type correct when all the functions are type correct. The functions are type checked using a symbol table where all the functions are bound to their type. This is done with two passes over the list of functions. One to build the symbol table and the other to check the function definitions using the table. With an additional helper function to get the pair of function's declared name and type, which consists of the types of the arguments and the types of the result, we can build the symbol table and then check the function definitions.
Assignments, data structures and overloading can also be type-checked. When a variable is given a value by an assignment, its type can be checked. Whether the variable had a previously assigned value or if the assigned value is never used are additional checks that can be performed but they are not based on the structural information that type checking depends on.
To check the data structures such as structs or a value that maybe of different types, the type checker uses a data structure similar to the one used by the abstract syntax trees of declarations. When the structured data is modified by operations that have well defined types for arguments and result, then they can be type checked just like functions.
When methods are overloaded, the same name is used for several different operations. They can all be tried one after the other and the expected type and arguments can be used to pick the matching one.
When a function is defined as a polymorphic type, the type checker can insert the actual types at every use of the generic / polymorphic type. This is different from type checking overloaded methods in that the function can be instantiated by any type and not just limited by a limited list of declared alternatives.
If the program uses implicit types, a type inference algorithm gathers information about the uses of functions and variables with which it infers the types.
Thus we have seen the type checking for expressions, function declarations and programs. Type-checking enforcing static type correctness and adds to the checks for the correctness of a program. This kind of checking is often not context free and hence it is done conceptually after lexing and parsing though it could be interleaved with the syntax analysis.
Also, translator may exploit or depend on type information in which case type checking may be combined with the actual translation.
Some other constructs of a language also adds to type safety. For example, virtual functions, templates, static_casts all improve type checking.
(text from Mogensen) 

type checking

Variables and functions are bound to their types in a symbol table which we use in type checking. Let us take an example where variables and functions have separate namespaces. Then we will need two symbol tables. Let us say variables are bound to integers and booleans only. Functions are bound by the type of the arguments and the type of the results. The symbol tables for variables and functions are inherited attributes for expressions.The type of the expression is returned as synthesized attributes. We generally use an abstract syntax tree for presentation. To retrieve the attributes we assume there are predefined functions for each terminal type. For each non-terminal, we define one or more functions that take an abstract syntax subtree and inherited attributes as arguments and return the synthesized attributes. Let us assume the symbol table for variables is given by the parameter vtable  and the symbol table for the function is given by the parameter ftable. Let us say the function error returns a type error. Since more than one errors could be reported, we let the error reporting to return. The type checker can guess what type to return and allow the reporting to return.
The type checking function evaluates based on the expression, vtable and ftable as parameters. Based on a switch case construct for the expression, we do the following:
1) if it's a numerical expression and an identifier we lookup the type of the identifier based on the vtable, name(id)  and if its bound it passes the type check or an error is returned.
2) if the expression is an addition, we call the type checker for each of the operands and return an int else we call error.
3) if the expression is a comparision, then we repeat the step of 2) and return a boolean
4) if the expression is a branch with if-then-else we call the type checker on each of the operations and return a bool as per the condition else we  return an error
5) if the expession is a method expression, the function name is looked up for the number and type of arguments and the return type. If this matches the expected, the resulting type is the return type but if the function is not found in the ftable or the vtable
6) a let expression declares a new variable and the return type is that of the expression.The symbol table is extended using bind and this is used for checking the body expression and finding the type.
This is how we type check expressions.
For function declarations, type checking is done by building a symbol table for variables and used for matching the result type of the function to the type of the body. The type check function for functions just checks for errors.
Similarly, a program can be considered a list of functions and is deemed type-correct if all the functions are type correct.
(text from Mogensen)
 

Tuesday, June 4, 2013

Symbol tables continued

Simple Imperative symbol table is used for imperative languages. It is usually implemented as a stack. Such a table should support the following:
1) empty table for the operation to begin with
2) binding : for each new name object pair that is pushed on to the stack
3) lookup : the stack is searched from top to bottom for a match.
4) enter : the top of the stack is noted
5) exit : the previous top of stack marker is recalled and becomes current
 A common efficiency issues to all of the mentioned symbol table implementations is that the lookup requires linear search.  A common solution to this problem is hashing where the names are hashed to integers. However that complicates the enter and exit functions mentioned above because the hash table has to be updated for all array elements at every entry and exit. Instead a single stack is maintained to record all updates to the table such that they can be undone in time proportional to the number of updates in the local scope.

Separate name spaces : Functions and variables have separate name spaces, which means defining a name in one space doesn't affect the other. Name spaces can also be shared and sharing can exist between any subsets of namespaces. Separate name spaces are easily implemented using a symbol table per namespace. Shared namespaces share a single table. Separate name spaces can also share a single symbol table if there are namespace indicators added to the names. A namespace indicator is a textual prefix to the name. A name-space indicator can be a textual prefix to the name or a tag paired with the name. In both cases, a lookup matches both the name space indicator and the names.

If we look at the objects in a program, some of them share namespaces. Types, functions and procedures, and variables share the same namespace. Labels, data constructors and modules don't share name spaces.

Type checking is enforced either in a separate phase or immediately after the syntax checking or it could be interleaved with it. Actual translation often occurs based on type checking so this is combined with the translation. The checking phase involves several passes over the abstract syntax tree  and each pass is a recursive walk. Each walk gathers information in earlier passes or uses it. Such information is called attributes . There are two kinds of attributes of the syntax tree - synthesized attributes are passed upwards in the syntax tree from the leaves upto the root. Inherited attributes are conversely passed downwards in the syntax tree. Attributes synthesized in one subtree may be inherited in another subtree or by the same subtree in a different pass. When the declarations are recursive, the scope may be the same syntax tree as the declaration itself. For example the symbol table synthesizes by declaration and inherits by the scope of the declaration.  ( text from Mogensen )
Finding stack frames from dump has multi level solutions.
1) Manually walkthrough the dump. This method reads the dump file header, lists the number of streams, finds the exception stream, reads the context and finds the stack pointer. Next it iterates through the streams to find the memory list stream, dumps the list of memory ranges, finds the range corresponding to the stack pointer, reads the memory pages for that range, finds the stack pointer and dumps the memory at the stack pointer to get the stack frames. For each of the stack frames, verify that the source corresponding to the current entry should make a call to the next frame.  For any errors in completing the operation above, display a message that the stack frames could not be fully resolved. To find the source corresponding to each entry, goto the stream with the module list information and for the stack frame, resolve the module, file, function and offset.  This method is exactly the same as what a debugger would do. The benefit of walking through the dump manually instead of a debugger is that this can be an in-memory stream based operation and since dump files come in ZipArchive and there are methods on ZipArchive to not extract the files but read them as streams. Just that the debugger and its SDK does not support reading a stream. This is not a marginal benefit because when we are dealing with thousands of dumps, we are not making wasting time to copy large files over network or to maintain the lifetime of copied files or to keep track of archived locations in our system. This is efficient, doable but expensive to re-implement outside the debugger.
2) use the debugger SDK. This makes the assumption that we programmatically call the debugger proxy to read the stack frame for us. This we call on the dump that we have extracted and made a local copy. The SDK has the benefit that it can be imported directly in a powershell CmdLet thus improving the automation that is desired for reading the dumps.The caveat here is that the SDK requires full trust and either requires registration to the GAC of the system on which it runs or a mention to skip verification  and this is appdomain based so we need to do this early on . This is not a problem in test automation or environment. Further, a dedicated service can be written that takes as an input the location of each dump and reads the stack trace using the pre-loaded debugger sdk.  In addition to portability, using the sdk also has the advantage that exception handling and propagation is easier since it is all in process. Moreover, the sdk comes with definitions of stack frames and validation logic that obviates string based parsing and re-interpretation from shell debugger invocation. At this level of solution, the changes are not as expensive and we reuse the debugger without having to rewrite the functionality we need from that layer.
3) Use a service that is a singleton and watches a folder for new dumps, reads those dumps using a debugger process or the layer mentioned just earlier and stores stack frames in a data store accessible to all. The service abstracts the implementation and provides APIs that can be used by different clients. Automation clients can directly call the APIs for their tasks. This approach has the advantage that it provides a single point of maintenance for all the usages.

Monday, June 3, 2013

Symbol tables

Objects such as variables, functions and types are named. At their declaration, their names are defined as synonyms and this is called binding. This is visible to a block of code which is the scope of the object. Such declarations are local and if the scope were for larger, the scopes would be global. If there are nested scopes, the context closest to the usage as it appears in the syntax tree is used.  The scoping that is based on the syntax tree is called static or lexical binding. In dynamic binding, the declaration most recently encountered during execution defines the current use of the name.
A symbol table is a table that binds names to objects. We start with an empty table. We need to bind a name to an object. In case the name is already defined in the symbol table, the new name takes precedence. We should be able to lookup a name in the symbol table to find the object it is bound to. We need to be able to enter a new scope and exit a scope, reverting the table to what it was prior to entry.
Symbol tables are implemented using a persistent or functional data structure. In this method, a copy is made of the data structure whenever an operation updates it, hence preserving the old structure. Only a portion of the data structure needs to be copied, the rest can be shared. There is another approach called the imperative approach in which when an update is made, the old binding of a name that is overwritten is recorded (pushed) on the stack. When a new scope is entered,  a marker is pushed on the stack. When the scope is exited, the bindings on the stack are used to revert to the earlier symbol table. Text from Mogensen.