Wednesday, February 27, 2013

A comparision of cloud hosting for web applications

There are quite a few metrics to evaluate and compare cloud hosting as provided by different vendors such as Amazon, Azure, Rackspace, Uhuru. Some of these are listed here:
1) Setting up the environment. Cloud hosting service providers may offer storage, application hosting, web site hosting and many other features. In order to create a cloud,  one may need to specify one or more of these choices for the environment. Amazon and Azure for instance provide various environment options. Some of them are easy to deploy run of the mill web applications while the others are there to serve more customizations such as different forms of storage - table, blobs, database et al.
2) Deployment. Application deployment is a repeated practice and hence the UI for deployment should consider ease of use. Time taken for the deployment or to go live is also an important consideration In  this case, its  not just the application developer who wants to be able bounce the production server and look at the stats but also the customers who might have to wait when the application is still deploying. Application deployment is probably the single most repeated interaction between a cloud service provider and the application developer.
3) Application deployment options is another big differentiator between the vendors. Some vendors allow for specifying the region where the servers are located. Some also allow configuring the network and the DNS registration of the application. Some allow remote desktop to the server itself. Some others allow for configuring security on the servers and application. Some allow more than one way of uploading the software. For example this could either be from packages or source control based deployment.
4) Another important consideration is the variety of web applications supported by the service provider. For example, some vendors may allow .Net software application hosting, others may allow php, ruby on rails, etc. Deployments for these might also require different servers to be cut as VM slices - different both in terms of operating system as well as the hosting stack.
5) Ease of use for end to end flow. This is probably the single most important factor in making these popular. In that respect, the Uhuru web application hosting experience is a breeze and delight. However, I haven't looked into .Net application deployment there.

indexer

indexer

ADI

 

Text Search and keyword lookup

Keyword lookup is a form of search which includes cases where the topic words for the provided text may not be part of the text. This is usually done based on some semantic dictionary or thesaurus involved and lookups.  Unfortunately, such thesaurus is specific to the domain for which the text is written. As an example, let's take a look at the keywords used for the internet. Search engine optimization professionals use keyword research to find and research actual search terms people enter into the search engines when conducting a search. They lookup the relevancy of the keywords to a particular site. They find the ranks of the keywords from various search engines for competing sites. They buy sample campaign for the keywords from search engines and they predict the click-through rate. Although search engine optimization is not related to the automated keyword discovery, it serves to indicate that relevancy and choice of keywords is a complex task and concordance is not enough. Additional metadata in the form of distance vectors or connectivity between words or their hiearchy in wordnet is required. Building a better dictionary or thesaurus for the domain is one consideration. Selecting the metrics to evaluate the keywords is another. Using the statistics for the keywords encountered so far is a third consideration.

Tuesday, February 26, 2013

FullText revisited

Let's take a look at full text search again. Fulltext is about indexing and searching. Fulltext is executed on a document or a full text database. Full text search is differentiated from searches based on metadata  or on parts of the original texts represented in the databases because it tries to match all of the words in all the documents  for the pattern mentioned by the user. It builds a concordance of all the words ever encountered and then executes the search on this catalog. The catalog is refreshed in a background task. Words may be stemmed or filtered before pushing it into the database. This also means that there can be many false positives for a seach, an expression used to denote the results that are returned but not relevant to the intended search. Clustering techniques based on Bayesian algorithms can help reduce false positives.
Depending on the occurences of words relevant to the categories, a search term can be placed in one or more of the categories.  There are  a set of metrics used to describe the search results - precision and recall. Recall measures the relevancy of the results returned by a search and precision is the measure of the quality of the results returned.  Some of the tools aim to improve querying so as to improve relevancy of the results. These tools utilize different forms of searches such as keyword search, field-restricted search, boolean queries, phrase search, concept search, concordance search, proximity search, regular expression, fuzzy search and wildcard search. 

Forensics

Forensic Science has a number of examples of statistical approaches to text search. These include
Factor analysis:
This method describes variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. The variations in three or four observed variables mainly reflect the variations in fewer observed variables. Factor analysis searches for such joint variations in response to unobserved latent variables.
Bayesian statistics: This method uses degree of belief, also called Bayesian probabilities, to express a model of the world. These are based on interprtation using probabilities.
Poisson Distribution: This is used to model a number of events occuring within a time event.
Multivariate analysis: This involves observation and analysis of more than one statistical outcome variable at a time.
Discriminant function analysis: This involves predicting a grouping variable based on one or more independent variable.
Cusum analysis : This is a cumulative sum method for finding a two or three letter words in a sentence that forms the habit of the speaker and thus can be used to detect tampered sections
 

Sunday, February 24, 2013

Integrated access to multiple data sources

Integrated access to multiple data sources
Large organizations typically have several databases and users may want to access data from more than one source. For example, an organization can have one data store for product catalog also called master data and another for billing and payments and yet another for reporting. These databases may contain some common information, determining the exact relationship between tables in different databases can get tough. For example, prices in one database might be dollars per dozen item and in another might be dollars per item. This is therefore typically avoided using XML DTDs which offer the promise that such semantic mismatches can be avoided if all parties conform to a single standard DTD. However, there are many legacy databases and most domains may not yet have an agreed-upon DTD. Semantic mismatches can be resolved and hidden from users by using relational views over the tables from the two databases. Defining a collection of views to give users a uniform presentation of relevant data from multiple databases is called semantic integration.  The task of defining these views for semantic integration can be challenging when there is little or no documentation for the existing databases. 
If the underlying databases are managed by different DBMS, some kind of middleware may be used to evaluate queries over integrating views, retrieving data over query execution time. Alternatively, the integrating views can be materialized and stored in a data warehouse. Queries can be run over the warehoused data instead of the source DBMS at run-time.

new venture

http://indexer.cloudapp.net

schema design 2

Normal forms are a guidance to avoid known problems during schema design. Given a schema, whether we decompose it into smaller schema is determined based on this normal forms.  The normal forms are based on functional descriptors and are first normal form, second normal form, third normal form and Boyce Codd normal form. Each of these forms have increasingly restrictive  requirements. A relation is first normal form if every field contains atomic values and not lists or sets. 2NF is mainly historical interest. 3NF and BCNF are important from design standpoint. Boyce Codd Normal form for a relation R holds truie if for every FD X->A that holds over R, one of the following statements is true.
A belongs to X : that is, it is a trivial FD or
X is a superkey
Third normal form holds when for every FD X-> A that holds over R, one of the following statements is true:
A belongs to X : that is, it is a trivial FD or
X is a superkey or
A is part of some key for R

Saturday, February 23, 2013

Setting up a website

If you want to create a website for yourself, here are some of the things you need to do:
1) Your company must have a name and logo. This is central to the theme on any page that displays information about your company.
2) Your company webstite must explain the company's value proposition in a simple and clear manner to the user from the home page. A picture, a slogan or a paragraph that conveys this to the user should be on the front page of your website.
3) Your company website must elicit user attention with achievements, partners or better yet examples of real life usage.
4) Your company website have business contact information if users are supposed to contact.
5) Your company must have a copyright to all information on the website.
 

Document parsing

Structured text is very valuable to identifying topics or keywords in a text. Word documents provides markup for such information and this can be useful to find topics. Word documents can be parsed to retrieve the table of contents or the structure and this can be used to divide the text into sections that can then be treated as unstructured. Content that is not text but have titles or captions to go with them should be treated the same as headings for section text. These are also candidates for document indexing. Thus an improvement to indexing unstructured text is to add logic to extract document structure and utilize the layout of the information presented by the user. The data structure used for capturing this layout for the purposes of indexing could have elements holding references to sections, their type and location and these elements are but a list populated from document parsing. Each element of this list will be treated the same was as any length unstructured text.
 

Friday, February 22, 2013

schema design

Schema refinement in database design:
Good schema design improves decomposition. One way to do this is to eliminate redundancy. Redundancy causes several problems: storage is repeated, updates may not make changes to all in a consistent manner, insertion may be dependent on other items  and deletion may  not be possible without losing other information as well.
Several normal forms have been proposed for relations and if a schema conforms to one of these, it can avoid certain kinds of problems. A property of decomposition is lossless-join which enables us to recover any instance of the decomposed relation from corresponding instances of the smaller relations. Another property is dependency-preservation which enables us to enforce any constraint on the original relation by simply enforcing some constraints on each of the smaller relations.
A good designer will ask if a relation is in a normal form or if the decomposition is dependency preserving.  Relations will often have functional dependencies.  A functional dependency is where different attributes will keep the same dependency for every pair of tuples  so for example, if t1.X = t2.X, then t1.Y = t2.Y
A primary key constraint is a special case of an FD.
Constraints can be defined on an Entity set. For example, the SSN can be a key to the tuples with attributes name, lot, rating, hourly_wages, hours_worked etc.
Constraints on a relationship set. These constraints can eliminate redundancies that ER design may not catch. For example, a contractor with contract id may involve parts, suppliers and Departments, then it maybe better to split the relations as CQSD and SDP.
Functional Dependencies also help with identifying attributes of entities and particularly find attributes that are on the wrong entity set. For example, employees can work in at most one department. So we can decompose this into two entities Workers and Departments  with attributes as workers(ssn, name, did, since)  and Departments(did, dname, budget, lot).
Closure of a set of functional dependencies is defined as the set of FDs implied by a given set F of FDs and is denoted as F+. Armstrong's axioms helps find closures of a set of FDs. The axiom suggests
1. Reflexivity if X is a proper subset of Y, then X --> Y
2. Augmentation if X-->Y, then XZ-->YZ for any Z
3. Transitivity if X-->Y and Y-->Z, then X-->Z
It is convenient to use additional rules while reasoning about F+
Union if X -->Y and X-->Z, then X-->YZ
Decomposition if X-->YZ, then X-->Y and X-->Z
These additional rules are not essential, their soundness can be proved using Armstrong's axioms.
The attribute closure of attribute set X is the set of attributes A such that X --> can be inferred using Armstrong Axioms.
 

Thursday, February 21, 2013

Skip list

Skip list insertion and deletion has to keep track of the references for all levels of the node to be inserted or deleted. So the first thing we do for insertion or deletions is that for each of the levels from top to bottom, all nodes with references that should now be changed are found. This could be on the stack via recursion or on heap. Then they are updated to the target node and with the current pointing to the earlier as in the case for insertion or they are updated to what the target node points to and the target node spliced as in the case for deletion. In all cases, the level zero iteration of nodes should span the entire list.

Insertion should perform the updates to the references as it walks down the levels. Deletion could perform the updates as it returns up the levels.

Wednesday, February 20, 2013

finding answers quickly

Many times when we search, we really are interested in the first few results rather than all of the results we get. There is an urgency for the user to get only the first few or the best few answers quickly. Rarely does the user go past the second or third page of the results. If the user does not find what they want, they usually refine the query and resubmit it. This is true not just for searches but also for decision support. Another pattern is when the user may want to avoid writing a complex query and instead write one that gives an approximate answer quickly, then refine the query continually until an exact answer is found.
Top N Queries refers to the first of these kinds of querying. This is very helpful for the users to be able to explicitly indicate how many answers they want, making it possible for the DBMS to optimize execution. As an example, one could request the products that are top 10 by sales.
So here the optimizer would add a predicate that checks for sales to be greater than a cut off value. Histograms can be used to find to reflect the top values.  This cut-off value can be refined if the the tuples are less or more than the number specified by the user but the approach is based on pre-computation.
Another approach is to use online aggregation where the answer quality is continually refined during computation progress. First, the grouping column value is determined to be within a range with a confidence level, then the user input priority is used to determine the order of the tuples to work with and finally use non-blocking algorithms for relational operators. For example, the sort-merge join algorithm blocks because sorting requires all input tuples. Instead nested loops join and hash join  are therefore preferable to sort-merge join for online aggregation

Tuesday, February 19, 2013

Distributed Databases (2)

Points to review:
Parallel databases  evaluate queries in parallel in order to improve performance.  Parallel queries can be executed on shared memory, shared disk system and shared-nothing system. In the shared-nothing system, there's linear speed-up and scale-up possible.
In pipelined parallelism, one operator consumes the output of another operator.
In data-partitioned parallelism, the input data is partitioned and we work on each partition in parallel.
Data can be partitioned with round robin partitioning where say the tuple i goes to processor i mod n or data can be partitioned with hash partitioning where a hash function distributes the tuples or range partitioning where the tuples are distributed over n search key ranges. Existing code can be parallelized by introducing split and merge operators.
Shared nothing architecture lends itself to data partitioning.
For query optimizations, we take parallelism within each operator and use pipelined parallelism to exploit parallelism between operators.
In a distributed database, data is stored across several locations. When the user does not have to know the location of the data, we achieve distributed data independence. When there is no difference between distributed transactions and local transactions, we achieve distributed transaction atomicity. If a transaction involves activities at different sites, the activity at a given site is called a subtransaction.
In distributed databases, a single relation might be fragmented or replicated across several sites. Horizontal and vertical fragmentation may apply to  rows and columns respectively. Fragments need qualifying names. Distributed catalog management is used to keep track of what is stored where.
When processing queries in a distributed DBMS, the location of the partitions of the relations need to be taken into account.  Two relations that reside on different sites can be joined by sending one relation to the other site and performing the join locally. Choice of sites could be based on the number of tuples to select or join. Semijoins and Bloomjoins reduce the number of tuples sent across the network. In such query processing, communication costs and autonomy of individual sites matter.
In replication, we store several copies of the relation or partition on different sites. In synchronous replication, all copies of a replicated relation are updated before the transaction commits.In asynchronous replication, copies are only updated periodically. Synchronous replication involves voting where an update must write a majority of  copies, and a read must access at least enough copies  to make sure that one of the copies is current. Asynchronous replication involves peer-to-peer replication  where more than one copy can be updatable a conflict resolution strategy deals with conflicting changes. Alternatively, in primary site replication, there is one primary copy that is updatable, the other secondary copies cannot be updated. Changes on the primary are captured and then applied to the other sites.
Locking can be either central at a designated primary copy or fully distributed. Distributed deadlock detection is required.
Recovery is done with a commit protocol that co-ordinates activities at different sites involved in the transaction. In two phase commit each transaction has a designated co-ordinator site. Subtransactions are executed at subordinate sites. Subordinate sites block until co-ordinator site is recovered.
Three phase Commit can avoid blocking even if the co-ordinator site. In 3PC,  a co-ordinator site sends out prepare messages and receives yes votes from all sub-ordinates. Then it sends out pre-commit messages, rather than a commit messages. When a sufficient number of acknowledgements have been received, the co-ordinator force-writes a commit  log record and sends a commit message to all the sub-ordinates. The precommit message lets the sites to communicate with each other without waiting for the co-ordinator to recover.

 

Monday, February 18, 2013

Distributed databases

What are the features in the database server for distributed databases ?

Distributed databases are faced with the challenge of dealing with heterogenity and autonomy. Logically, these databases are connected via links. Physically, they share storage and query processing. In addition, reovery and concurrency control now spans linked servers.

Distributed databases originated from internet and mobile computing. The motivation was to provide distributed computing over LAN, WAN, internet etc to boost availability and reliability via redundancy and performance via distribution of work.

The computing environment for distributed databases can vary from homogenous nodes in a shared nothing architecture to a more centralized or truly distributed heterogenous computers. In such cases, the transparency of computing, the networking costs, the replication, fragmentation and allocation become important considerations. And site and communcation failures add to issues with robustness.

In a conceptual distributed management, heterogenity and autonomy classification criteria are considered.
The degree of homogenity depends on the choices of data models, constraints and query languages. Semantic attributes of an entity, attribute names, meaning of an attributes, isolation level choices.

The degree of local autonomy depends on the data model design autonomy, communication autonomy, execution autonomy and association autonomy.

The above define the classification criteria for conceptual DMs. With this classification, we can consider the following categories of distributed databases.
1. Distribution transparent category: These are homogeneous, no local autonomy databases that looks like a single centralized DB to users.
2. Multidatabase category: These have no common schema.
3. and Federated category: These have common schema, but local autonomy.
In addition, hardware and software may be heterogeneous

In the Federated databases, we can have a schema hierarchy where in the local component is translated to canonical data model and exported such as with public view or federated such as with a global view to the DBAs or made external such as with views to user groups.

Fedearated schema is usually built with consensus. And data interchange standards complement federated schema. As an example, XML is a syntax for federated schema and data in federated schema. This category has support of major IT vendors.

Let's consider schema integration examples.

An engineering database may have relational schema such as with the following entities
Engineer ( Engineer no, name, title, salary)
Project (Project Number, Project Name, budget, location)
Client (Client name, address)
and with following relationships (N:1)
Engineer works in Project (Responsibility, Duration)
Projject Contract by Client (Contract Date)

Also, an Employee database may have a CODASYL schema
Department (dept-name, budget, manager)
Employee( e# name, address, title, salary)
Department employs Employee (1:N relationship)

Now we can integrate this schema.
Here's one solution that uses Entity relationship Diagram)

Entities:
Department(dname, budget, manager)
Employee(E#, name, title, address, salary)
Engineer()
Project(PNo, Pname, Budget, Location)
Client(Client Name, Address)

Relationships
Engineer() is a subtype of employee
Department employs Employee(1:N)
Engineer works in project (M:N) : (Responsibility, Duration)
Project Contracted by Clients(M:N) (Contract Date)

Linking of databases can be done with the following;
Global naming of data objects such as tables. So we can have two part names for tables in the form tablename@databasename eg emp@sales
Database Links may be created for one way communication such as sales over us.america
Database replication could be setup with read-only or replicated master table and Read-Write or snapshot with periodic refresh.

Physical storage of distributed databases involves the following: Firstly, the tables could be fragmented with horizontal division of rows or vertical division by columns or both such as with division by both rows and columns. Secondly, the degree of replication could be specified. So each fragment could appear at one site only and this is called non-redundant replication or all the sites could have all the data and this is called full replication. Alternatively, selective replication could also be specified where some fragments could have higher degree of replication.

Query processing also requires some features for distributed databases. Data transfer strategies now need to consider communications cost besides CPU cost and I/O cost. Query decomposition and allocation between databases should also be determined. Consider the case when data is at site 1 and site 2 and query is at site 3. In such cases, we could send both table to site 3 or send one table to Site 2 and then the result to site 3 or send the other table to site 1 and then the result to site 3.
Another data transfer strategy is semi-join stragety. Here Site 2 could take some of the colums of one table and send to site 1. Site1 could then join the projection and send it back to site 2. Site 2 could then join it with its data and then send the result to site 3.
Semi-join strategies are usually cheaper than conventional strategies when the tables are large but semi-joins are cheap or when there are lots of irrelevant columns and very small join selectivity.

Thus distributed databases require a few more features.

Sunday, February 17, 2013

Data Strcutures using C#

Arrays serve as a popular homogenous uniform-sized direct access data structure. Well not all array elements need be uniform length. Jagged arrays are arrays of arrays where the contained arrays could be of variable length.  These are declared as follows:
int[][] jaggedArray = new int[3][];
jaggedArray[0] = new int[] { 1, 3, 5, 7, 9 };
jaggedArray[1] = new int[] { 0, 2, 4, 6 };
jaggedArray[2] = new int[] { 11, 22 };
A multidimensional array on the other hand is declared as follows:
int[,] multiArray = new int[,]
{ {1 ,2, 3}, {4, 5, 6}, {7, 8, 9|}}

Saturday, February 16, 2013

Object Oriented databases


The object oriented databases originated from well-known limitations of the Relational DBMS which was too simple for CAD/CAM, GIS/CASE since there was limited datatypes and no UDTs at the time.  Among the other limitations were that RDBMS does not allow the concept of generics or generalization and aggregation. And there is a mis-match between DB language and application languages. There was no support for design transactions, versions and schema evolutions. Hence the approach to extended relational DBs ( or object-relational model ) with support for user defined data types, stored procedures or rules, transitive closures, and generalization.

Fundamentally, an object oriented database implements a conceptual object data model with objects and their identifiers. This model is all about encapsulation, classes, inheritance, overloading and overriding, complex values and types. Therefore it differs from a entity relationship model where there are explicit relations defined and schemas. The object model extends the relationships to add support for subclass and instances, composition hierarchy and checking out / checking in of objects etc. Integrity constraints are also extended in object databases with referential integrity constraints and uniqueness constraints and siblings or sub-class relationships.  Support for composite objects with references to other objects with exclusive or shared parts is also available.

Similarly, the query language and optimization has been extended. The queries look very similar to those in programming languages. Expression with support for All or Any are also made available. The issues with query optimization in the object-relational model include user defined methods (hard to optimize), class hierarchy based queries ( requiring new strategies), multi-valued attribute ( needs new cost models ) and dual buffering ( separately process objects in the buffer / on the disk).

Locking is at the granularity of objects (intention locks) on large complex objects. Transactions share objects in memory buffers and recovery needs to handle memory buffers.

An object database provides a new model closer to application development and requires refinements on the relational model. When objects are flattened, they lend themselves to storage in a relational model.

Friday, February 15, 2013

Spatial databases

A spatial database stores information about objects in space which we can express in the form of geometry such as points, lines and polygons or geography. These are data types that are very different from the customary varchar, int, varbinary etc data types for databases. Spatial data is also queried very differently.  The open geospatial consortium defines the following query types
Spatial measurements
Spatial functions
Spatial predicates
Constructor functions
Observer functions
NoSQL systems like MongoDB and CouchDB support only a limited set.
Also the index used for spatial data types are different too. Due to the nature of the spatial data types, some patterns are very common in spatial queries and regular indexes don't support these very well. For example, how far two points differ or if a point is within a spatial area of interest require algorithms and data structures that are different from B-Trees or bitmap indexes.
For example in Microsoft SQL Server, you can do the following:
CREATE TABLE SpatialTable
( id int IDENTITY (1,1),
GeomCol1 geometry,
GeomCol2 AS GeomCol1.STAsText() );
GO

INSERT into SpatialTable (GeomCol1)
VALUES (geometry::STGeomFromText('LINESTRING (20, 180, 100, 100, 180, 180)', 0));

INSERT into SpatialTable(GeomCol1)
VALUES (geometry::STGeomFromText('POLYGON ((0, 0, 150, 0, 150, 150, 0, 150, 0, 0)', 0));

DECLARE @geom1 geometry;
DECLARE @geom2 geometry;
DECLARE @result geometry;

SELECT @geom1 = GeomCol1 FROM SpatialTable Where id = 1;
SELECT @geom2 = GeomCol2 FROM SpatialTable Where id = 2;
SELECT @result = @geom1.STIntersection(@geom2);
SELECT @result.STAsText();

Incidentally, expressing spatial data types as .NET UDTs and performing spatial operations is practical but just not deep enough for the kind of features and the size of data such as for astronomical study.
 

recap

C sharp data structures
Here are some common data structures in C#
ArrayList: This is a dynamically sized array of objects that implements the IList Interface.  It works by maintaining an internal array of objects that is replaced with a larger array when it reaches its capacity of elements. It is very efficient at adding elements since there is usually a free slot in the end but is inefficient at inserting . Searching can be efficient if the BinarySearch method can be used but you must Sort the array first.
BitArray: This is a dynamically sized array of Boolean values. It is more memory efficient than simple array of bools because it uses only one bit for each value whereas a bool array uses two bytes for each value.
Hashtable class: A Hashtable class is a standard dictionary (key/value) data structure that also uses a hashing algorithm to store and index values efficiently. The GetHashCode method is used for hashing. Any type used as a key in the hashtable should return a good hash of the object’s internal value.
Queue class: A Queue is a standard first in first out (FIFO) data structure providing simple operations to enqueue, dequeue, peek etc.
SortedList class: A SortedList is a standard dictionary data structure that uses a binary-chop search to index efficiently. SortedList implements IDictionary.
Stack class: A stack class is a standard Last-in first-out (LIFO) data structure.
 StringCollection: A StringCollection is a standard collection data structure for storing strings.
HashSet: A HashSet is a generic collection that contains no duplicate elements and where the elements are in no particular order. The capacity of the HashSet is the number of elements the object can hold.  
LinkedList: A LinkedList is a general purpose linked list. It supports enumerators and implements the ICollection interface, consistent with other collection classes in the .Net framework.
 SortedDictionary: A sorted dictionary is a binary search tree and is very similar to the SortedList but differs in execution speed and memory use.
ConcurrentQueue generic is used for thread safe queue operations and can be used safely by multiple threads. You could invoke multiple threads with the Parallel library i.e. Parallel.Invoke(action, action);
Similarly ConcurrentBag, ConcurrentDictionary  and ConcurrentStack are also available.
Partitioner and OrderablePartitioner provides a manner of partitioning data into multiple partitions.
A Blocking Collection provides a blocking and bounding capabilities for threadsafe collections that implement IProducerConsumerCollection<T>
BinaryTree and BinarySearchTree can be implemented with a generic Node<T> class.
public class Node<T>
{
public Node<T> Left {get; set;}
public Node<T> Right {get; set;}
public T data  {get; set;}
}
}
and travesals can be written in  a recursive way:
public PreorderTaversal (Node current)
{
if ( current != null)
{
Console.WriteLine(current.Value);
PreorderTraversal(current.Left);
PreorderTraversal(current.Right);
}
}

Dynamics CRM Q & A

Role based security is different from object based security. An access right is a right granted to a user of an object. A privilege is a right granted to user on a class of objects. Access right applies only after privileges have taken effect. Policies and roles makes it easy to grant privileges. Role based security uses authentication and authorization mechanism of the application server.

Find performs a search on an attribute for which it is defined. Advanced Find performs search on the conditions and the attributes that user may choose one or more.

CRM discovery service is used for ORG related activities while CRM metadata service is used for CRM record activities.

A plugin is different from a workflow. A plugin is a logic that can be executed repeatedly for different events or entities, even when offline, with or without impersonation and usually runs quickly. A workflow is governed by an end-user, can have sub-processes triggered and can take a long time.

MSCRM_Config and MSCRM_orgname databases have been out of the box on MSCRM install.

The main challenge with CRM is to keep the focus and center on the customer record while scaling or using better top level tools and custom dashboards.

What is the typical cost of using CRM ? It varies between $20 to $350 per month per user.

What should CRM do ? CRM should help monitor, deliver and measure my effectiveness.

What do you see as new and emerging in CRM ? Mobile applications supporting CRM is a new and emerging field.

Do you have a few key best practices someone considering CRM can use ?
Some suggestions are: First consider your future needs  Second take the opportunity to clean up your data now and Third communicate throughout the process to get early buy-in.

What advantage might CRM have for specific veriticals ? If you are a services company with long sales cycle, project terms, and frequent interactions and touch points, CRM will be exponentially more valuable to you

Reference: Q & A slideshare.net/warsha.agarwal

Thursday, February 14, 2013

PHP, Heroku, MongoDB

MongoDB is a key value store. Its built for high-performance and is schemafree. It is also a document oriented database and a relational store. Hence it supports NOSQL (Not Only SQL). It's used by GitHub and sourceforge. You can try it yourself at http://try.mongodb.org/ And you can use it in most languages such as C, C++, C#, php, python, node.js, HASKELL, SCALA,  Java etc. It uses the terms similar to other databases. you type show dbs, use admin, show collections etc.  A collection is a table. Each document is a row.  You can type
db.accounts.find().forEach(function(doc) { print (tojson(doc));});
You can display each document in JSON format this way.  The database is also easy to understand with similar functionality.
Query
You use dynamic queries. You can insert JSON like
db.colors.insert({name:'red', primary:'true'})
> var cursor = db.colors.find()
> cursors.next()
and of course SELECT * from colors where name = 'green'
or db.colors.find({primary: true}).sort({name: 1}).limit(1)
db.people.find({age: {lte:27}})
Operators
You can also use $gt, $lt, $ne, $in, $nin, $mod, $all, $size, $exists, $type, $elemMatch, $not and $where.
Index
You can also use Indexes. db.colors.ensureIndex({name: 1}) for single ascending or -1 for single descending order or {unique:true} for unique and {background:true} for non-blocking index in background
You can also mix with db.colors.ensureIndex({name: 1}, created_at: -1})
Aggregation:
You can use db.colors.distinct('name')
You can also use items group
db.items.group({key:{template:true}, initial:{count:0}, reduce:function(obj, prev) { prev.count += 1}});
Data Types:
Array, Binary, Boolean, DateTime, DB reference, Embedded object, Integer, Null, ObjectId, RegExp, String, Symbol, Timestamp
Relationships are used similarly normalized, embedded ( Embedding is pre-joining) etc. Embed when documents appears with parent. 4MB document size limit.
And there's support via e-mail, IRC, Web, book, conference and training.
reference : slides from jnunemaker

PHP:
web programming language, runs on the web server, It's free, cross-platform interpreted language and is targeted towards web programming. The PHP interpreter runs the command between php tags, string syntax called a here document, displaying information about a database. PHP form runs the command between <?php? > tags and processes form data via $_POST.
PHP worsks on pieces of text called strings. Strings can be of three types - enclosed in single quotes, enclosed in double quotes and a here document. Strings can even contain binary files such as image or sound. Strings can support variable interpolation i.e. the variables in double quotes will be substituted with their value.  Concatenation is called implode and splitting is called explode, randomly shuffle with str_shuffle().  Perl like variables beginning with $.
 Pass by reference and value are both allowed.

Wednesday, February 13, 2013

Interview Preparation


 

Insertion of a binary tree involves the following steps:

If the root is null, make this the root.

If the root is a single node with no children and it’s value is greater than the input, insert the input as the left child of the root. Otherwise insert as the right child.

If the root has any siblings, traverse right if the root is smaller than the input, otherwise traverse left. In each case, append the tree with the input as a leaf. The leaf attaches to the sentinel node just as in the case of the root above.

Deletion from a binary tree involves the following steps:

Deletion is tricky because the delete of the node requires several of the following nodes to be written and this is because splicing the node may be required. The case of deleting the nodes is trivial because the left child leaf or the right child leaf can be deleted without affecting the rest of the tree.
There are three cases for the node to be deleted. These cases depends on how many siblings the node has for it to be deleted.
1) the first case is when there are no children , we just remove it.
2) the second case is when there is only one child, we splice out the node.
3) the third case is when there are two children, we splice out the successor which has at most one child and replace z with y.
 
So the steps are as follows:
Walk down the tree to find the node to be deleted. Traverse right if the node to be deleted has a value greater than the root. Arrive at the node to be deleted or null if the node is not found .
If the node to be deleted is the root and the root has two siblings, the tree successor has to be found, spliced and made the root. Otherwise if there is only one sibling, then that sibling is chosen to replace the node. We can call the node to be deleted as z and the one to be spliced as y. y = z when there is only one sibling.

If y has a left sibling, then we will need to update its parent and hence we will call it x.
Otherwise we will choose x as the right sibling.
If  x is not null, we update it's parent to point to y's parent.
If y's parent is null, then y is the root, hence x has to be made the root.
Otherwise. if y was the left sibling of its parent then that parent's left sibling should now point to x, otherwise we choose the right sibling.

if y != z, then we transfer y's data to z.

Red Black Tree

The Red black tree must satisfy the following properties:
1. Every node is either red or black.
2. The root is black
3. Every leaf is black
4. If a node is red, then its children are black
5. For each node, all paths to the node contain the same number of black nodes.

Tree Insertions or deletions require rotations.

1. Tree Insertion is very much like the tree insert procedure but involves recoloring and rotation at the end to satisfy the red-black properties. There are three cases to be considered depending on whether the uncle (y) of the node just inserted (z) is red or black and z is a right or left child.
To enumerate these cases:
Case 1:  z's uncle y is red : Action - recolor the parent and the uncle, z moved up
Case 2:  z's uncle y is black and z is a right child : Action - perform a left rotation
Case 3:  z's uncle y is black and z is a left child : Action - perfrom a right rotation
Color of root and leafs to be checked at the end of the procedure  :

2. Tree deletions is very much like the tree delete procedure also requires recoloring and rotations. There are four cases to be considered and these are the dependent on whether the node to be fixed (x) has a red or black sibling and whether that node's sibling (w) has red or black children.
To enumerate these cases:
Case 1: x's sibling w is red : Action - exchange the color of the parent and w and perform a left rotation
Case 2: x's sibling w is black, and both of w's children are black : Action - color the sibling to red and move x up the tree
Case 3: x's sibling w is black, w's left child is red, and w's right child is black : Action - exchange the color of w and its left child and perform a right rotation.
Case 4: x's sibling w is black and w's right child is red : Action - remove extra black and perform left rotation.

Graph : This can have nodes similar to a n-ary tree. Graph can also be represented by edges and vertices.
Breadth First Traveral:
Use a queue and insert a node. When the node is removed, it siblings are inserted. Each pop operation results in a step of the breadth first traversal. This continues until the queue is empty.

Depth first traversal is easy to write with a recursive function.  Check the parameter, define the terminal condition, recurse down the adjacent nodes, one by one.

Djikstra's algorithm is useful to find a single-source shortest path on a weighted, directed graph. You pick the minimum vertex among the adjacent and then relax the edges for each vertex adjacent to the pick.

B-tree is also popular. Each node has the number of keys currently stored, the node n[x] keys themselves, stored in non-decreasing order and a boolean value to say if it is a leaf or an internal node.

Techniques to use various data structures

Divide and conquer. This performs faster by collapsing linear searches to log n search. Then combining the solutions to sub-problems. Its great for sub problems that are independent

Dynamic Programming: This also combines the solutions to sub-problems except that the sub-problems overlap.

Greedy Algorithms :  This makes the choice that looks best at the moment. It will always make the locally optimal choice.
There's top down as well as bottom up approach where the problems are decomposed top down and solutions are built bottom up.


Some examples:

1. Longest Common Subsequence : 
Step 1 : Characterize a LCS with the following criteria: say Z is an LCS of X and Y
1. if Xm == Yn, then Zk = Xm = Yn  and  Zk-1 is an LCS of X and Y
2. If Xm != Yn, then Zk != Xm => Z is an LCS of Xm-1 and Y
3. If Xm != Yn, then Zk != Yn => Z is an LCS of X and Yn-1

Step 2: Define the recursion
C[i, j]  =    0                                         if i = 0, or j = 0
           =    c[i -1, j - 1] + 1                   if i, j > 0 and xi = yi
           =    max(c[i, j - 1], c[i -1, j ])    if i, j > 0 and xi != yj

Step 3: Computing the length of an LCS
LCS-Length(X, Y) : Use the step 2 and set up a recursion to compute the length.

2. String pattern matching :
Finite state Automata can be used for the following:
Define a  finite automaton matcher with right going edges forming the spine of the automaton  and numbering as many as the pattern specified. A directed edge shows some state transition. Pattern match length and offset are both captured
In C# language provisions for performing a regex, we have some good design and implementation.
Here it is for  a quick recap.

Regex r = new Regex( pattern, options);
int[] gnums = r.GetGroupNumber();
Match m = r.Match(text);
while (m.Success)
{
for ( int i = 0; i < gnums.Length; i++)
{
Group g = m.Groups[gnums[i]];
CaptureCollection cc = g.Captures;
for ( int j = 0; j < cc.Count; j++)
{
     capture c = cc[j];
     Console.WriteLine( "\t\tCapture{0} = [{1}] Index = {2} and Length = {3}", j, c, c.Index, c.Lenght);
}
}
m = m.NextMatch();
}

Moving to the language based interview questions on C#, here is a quick recap of some questions asked in interviews:

Some questions :
1. [TCS] Desribe the garbage collector in .Net. It's compact generational markdown compactor.
2. [TCS] What is reflection ? How's it used ?  Suggested answer : Reflection is a way to manipulate the types based on their metadata. Some uses of reflection are late binding, serialization , remoting, attributes etc. 
3. [Starbucks] How do you wrap your MVC methods for error handling. Answer [HandleError] attribute
4. [Starbucks] How do you define your serialization attributes. Why is order useful?
5. [Starbucks] How do you use WCF with SOAP ? What additional items do you specify in the web.config ?
6. [T-Mobile] What attributes are used to define REST APIs in Controller ? HttpGet, HttpPost
7. [Infosys] With use of a single sign on software  such as siteminder across various internal applications, a user session is lost between navigations. How do you diagnose and fix the problem ? This could be an issue with javascript, services, siteminder or network.
8. [Infosys] How do you prioritze, keep the team moving, meet the deadlines, and deliver with high quality
9. [Ontela] How would you design a file system  ? What are the layers and interface.

There is a lot of considerations for the system programming questions.  For example, File System design has to consider organizing files and block level storage, locking, logging, file access security and remoting The layers involve File Organization,  File Management. file allocation strategy, file buffering and disk manangement interface.
File allocation strategies have to consider file allocation  in large blocks or small file system devices. File Allocation strategies could consider storage in a linked list of alarge blocks, consecutive storage in a single blocks, and storage in a bunch of smal blocks tracked by an index. If random access is important, the linked list might be a poor choice. If fast reads are important, ascertaining the physical location of data would be in constant time for consecutive or indexed storage. If internal and external fragmentation is to be considered, then allocation size will need to be adjusted. If free space needs to be tracked, linked list or bit vector could be used. Operating System like Linux make files systems to be extended or added as plugins with their implementation of Virtual File System.
Virtual File System is an indirection layer which handles all the file system calls and calls the necessary file system functions in the physical file system code to do the I/O. The physical file system may employ one or more buffer cache to interact with device drivers and in turn with device controllers.
File Systems can be mounted and this populates the file descriptors which contains several kinds of information. Information that are common to every file system type. 

Talking about files, let's take a look at newer dbs now.

libraries for data mining

Some of the data mining packages available on googlecode are nifty utilities to mine 2D data. They come with various  alogorithms and technique implementations. This can directly be used with different kinds of data. One such logic involves computing cluster centroids and points, then finding euler distance between the points and the centroids. Euler distance is the square root of the sum of squares of the x and y offsets from the centroid. In the case of finding keywords in a text, wordnet similarity distance between the selected candidates could be considered and a 2D array (points, clusters) of similarity populated each ranging between 0 and 1. We could iterate through all these points to compute sum of euler distance for a point from all of the clusters. Also, for the same point we could compute the inverse of the fraction of the distance to each cluster to the sum just computed. Then we could normalize the matrix to tag each data point to different clusters. Once we know the membership of each words to different clusters, we could populate index with words that maximize cluster membership.

Monday, February 11, 2013

Glossary


Glossary of terms in computing

Automation : This refers to the set of tasks that have been specified in a way that they can run without manual intervention. This comes in very useful in test setup, execution, tear down, results store, and reports making.  Frequently such tasks are delegated to a test harness.

Balancing :  This refers to sharing the load between one or more processors or units of control such that there is a distribution of work resulting in more concurrency and faster execution as well as fault tolerance.

Cloud Computing : This refers to the use of one or more virtual machines in a hosted environment where the provider of VM and its network,  the blades and hardware are taking away the onus of maintaining the application on a proprietary system while assuring high availability and geographic reachability.

Degree of Parallelism: This refers to the number of concurrent unit of contol invoked in the execution of a task usually by a partitioning of data.

Execution  This refers to the actions needed to complete a task. The actions are usually outlined ahead together with the context and data with which to operate on and in a tightly controlled environment.

Flow diagrams : This refers to either the control or the data flows in a system where the system interaction and vulnerabilities can be studied by seeing how the data changes and with what controls as it goes through the system. This is one way to white box a system in that its internal workings are made clear with examples of data and control. Such diagrams are useful for threat analysis.

GigaByte: This is a unit of storage equal to 10 power 9 bytes and represents a way to describe the size of say files or databases.

Hashing : This is a technique to make lookups faster by keeping track of where insertions were made for entries in a collection so that they can be retrieved with a fixed cost. Typically, a range of hash values is used and a hash is computed on the object given and then put into the category.  The hashing is very effective when a good hash function is used otherwise chaining and linear search may result.

Instruction : This is a specific task specified to the processor and can be considered ‘bare metal’ or ‘native’ in that it is independent of any programming languages and very much tied to the hardware for which it is specified.

Jump : This is a very specific instruction given to the processor to move from the next instruction in a sequence to something outside the sequence.

Keys, public and private : This refers to cryptographic key pairs usually used for encryption such that only the sender and receiver can interpret the encrypted traffic. The private key is also called the secret and is used for decrypting while the public key can be shared.

Locking: This refers to the locking of a set of instructions or resources such that they provide mutual exclusion so as to prevent clobbering by interfering reads or writes. Interestingly, Lock free implies the guarantee of atomic reads and writes without conflicts by tightly controlling the sequence in which they appear but without the use of locks.

Monitoring: This is ensuring continuous availability of say a web service by frequently checking for some heartbeat events or directly calling the service periodically.

Nightly run: This refers to running large analytical workloads such as OLAP on system when there is less chance for conflicts with update statements.

Proxy: This is a software entity that sits between a sender and receiver and usually passes the traffic . Proxy can be either transparent or opaque and are commonly used with http and https.

Query: This is the language of retrieving data from a data store and can be both data definition as well as data manipulation in nature. They can be as granular as a batch or statement in T-SQL

Resource: A resource is known by its cost and is helpful with management.  It can be used to refer to memory or cpu required fo r execution and can loosely be used for humans assigned to different tasks.

Scale out/up:  A scale out is used to refer to additional servers added horizontally to say  a cluster  where as a scale up tries to increase the capacity of a single server in terms of memory or processors

Ticket: A ticket is a snapshot that captures enough information to identify an incident and for tracking. It can be associated with a user on whose behalf it is created and tracked as it is worked upon and closed.

User: A user to a computer system knows only the functionality and not the mechanics of the system. In other words, the user should not be expected to figure out whether a system failed to complete a specified action or what the remedial actions are.  The user only sees the system as a tool for solving a processing.

Violation, access:  Access violation is a bad address specified to the processor and most commonly is a null reference.

Zero: A zero is a state of a bit in which there is no signal and is a convenient test for the Boolean state false. It  provides a comparison against  a signal.

 

 

Friday, February 8, 2013

Abstract Data Types

Abstract Data Types (ADTs) can be used directly in the relational model. They don't change the algebra, just the expressions over attribute values. However deeper support for non-relational structured types is provided via XML with languages like XPath and XQuery. The following three approaches enumerate some of the handling of structured types like XML. The first is to build a custom database system that operates on data with structured types but this perhaps more laborious and intensive than supporting XML in relational model. The second approach is to treat the complex type as an ADT where say a column type of XML is used to store one XML document per row. All executions are handled outside the query optimizer. A third approach is for the DBMS to normalize the nested structure into a set of relations upon insertion and this is often called shredding. With shredding, the sub-objects are mapped with foreign keys and ordering information between elements of the same level is removed which improves performance.

Thursday, February 7, 2013

a new way to tag words

When you want to select keywords in a text, there are several techniques such as linguistic search etc available as mentioned earlier. Usually words are tagged as various parts of speech (POS) available from say WordNet. I propose a slightly different tag that might make document indexing easier. This is a very simple metric that is based on how frequently these words are used across the internet. With this metric, I'm not suggesting a better way to find hapaxes. Instead I'm proposing to differentiate between frequently occuring not so common words in a text block.
The metric could be a simple Poisson Distribution where the formula is
and K is the number of occurrences of the word, k! is the factorial of k and
is a positive real number

 
 

Tuesday, February 5, 2013

indexing methods

Among the various algorithms for automated document indexing, the following variations could be considered:
1) use clusters based on number of words sharing similar concepts and their cumulative frequency. ( so this is a histogram based on topic and weights assigned to individual candidates based on the fraction of occurrances of this candidate to the overall occurances of all the words in the cluster. This idea is based on a paper by Document Indexing with a Concept Hierarchy by Gelbukh, Sidorov and Guzman.  However, this variation plans to use existing concepts as detected by clustering functions using any dictionary or ontology to find relevance between index candidates.
One way to improve performance is such that the entire dictionary be in-memory or perhaps those pages that are relevant to the candidates at hand. In order to find the pages, we know the synsets of the words and their hierarchy. Alternatively, we could consider taking only fixed size sections of the dictionary and loading them in memory at a time. This way only the sections can be switched in and out as required for processing.

Monday, February 4, 2013

Design of skip lists


There are sample implementations of skip lists in many languages:
In C++ here is one reference
In C# here is another reference
In Python here is another reference
The common features across all these implementations are as follows:
1. Each skip list node has a height and a set of neighboring skip list nodes, precisely as many neighboring nodes as its height, some of which may be null.
2. Each skip list node has some piece of data associated with it.
3. Each skip list node has a key to look it up

Skiplist is initialized with the max height of any node.
Head and tail are initialized and attached

Insert operation:
Find where the node goes by comparing keys
If its a duplicate, don't insert, otherwise perform insert
Use a random number to generate a height

Remove operation:
Find the node to be deleted, traverse using available max height to lowest till you reach the node
for each of the levels on the current node, update head and tail forward references and delete the node
if not found return false


Sunday, February 3, 2013

automatic document indexing

Automatic document indexing of an unstructured document involves cherry picking of words from various parts of the document and classifying them. Research has proposed statistical and linguistic methods, machine learning, neural-networks, self-organizing maps and connectionist models. Among the linguistic methods, natural language processing libraries have enabled semantic interpretation of index candidates. In the absence of other input in terms of mark-ups, metadata or punctuations, the index candidates could be picked up based on clusters and outliers to the topics presented in the text. To detect semantic relevance of the words, we can use concept hierarchy structures such as from WordNet. In a set of related words representing a cluster and all other things considered equal, then the one appearing in the text with the highest frequency can be picked.


Design:
I propose a single table with the record of all words occuring in the document and their attributes. Attributes will
include the following: word offsets, page numbers, frequency, tags and emphasis etc. Candidates will be unigrams.
Classifier logic could be written in separate modules to determine the clusters and outliers in the above table.
Tables will be implemented as skip-lists in memory for fast traversal, insert and update.
Document parsing, word stemming, canonicalization and population of the table is the first step.
Clustering and index selection is the next step.
Advantages:
The index selected will be semantically organized and connected.
Entries and Sub-Entries can be computed with different interchange-able classifier algorithms.
Index candidates could be a common subset of various classifier.
Disadvantages:
The index selected will be single word representations.
Clusters could miss the newly introduced words that are not in dictionary or ontology used.
Outliers could miss an idea that is not mentioned anywhere else.

Interview questions from various companies

Interview Questions I've been asked in interviews in the last five years

Amazon
1. Compute the max sum of a sequence in an integer array
2. Design a memory manager
3. Design a parking lot.

Alaska Airlines:
1. Describe the data structure for aircraft seating system.
2. Write a SQL query with a join and group by
3. What do you see as the challenge in working here ?

Avanade
1. What are the following C# keywords used for : static, protected internal, etc. ?
2. Write SQL queries for finding the customers who bought ten or more products ?

BankOfAmerica:
1. What architecture would you propose for the login functionality for our customers ?

BMC
1. Have you worked on the Remedy stack ?
2. Explain what kind of monitoring have you specified ?
3. How would you deploy on Linux based systems ?
4. How would your ideal deployment UI look like ?

Bocada:
1. Identify the sibling for a node in a Binary Tree.
2. Pick between Breadth-First Search and Depth-First Search. Implement one. What data structure would you use ?

Expedia
1. Given a binary search tree,  find a specified node ?
2.  Write code for Breadth-First Search ?
4. How would you reverse the words in a string ?
3. If you are an engineer on call for production support of a technology you do not own, how would you handle incident reports ?

Facebook
1. Given a set of elements, print the power set which involves zero or more discrete combinations of its elements.

Clear.Com
How would you store and retrieve integers from a large file ?
ClearWire

F5
1. Explain the difference between TCP and UDP
2. Get the N'th last element from a singly linked list.

IBM  ( Netezza )
1. Explain how query processing works in a DBMS ?

Honeywell
1. Is XML search case-sensitive ?
2. When would you use WCF ?
3. What's REST ?
4. What is dependency injection ?

Intel

Infosys
1.  How would you improve the performance of web applications ?
2. With use of a single sign on software  such as siteminder across various internal applications, a user session is lost between navigations. How do you diagnose and fix the problem ? This could be an issue with javascript, services, siteminder or network.
3. How do you prioritze, keep the team moving, meet the deadlines, and deliver with high quality

Intelius
1. Reverse the nodes of a linked list.

Micronet/ Kelly:
1. What are lambda operations ?
2. What's the difference between string and string builder ?


Microsoft Corporation
    BING
1. Given a chess board, how would you find the smallest number of moves for a coin to reach a position on the board, if possible .
2. How would you schedule a task to start at time t ?
    Azure
1. Get predecessor in a binary tree.
    Exchange
    Windows TCP/IP
1. Explain how would you search for patterns in a string ?

EMC
    DataDomain,
1. Have you worked with Active Directory ?
2. What is CIFS ?
3. How do ACLs work ?
    Isilon

Ontela
1. Given a square matrix of size N and entries 'X' or 'O', how can you tell if there's a line of 'X' and 'O's ?
2. How would you design a file system ? What are the layers and Interface ?

Banking software:
Write a method to shuffle the numbers in an array and write test cases for the same in pair programming.


RealNetworks
Write a function to search for string patterns.

Riversand
1. How would you swap two numbers without a temporary variable ?
2. Have you worked with MDM before ?

Salesforce
1. Write a program to compute the Inventory and prepare a report as shown with the given input.

Skype
1. Write a protocol handler that reverses the bytes

Starbucks
1. What are wrong with the following implementation of WCF code and config file ?
2. How do you wrap your MVC methods for error handling. Answer [HandleError] attribute ?
3. How do you define your serialization attributes. Why is order useful?
4. How do you use WCF with SOAP ? What additional items do you specify in the web.config ?

Teneros
1. Write code to demonstrate inheritance and access modifiers.

TCS
1. Which would you say is your strength ? Front-End, Back-End, Middle-Tier ?
2. Desribe the garbage collector in .Net. It's compact generational markdown compactor.
3. What is reflection ? How's it used ?  Suggested answer : Reflection is a way to manipulate the types based on their metadata. Some uses of reflection are late binding, serialization , remoting, attributes etc. 
4. What are some of the UI patterns for search functionality and the corresponding ORM queries?

T-Mobile
1. What's app data directory ? Why do you need it ?
2. Write code to search a number in a binary tree where the nodes are arbitrary ?

Zillow
1. How would you find out if a given number is prime ?
2. Write test cases for the same.