Cluster computing

Monday, February 11, 2013

Glossary

Glossary of terms in computing

Automation : This refers to the set of tasks that have been specified in a way that they can run without manual intervention. This comes in very useful in test setup, execution, tear down, results store, and reports making. Frequently such tasks are delegated to a test harness.

Balancing : This refers to sharing the load between one or more processors or units of control such that there is a distribution of work resulting in more concurrency and faster execution as well as fault tolerance.

Cloud Computing : This refers to the use of one or more virtual machines in a hosted environment where the provider of VM and its network, the blades and hardware are taking away the onus of maintaining the application on a proprietary system while assuring high availability and geographic reachability.

Degree of Parallelism: This refers to the number of concurrent unit of contol invoked in the execution of a task usually by a partitioning of data.

Execution This refers to the actions needed to complete a task. The actions are usually outlined ahead together with the context and data with which to operate on and in a tightly controlled environment.

Flow diagrams : This refers to either the control or the data flows in a system where the system interaction and vulnerabilities can be studied by seeing how the data changes and with what controls as it goes through the system. This is one way to white box a system in that its internal workings are made clear with examples of data and control. Such diagrams are useful for threat analysis.

GigaByte: This is a unit of storage equal to 10 power 9 bytes and represents a way to describe the size of say files or databases.

Hashing : This is a technique to make lookups faster by keeping track of where insertions were made for entries in a collection so that they can be retrieved with a fixed cost. Typically, a range of hash values is used and a hash is computed on the object given and then put into the category. The hashing is very effective when a good hash function is used otherwise chaining and linear search may result.

Instruction : This is a specific task specified to the processor and can be considered ‘bare metal’ or ‘native’ in that it is independent of any programming languages and very much tied to the hardware for which it is specified.

Jump : This is a very specific instruction given to the processor to move from the next instruction in a sequence to something outside the sequence.

Keys, public and private : This refers to cryptographic key pairs usually used for encryption such that only the sender and receiver can interpret the encrypted traffic. The private key is also called the secret and is used for decrypting while the public key can be shared.

Locking: This refers to the locking of a set of instructions or resources such that they provide mutual exclusion so as to prevent clobbering by interfering reads or writes. Interestingly, Lock free implies the guarantee of atomic reads and writes without conflicts by tightly controlling the sequence in which they appear but without the use of locks.

Monitoring: This is ensuring continuous availability of say a web service by frequently checking for some heartbeat events or directly calling the service periodically.

Nightly run: This refers to running large analytical workloads such as OLAP on system when there is less chance for conflicts with update statements.

Proxy: This is a software entity that sits between a sender and receiver and usually passes the traffic . Proxy can be either transparent or opaque and are commonly used with http and https.

Query: This is the language of retrieving data from a data store and can be both data definition as well as data manipulation in nature. They can be as granular as a batch or statement in T-SQL

Resource: A resource is known by its cost and is helpful with management. It can be used to refer to memory or cpu required fo r execution and can loosely be used for humans assigned to different tasks.

Scale out/up: A scale out is used to refer to additional servers added horizontally to say a cluster where as a scale up tries to increase the capacity of a single server in terms of memory or processors

Ticket: A ticket is a snapshot that captures enough information to identify an incident and for tracking. It can be associated with a user on whose behalf it is created and tracked as it is worked upon and closed.

User: A user to a computer system knows only the functionality and not the mechanics of the system. In other words, the user should not be expected to figure out whether a system failed to complete a specified action or what the remedial actions are. The user only sees the system as a tool for solving a processing.

Violation, access: Access violation is a bad address specified to the processor and most commonly is a null reference.

Zero: A zero is a state of a bit in which there is no signal and is a convenient test for the Boolean state false. It provides a comparison against a signal.

Friday, February 8, 2013

Abstract Data Types

Abstract Data Types (ADTs) can be used directly in the relational model. They don't change the algebra, just the expressions over attribute values. However deeper support for non-relational structured types is provided via XML with languages like XPath and XQuery. The following three approaches enumerate some of the handling of structured types like XML. The first is to build a custom database system that operates on data with structured types but this perhaps more laborious and intensive than supporting XML in relational model. The second approach is to treat the complex type as an ADT where say a column type of XML is used to store one XML document per row. All executions are handled outside the query optimizer. A third approach is for the DBMS to normalize the nested structure into a set of relations upon insertion and this is often called shredding. With shredding, the sub-objects are mapped with foreign keys and ordering information between elements of the same level is removed which improves performance.

Thursday, February 7, 2013

a new way to tag words

When you want to select keywords in a text, there are several techniques such as linguistic search etc available as mentioned earlier. Usually words are tagged as various parts of speech (POS) available from say WordNet. I propose a slightly different tag that might make document indexing easier. This is a very simple metric that is based on how frequently these words are used across the internet. With this metric, I'm not suggesting a better way to find hapaxes. Instead I'm proposing to differentiate between frequently occuring not so common words in a text block.

The metric could be a simple Poisson Distribution where the formula is

and K is the number of occurrences of the word, k! is the factorial of k and

is a positive real number

Tuesday, February 5, 2013

indexing methods

Among the various algorithms for automated document indexing, the following variations could be considered:
1) use clusters based on number of words sharing similar concepts and their cumulative frequency. ( so this is a histogram based on topic and weights assigned to individual candidates based on the fraction of occurrances of this candidate to the overall occurances of all the words in the cluster. This idea is based on a paper by Document Indexing with a Concept Hierarchy by Gelbukh, Sidorov and Guzman. However, this variation plans to use existing concepts as detected by clustering functions using any dictionary or ontology to find relevance between index candidates.
One way to improve performance is such that the entire dictionary be in-memory or perhaps those pages that are relevant to the candidates at hand. In order to find the pages, we know the synsets of the words and their hierarchy. Alternatively, we could consider taking only fixed size sections of the dictionary and loading them in memory at a time. This way only the sections can be switched in and out as required for processing.

Monday, February 4, 2013

Design of skip lists

There are sample implementations of skip lists in many languages:
In C++ here is one reference
In C# here is another reference
In Python here is another reference
The common features across all these implementations are as follows:
1. Each skip list node has a height and a set of neighboring skip list nodes, precisely as many neighboring nodes as its height, some of which may be null.
2. Each skip list node has some piece of data associated with it.
3. Each skip list node has a key to look it up

Skiplist is initialized with the max height of any node.
Head and tail are initialized and attached

Insert operation:
Find where the node goes by comparing keys
If its a duplicate, don't insert, otherwise perform insert
Use a random number to generate a height

Remove operation:
Find the node to be deleted, traverse using available max height to lowest till you reach the node
for each of the levels on the current node, update head and tail forward references and delete the node
if not found return false

Sunday, February 3, 2013

automatic document indexing

Automatic document indexing of an unstructured document involves cherry picking of words from various parts of the document and classifying them. Research has proposed statistical and linguistic methods, machine learning, neural-networks, self-organizing maps and connectionist models. Among the linguistic methods, natural language processing libraries have enabled semantic interpretation of index candidates. In the absence of other input in terms of mark-ups, metadata or punctuations, the index candidates could be picked up based on clusters and outliers to the topics presented in the text. To detect semantic relevance of the words, we can use concept hierarchy structures such as from WordNet. In a set of related words representing a cluster and all other things considered equal, then the one appearing in the text with the highest frequency can be picked.


Design:



I propose a single table with the record of all words occuring in the document and their attributes. Attributes will 



include the following: word offsets, page numbers, frequency, tags and emphasis etc. Candidates will be unigrams. 



Classifier logic could be written in separate modules to determine the clusters and outliers in the above table. 



Tables will be implemented as skip-lists in memory for fast traversal, insert and update.



Document parsing, word stemming, canonicalization and population of the table is the first step.



Clustering and index selection is the next step.







Advantages:

The index selected will be semantically organized and connected.

Entries and Sub-Entries can be computed with different interchange-able classifier algorithms.

Index candidates could be a common subset of various classifier.



Disadvantages:

The index selected will be single word representations.

Clusters could miss the newly introduced words that are not in dictionary or ontology used.

Outliers could miss an idea that is not mentioned anywhere else.

Interview questions from various companies

Interview Questions I've been asked in interviews in the last five years

Amazon
1. Compute the max sum of a sequence in an integer array
2. Design a memory manager
3. Design a parking lot.

Alaska Airlines:
1. Describe the data structure for aircraft seating system.
2. Write a SQL query with a join and group by
3. What do you see as the challenge in working here ?

Avanade
1. What are the following C# keywords used for : static, protected internal, etc. ?
2. Write SQL queries for finding the customers who bought ten or more products ?

BankOfAmerica:
1. What architecture would you propose for the login functionality for our customers ?

BMC
1. Have you worked on the Remedy stack ?
2. Explain what kind of monitoring have you specified ?
3. How would you deploy on Linux based systems ?
4. How would your ideal deployment UI look like ?

Bocada:
1. Identify the sibling for a node in a Binary Tree.
2. Pick between Breadth-First Search and Depth-First Search. Implement one. What data structure would you use ?

Expedia
1. Given a binary search tree, find a specified node ?
2. Write code for Breadth-First Search ?
4. How would you reverse the words in a string ?
3. If you are an engineer on call for production support of a technology you do not own, how would you handle incident reports ?

Facebook
1. Given a set of elements, print the power set which involves zero or more discrete combinations of its elements.

Clear.Com
How would you store and retrieve integers from a large file ?
ClearWire

F5
1. Explain the difference between TCP and UDP
2. Get the N'th last element from a singly linked list.

IBM ( Netezza )
1. Explain how query processing works in a DBMS ?

Honeywell
1. Is XML search case-sensitive ?
2. When would you use WCF ?
3. What's REST ?
4. What is dependency injection ?

Intel

Infosys
1. How would you improve the performance of web applications ?
2. With use of a single sign on software such as siteminder across various internal applications, a user session is lost between navigations. How do you diagnose and fix the problem ? This could be an issue with javascript, services, siteminder or network.
3. How do you prioritze, keep the team moving, meet the deadlines, and deliver with high quality

Intelius
1. Reverse the nodes of a linked list.

Micronet/ Kelly:
1. What are lambda operations ?
2. What's the difference between string and string builder ?

Microsoft Corporation
    BING
1. Given a chess board, how would you find the smallest number of moves for a coin to reach a position on the board, if possible .
2. How would you schedule a task to start at time t ?
    Azure
1. Get predecessor in a binary tree.
    Exchange
    Windows TCP/IP
1. Explain how would you search for patterns in a string ?

EMC
    DataDomain,
1. Have you worked with Active Directory ?
2. What is CIFS ?
3. How do ACLs work ?
    Isilon

Ontela
1. Given a square matrix of size N and entries 'X' or 'O', how can you tell if there's a line of 'X' and 'O's ?
2. How would you design a file system ? What are the layers and Interface ?

Banking software:
Write a method to shuffle the numbers in an array and write test cases for the same in pair programming.

RealNetworks
Write a function to search for string patterns.

Riversand
1. How would you swap two numbers without a temporary variable ?
2. Have you worked with MDM before ?

Salesforce
1. Write a program to compute the Inventory and prepare a report as shown with the given input.

Skype
1. Write a protocol handler that reverses the bytes

Starbucks
1. What are wrong with the following implementation of WCF code and config file ?
2. How do you wrap your MVC methods for error handling. Answer [HandleError] attribute ?
3. How do you define your serialization attributes. Why is order useful?
4. How do you use WCF with SOAP ? What additional items do you specify in the web.config ?

Teneros
1. Write code to demonstrate inheritance and access modifiers.

TCS
1. Which would you say is your strength ? Front-End, Back-End, Middle-Tier ?
2. Desribe the garbage collector in .Net. It's compact generational markdown compactor.
3. What is reflection ? How's it used ? Suggested answer : Reflection is a way to manipulate the types based on their metadata. Some uses of reflection are late binding, serialization , remoting, attributes etc.
4. What are some of the UI patterns for search functionality and the corresponding ORM queries?

T-Mobile
1. What's app data directory ? Why do you need it ?
2. Write code to search a number in a binary tree where the nodes are arbitrary ?

Zillow
1. How would you find out if a given number is prime ?
2. Write test cases for the same.