Cluster computing

Tuesday, February 5, 2013

indexing methods

Among the various algorithms for automated document indexing, the following variations could be considered:
1) use clusters based on number of words sharing similar concepts and their cumulative frequency. ( so this is a histogram based on topic and weights assigned to individual candidates based on the fraction of occurrances of this candidate to the overall occurances of all the words in the cluster. This idea is based on a paper by Document Indexing with a Concept Hierarchy by Gelbukh, Sidorov and Guzman. However, this variation plans to use existing concepts as detected by clustering functions using any dictionary or ontology to find relevance between index candidates.
One way to improve performance is such that the entire dictionary be in-memory or perhaps those pages that are relevant to the candidates at hand. In order to find the pages, we know the synsets of the words and their hierarchy. Alternatively, we could consider taking only fixed size sections of the dictionary and loading them in memory at a time. This way only the sections can be switched in and out as required for processing.

Monday, February 4, 2013

Design of skip lists

There are sample implementations of skip lists in many languages:
In C++ here is one reference
In C# here is another reference
In Python here is another reference
The common features across all these implementations are as follows:
1. Each skip list node has a height and a set of neighboring skip list nodes, precisely as many neighboring nodes as its height, some of which may be null.
2. Each skip list node has some piece of data associated with it.
3. Each skip list node has a key to look it up

Skiplist is initialized with the max height of any node.
Head and tail are initialized and attached

Insert operation:
Find where the node goes by comparing keys
If its a duplicate, don't insert, otherwise perform insert
Use a random number to generate a height

Remove operation:
Find the node to be deleted, traverse using available max height to lowest till you reach the node
for each of the levels on the current node, update head and tail forward references and delete the node
if not found return false

Sunday, February 3, 2013

automatic document indexing

Automatic document indexing of an unstructured document involves cherry picking of words from various parts of the document and classifying them. Research has proposed statistical and linguistic methods, machine learning, neural-networks, self-organizing maps and connectionist models. Among the linguistic methods, natural language processing libraries have enabled semantic interpretation of index candidates. In the absence of other input in terms of mark-ups, metadata or punctuations, the index candidates could be picked up based on clusters and outliers to the topics presented in the text. To detect semantic relevance of the words, we can use concept hierarchy structures such as from WordNet. In a set of related words representing a cluster and all other things considered equal, then the one appearing in the text with the highest frequency can be picked.


Design:



I propose a single table with the record of all words occuring in the document and their attributes. Attributes will 



include the following: word offsets, page numbers, frequency, tags and emphasis etc. Candidates will be unigrams. 



Classifier logic could be written in separate modules to determine the clusters and outliers in the above table. 



Tables will be implemented as skip-lists in memory for fast traversal, insert and update.



Document parsing, word stemming, canonicalization and population of the table is the first step.



Clustering and index selection is the next step.







Advantages:

The index selected will be semantically organized and connected.

Entries and Sub-Entries can be computed with different interchange-able classifier algorithms.

Index candidates could be a common subset of various classifier.



Disadvantages:

The index selected will be single word representations.

Clusters could miss the newly introduced words that are not in dictionary or ontology used.

Outliers could miss an idea that is not mentioned anywhere else.

Interview questions from various companies

Interview Questions I've been asked in interviews in the last five years

Amazon
1. Compute the max sum of a sequence in an integer array
2. Design a memory manager
3. Design a parking lot.

Alaska Airlines:
1. Describe the data structure for aircraft seating system.
2. Write a SQL query with a join and group by
3. What do you see as the challenge in working here ?

Avanade
1. What are the following C# keywords used for : static, protected internal, etc. ?
2. Write SQL queries for finding the customers who bought ten or more products ?

BankOfAmerica:
1. What architecture would you propose for the login functionality for our customers ?

BMC
1. Have you worked on the Remedy stack ?
2. Explain what kind of monitoring have you specified ?
3. How would you deploy on Linux based systems ?
4. How would your ideal deployment UI look like ?

Bocada:
1. Identify the sibling for a node in a Binary Tree.
2. Pick between Breadth-First Search and Depth-First Search. Implement one. What data structure would you use ?

Expedia
1. Given a binary search tree, find a specified node ?
2. Write code for Breadth-First Search ?
4. How would you reverse the words in a string ?
3. If you are an engineer on call for production support of a technology you do not own, how would you handle incident reports ?

Facebook
1. Given a set of elements, print the power set which involves zero or more discrete combinations of its elements.

Clear.Com
How would you store and retrieve integers from a large file ?
ClearWire

F5
1. Explain the difference between TCP and UDP
2. Get the N'th last element from a singly linked list.

IBM ( Netezza )
1. Explain how query processing works in a DBMS ?

Honeywell
1. Is XML search case-sensitive ?
2. When would you use WCF ?
3. What's REST ?
4. What is dependency injection ?

Intel

Infosys
1. How would you improve the performance of web applications ?
2. With use of a single sign on software such as siteminder across various internal applications, a user session is lost between navigations. How do you diagnose and fix the problem ? This could be an issue with javascript, services, siteminder or network.
3. How do you prioritze, keep the team moving, meet the deadlines, and deliver with high quality

Intelius
1. Reverse the nodes of a linked list.

Micronet/ Kelly:
1. What are lambda operations ?
2. What's the difference between string and string builder ?

Microsoft Corporation
    BING
1. Given a chess board, how would you find the smallest number of moves for a coin to reach a position on the board, if possible .
2. How would you schedule a task to start at time t ?
    Azure
1. Get predecessor in a binary tree.
    Exchange
    Windows TCP/IP
1. Explain how would you search for patterns in a string ?

EMC
    DataDomain,
1. Have you worked with Active Directory ?
2. What is CIFS ?
3. How do ACLs work ?
    Isilon

Ontela
1. Given a square matrix of size N and entries 'X' or 'O', how can you tell if there's a line of 'X' and 'O's ?
2. How would you design a file system ? What are the layers and Interface ?

Banking software:
Write a method to shuffle the numbers in an array and write test cases for the same in pair programming.

RealNetworks
Write a function to search for string patterns.

Riversand
1. How would you swap two numbers without a temporary variable ?
2. Have you worked with MDM before ?

Salesforce
1. Write a program to compute the Inventory and prepare a report as shown with the given input.

Skype
1. Write a protocol handler that reverses the bytes

Starbucks
1. What are wrong with the following implementation of WCF code and config file ?
2. How do you wrap your MVC methods for error handling. Answer [HandleError] attribute ?
3. How do you define your serialization attributes. Why is order useful?
4. How do you use WCF with SOAP ? What additional items do you specify in the web.config ?

Teneros
1. Write code to demonstrate inheritance and access modifiers.

TCS
1. Which would you say is your strength ? Front-End, Back-End, Middle-Tier ?
2. Desribe the garbage collector in .Net. It's compact generational markdown compactor.
3. What is reflection ? How's it used ? Suggested answer : Reflection is a way to manipulate the types based on their metadata. Some uses of reflection are late binding, serialization , remoting, attributes etc.
4. What are some of the UI patterns for search functionality and the corresponding ORM queries?

T-Mobile
1. What's app data directory ? Why do you need it ?
2. Write code to search a number in a binary tree where the nodes are arbitrary ?

Zillow
1. How would you find out if a given number is prime ?
2. Write test cases for the same.

Saturday, February 2, 2013

DBMS storage management

Sequential access to data is at least an order of magnitude faster than random access and has been increasing. As a result, a DBMS store manager finds it critical to place blocks on the disk such that the queries can access it sequentially. Since the DBMS knows the workload access patterns better than the underlying system, there is a requirement to exercise full control over the spatial positioning of database blocks on disk. The best way for the DBMS to enforce spatial locality of the data is to store it directly on the raw device. However, this uses up the disk partitions and the interfaces are OS specific. Instead developments like RAID and SAN have made the virtual device more appealing. Consequently, the store manager now accesses a single files and places these block directly in the file. The file is essentially treated as a linear array of disk-resident pages. Further DBMS vendors also allow database size to be customized to a size appropriate to a workload. In addition to where the data is written, the discussion on when the data is written is equally important. DBMS will try to postpone or reorder writes and this may conflict with OS read-ahead and write behind approach. The write ahead logging is required by the DBMS to provide durability and correctness. Another reason for a conflict may be OS buffering impact on DBMS IO optimizations. For example, the data access algorithm and plan is known from the query plan where as the OS only knows about bytes. Consequently, this double buffering between DBMS and OS is considered redundant. Copying is often ignored however the major bottleneck for throughput in well tuned transaction processing has not been I/O. Sufficient disks and higher RAM can enable the processors to have data all the time. OS now support options to turn off this additional buffering. This ensures that writes go to the disk when requested. In order that the DBMS provides access to database pages, it implements a buffer pool. Along with arrays of buffer pool, it maintains a hash table that maps the page numbers currently held in memory to their location in the frame table, the location for the page on the disk storage, and metadata about the page such as a dirty bit and any information needed by the page replacement policy. Active pages are "pinned" in memory and this is typically a small fraction of the overall buffer pool. Since the contents of page could be diverse, the page replacement policy algorithms are focused on instead. With 64 bit addressing and falling memory prices, very large buffer pools are now possible. Flash memory is also making its impact with a tradeoff between cost/performance relative to disk and RAM. Another factor to improve data residency has been compression but this has required representations that are amenable to data processing and query processing internals. Finally, column oriented storage is being considered in non-traditional database market.

Friday, February 1, 2013

Data Warehouses

Data Warehouses are large historical databases that get loaded with data periodically. They are then used for decision support which account for nearly one third of all DBMS activity. They have come to require custom query optimization and execution engine support. The first of these is that they are not suited for online transaction processing OLTP where response time was critical. Historical data, on the other hand, was used for analysis such as which items to put on promotion and where layout was important for the upcoming season. These kind of queries were based on the wisdom that historical information enabled better stock management and more savings. The second of these is that such databases required a schema very typical and different from others such as star, multi-level star or snowflake schema where customers, products, stores, times etc. were the dimesions or fact tables. Many dimensions were naturally hierarchical. and hence the use of multi-level star or snowflake. The third of these is that they required very different data structures from the B+-trees which were optimized for fast insertion, deletion and update of records. In contrast, a data warehouse performs an initial load and then the data is static for months. Bitmaps helped store this information with far fewer bits and were also advantageous for various sophisticated bitmap arithmetic including conjunctive filters. The fourth difference in these were that these databases required fast loads of large data periodically. These loads not only were accumulated data but also loaded from systems during say night time.Loads and queries could potentially conflict and hence there was a need to provide techniques such as update-in-place and historical queries. Updates were timestamped and MVCC isolation were provided by a few vendors. The fifth difference is that the joins on the data were very costly to perform and hence there was a need to capture the view and persist it till the next update. The sixth difference is that these databases supported predictable queries that consisted of various aggregates. These aggregates often called data cubes were special class of materialized views that allowed user to navigate the data. Lastly, such systems required special hardware and some vendors like Teradata and Netezza provided proprietary hardware.

Data archival landscape

Data explodes continously and companies are forced to retain this data over longer periods of time. Consequently, the companies look forward to more capable archival solutions. And there are different vendors and technologies in this space. Many have overlapping ideas or technology propositions and it can become quite murky with the various jargons each one uses. Here is a glossary of a few key concepts.
1. NAS : Network Accessible Storage is probably the most common form of large scale storage for files and clusters. These are shared nothing commodity serves joined together with high-speed network so that they create a combined storage space for the data. As opposed to SAN or Storage Area Network, NAS uses IP network to connect the servers. SAN is used for large databases.
2. Appliance : A storage appliance is logically a drive to interacting systems. Physically, it may comprise of a server that allows for addition and removal of disk drives. These can grow to very large number and size in terms of drive and capacity. Moreover, they can comprise of very complex organization of servers and drives in a scaled architecture to enable generational data and archival. Their appeal includes low maintenance and little or no custom application for integration.
3. Tiering: Tiering is yet another solution from storage industry which enables policies to be specified for generational mark-down of data and its movement between tiers. This enables differentiation of hardware for space to suit various storage traffic. By providing tiers, the storage space is now prioritized based on media cost and usage.