Cluster computing

Sunday, February 3, 2013

Interview questions from various companies

Interview Questions I've been asked in interviews in the last five years

Amazon
1. Compute the max sum of a sequence in an integer array
2. Design a memory manager
3. Design a parking lot.

Alaska Airlines:
1. Describe the data structure for aircraft seating system.
2. Write a SQL query with a join and group by
3. What do you see as the challenge in working here ?

Avanade
1. What are the following C# keywords used for : static, protected internal, etc. ?
2. Write SQL queries for finding the customers who bought ten or more products ?

BankOfAmerica:
1. What architecture would you propose for the login functionality for our customers ?

BMC
1. Have you worked on the Remedy stack ?
2. Explain what kind of monitoring have you specified ?
3. How would you deploy on Linux based systems ?
4. How would your ideal deployment UI look like ?

Bocada:
1. Identify the sibling for a node in a Binary Tree.
2. Pick between Breadth-First Search and Depth-First Search. Implement one. What data structure would you use ?

Expedia
1. Given a binary search tree, find a specified node ?
2. Write code for Breadth-First Search ?
4. How would you reverse the words in a string ?
3. If you are an engineer on call for production support of a technology you do not own, how would you handle incident reports ?

Facebook
1. Given a set of elements, print the power set which involves zero or more discrete combinations of its elements.

Clear.Com
How would you store and retrieve integers from a large file ?
ClearWire

F5
1. Explain the difference between TCP and UDP
2. Get the N'th last element from a singly linked list.

IBM ( Netezza )
1. Explain how query processing works in a DBMS ?

Honeywell
1. Is XML search case-sensitive ?
2. When would you use WCF ?
3. What's REST ?
4. What is dependency injection ?

Intel

Infosys
1. How would you improve the performance of web applications ?
2. With use of a single sign on software such as siteminder across various internal applications, a user session is lost between navigations. How do you diagnose and fix the problem ? This could be an issue with javascript, services, siteminder or network.
3. How do you prioritze, keep the team moving, meet the deadlines, and deliver with high quality

Intelius
1. Reverse the nodes of a linked list.

Micronet/ Kelly:
1. What are lambda operations ?
2. What's the difference between string and string builder ?

Microsoft Corporation
    BING
1. Given a chess board, how would you find the smallest number of moves for a coin to reach a position on the board, if possible .
2. How would you schedule a task to start at time t ?
    Azure
1. Get predecessor in a binary tree.
    Exchange
    Windows TCP/IP
1. Explain how would you search for patterns in a string ?

EMC
    DataDomain,
1. Have you worked with Active Directory ?
2. What is CIFS ?
3. How do ACLs work ?
    Isilon

Ontela
1. Given a square matrix of size N and entries 'X' or 'O', how can you tell if there's a line of 'X' and 'O's ?
2. How would you design a file system ? What are the layers and Interface ?

Banking software:
Write a method to shuffle the numbers in an array and write test cases for the same in pair programming.

RealNetworks
Write a function to search for string patterns.

Riversand
1. How would you swap two numbers without a temporary variable ?
2. Have you worked with MDM before ?

Salesforce
1. Write a program to compute the Inventory and prepare a report as shown with the given input.

Skype
1. Write a protocol handler that reverses the bytes

Starbucks
1. What are wrong with the following implementation of WCF code and config file ?
2. How do you wrap your MVC methods for error handling. Answer [HandleError] attribute ?
3. How do you define your serialization attributes. Why is order useful?
4. How do you use WCF with SOAP ? What additional items do you specify in the web.config ?

Teneros
1. Write code to demonstrate inheritance and access modifiers.

TCS
1. Which would you say is your strength ? Front-End, Back-End, Middle-Tier ?
2. Desribe the garbage collector in .Net. It's compact generational markdown compactor.
3. What is reflection ? How's it used ? Suggested answer : Reflection is a way to manipulate the types based on their metadata. Some uses of reflection are late binding, serialization , remoting, attributes etc.
4. What are some of the UI patterns for search functionality and the corresponding ORM queries?

T-Mobile
1. What's app data directory ? Why do you need it ?
2. Write code to search a number in a binary tree where the nodes are arbitrary ?

Zillow
1. How would you find out if a given number is prime ?
2. Write test cases for the same.

Saturday, February 2, 2013

DBMS storage management

Sequential access to data is at least an order of magnitude faster than random access and has been increasing. As a result, a DBMS store manager finds it critical to place blocks on the disk such that the queries can access it sequentially. Since the DBMS knows the workload access patterns better than the underlying system, there is a requirement to exercise full control over the spatial positioning of database blocks on disk. The best way for the DBMS to enforce spatial locality of the data is to store it directly on the raw device. However, this uses up the disk partitions and the interfaces are OS specific. Instead developments like RAID and SAN have made the virtual device more appealing. Consequently, the store manager now accesses a single files and places these block directly in the file. The file is essentially treated as a linear array of disk-resident pages. Further DBMS vendors also allow database size to be customized to a size appropriate to a workload. In addition to where the data is written, the discussion on when the data is written is equally important. DBMS will try to postpone or reorder writes and this may conflict with OS read-ahead and write behind approach. The write ahead logging is required by the DBMS to provide durability and correctness. Another reason for a conflict may be OS buffering impact on DBMS IO optimizations. For example, the data access algorithm and plan is known from the query plan where as the OS only knows about bytes. Consequently, this double buffering between DBMS and OS is considered redundant. Copying is often ignored however the major bottleneck for throughput in well tuned transaction processing has not been I/O. Sufficient disks and higher RAM can enable the processors to have data all the time. OS now support options to turn off this additional buffering. This ensures that writes go to the disk when requested. In order that the DBMS provides access to database pages, it implements a buffer pool. Along with arrays of buffer pool, it maintains a hash table that maps the page numbers currently held in memory to their location in the frame table, the location for the page on the disk storage, and metadata about the page such as a dirty bit and any information needed by the page replacement policy. Active pages are "pinned" in memory and this is typically a small fraction of the overall buffer pool. Since the contents of page could be diverse, the page replacement policy algorithms are focused on instead. With 64 bit addressing and falling memory prices, very large buffer pools are now possible. Flash memory is also making its impact with a tradeoff between cost/performance relative to disk and RAM. Another factor to improve data residency has been compression but this has required representations that are amenable to data processing and query processing internals. Finally, column oriented storage is being considered in non-traditional database market.

Friday, February 1, 2013

Data Warehouses

Data Warehouses are large historical databases that get loaded with data periodically. They are then used for decision support which account for nearly one third of all DBMS activity. They have come to require custom query optimization and execution engine support. The first of these is that they are not suited for online transaction processing OLTP where response time was critical. Historical data, on the other hand, was used for analysis such as which items to put on promotion and where layout was important for the upcoming season. These kind of queries were based on the wisdom that historical information enabled better stock management and more savings. The second of these is that such databases required a schema very typical and different from others such as star, multi-level star or snowflake schema where customers, products, stores, times etc. were the dimesions or fact tables. Many dimensions were naturally hierarchical. and hence the use of multi-level star or snowflake. The third of these is that they required very different data structures from the B+-trees which were optimized for fast insertion, deletion and update of records. In contrast, a data warehouse performs an initial load and then the data is static for months. Bitmaps helped store this information with far fewer bits and were also advantageous for various sophisticated bitmap arithmetic including conjunctive filters. The fourth difference in these were that these databases required fast loads of large data periodically. These loads not only were accumulated data but also loaded from systems during say night time.Loads and queries could potentially conflict and hence there was a need to provide techniques such as update-in-place and historical queries. Updates were timestamped and MVCC isolation were provided by a few vendors. The fifth difference is that the joins on the data were very costly to perform and hence there was a need to capture the view and persist it till the next update. The sixth difference is that these databases supported predictable queries that consisted of various aggregates. These aggregates often called data cubes were special class of materialized views that allowed user to navigate the data. Lastly, such systems required special hardware and some vendors like Teradata and Netezza provided proprietary hardware.

Data archival landscape

Data explodes continously and companies are forced to retain this data over longer periods of time. Consequently, the companies look forward to more capable archival solutions. And there are different vendors and technologies in this space. Many have overlapping ideas or technology propositions and it can become quite murky with the various jargons each one uses. Here is a glossary of a few key concepts.
1. NAS : Network Accessible Storage is probably the most common form of large scale storage for files and clusters. These are shared nothing commodity serves joined together with high-speed network so that they create a combined storage space for the data. As opposed to SAN or Storage Area Network, NAS uses IP network to connect the servers. SAN is used for large databases.
2. Appliance : A storage appliance is logically a drive to interacting systems. Physically, it may comprise of a server that allows for addition and removal of disk drives. These can grow to very large number and size in terms of drive and capacity. Moreover, they can comprise of very complex organization of servers and drives in a scaled architecture to enable generational data and archival. Their appeal includes low maintenance and little or no custom application for integration.
3. Tiering: Tiering is yet another solution from storage industry which enables policies to be specified for generational mark-down of data and its movement between tiers. This enables differentiation of hardware for space to suit various storage traffic. By providing tiers, the storage space is now prioritized based on media cost and usage.

Thursday, January 31, 2013

Threads and Process

"An operating system thread is a unit of control without additional private OS context and without a private address space. Each OS thread has full access to the memory of the other threads executing within the same multithreaded OS process. Thread execution is scheduled by the operating system kernel scheduler and these threads are often called "kernel threads" or k-threads.
A Lightweight Thread package is an application-level construct that supports multiple threads within a single OS process. Unlike OS threads scheduled by the OS, lightweight threads are scheduled by an application-level thread scheduler. The difference between a lightweight thread and a kernel thread is that a lightweight thread is scheduled in user-space without kernel scheduler involvement or knowledge. The combination of the user-space scheduler and all of its lightweight threads run withing a single OS process and appears to the OS scheduler as a single thread of execution. Lightweight threads have the advantage of faster thread switches when compared to OS threads since there is no need to do an OS kernel mode switch to schedule the next thread. Lightweight threads ahve the disadvantage, however, that any blocking operation such as a synchronous I/O by any thread will block all threads in the process. This prevents any of the other threads frommaking progress while one thread is blocked waiting for an OS resource. Lightweight thread packages avoid this by (1) issuing only asychronous (non-blocking) I/O requests and (2) not invoking any OS operations that could block. Generally, lightweight threads offer a more difficult programming model than writing software based on either OS processes or OS threads. Some DBMSs implement their own lightweight thread (LWT) packages. These are a special case of general LWT packages. We refer to these threads as DBMS threads and simply threads when the distinction between DBMS, general LWT and OS threads are unimportant to the discussion. A DBMS client is the software component that implements the API used by the application programs to communicate with a DBMS. Some example database access APIs are JDBC, ODBC, and OLE/DB. In addition, there are a wide variety of proprietary database access API sets. Some programs are written using embedded SQL, a technique of mixing programming language statements with database access statements. This was first delivered in IBM COBOL and PL/I and, much later, in SQL/J which implements embedded SQL for Java. Embedded SQL is processed by preprocessors that translate the embedded SQL statements into direct calls to data access APIs. Calls made to these APIs are marshaled by the DBMS client component and sent to the DBMS over some communications protocol. The protocols are usually proprietary and often undocumented. In the past, there have been several efforts to standardize client-to-database communication protocols, with Open Group DRDA being perhas the best known, but none have achived broad adoption. A DBMS worker is the thread of execution in the DBMS that does work on behalf of a DBMS client. A 1:1 mapping exists between a DBMS worker and a DBMS client: the DBMS worker handles all SQL requests from a single DBMS client. The DBMS client sends SQL requests to the DBMS server. The worker executes each request and returns the result to the client."
Reference: Architecture of a database system - Hellerstein, Stonebraker, Hamilton

Wednesday, January 30, 2013

Python NLTK processing

The NLTK short form for natural language toolkit library available in Python programming language defines the following constructs:
concordance : this provides all occurrances of the word in the text block along with their context.
similar : this provides all the words similar to the one specified in concordance that appear in the same range of contexts.

Tuesday, January 29, 2013

Lock free

Lock primitives such as mutual exclusion locks are popular but when they number more than a handful, their benefits wane. Programmers try to maintain correctness with conflicting operations but in serializing some of these, they also serialize some non-conflicting operations. Similarly, programmers try to maintain live-ness but these locks are held longer than would otherwise be necessary. Also, without scheduler support, programmers need to be aware of priority inversion problems. Lastly, for high performance programmers must balance the granularity at which locking operates against the time that the application will spend acquiring and releasing locks.
Hence, three different APIs are proposed by Fraser and Harris, each of which attempts to improve the shortcomings mentioned above. Further, they are non-blocking and generally allow direct-access parallelism. They all follow a common optimistic style.
The first API provides multi-word compare-and-swap (MCAS) which generalizes single-word CAS operation found on many processors. It automatically updates one or more memory locations from a set of expected values to a set of new values. This API comes with two methods MCASRead and MCAS where the former allows for the values to be read and remembered until the subsequent MCASRead while the latter allows for single update.
The second API provides word-based software transactional memory (WSTM) which avoids some of these problems by allowing a series of reads and writes to be grouped as a software transaction and applied to heap automatically. This API comes with two methods WSTMRead and WSTMWrite ( one for read and another for write ).
The Third API provides an object-based software transactional memory (OSTM) which allows a thread to 'open' a set of objects for transactional access and, once more, to commit updates to them atomically. Each object is accessed through an OSTMHandle which must be subject to an OSTMOpenForReading or OSTMOpenForWriting call in order to obtain access to underlying data.
As a comparison between the three APIS, here are some of the criteria to use:
First, disjoint-access parallelism is demonstrated by MCAS when accessing disjoint sets of words and probabilistically by WSTM. OSTM shows that when accessing disjoint sets of objects.
Second, MCAS shows no read parallelism whereas both WSTM and OSTM allows the same.
Third, as a space cost, two bits are reserved in each word by MCAS and a fixed size table of 65,536 double-word entries is used by WSTM. OSTM uses up one word in each object handle.
Fourth, MCAS has no compos-ability whereas both WSTM and OSTM allow compos-ability.