Saturday, May 11, 2013

File System Interface

Let us quickly review the operating system design principles of a file system.
We want to be discuss the functions of the file systems, describe the interfaces to file systems,  discuss file system design tradeoffs and access methods and explore file-system protection.
Files are mapped by the operating system onto physical devices. It is usually the smallest allotment of logical secondary storage. A file can have attributes such as name, identifier, type, location, size and protection.
File operations include creating, writing, reading, repositioning, deleting and truncating a file. These six basic operations comprise the minimal set of required file operations
Most of the file operations mentioned involve searching the directory for the entry associated with the named file. To avoid this search, many systems require that an open() call is made prior to file use. And the file-system maintains an open-file table, containing information about all open files. The open system call typically returns a pointer to the entry in the open-file table. This pointer, not the actual file name is used in all file operations
Similarly the operating system maintains a file-open count so that it can reuse the open file table entries or it could run out of space in the table The file system also has to locate the file on the disk.
Each process opens a file in an access mode so that subsequent I/O requests can be allowed or denied. File Systems also provide facilities for locking an open file. File locks are usually the same kind as reader-writer locks - a shared lock is similar to reader lock so that several processes can acquire the lock concurrently and an exclusive lock is like a writer lock, allowing only one process to acquire it.
File types are also important to the file system. File extensions inform the files system on the programs associated with the file type and the actions to be taken.
File types can be used to indicate the internal structure of the file. For example, an executable might have a specific file structure so that it can determine where in memory to load the file and the starting point of the instructions. The Macintosh operating system for example requires that files have a resource fork and a data fork. The resource fork contains information of interest to the user. The data fork contains program code or data - the traditional file contents.
Access methods determine how the information stored in the file is accessed. Sequential access method uses rewind, read next and write next operations. Editors and compilers use the file in this manner. Read and write makes up the bulk of the operations on the file. Another method is direct access. The file is made up of fixed length logical records that allows programs to read and write in no particular order. This method supports operations such as read n and write n where n is the relative block number.
File Systems also have to keep track of the layout referred to as the storage structure. Directories are a useful way of organizing files and they support operations such as Search for a file, create a file, delete a file, list a directory and rename a file. Traversal of the file system to list all directories and files is another common operation. Layout of files and directories could have a backup copy.
File Systems is usually mounted before it can be used. The mount point defines the location within the file structure where the file system is to be attached. The file system is then checked if its valid.
A file system can be remote such as in the case of distributed file system where the remote file systems are visible from the local machine.
 

Friday, May 10, 2013

Today I had to do some debugging and it reminded me of a book I had read earlier. I'm writing a review of the same. This book was titled "Debugging: The 9 indispensable rules for finding even the most elusive software and hardware problems" by David J Agans.
This book advocates a set of rules that will make your debugging experience more methodical, save you time from those hours of arduous labor associated with this task, avoid wild goose chase, and refine your analytical skills. If you play by these rules, it may not guarantee success in every case but it will help you avoid the common pitfalls. The toughest challenge with debugging has always been that the issue (we call them bugs) is a case by case basis and the worst drawback is the assumptions and generalizations that go with the investigation. These rules hone your investigation skills and are just as evident from real world examples quoted in the book as one may interpret from some documented investigations gone wrong.
These rules are not that hard and are listed below for a read:
Rule 1: "Understand the system" You should read the manual, know the expected behaviors, know the road map and your tools. Ignorance can lead you to frustrations and the system cannot be controlled if you don't know how it should behave.
Rule 2: "Make it fail" A very common issue of understanding what is wrong is to find out what it takes to make it fail. If you know the steps to repeat a symptom to the point that the failure can happen deterministically and repeatedly, you have armed yourself with the knowledge of what you want to investigate. You can start at the beginning and by that it means you can use a baseline or a control for a starting point. And then find the steps from that point on. The toughest part of this rule is that some bugs are not easy to repeat. Often these are labeled as timing issue, bad luck, lies and statistics etc. But this rule encourages you to persist first in knowing how to make it fail.
Rule 3: "Quit thinking and look" The very first thing we do when we get a bug is that we put on our thinking caps. While our reasoning could be considered infallible, we don't know until we have taken a look. What if there was some data that surfaces only because we took a look. Albeit looking is hard, it is well worth a rule to be mentioned. Looking can be hard because the systems may not be transparent or they are complex (we call them black box) but even so we might be able to open up logs or traces or in some cases add instrumentation to know what the system is doing.
Rule 4: "Divide and Conquer" The divide and conquer approach is required to narrow the search space so that we spend less time on that which doesn't matter and more on the ones that matter. You can also fix the bugs you know about first and silence the noise so that you can take a hard look at the bad bug. You can even  narrow down the layers vertically or the components horizontally to get to the bug.
Rule 5: "Change one thing at a time" While the rules above are about observing the system and learning from it even as it includes a method to get more information from the system than what it does, this rule is about your prudence in making one change at a time. The equivalent phrases for describing this are "use a rifle not a shotgun" and "grab the brass bar with both hands". The recommendation is to change and compare one thing at a time.
Rule 6: "Keep an audit trail" Write down every change to the system you make, in that order, and the results. Doing this will help you correlate, rediscover the covered ground and learn from it. The equivalent phrase for this is "the shortest pencil is longer than the longest memory".
Rule 7: "Check the plug"  This rule helps you avoid making assumptions or even better question them. You should not start from square three but from square one. If the power switch is not on, you are not going to see any response. "Check the plug is a rule to say go about your steps methodically starting from the beginning and the starting point includes the check for the basics.
Rule 8: "Get a fresh view" As in the game of chess, sometimes we pigeon hole ourselves into the very thing we want to solve. And sometimes we are too shy to get help or ask an expert. And we don't have to be sure, we can report symptoms and get a fresh view.
Rule 9: "If you didn't fix it, it ain't fixed". This is a rule that says if you thought you made a fix and the symptoms stopped, you cannot rest until you know that its really fixed and that it's your fix that fixed it. Ignoring this rule means you are pretending that the problem has gone away. It never just goes away by itself. Fix the cause and the process but be doubly sure that its your fix that fixed it.

Inverted file, TF- IDFs and Shannon information associated with terms

This is a data structure that is useful to retrieve all documents that contain the terms in a query. For each word, there is an ordered list of document identifiers maintained that has the given data. All query terms are organized in a second index structure such as a B+ tree or hash index and is referred to as the vocabulary index. With a single term from the query, the vocabulary index is looked up to find an inverted list and with that list, the rids are mapped to find the documents containing that term. For conjunction of several terms, the process is repeated for one word at a time. For a query with disjunction of terms, the inverted lists are merged.
A Term frequency - Inverse document frequency (Salton & Yang 1973) gives a weight for the given term among other terms and can be useful for such things as keyword extraction. A TF-IDF is defined as N(w)log D/D(w) where D(w) represents a number of documents containing word w in the training data, and D the total number of documents in a training data. 
Shannon information similarly gives a code for encoding the words in a set W. We can use this code to say compress or transmit the document in an efficient manner. Then l(w) = -logP(w) gives the length required for a code of a given word. We can use this length to compute Shannon information as -N(w)log P(w) where N(w) is the number of times the word w occurs in the text. Hang Yamanishi paper further compares Shannon information to TF-IDF on the basis of estimating the words with maximum likelihood of occurrence and defines it as -N(w)logF(w)/F where F(w) is the frequency of the word w in the training data and F the total frequency of word w in training data. Note that Shannon information now looks similar to TF-IDF and improves on the weights associated with words for keyword extraction.

Thursday, May 9, 2013

Indexing for text search

A text search database is a collection of documents and is referred to as text database. Let us think of this database as a single relation schema with one field of type document. Each record in the document contains exactly one document. Each record in this relation contains exactly one document.
Now, let us look at a class of queries based on keyword search that enables us to ask for all documents containing a given keyword. This is the most common kind of query on the web today. A more complex query is find all documents that have keyword 1 or keyword 2  and not keyword 3. For such documents we may also want to look for synonyms such as automobile instead of car. We can rank retrieved documents by the proximity of the query keywords in the document.
There are two types of such queries for text databases: Boolean queries and Ranked queries. In a Boolean query, the user supplies a Boolean expression like the one mentioned above and is called the conjunctive normal form. This query consist of j conjuncts, each of which consists of several disjuncts. Each conjunct corresponds to one concept, and the different words within each conjunct correspond to different terms for the same concept. In a ranked query, the user also specifies a list of words, but the results of the query is a list of documents ranked by their relevance to the list of user terms. Determining when and how a document is relevant to the set of user terms is a different problem. The algorithms for that problem belong to information retrieval, which is closely related to database management. Information retrieval systems unlike database systems don't address updates, concurrency control and recovery because the data is largely static. The criteria used to evaluate such retrieval system includes precision and recall which we define in that order as percentage of retrieved documents that are relevant and the percentage of relevant documents retrieved in response to a query. For Boolean queries, there are two index schemas - the inverted file index and the signature file index. The first index is widely used due to its simplicity and good performance but it has a huge space overhead. The second index has a small space overhead and enables a quick search for eliminating non-matching documents. Words are stemmed and analyzed for a common root during index creation. - as read from textbook on Database management systems.

signature file

A signature file is data structure for text based database systems that support efficient evaluation of boolean queries. It contains an index record for each document in the database. This index record is the signature and it is has a fixed size of b bits, b is called the signature width. The bits that are set depend on the words that appear in the document.Words are mapped to bits based on a hash function. Bits are set from the  result of the hash function. The same bit could be set twice by different words because the hash function maps both words to same bit. One signature matches another if it has at least as many bits set as the other. For a query consisting of a conjunction of terms, we first generate the query signature by applying the hash function to each word in the query. Then we scan the signature file to get all the documents that match the query signature, because every such document is a potential result to the query. Note that the documents may not contain the words in which case it is called a false positive. These can be expensive since we have to retrieve the document, parse, stem and then check to determine whether it contains the query terms. For the queries that have a disjunction of terms, we generate a list of query signatures and apply a comparision rule of any signature match by the documents. Also, for each query, we have to scan the complete signature file, and there are as many records in the signature file as there are documents. To reduce the amount of data that has to be retrieved by each query, we can vertically partition the signature file into a bit of slices, and we call such an index a bit sliced signature file. The length of each bit slice is still equal to the number of documents in the database, but for q bits set, we only need to retrieve q bit slices. - as read from textbook on Database management systems

continuing on the previous post

 One more concept in the visual rendering of objects is that the objects should redraw themselves on any user input such as dragging or adding or deleting them.

Wednesday, May 8, 2013

If you have used the tools such as code map, you may have liked the display of the organization of items as objects which float around to new locations as new ones are added and old ones are removed.  While the default layout of the objects are such that each is evenly spaced and clustered around the center of the canvas, you can drag any object to a new position and the connectors that connect the moved object with others are automatically redrawn avoiding any objects that may be in the path between the moved object and others. This post talks about how to implement the logic that makes these objects float around to an appealing visual display. First, each item as it is added on the canvas finds its starting position. This is typically the center of the canvas for the first item. The next item is placed such that the centers of both these objects are at an equidistance from the center. As more and more vertices are added, the objects occupy the vertices of the corresponding polygons. This is true when the objects are all equal and the items appear uniformly spread around the canvas. But sometimes objects are represented as hierarchy in which there are multiple levels, each level appearing as a line of items. In such a case, the object occupy positions on the line such that their centers are equidistant. If there were a graph of objects, then the node with the maximum number of edges could occupy the center of the canvas, and the other nodes could be arranged around it in increasing circles such that the ones with more number of edges are closer. No matter how we choose to position the centers, the objects themselves can be visualized as a square or rectangle so that some space is left around the objects when they are rendered. Now we can talk about the connectors such which connect between the center of the edges of this square or rectangle bounding the objects. These connectors are drawn as straight lines when there are no obstructions and as curves when they have to avoid other objects in their path. The curves are drawn with no particular preference for left or right of the obstruction and chosen such that the distance covered is minimum. This can be easy to find given the x and y co-ordinates of the start and end of connectors. When the object is moved, all the connectors to that object are redrawn based on the new destination. This redrawing does not affect the other objects. The curvature of the connectors are dependent on the size of the obstruction. Both the connectors and the objects make use of their centers for finding positions on the canvas. Since the strategy for spacing out their centers on the canvas are interchangeable, the code to implement them could consider using the design patterns.