Sunday, March 31, 2013

How does a classifier based text data mining work ?

Let's say we want an implementation for detecting keywords in text using a semantic lookup and clustering based data mining approach. The implementation data structures, steps of the control flow and data structures are to be explained below.
In general for any such implementation, the steps are as follows:
The raw data undergoes a data selection step based on stemming and part of speech tagging
The data is cleaned to remove stop words.
In the data mining step, we extract the keywords by evaluating each word and building our tree or running a decision tree.
In the evaluation step, we filter and present the results.


To begin with, we view an unstructured document as a bag of words and attempt to find hierarchical clusters.  We use a specific CF tree data structure such as the BIRCH system (Zhang et al) to populate a data set that we can run for our algorithms and others. This one is scalable and efficient for the task at hand.
The CF tree represents hierachical clusters of keywords that we insert into the tree one by one. We find the cluster that this keyword belongs to based on a distance function that tells us the distance of this keyword from cluster centers. The distance functions are interchangeable.


We could tag parts of speech this way:
from nltk.corpus import brown
 suffix_fdist = nltk.FreqDist()
common_suffixes = suffix_fdist.keys()[:100]

def pos_features(word):
     features = {}
     for suffix in common_suffixes:
         features['endswith(%s)' % suffix] = word.lower().endswith(suffix)

tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.DecisionTreeClassifier.train(train_set)
classifier.classify(pos_features('cats'))
'NNS'    

Saturday, March 30, 2013

A review of emerging trends in search user interface

Chapter review from book: Search User Interfaces Marti Hearst 2009
Search Interface technology has newer trends in the areas of mobile search. multimedia search, social search, and a hybrid of natural language and command based queries in search.
Mobile search interfaces stand out because it is predicted to be the primary means of connecting to the internet by 2020 (Rainie,2008) . Here time and location information could be used to better anticipate a mobile users information needs. For example, a user may want to reserve a table at a restaurant in the next half -hour within a one mile radius.. Another factor observed by Kamvar and Baluja 2006 from a study of the requests to mobile search engine sites was that the query length was shorter for handheld devices than desktop searches and they had fewer variations. They also observed that the users did more retries and reformulations than change the topic. Context based search where context can be defined as current activity, location, time and conversation becomes relevant to mobile queries. Query entry on mobile devices are also subject to dynamic term suggestions and anticipation of common queries. Search results are also presented differently on handheld devices such as with formatting to list essential content only. Formatting could be such that the layout with thumbnails may be preserved but the relevant text could be highlighted. Information visualization and categorization also helps improve the usability on these devices. This is advocated for navigation of pages in the Fathumb (Karlson et al.) interfaces. Results could also be specialized  for certain queries such as showing maps when it involves locations. Besides presenting results, some browsers are also optimized for handheld devices.
Images, video and audio are also increasingly being searched on mobile devices. While automatic image recognition is still difficult, automatic speech recognition is greatly improved. There are techniques to input queries that are spoken. Yet Image searches are an important part of the overall searches performed on mobile devices. And the issue is harder to solve when the associated text or metadata with the images do not describe the image content adequately. Videos are searched by segmenting the data into scenes and then shots. Text associate with the shots is attributed to the corresponding index in the videos. Audio searches are improved when the associated audio could be converted to some extent to text.
When the different multimedia results are to be included for a user's query input, it is called blended results or universal search. In general search results are improved with keywords, text labels from anchor texts, surrounding text in web pages or human assisted tags.
Another interesting trend for search is social search. Web 2.0 is considered to be interactive web where people interact with one another for their online needs. Social ranking of web pages, collaborative search and human powered question answering are all examples of user interaction for search. Social ranking is considered the "wisdom of crowds" and is generally indicated by the number of people's recommendations which can then be ranked by a search engine to display the results. Social tagging of websites or liking a website on social networking tools such as Facebook and search engine features to let the user rank are all used for ranking the results based on social interaction.
Multi-person and collaborative search on the other hand is for users to do common tasks involving collaboration such as for travel planning, grocery shopping, literature search, technical information search and fact finding. Typically a centralized location for assimilating a list and a view into participants actions is provided for such search. The latter helps with precision and recall. Precision and Recall are two terms used to measure search: Precision is the fraction of the retrieved instances that are relevant, while Recall is the fraction of relevant instances that are not retrieved. Therefore Precision considers the retrieved results and Recall considers the universe of relevant results. Along the same lines, another set of metrics are freshness and relevance. Freshness is about documents not yet looked at and relevance is about documents that are a match for the user's query. These two metrics counter balance each other and are continuously updated.
Massive scale human question answering is another technique for social search where people enter questions and thousands of others suggest answers in near real-time. This is particularly helpful in certain languages other than English.
Lastly, semantic search is another trend where keywords are discerned by looking up related synonyms to enhance search results.


Thursday, March 28, 2013

Search functionality in UI

In a web application, domain object values are often very relevant and useful to track records or pull up case information. For example, a case could be referred by the name of an individual or the product purchased by the individual. In such cases, the relational storage schema or the business object mapping are not discussed. Simply the value of an entity is all the user wants to provide. Then the search functionality has to pull up the corresponding data from the data source or store. The values provided for search are simple strings and it can be assumed that the corresponding entity attribute stores such values as strings. Therefore, the lookup of these values could be implemented with stored procedures that use fulltext.

Wednesday, March 27, 2013

Book Review : Key Words and Corpus Analysis, In Language Education.


Title: Textual Patterns: Key Words and Corpus Analysis, In Language Education.
Author: Mike Scott & Christopher Tribble (2006)
Summary of Chapter 4 and 5 relevant to keyword extraction.
In chapter 4 of the book, the authors introduce a method for identifying keywords in a text. They propose that there are two main kinds of output in a keyword list  - aboutness and style.
He says keyness is a quality words may have in a given text or set of texts. It will be textual.  He cites an example of a sentence where you can break the sentence into a dozen trivia each of which has a keyness. What we select out of these could be based on a threshold which is the third factor in this analysis. Once we have a list of words and their frequencies we can compare them to those from known corpus to eliminate usual and frequently occurring words such that only outstanding ones remain. The contrast between the reference and the target text is handled in a statistical way. Statistical tests of probability compare a given finding to what can be expected using a quantitative number.
But statistically significant contrast words need not reflect importance and aboutness.  They can be considered to represent “style”. Aboutness can be gathered from the parts of speech analysis. Usually there is a target that references a “genid0” that ties it to the various parts of the speech.
This often depends on context which can be temporal and spatial. Context also has many levels such as collocations, same sentence, same paragraph, same section or chapter, same story or the whole text, a set of texts or the context of culture.
Therefore there are several issues with detecting style and aboutness. First is the issue of selecting a text section versus a text versus a corpus versus a sub-corpus. Second is the statistical issue of what can be claimed.  Third issue is how to choose a reference corpus. Fourth issue is handling related forms such as anotonyms. The fifth issue is the status of the keywords one may identify and what is to be done with them.
Choosing a reference corpus can be tackled with a moderate size mixed bag of corpus. The keyword procedure using this reference corpus as mentioned earlier is fairly robust because even an unrelated reference corpus can detect words that indicate aboutness.  And aboutness may not be one thing but a combination of several different ones.
Related forms can be avoided with a dictionary such as WordSmith by using stemming and ignoring semantic relations.
Status of the keyword can be determined with the help of context or purpose since its not evident from the text itself. As an example banana and weapons of mass destruction may not have status without context.  But status can be a pointer to a specific textual aboutness and/or style or even a pattern. Or the keywords may have been statistically arrived but not established. Thus the keyword candidacy could be determined.
Keyword clusters or key phrases are also important for describing the aboutness and style of a given text or set of texts. When such groups of co-occuring words are selected, how many are positively or negatively key and are there any patterns in these two types ? These are also further lines of research.
Thus the authors conclude that keyness is a pointer to the importance which can be sub-textual, textual or intertextual.

Tuesday, March 26, 2013

Java Continued part 3

Java server is implemented with code behind, JSP pages, server side scripts etc. It's usually a best practice to keep code out of JSP pages because syntax errors in Java code in a JSP page are not detected until the page is deployed as opposed to tag libraries and servlets. JSP pages could be maintained by page authors who are not Java experts. HTML markup and JSP can be hard to read. Besides JSP pages are primarily intended for presentation logic.
Servers are also implemented with Struts framework. Do not mix DynaForm beans with Struts. Use scaffolding to your advantage.

Turney's paper on learning algorithms for keyphase extraction

Turney's paper on keyphrase extraction uses the C4.5 decision tree induction algorithm (Quinlan, 1993) and proposes GenEx -  a hybrid genetic algorithm for keyphrase extraction. C4.5 is general purpose algorithm and this paper has used it in a few different ways. However, their proposal GenEx is tailor made for learning to extract key phrases. It has two main components, the Genitor genetic algorithm (Whitley, 1989) and the Extractor keyphrase extraction algorithm (Turney, 1997, 1999). Extractor takes a document as input and produces a list of keyphrases as output. There are twelve parameters that govern how it processes the output. The Genitor takes these parameters and tunes them to maximize fitness on training data. These twelve parameters of Extractor are as follows:
NUM_PHRASES
NUM_WORKING
FACTOR_TWO_ONE
FACTOR_THREE_ONE
MIN_LENGTH_LOW_RANK
MIN_RANK_ROW_LENGTH
FIRST_LOW_THRESH
FIRST_HIGH_THRESH
FIRST_LOW_FACTOR
FIRST_HIGH_FACTOR
STEM_LENGTH
SUPPRESS_PROPER
The Extractor algorithm works something like this:
Step 1 : Find single stems
Step 2 : Score single stems - The score is the number of times the stem appears in the text, multiplied by a factor. If the stem first appears before FIRST_LOW_THRESH, then multiply the frequency by the FIRST_HIGH_FACTOR. Typically, FIRST_LOW_FACTOR is greater than one and FIRST_HIGH_FACTOR is less than one. Thus frequent stems get a higher score and rare stems get a lower score.
Step 3 : Select top single stems : Rank the stems in the order of decreasing score and make a list of the top NUM_WORKING single stems.
Step 4 : Find Stem phrases : Make a list of all phrases ( unigrams, bigrams, trigrams) that repeat without gap or punctuation. Stem each phrase by truncating each word in the phrase at STEM_LENGTH characters
Step 5 : Score stem phrases : For each stem phrase, count how often the stem phrase appears in text and note when it appears. Assign a score to each phrase just like step 2, then adjust based on the number of words in the stem. If there are two words in stem, multiply each word by FACTOR_TWO_ONE and if there are three stems, multiply the score by FACTOR_THREE_ONE. Thus longer phrases have higher score to boost them over rare single word stems.
Step 6 : Expand single stems: For each stem in the list of the top NUM_WORKING single stems, find the highest scoring stem phrase of one, two or three stems that contain the given stem to build a list of same count stem phrases sorted by their scores.
Step 7 : Drop duplicates: For example, when two single stems expand to the same two word stem phrase, they are duplicates. Preserve the ranking among the remaining stems.
Step 8 : Add suffixes : For each of the remaining stem phrases, find the most frequent corresponding whole phrases
Step 9 : Add capitals : Prefer the lowest number of capitals in the phrases. This is usually the best unless the capitalization is inconsistent where one of the words is say a proper noun.
Step 10 : Final output : We now have an ordered list of mixed-case ( upper and lower case, if appropriate ) phrases with suffixes added. The list is ordered by scores calculated in the step 2.
The Genitor algorithm may be viewed as a method for optimizing a string of bits using techniques that are inspired by biological evolution. Think of the bit string as a population of individuals. New bit strings are created by randomly changing existing individuals ( this operation is called mutation ) and by combining sub-strings from parents to make new children (this operation is called crossover). Each individual is assigned a score called its fitness, based on some measure of the quality of the bit string, with respect to a given task.
GenEx  The parameters in GenEx are set using the standard machine language paradigm of supervised learning. The algorithm is tuned with a dataset consisting of a set of documents paired with their target lists.  The document set is divided into test and training subsets. The learning process involves adjusting the parameters  to maximize the match between the output of the extractor and the target keyphrase  lists, using the training data. The success of the learning process is measured by the examining the match between the testing data.

Monday, March 25, 2013

Java continued ...

Java has excellent support for client server programming. In general, it has good support for networking and particularly for client server application development. Java has built in support for applets and has rich support for document model. Applets enable many of the common UI operations we perform on Windows. Microsoft has a runtime support for Java called Microsoft J++.

Sunday, March 24, 2013

Java

I'm trying to take a java test today so here is a quick overview:
 Java features: simple, object-oriented, robust, multi-threaded, architecture-neutral, interpreted and high-performance, distributed and dynamic.
Code is written usually in one of two ways : viz. procedural  -  set of instructions that defines "what is happening" or second object-oriented  - code is organized around data to describe "who is being affected".

Common data types include byte, int, double etc. You can use instanceOf
String operations : StartsWith, equalsIgnoreCase, regionMatches, reverse etc.

Common data structures are Vector, Dictionary, Hashtable, StringTokenizer, BitSet, Observable,
use static and this keywords for class methods liberally eg:
 class Box
{
public static void main(string args[])

use implements and extends judiciously
class Deadlock implements Runnable 
class Box extends Shape

 use Thread, ThreadGroup and Runnable
 use Cloneable interface
 use Observable class and Observe interface. Observable calls notifyObservers() and Observer declares update(Observable, Object) method.

Friday, March 22, 2013

Expression Tree and WPF

DependencyProperty in WPF represents a property that can be set through methods such as styling, data binding, animation and inheritance. It also reports information such as whether changing a property value can be coerced. Further, it can be edited in the Visual Studio designer just like other properties.
QueryControls can make use of dependencyproperty when representing the expression tree on the group canvas. This is very helpful as the clauses are added, deleted, modified or grouped/ungrouped. Expressions are linear textual representations. They are adapted with ExpressionAdapters to form query trees. LINQ provides the functionalities for expression evaluation.

UI automation continued

There are several diagnostics that can help with UI automation. . For example, WPF has a method called DrawHighlight() that can be called on its controls which makes it appear highlighted on the site. Another way is to use the default spy tool that comes with visual studio template for coded UI project that lets you identify the properties with which to search a control. Controls can be searched by first explicitly setting the property to search for such as a className, name or display text and then calling the FindMatchingControls method that recursively searches the document model. Logging can also be turned on for WPF which helps further with the asynchronous calls made through the UI.

Thursday, March 21, 2013

User Interface automation

When automating UI test cases, there are several challenges that cost time and effort. First of these is that the controls on the winforms may not be named. To find the contol, one may have to walk the document model. Second the controls may not be at the level in the tree as they visually appear on the UI. One may have to use a tool to detect the level and the location of the tree node. Sometimes the tools also don't give the correct information. Further, the properties or attributes used for identifying the control may require a recursive search over the controll hierarchy in the document. Third the contols may not all be consistent across UI panes and their access and usage varies even though they are the same controls. These cause additional code to be written for each such case. Fourth, there are many different ways to implement the same functionality involving different controls or generally different workflow. Code for automation has to deal with all these cases. Fifth, the contols may not always be out of box. They may provide advanced features or they may have additional test cases required. Custom controls have their own test requirements. All these add to the code variations. Fortunately, they can be organized and handled as much as possible.

Wednesday, March 20, 2013

Test harness and drivers

Frequently, we develop modules for automted test execution but don't integrate all of them. We do this mostly because we are short on time or resource. A test framework that's useful for manual tests can also be useful for automated tests. Just the road-map of the integration needs to be worked out. This is where a technical specification and planning tools help. Because it lists all the dependencies, the layers, the object model and mappings, the translations and operations, the stages of development, the costs for development and test and the trade-offs and the deliverables from sprint planning. With a knowledge of the means and methods to integrate test tools and framework captured in written format, resources can pick them up or execute independent of each other. The benefits of integration is mutual to the individual software products being integrated and adds overall value to existing products. I want to be able to do this.

Tuesday, March 19, 2013

automated cars

Why are automated cars better than manual ?
There are several reasons why automated cars are better than manual.
First, the driver has less things to remember and practice for the ride. There are fewer chances for driver manoeuvre errors and hence more safe to drive. Driver weariness and machine faults are a significant contributor to accidents. If the transmission is automated, it reduces a factor in the cause for accidents and improves safety when driving.
Second, the bulk of city driving is in stop and go traffic and these are better done with automatic transmission. The stop and go traffic requires the gears to be shifted by reaching neutral before going to higher gears and this adds to constant wear and tear.  Besides, the driving is smoother when the gear shifts are not noticeable.
Third, with the gear shifting is smoother in automated cars as opposed to manual. There are less chances of improper use or damage. Even in the case of emergency responses, the driver may occassionally wear out the transmission since it takes more time to respond. Experimental studies have shown that the first 1.8 seconds of an impending collision detection and response are crucial in avoiding a fatal accident.
Fourth, the car has less associated costs in terms of insurance and maintenance.  Insurance companies love the norm and automated transmission has become popular in sales. Car prices are affected when the transmission is automatic and so is the resale value of the car.
Fifth, the control over the torque that comes with different gear shift is not necessary in most city driving  in the United States. This control is great for say racing or driving on rugged terrains but usually can be avoided for city driving.

Monday, March 18, 2013

.net source for standard query operators

The standard query operators implementation is available for download from .Net. For a collection of query clauses, a grouping can be made. These are treated as subtrees in an expression tree. Each item of the result is run through the decision tree to be included in the result.  The choice to include an item depends on the success of a predicate and the logical operation with the current evaluation. 

expression tree

An expression tree is an efficient data representation of a query  lambda expression. In terms of LINQ to SQL, the expression tree gets converted to SQL to ensure the filtering and ordering are performed on the server. For lambda expressions, IL code or an expression tree can be generated. If an operator accepts an expression for a method delegate, IL code is emitted.If an operator accepts an expression of a method delegate, the expression tree is returned. LINQ to SQL objects generate IL code because it takes in a generic delegate, whereas LINQ to SQL implementation takes an expression tree that gets converted to SQL to be executed by SQL Server.

Sunday, March 17, 2013

Longest common subsequence

When there is a recursive algorithm based on the subproblems but the total number of subproblems is not too great, it is possible to cache all solutions in a table. In the case of a longest common subsequence detection, running through all the subsequences would take exponential time. Instead we can define a recursion based on the following :
Let c[i, j] be the length of the longest common subsequence of the prefix  of X[1..i] and Y[1..j]. Recursion:
c[i, j] = { 0 if i =  0 or j = 0
          = { c[i-1, j-1] + 1 if i,j > 0 and xi = yj
          = { max( c[i,j-1], c[i-1,j] ) otherwise
This way we utilize the solutions we computed. We compute the table c[i,j] bottom-up and also store the value b[i,j] that captures whether c[i,j-1], c[i-1,j] or c[i-1,j-1] + 1is optimum.  We therefore find the longest common subsequence by walking backwards on the path from the last cell of b[i,j] to the first.

Saturday, March 16, 2013

decision trees


From the work of some sorting algorithms, we can derive a decision tree.  A decision tree can be visualized as nodes for each comparison and left and right sub-trees for yes or no choices. Each execution of an algorithm gives a downward path in the tree. Worst-case time is lower-bounded by longest downward path (height) in this binary tree. A tree of height h has at most 2h leaves. If a tree has L leaves, then its height is at least log L.

Greedy algorithms have no general recipe. In general, greedy algorithms assume that there is an objective function f(x1, x2, …, xn) to optimize ( say maximize) that depends on some choices. Second, there is a way to estimate the contribution of each choice to the final value but without taking into account how our choices will constrain later choices. Finally, the algorithm is greedy if it still makes the choice with the best contribution. Greedy algorithm is frequently not the best choice but sometimes it is.

There are cases of algorithm problems when nothing is better than trying out all possibilities. This is especially helpful when the number of possibilities is bounded. For example we can use exhaustive search to find elements of a maximum independent set in a graph. A maximum independent set is one in which no two vertices have a connecting edge. Independent set is NP-complete.

Courtesy study material  uploaded by Peter Gacs, Boston university.

Friday, March 15, 2013

How do you design a text mining system ?


We begin by encapsulating the core concept of a text mining system which is a concordance. Next we support a management layer over the concordance. We streamline the insertion, deletions and updates. We encapsulate every parameter we let the system depend on. We build supportability and stack trace capture for diagnosis. We write unit-tests side by side with the existing implementation. Sounds exciting.

Thursday, March 14, 2013

Red black trees revisited

Among the data structures in a text book for computer science, red and black trees possibly capture the imagination most with its colorful transformations during rotations and recolorings of sub trees.
So let's revisit this data structure and the insert and delete operations for these.
For a red-black tree:
1)  Every node is either red or black.
2) The root node is black.
3) The sentinel nodes are black.
4) if a nodes is red, then both its children are black.
5) For each node, all paths from that node down to the leaves have the same number of black nodes.

Left Rotate operation: The right sibling becomes the parent. The left subtree of this sibling becomes the right subtree of the node displaced.

Right Rotate operation: The left sibling becomes the parent.The right subtree of this sibling becomes the left subtree of the node displaced.

Insert operation : goes very much like a tree insert followed by a RB tree fixup. To insert a node z in a tree T, we walk down the tree with x as the root and y as the parent. Then we handle the case of appending the z as the left or the right child of y. We color the new node z as red. Then we fix up the tree.Fix up requires iterations as long as z's parent is red. If z's uncle is red, we recolor the nodes and move the pointer up the tree. If z and its parents are both are red but z's uncle is black, then if z is the right child of z.p then we perform a left rotation else if z is the left child of z.p, then we perform a right rotation.

Delete operation: also goes very much like a tree delete followed by a RB tree fixup.  If z is the node to be deleted from a tree T, we maintain a node y as the node either to be removed or to be moved within the tree. We also maintain y's original color since y's color could change.  We also maintain x as the sole sibling of z so that we can transplant it.  If z has both siblings, we choose y as the tree minimum on z's right subtree and x as the sibling of y so that we may perform the same transplant operation on x as earlier and then we transplant z. We set the y's color to z's color. If the original color of y was black, we perform RB-tree fixup for delete.  We begin the fixup with x.  We iterate while x is not the root and x's color is black. In each iteration, we take x and w as siblings and perform rotations and recolorings for fix up . If x is the left child and w's color is red, we color it black and do a left rotation of x's parent. If w's left color is black and w's right color is black, we trivially color w to red and set x to its parent. Else if only the right's color is black we set w's left color to black and right rotate w and keep w to the right of x's parent. If x's sibling w is black and w's right child is red, we set w's color to x's parent and color both x's parent and w's right to black, then we left rotate on x's parent. Since we started out with x being the left child, the same cases apply to the x being the right child but with left and right exchanged.

Data Structures

Heap data structure is very useful for organizing data for constant time retrieval of maximum or minimum entries and logarithmic to find elements.  The max heapify method works like this: You take the left sibling, right sibling and the root to find the maximum among the three. If the largest is not the root, you swap it with the root and recurse down to the swapped node subtree.

Insertion is also logarithmic. If you insert the element as the last, it can be floated up.
So for index ranging from N to 1, heapify the tree at index.

Wednesday, March 13, 2013

web API and mobile application clients

In my experience with working on a retail company's API for launching their mobile applications on Android and iOS devices, there were a few items worth recalling:
1) The APIs were external facing and hence they required authentication and authorization over the internet.
2) The APIs could be signed on from mobile applications as well as desktop application.
3) The APIs supported both encrypted as well as unencrypted access over Internet.
4) The APIs were versioned and had a versioning policy.
5) The APIs were categorized based on services they offered. They exposed resources for these services.
6) The APIs had a cache for improving performance. The cache used URI hashing as well as periodic refreshes from data sources.
7) The APIs were published through a web proxy that provided monitoring services.
8) The APIs were called by clients with different APIkeys.
9) The APIs were implemented to access data from different data sources.
10) The APIs had SLA defined and honored. These covered timeout and availability.
11) The APIs were performance tested with different workloads.
12) The API URIs were qualified with the resource and query parameters necessary for shopping.

Tuesday, March 12, 2013

keyword extraction using naive Bayes

Keywords could be considered local to the document they appear in. Consequently, keywords not only have an attribute via term frequency but also in their appearance in a given document as opposed to others. This has been utilized in papers such as Yasin et al in keyword extraction using naive Bayes to identify whether a word belongs to the class of ordinary words or keywords. The metric is called TFxIDF which combines Term Frequency and Inverse Document Frequency. TF*IDF(P,D) = P(word in D is W) x [ -log P(W in a document) ]. Assuming feature values are independent, Naive Bayes classifier has been proposed in thsi paper with the following model:
P(key | T, D, PT, PS) = P(T|key) x P(D|Key) x P(PT|key) x P(PS|Key) / P(T, D, PT, PS) where P(key) denotes the prior probability that the word is a key,  P(T|key) denotes the probability of having TFxIDF score T given the word is a key, P(D|Key) denotes the probability of having neighbor distance D to the previous occurance of the same word, P(PT|Key) denotes the probability of having relative distance D to the previous occurance of the same word given the word is a key.

Saturday, March 9, 2013

conceptual schema versus physical schema

The conceptual schema sometimes called the logical schema is the data model in the DBMS. It describes all the relations that are stored in the database. These relationships could contain information in the form of entities and relationships. For example, entities such as employees and departments could have relationships such as employees working for a department. Each collection of entities and each collection of schema can be described as a relation, leading to the conceptual schema. The process of arriving at a choice of relations and fields for each relation is called conceptual database design.

The physical schema specifies additional storage details. So it describes further how the conceptual schema is stored in the disk. And file organizations used to store the relations and auxilary data structures such as indexes  that speed up data retrieval operations are all part of physical schema.

Friday, March 8, 2013

Computer Vision 2

There are other techniques to image segmentation. Some of them are enumerated below:
Region growing methods: Here seeds mark each of the objects to be segmented. The regions are iteratively grown by comparing all unallocated neighbouring pixels to the region. The difference between a pixel's intensity value and the region's mean is used as a similarity measure.
Split and merge methods: This method splits the image to four quadrants and then they can be merged if they are found to be homogeneous.
Partial Differential equation based methods: PDE methods involve setting up an equation such as a curve propagation or Lagrangian equation that parameterizes the contours. To choose the contour also called snake derivatives are computed using finite differences  and derivatives. Between similar choices, the steepest gradient descent is chosen for minimizing the energy. Thus this leads to fast and efficient processing.
Level Set Methods: The level set method was proposed to track moving interfaces. It can be used to efficiently address the problem of curve/surface propagation in an implicit manner. The central idea is to represent the evolving contour using a signed function where its zero level corresponds to the actual contour. Then according to the motion equation of the contour, one can easily derive a similar flow for the implicit surface that when applied to the zero level will reflect the propagation of the contour.
Fast marching methods specify the evolution of a closed curve as a function of time T and speed F(x) in the normal direction at a point x  on the curve. The speed function is specified, and the time at which the contour crosses a point x is obtained by solving the equation.
Graph Partitioning methods can effectively be used for image segmentation. In these methods,, the image is modeled as a weighted undirected graph.

Thursday, March 7, 2013

Image processing

Edge detection and segmentation is a form of image processing where the changes in intensity at the edges are used to detect lines or curves. The edges identified by edge detection are often disconnected. They are connected to form closed region boundaries which are then segmented.  The simplest method of image segmentation is called thesholding method. In this a theshold value is used to turn a gray scale image into a binary image. Clustering methods like K-means algorithm can also be used iteratively for selecting different thresholds. Here the distance between the pixel and the cluster center is  is minimized and the cluster center is recomputed by averaging all of the pixels in the cluster. Compression based methods are used to choose the optimal segmentation based on the coding length of the data. While segmentation tries to find patterns in an image and any regularity in the image can be used to compress it. For example, the contours of an image is represented as a chain code and the smoother the boundary, the shorter the encoding. Another approach is to use histogram based methods that are very efficient because they require only one pass over the pixels. The peaks and valleys in the histograms can be used to detect clusters.

Wednesday, March 6, 2013

REST API Design.

This link here talks about REST API design.
http://blog.apigee.com/detail/slides_for_restful_api_design_second_edition_webinar)
eg: /owners/bob/dogs/search?q=fluffy
/services/data/v1.0/sobjects/Account
/v1/dogs/
/dogs?id=123
 

Monday, March 4, 2013

RAID levels

Disks are potential bottlenecks for system performance and storage system reliability. If the disk fails, the data is lost. A  disk array is used  to increase performance and reliability through data striping and redundancy. Instead of having a single copy of data, redundant information is maintained and carefully organized so that in the case of a disk failure, it can be used to reconstruct the contents of the failed disk. These redundant array of independent disk organizations are referred as RAID levels and each level represents a tradeoff between reliability and performance.
In Data Striping, the data is segmented into equal-size partitions that are distributed over multiple disks. The size of the partition is called the striping unit. The partitions are usually distributed using a round robin mechanism.
In Redundancy, if the mean time to failure of a single disk is about a few years, it is smaller for a disk array. Hence check disks and parity schemes are involved to improve reliability. The check disk contains information that can be used to recover from failure of any one disk in the array.This group of data disks and check disks together constitute reliability groups.
Here is a list of the RAID levels:
Level 0: Non-redundant: A RAID level 0 system uses data striping to increase maximum bandwidth available.  No redundnant information is maintained.  This is usually the least expensive solution.
Level 1: Mirrored: Instead of having one copy of the data, two copies of the data are maintained. This type of redundancy is often called mirroring. Every write on a disk block involves write on both disk.This allows parallel reads between disk blocks but is usually the most expensive solution.
Level 0+1: Striping and Mirroring: Like in level 1, read requests can be scheduled to both the disk and its mirror image and bandwidth for contiguous blocks is improved from the aggregation of all the disks.
Level 2: Error-correcting codes: In RAID level 2, the striping unit is a single bit. The redundancy scheme used is the Hamming code. The number of check disks grows logarithmically with the number of data disks.
Level 3: Bit-Interleaved Parity: Redundancy schema in RAID level 2 is better in terms of cost than RAID level 1 but it keeps more redundant information than is necessary. Instead of using several disks to store hamming code that informs which disk has failed, we rely on that information from the disk controller and use a single check disk with parity information which is the lowest overhead possible.
Level 4: Block-Interleaved parity. RAID level 4 has a striping unit of a disk block, instead of a single bit  as in RAID level 3.  Block-level striping has the advantage that read requests the size of a disk block can be served entirely by the disk where the requested block resides.  The write of a single block  still requires a read-modify-write cycle, but only one data disk  and the check disk are involved and the difference between the old data block and the new data block is noted.
Level 5: Block Interleaved distributed parity: This level improves upon the previous level by distributing the parity blocks uniformly over all disks,  instead of sorting them on a single check disk. This has two advantages. First, several write requests  potentially be processed in parallel, since the bottleneck of  a unique check is removed. Second, read requests have a higher degree of parallelism. This level usually has the best performance.
Level 6: P+Q redundancy: Recovery from the failure of a single disk is usually not sufficient  in very large disk arrays. First, a second disk might fail before replacement and second the probability of a second disk failing is not negligible. A RAID level 6 system uses Reed-Solomon codes to be able to recover from two simultaneous disk failures.

Saturday, March 2, 2013

Page numbers for a book index

After the index words have been chosen, they are usually displayed with the page numbes. The trouble with the page numbers is that they are dependent on the rendering software. The page sizes could change for an electronic document, hence the page numbers for the index may need to be regenerated. This change in page size could be tracked with a variety of options and mostly after the index words have been selected.We will consider the trade offs here.  First, the page number for the different pages can be associated with the index words with a linear search for every occurance of the word. This is probably a naiive approach. Second, we already have word offsets from start for each of the words. Hence we can translate the offsets to page numbers if we know the last word offset of each page. So we can build an integer array where the index gives the page number and the data gives the offset of the last word on the page. Then for each word that we list in the index, we can also print the associated page numbers. Third, we can utilize any information the document rendering system provides including methods and data structures for page lookup.  From these approaches, it should be clear that the best bet for the page number data source is the rendering engine which is external to the index generator program.

For rendering of word documents or for the retrieving the page numbers, we could use either microsoft.office.interop.Word or  aspose.words library.

To determine the renderer, the program could look up the file extension type to use the appropriate library. The library can be loaded as the input file is read or used via dependency injection. Wrapper over the libraries to provide only the page information method may be sufficient or a design pattern to abstract away the implementations on a file format basis can be used.

Friday, March 1, 2013

Web service helper methods


Here are some common service provider helper methods for web services or applications:

WCFSerializationHelpers:

This uses a data contract serializer and deserializer to serialize the object and return it as string. It supports more than one format such as XML or JSON output. The object should be checked for serializability before performing the operation. This could be tested with:

public static T WcfRoundtripSerialize<T>(T instance) where T : class, new()

 RestServiceAsyncClientHelpers:

This  is a helper class to make ‘GET’ and  ‘POST’ calls asynchronously using HttpWebRequest.BeginGetResponse like methods. XmlSerializer can be used to serialize / de-serialize the streams. Method and ContentType as ‘text/xml’ should also be specified

SOAPMessageInspector  

This is a helper class to dump the SOAP messages or for modifying them. These can be hooked to the SOAP stream by implementing SOAPExtension. Helper methods can be invoked based on the stage of the processing as discerned from the soap server message.