Monday, June 10, 2013

knowledge discovery

Knowledge extraction requires the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of schema based on source data. It is similar to NLP and ETL but involves representing the results in a format that is called RDF or Resource Description Framework. Resource Description Framework is a modeling of information ( metadata) such as what is implemented for web resources. The RDF data model makes use of subject-predicate-object expressions. RDF is an abstract model and has several serialization formats. This format is machine readable and machine interpretable. A collection of RDF statements intrinsically represents a labeled directed multi-graph. As such an RDF based data model is persisted in relational stores in the triple tuple format and similar. The reverse of mapping relational databases to RDF is also common because this data can be made available to semantic web.
The process of information extraction uses traditional methods from ETL, which transforms the data into structured formats. Knowledge extraction can be categorized based on
1) source such as whether it is text, DB, XML, CSV etc.
2) exposition such as with an ontology file or semantic database.
3) synchronization such as if the extraction is once or multiple times and how it is synced.
4) reuse of vocabularies such as if the tool is able to map table columns to resources
5) automatization such as the degree to which the extraction is assisted or if its manual.
6) requiring domain ontology such as with pre-existing ones
Mapping from RDB tables / views to RDF entities proceeds with the conversion of each column to a predicate, each column value to an object, each row key to a subject and each row to a collection of triples with a common subject.
Similarly, the reverse mapping involves creating a RDFS class for each table, a conversion of all primary keys and foreign keys into IRIs, assigning a predicate IRI to each column, assigning an rdf type predicate for each row, linking it to an RDFS class IRI corresponding to the table.

machine learning

Machine Learning

Unsupervised learning is clustering. The class labels of training data are unknown and given a set of measurements, the data is clustered
Supervised learning is classification : The training data has both measurements and labels indicating the class of observations. New data is classified based on the training data.
Classification constructs a model. A model is a set of rules such as if this condition, then this label. The classification model is used to predict categorical class labels or estimating accuracy of the model.
Prediction is computation such as a formula on the attributes of data and models continuous-valued functions.
Classification tasks include induction from training set and deduction from test set. A learning algorithm is used to build a learning model. When the model is trained, it can be used for deduction from test set.
The speed at which a classifier can be used is time it takes to construct the model and then to use it for classification.
Classification models are evaluated based on accuracy, speed, robustness, scalability, interpretability, and other such measures.
Classification algorithms are called supervised learning because a supervisor prepares a set of examples of a target concept and a set of tasks that finds a hypothesis that explains a target concept which is then used and measured as performance of how accurately the hypotheis explains the example
If we define the problem domain as set of X then classification is a function c that maps X to a result set D which is going to be analyzed.
if we consider a set of <x,d> pairs with x belonging to X and d belonging to D, then this set called Experience E is explained by a hypothesis h and the set of all such hypotheses is H. The goodness of H is the percentage of examples that are correctly explained.
Given the examples in E, supervised learning algorithms search the hypotheses H for the one that best explains the examples in E. The type of hypothesis required influences the search algorithm. The more complex the representation, the more complex the search algorithm. Search can go from general to specific and from specific to general. It's better to work with single dimensions and boolean d aka concepts.
Inductive Bias is the set of assumptions that together with the training data deductively justify the classification assigned by the learner to future instances. There can be a number of hypotheses consistent with the training data. Each learning algorithm has an inductive bias that affects the hypothesis selection. Inductive Bias can be based on language-syntax, heuristic-based-semantics, rank based preference and restriction based on search space.
[Courtesy : Prof. Pier Luca Lanzi lecture notes]

Sunday, June 9, 2013

Text mining targets unstructured text as opposed to structured text in web mining and databases in data mining. and patterns in natural language processing. Data Mining and Natural language processing both find patterns and information is retrieved with queries while Text mining finds nuggets. That said, text mining has overlap with all of the above.
The processing stages in text mining are text storage, text preprocessing, text transformation or attribute generation, attribute selection, Data mining or pattern discovery and interpretation or evaluation. Storing text doesn't necessarily need to be in raw form only but can be stored as document clusters. Text is characterized by one or more of the following:
source : human input, automated input in different languages and formats
context : words and phrases create context.
ambiguity : word and sentence are disambiguated based on ontology
noise : erroneous data, misspelt data, stop words, etc
format : normal speech, interactive chat, etc.
sparseness: aka document density percentage in typical document
Text processing involves cleanup, tokenization, part of speech tagging, word sense disambiguation and semantic structures. Text cleanup involves removing junk characters, binary formats, tables, figures and formulas. Tokenization is splitting up into a set of tokens. Part of speech tagging is associating words with parts of speech which can be grammar based or statistically based. Word sense disambiguation is how many distinct senses is used in a given sentence. Semantic processing involves chunking which produces syntactic constructs like noun phrases and verb phrases or full parsing which yields a tree. Chunking is more common.
Text transformation involves text representation and feature selection which characterizes the document. A classifier is used to automatically generate labels (attributes) from the features fed into it.
Feature selection is based on two approaches - one is to select the features before using them in a classifier, which requires a feature ranking method and the othe is to select the features on how well they work in a classifier. In the latter case the classifier is part of the feature selection method and is often an iterative process. However the classifier needs to be trained and the evaluation is based on actual use. The features are evaluated iteratively. In the former case, there are many more choices to feature selection since this is independent of the classifier. Each feature is evaluated once lowering computational cost. Attributes generated are the labels of the classes automatically produced by the classifier on the selected features.
Attribute selection is done because higher dimensions causes issues with machine learners and hence irrelevant features are removed.
After the attributes are selected, patterns can be found with data mining techniques and the results interpreted.

Review of text mining slides from CSE634 presentation by Chiwara, Al-Ayyoub, Hossain and Gupta

Fast Marching Method

In the family of boundary value problem solving methods, Fast Marching method is another numerical method. It solves the Eikonal equation

Introduction

Fast marching method was introduced by Sethian. It was originally designed for forward only propagation of the boundary with positive values for speed, but has later been revised for including both positive and negative values for speed and this is now called Generalized Fast Marching Method.

Motivation

This method allows us to convert the problem of continuously forward or backward moving front into a stationary formulation where the arrival surface gives the front at zero –level. Furthermore, using a grid with unit-distance and numerical techniques, this method is very fast, efficient and highly stable.

Terminology

Arrival surface: The arrival surface results from the travel time function that gives the arrival of the front at a given point

Narrow band: This is the set of active data points being considered for the calculation of the next level of the front. The travel time of each of the data points is computed based on the gradient between the previous point and the current as well as the gradient between current and the next point.

Initial Data

The speed at a given data point is known. Each data point can have a different speed

Algorithm

The fast marching method solves boundary value problems that describe how a closed curve evolves as a function of time t with a speed f(x) in the normal direction at a point x on the curve. The speed is given but the time it takes for the contour to pass the point x is calculated by the equation. Different points may have different speeds. The region grows outward from the seed of the area. The more the points considered, the better the contour. The fast marching method makes use of a stationary surface that gives the arrival information of the moving boundary. From the initial curve, build a little bit of the surface upwards, with each iteration always starting from the initial position. It's called fast because it uses a fast heap sort algorithm that can tell the proper data point to update and hence to build the surface. For each of the N points at any given height h, the next point is determined and always in the order from the priority queue. The surface grows with height h in a monotonically increasing order because previously evaluated grid points are not revisited. This is because of non-negative speed and because the heap guarantees that the grid point with the minimum distance is selected first out of the data points in the sweep of N points at any level. Hence the entire operation of determining the next contour takes NlogN time.

This is a numerical method because it tries to view the curve as a set of discrete markers. The fast marching method has applications in image processing such as for region growing.

In the fast march method, let us say there are N points on the zero-level of the surface we track in a N*N*N unit distance grid. The curve is pushed outwards in unit-distances. Then, each step involves the following:

1) The point with the smallest distance is removed from the priority queue and its value is frozen so that we don't have to revisit it again and our direction is strictly outward.

2) Points are added to the priority queue to maintain unit-thickness

3) The distance of the neighbors of the removed point are recomputed. The finite difference is a multivariable quadratic equation.

we did not discuss the partial differential equation (PDE) or how to solve it numerically. Specifically, we did not discuss the distance function (time cost) based on which the next data point is selected. Here we mention it with an example.

We start from a data point whose travel time is frozen i.e. we don't change it again and proceed outwards to the neighboring points from it. We put these neighboring points in a priority queue where the points are sorted based on the minimum travel time. Then the top point in this priority queue is removed which gives the minimum travel time. This point is then frozen. The neighboring points are updated with the Euclidean norm of the gradients. The gradients are along the x,y,z axes and are computed as derivatives (inverse of speed) defined as the difference in the travel time between the new point and the old point on the same axes over a unit distance. We take the maximum of this gradient. The gradients are also taken between the next point and the new point in the normal direction. We take the minimum of this gradient. The Eucledian norm is nothing but a sum of squares just like we measure a hypotenuse. So we take the squares of the maximum and minimum respectively for each of the three axes. This gives us a quadratic in travel time t at the new point and speed v that has a well-defined solution as one of either the travel time at the previous point + (unit-distance / speed) or the next point + (unit-distance/speed). In the fast marching method, we simplify this computation to take forward only direction and at zero-level plane so only x and y co-ordinates.

We repeat this for all other neighboring points around the one we removed from the priority queue.

It might so happen that one or more of the neighboring points are in the active set i.e. in the priority queue. If we take a new point for which we want to compute the travel time as the center, then there are four neighbours around it and one of them is the smallest of all the trial values. Then there are four cases we need to consider. 1) where none of the neighbors are in the active set, 2) one of the neighbors is in the active set 3) two of the neighbors are in the active set and 4) all three of these neighbors are in the active set.

In the case where none of the neighbors are in the active set, the travel time of the new point is the sum of that of A and the inverse speed function f (f is based on the unit-distance).

In the case where one of the neighboring points b is already in the active set, then we know that its travel time is more than the one we just removed and also that the travel time of the new point will be lesser than the sum of the travel time for b and the inverse speed function f.

If two of the neighbors are in the active set and one of them is frozen, then this point is in the forward direction and will be the contributor to the inverse speed function since the frozen value will not be changed and this degenerates to the case 1)

If three of the neighbors are in the active set, the smallest values in each co-ordinate direction is taken and this degenerates to the cases discussed above.

Thus we obtain the travel times of the neighboring points as well and add them to the priority queue

The neighboring points are typically referred to as the narrow band and we locate the grid point with the minimum travel time.

Computational Complexity

The entire operation of determining the next contour takes NlogN time.

Error Analysis

There are two limitations of this approach.

This method does not apply to continuous boundaries

There is a potential large error in the first step of the numerical calculation.

Courtesy : Al Khalifa, S. Fomel : Fast Marching Implementation, Sethian : Fast Marching Method.

Saturday, June 8, 2013

rebuilding the contour in fast marching method

In the previous post, we did not discuss the partial differential equation (PDE) or how to solve it numerically. Specifically, we did not discuss the distance function (time cost) based on which the next data point is selected. In this post we mention it with an example.
We start from a data point whose travel time is frozen i.e. we don't change it again and proceed outwards to the neighboring points from it. We put these neighboring points in a priority queue where the points are sorted based on the minimum travel time. Then the top point in this priority queue is removed which gives the minimum travel time. This point is then frozen. The neighboring points are updated with the Euclidean norm of the gradients. The gradients are along the x,y,z axes and are computed as derivatives (inverse of speed) defined as the difference in the travel time between the new point and the old point on the same axes over a unit distance. We take the maximum of this gradient. The gradients are also taken between the next point and the new point in the normal direction. We take the minimum of this gradient. The Eucledian norm is nothing but a sum of squares just like we measure a hypotenuse. So we take the squares of the maximum and minimum respectively for each of the three axes. This gives us a quadratic in travel time t at the new point and speed v that has a well-defined solution as one of either the travel time at the previous point + (unit-distance / speed) or the next point + (unit-distance/speed). In the fast marching method, we simplify this computation to take forward only direction and at zero-level plane so only x and y co-ordinates.
We repeat this for all other neighboring points around the one we removed from the priority queue.
It might so happen that one or more of the neighboring points are in the active set i.e. in the priority queue. If we take a new point for which we want to compute the travel time as the center, then there are four neighbours around it and one of them is the smallest of all the trial values. Then there are four cases we need to consider. 1) where none of the neighbors are in the active set, 2) one of the neighbors is in the active set 3) two of the neighbors are in the active set and 4) all three of these neighbors are in the active set.
In the case where none of the neighbors are in the active set, the travel time of the new point is the sum of that of A and the inverse speed function f (f is based on the unit-distance).
In the case where one of the neighboring points b is already in the active set, then we know that its travel time is more than the one we just removed and also that the travel time of the new point will be lesser than the sum of the travel time for b and the inverse speed function f.
If two of the neighbors are in the active set and one of them is frozen, then this point is in the forward direction and will be the contributor to the inverse speed function since the frozen value will not be changed and this degenerates to the case 1)
If three of the neighbors are in the active set, the smallest values in each co-ordinate direction is taken and this degenerates to the cases discussed above.
Thus we obtain the travel times of the neighboring points as well and add them to the priority queue
The neighboring points are typically referred to as the narrow band and we locate the grid point with the minimum travel time.

Friday, June 7, 2013

I came across an interesting problem today and I will mention it here verbatim and try to solve it. The question was how do you delete a node from a singly linked list if the head is not given. So we only have the current node and the node has to be freed. Futhermore, the singly linked list is not a circular list that we can traverse to find the head. Since we don't know the head and we can't traverse the list, we don't know the previous node to update its reference.
By deleting the node and not the knowing the previous node, we have broken the list and created dangling references. So one solution I suggested was that we treat the list as immutable physically and overlay a layer that shows the logical list by skipping over the nodes that are deleted. So we pass through the nodes that are deleted during traversal without doing any operation on them. We can keep additional state per node in a wrapper that says if it is deleted or not. Another approach is to scan the memory for all occurance of the current pointer and set it to null now that the current node is freed though this could break users of the structures that may not be expecting a null pointer where it was previously expecting a valid pointer. That may depend on how the list is used.