We discuss bootstrapping in Natural Language Processing. We read from the paper by Ehara, Sato, Oiwa, and Nakagawa and from nlp blogger. Bootstrapping is a technique by which we use a binary classifer in an iterative manner. In Bootstrapping unlabeled instances can be labeled using the initial labeled "seed" set with iterations The steps involved are
build a classifier with rules to select the positive answer with high precision and
build a classifier with rules to select the negative answer with high precision
Run the classifiers on a large data set to collect a labeled set.
Train the classifier on the labeled set.
Run the classifier on the original data.
The choice of the seed set is important and hence its chosen based on two requirements - the choice of seed set affects the accuracy and there's labeling cost associated with it. The authors present a model called the expected model rotation that works well on realistic data. With this model, they propose an "iterative seeding framework" . At the end of each iteration, the seeds are assessed for quality and improved in response to the current human labels and characteristics. This involves a criteria that improves the seeds as well as a score to keep track of the similarity of the seeds that is updated with each iteration.
In bootstrapping given a dataset we can consider the first l data to be labeled and the rest u data to be unlabeled. The dataset consists of m dimensional feature vectors and the first l have labels that belong to a set C of semantic classes.Suppose we were to pick out the animals names from the given instances then the set C consists of animal or not animal labels and is binary. The ranking is given by the score vector y-animal - y-not-animal seed vectors. The simple bootstrapping proceeds by setting the score vector with the first classifier rule for the first label and continues until all of the rules for each label is identified.
If X is the feature matrix pertaining to a given text, the score vector after c iterations is obtained by (1/m.1/n. X X-Transpose) ^c. y
To make f seed dependent, Komachi et al 2008 mentioned using an equation to include the score vector as (-Laplacian)^c.y since a Laplacian is defined as 1-D^(-1/2)X.X-TransposedD^(-1/2) . This method is called the Laplacian propagation and the score vector is given by (I+ Beta L)^(-1) y
We will shortly see how this Laplacian propagation works. The thing to note here is that bootstrapping works with supervised and unsupervised modes only and even though it is said to work well, it doesn't have the same rigor as Vector space model or graph based efforts. The reason we are looking into bootstrapping is merely to see the usage of Laplacian operators.
Let us look further into the paper for the Laplacian operator. The authors propose a criteria for iterative seeding. For an unlabeled instance, they denote the goodness of seed as gi and select the instance with the highest goodness of seed as the next seed added in the next iteration. Each seed selection criterion defines each goodness of seed gi and the unlabeled instance influences the model. They show that the Komachi label propagation score can be considered as the margin between each unlabeled data point and the hyperplane obtained by ridge regression. They denote this with a transformed equation.
build a classifier with rules to select the positive answer with high precision and
build a classifier with rules to select the negative answer with high precision
Run the classifiers on a large data set to collect a labeled set.
Train the classifier on the labeled set.
Run the classifier on the original data.
The choice of the seed set is important and hence its chosen based on two requirements - the choice of seed set affects the accuracy and there's labeling cost associated with it. The authors present a model called the expected model rotation that works well on realistic data. With this model, they propose an "iterative seeding framework" . At the end of each iteration, the seeds are assessed for quality and improved in response to the current human labels and characteristics. This involves a criteria that improves the seeds as well as a score to keep track of the similarity of the seeds that is updated with each iteration.
In bootstrapping given a dataset we can consider the first l data to be labeled and the rest u data to be unlabeled. The dataset consists of m dimensional feature vectors and the first l have labels that belong to a set C of semantic classes.Suppose we were to pick out the animals names from the given instances then the set C consists of animal or not animal labels and is binary. The ranking is given by the score vector y-animal - y-not-animal seed vectors. The simple bootstrapping proceeds by setting the score vector with the first classifier rule for the first label and continues until all of the rules for each label is identified.
If X is the feature matrix pertaining to a given text, the score vector after c iterations is obtained by (1/m.1/n. X X-Transpose) ^c. y
To make f seed dependent, Komachi et al 2008 mentioned using an equation to include the score vector as (-Laplacian)^c.y since a Laplacian is defined as 1-D^(-1/2)X.X-TransposedD^(-1/2) . This method is called the Laplacian propagation and the score vector is given by (I+ Beta L)^(-1) y
We will shortly see how this Laplacian propagation works. The thing to note here is that bootstrapping works with supervised and unsupervised modes only and even though it is said to work well, it doesn't have the same rigor as Vector space model or graph based efforts. The reason we are looking into bootstrapping is merely to see the usage of Laplacian operators.
Let us look further into the paper for the Laplacian operator. The authors propose a criteria for iterative seeding. For an unlabeled instance, they denote the goodness of seed as gi and select the instance with the highest goodness of seed as the next seed added in the next iteration. Each seed selection criterion defines each goodness of seed gi and the unlabeled instance influences the model. They show that the Komachi label propagation score can be considered as the margin between each unlabeled data point and the hyperplane obtained by ridge regression. They denote this with a transformed equation.
No comments:
Post a Comment