We continue discussing Convolutional Neural Network (CNN) in image and object embedding in a shared space, shape signature and image retrieval.
We were discussing the Euclidean distance and the distance matrix. As we know with Euclidean distance, the chi-square measure, which is the sum of squares of errors, gives a good indication of how close the objects are to the mean. Therefore it is a measure for the goodness of fit. The principle is equally applicable to embedding space. By using a notion of errors, we can make sure that the shapes and images embedded in the space do not violate the intra member distances. The only variation the authors applied to this measure is the use of Sammon error instead of the chi-square because it encourages the preservation of the structure of local neighborhoods while embedding. The joing embedding space is a Euclidean space of lower dimension while the shapes and the images are represented in the original high dimensional space.
The Sammon Error is a weighted sum of differences between the original pairwise distances and the embedding pairwise distances. Dissimilar shapes have more differences and therefore they are weighted down.
The embedding of shapes and images proceeds with minimizing the Sammon Error using non-linear Multi-dimensional scaling. It is a means of visualizing the level of similarity of individual cases in a dataset. By minimizing the intra member distance in the placement of items in an N-dimensional space.
We saw how the embedding space is created. Mapping new shapes is slightly more effort. The space was originally constructed with a set of 3D shapes. They were jointly embedded. Introducing a new shape requires us to find an embedding point. The steps for this include:
First, a feature vector is computed.
Second, pairwise distances are computed.
Third we minimize the Sammon error but this time applying Liu-Nocedal method which is a large scale optimization method that combines BFGS steps and conjugate directions steps. BFGS is an iterative method for solving unconstrained non-linear optimization problems.
The 3D shapes in the embedding space have abundant information to train the CNN and also to perform data generation. The shapes are represented as clean and complete meshes which allows control and flexibility. Many images can be generated from the shapes using a rendering process. This is called image setting. In the embedding space, a shape is mapped to a point. For each image, its association with a shape is automatically known. The collection of images and shapes form the training data for CNN.
CNN models can approximate high dimensional and non-linear functions as we recall the feature vector has a large number of attributes and the Sammon error minimization objective is a non-linear function. CNN can infer millions of parameters. CNN therefore can be precise and informative once it is trained on a large amount of data. If the data is not proper, CNN cannot learn enough latent information and there results have overfitting. When the images are generated with rich variation in lighting and viewpoint and superimposed on random backgrounds, the CNN has sufficient data. Approximately 1 million images are synthesized per category.
We were discussing the Euclidean distance and the distance matrix. As we know with Euclidean distance, the chi-square measure, which is the sum of squares of errors, gives a good indication of how close the objects are to the mean. Therefore it is a measure for the goodness of fit. The principle is equally applicable to embedding space. By using a notion of errors, we can make sure that the shapes and images embedded in the space do not violate the intra member distances. The only variation the authors applied to this measure is the use of Sammon error instead of the chi-square because it encourages the preservation of the structure of local neighborhoods while embedding. The joing embedding space is a Euclidean space of lower dimension while the shapes and the images are represented in the original high dimensional space.
The Sammon Error is a weighted sum of differences between the original pairwise distances and the embedding pairwise distances. Dissimilar shapes have more differences and therefore they are weighted down.
The embedding of shapes and images proceeds with minimizing the Sammon Error using non-linear Multi-dimensional scaling. It is a means of visualizing the level of similarity of individual cases in a dataset. By minimizing the intra member distance in the placement of items in an N-dimensional space.
We saw how the embedding space is created. Mapping new shapes is slightly more effort. The space was originally constructed with a set of 3D shapes. They were jointly embedded. Introducing a new shape requires us to find an embedding point. The steps for this include:
First, a feature vector is computed.
Second, pairwise distances are computed.
Third we minimize the Sammon error but this time applying Liu-Nocedal method which is a large scale optimization method that combines BFGS steps and conjugate directions steps. BFGS is an iterative method for solving unconstrained non-linear optimization problems.
The 3D shapes in the embedding space have abundant information to train the CNN and also to perform data generation. The shapes are represented as clean and complete meshes which allows control and flexibility. Many images can be generated from the shapes using a rendering process. This is called image setting. In the embedding space, a shape is mapped to a point. For each image, its association with a shape is automatically known. The collection of images and shapes form the training data for CNN.
CNN models can approximate high dimensional and non-linear functions as we recall the feature vector has a large number of attributes and the Sammon error minimization objective is a non-linear function. CNN can infer millions of parameters. CNN therefore can be precise and informative once it is trained on a large amount of data. If the data is not proper, CNN cannot learn enough latent information and there results have overfitting. When the images are generated with rich variation in lighting and viewpoint and superimposed on random backgrounds, the CNN has sufficient data. Approximately 1 million images are synthesized per category.
No comments:
Post a Comment