Wednesday, November 8, 2017

We were discussing the difference between data mining and machine learning given that there is some overlap. In this context, I want to attempt explaining modeling in general terms. We will be following slides from Costa, Kleinstein and Hershberg on Model fitting and error estimation.
A model articulates how a system behaves quantitatively. It might involve equations or a system of equations using variables to denote the observed. The purpose of the model is to give a prediction based on the variables. In order to make the prediction somewhat accurate, it is often trained on a set of data before being used to predict on the test data. This is referred to as model tuning. Models use numerical methods to examine complex situations and come up with predictions. Most common techniques involved for coming up with a model include statistical techniques, numerical methods, matrix factorizations and optimizations.  Starting from the Newton's laws, we have used this kind of technique to understand and use our world.
Sometimes we relied on experimental data to corroborate the model and tune it. Other times, we simulated the model to see the predicted outcomes and if it matched up with the observed data. There are some caveats with this form of analysis. It is merely a representation of our understanding based on our assumptions. It is not the truth. The experimental data is closer to the truth than the model. Even the experimental data may be tainted by how we question the nature and not nature itself.  This is what Heisenberg and Covell warn against. A model that is inaccurate may not be reliable in prediction. Even if the model is closer to truth, garbage in may result in garbage out.

#codingexercise
Get Tanimoto coefficient between two vectors

double get_tanimoto_coefficient(int dimension, double* vec1, double* vec2)
{
double numerator = 0;
double denominator = 0;
double dotProduct = 0;
double magnitude = 0;
int i, j;
    for (i = 0; i < dimension; i++) dotProduct += vec1[i] * vec2[i];

numerator = dotProduct;
    for (i = 0; i < dimension; i++) magnitude += vec1[i] * vec1[i];
denominator += magnitude;
    magnitude = 0;
    for (i = 0; i < dimension; i++) magnitude += vec2[i] * vec2[i];
denominator += magnitude;
denominator -= dotProduct;
if (denominator == 0) return 0;
return numerator/denominator;
}

Tanimoto coefficient looks awfully similar to cosine distance but they have very different meanings and have little similarity other than expression syntax 

No comments:

Post a Comment