Friday, November 10, 2017

We were discussing modeling. A model articulates how a system behaves quantitatively. Models use numerical methods to examine complex situations and come up with predictions. Most common techniques involved for coming up with a model include statistical techniques, numerical methods, matrix factorizations and optimizations.  
Sometimes we relied on experimental data to corroborate the model and tune it. Other times, we simulated the model to see the predicted outcomes and if it matched up with the observed data. There are some caveats with this form of analysis. It is merely a representation of our understanding based on our assumptions. It is not the truth. The experimental data is closer to the truth than the model. Even the experimental data may be tainted by how we question the nature and not nature itself.  This is what Heisenberg and Covell warn against. A model that is inaccurate may not be reliable in prediction. Even if the model is closer to truth, garbage in may result in garbage out
Any model has a test measure to determine its effectiveness. since the observed and the predicted are both known, a suitable test metric may be chosen. for example the sum of squares of errors or the F-measure may be used to compare and improve systems.
#codingexercise 
implement the fix centroid step of k-means
bool fix_centroids(int dimension, double** vectors, int* centroids, int* cluster_labels, int size, int k)

{
    bool centroids_updated = false;
    int* new_centroids = (int*) malloc(k * sizeof(int));
    if (new_centroids == NULL) { printf("Require more memory"); exit(1);}
    for (int i = 0; i < k; i++)
    {
        int label = i;
        double minimum = 0;
        double* centroid = vectors[centroids[label]];
        for (int j = 0; j < size; j++)
        {
             if (j != centroids[label] && cluster_labels[j] == label)
             {
                double cosd = get_cosine_distance(dimension, centroid, vectors[j]);
                minimum += cosd * cosd;
             }
        }

        for (int j = 0; j < size; j++)
        {
             if (cluster_labels[j] != label) continue;
             double distance = 0;
             for (int m = 0; m < size; m++)
             {
                if (cluster_labels[m] != label) continue;
                if (m == j) continue;
                double cosd = get_cosine_distance(dimension, vectors[m], vectors[j]);
                distance += cosd * cosd;
             }

             if (distance < minimum)
             {
                 minimum = distance;
                 new_centroids[label] = j;
                 centroids_updated = true;
             }
        }
    }

    if (centroids_updated)
    {
        for (int j = 0; j < k; j++)
        {
            centroids[j] = new_centroids[j];
        }
    }
    free(new_centroids);
    return centroids_updated;

}

Thursday, November 9, 2017

we were discussing modeling in general terms. We will be following slides from Costa, Kleinstein and Hershberg on Model fitting and error estimation.
A model articulates how a system behaves quantitatively. Models use numerical methods to examine complex situations and come up with predictions. Most common techniques involved for coming up with a model include statistical techniques, numerical methods, matrix factorizations and optimizations.  
Sometimes we relied on experimental data to corroborate the model and tune it. Other times, we simulated the model to see the predicted outcomes and if it matched up with the observed data. There are some caveats with this form of analysis. It is merely a representation of our understanding based on our assumptions. It is not the truth. The experimental data is closer to the truth than the model. Even the experimental data may be tainted by how we question the nature and not nature itself.  This is what Heisenberg and Covell warn against. A model that is inaccurate may not be reliable in prediction. Even if the model is closer to truth, garbage in may result in garbage out
#codingexercise
assign clusters to vectors as part of k-means:
void assign_clusters(int dimension, double** vectors, int* centroids, int* cluster_labels, int size, int k)

{

    for (int m = 0; m < size; m++)

    {

        double minimum = get_cosine_distance(dimension, vectors[centroids[cluster_labels[m]]], vectors[m]);

        for (int i = 0; i < k; i++)

        {

             double* centroid = vectors[centroids[i]];

             double distance = get_cosine_distance(dimension, centroid, vectors[m]);

             if (distance < minimum)

             {

                 cluster_labels[m] = i;

             }

        }

    }

}

Wednesday, November 8, 2017

We were discussing the difference between data mining and machine learning given that there is some overlap. In this context, I want to attempt explaining modeling in general terms. We will be following slides from Costa, Kleinstein and Hershberg on Model fitting and error estimation.
A model articulates how a system behaves quantitatively. It might involve equations or a system of equations using variables to denote the observed. The purpose of the model is to give a prediction based on the variables. In order to make the prediction somewhat accurate, it is often trained on a set of data before being used to predict on the test data. This is referred to as model tuning. Models use numerical methods to examine complex situations and come up with predictions. Most common techniques involved for coming up with a model include statistical techniques, numerical methods, matrix factorizations and optimizations.  Starting from the Newton's laws, we have used this kind of technique to understand and use our world.
Sometimes we relied on experimental data to corroborate the model and tune it. Other times, we simulated the model to see the predicted outcomes and if it matched up with the observed data. There are some caveats with this form of analysis. It is merely a representation of our understanding based on our assumptions. It is not the truth. The experimental data is closer to the truth than the model. Even the experimental data may be tainted by how we question the nature and not nature itself.  This is what Heisenberg and Covell warn against. A model that is inaccurate may not be reliable in prediction. Even if the model is closer to truth, garbage in may result in garbage out.

#codingexercise
Get Tanimoto coefficient between two vectors

double get_tanimoto_coefficient(int dimension, double* vec1, double* vec2)
{
double numerator = 0;
double denominator = 0;
double dotProduct = 0;
double magnitude = 0;
int i, j;
    for (i = 0; i < dimension; i++) dotProduct += vec1[i] * vec2[i];

numerator = dotProduct;
    for (i = 0; i < dimension; i++) magnitude += vec1[i] * vec1[i];
denominator += magnitude;
    magnitude = 0;
    for (i = 0; i < dimension; i++) magnitude += vec2[i] * vec2[i];
denominator += magnitude;
denominator -= dotProduct;
if (denominator == 0) return 0;
return numerator/denominator;
}

Tanimoto coefficient looks awfully similar to cosine distance but they have very different meanings and have little similarity other than expression syntax 

Tuesday, November 7, 2017

We were discussing the difference between data mining and machine learning given that there is some overlap. In this context, I want to attempt explaining modeling in general terms. We will be following slides from Costa, Kleinstein and Hershberg on Model fitting and error estimation.
A model articulates how a system behaves quantitatively. It might involve equations or a system of equations using variables to denote the observed. The purpose of the model is to give a prediction based on the variables. In order to make the prediction somewhat accurate, it is often trained on a set of data before being used to predict on the test data. This is referred to as model tuning. Models use numerical methods to examine complex situations and come up with predictions. Most common techniques involved for coming up with a model include statistical techniques, numerical methods, matrix factorizations and optimizations.  Starting from the Newton's laws, we have used this kind of technique to understand and use our world.


#codingexercise
Describe the k-means clustering technique
void kmeans(int dimension, double **vectors, int size, int k, int max_iterations)
{
     if (vectors == NULL || *vectors == NULL || |size == 0 || k == 0 || dimension  == 0) return;

     int* cluster_labels = initialize_cluster_labels(vectors, size, k);
     int* centroids = initialize_centroids(vectors, size, k);
     bool centroids_updated = true;
     int count = 0;
     while(centroids_updated)
     {
         count++;
         if (count > max_iterations) break;
         assign_clusters(dimension, vectors, centroids, cluster_labels, size, k);
         centroids_updated = fix_centroids(dimension, vectors, centroids, cluster_labels, size, k);
     }

     print_clusters(cluster_labels, size);
     free(centroids);
     free(cluster_labels);
     return;
}

Monday, November 6, 2017

We were enumerating the differences between data mining and machine learning. Data Mining is generally used in conjunction with a database. Some of the algorithms included with models and predictions used with data mining fall in the following categories:
Classification algorithms - for finding similar groups based on discrete variables
Regression algorithms - for finding statistical correlations on continuous variables from attributes
Segmentation algorithms - for dividing into groups with similar properties
Association algorithms - for finding correlations between different attributes in a data set
Sequence Analysis Algorithms - for finding groups via paths in sequences

Some of the machine learning algorithms such as from MicrosoftML package includes:
fast linear  for binary classification or linear regression
one class SVM for anomaly detection
fast trees for regression
fast forests for churn detection and building multiple trees
neural net for binary and multi-class classification
logistic regression for classifying sentiments from feedback

Applications of machine learning are generally for :
1) making recommendations with collaborative filtering
2) discovering groups using clustering and unsupervised methods as opposed to neural networks, decision trees, support vector machines and bayesian filtering which are supervised learning methods
3) searching and ranking as used with pagerank for web pages
4) text document filtering
5) modeling decision trees and
6) for evolving intelligence such as with genetic programming and elimination of weakness

#codingexercise
Get cosine similarity between two vectors

double get_cosine_distance(int dimension, double* vec1, double* vec2)
{
    double distance = 0;
    double magnitude = 0;
    int i, j;
    for (i = 0; i < dimension; i++) distance += vec1[i] * vec2[i];
    for (i = 0; i < dimension; i++) magnitude += vec1[i] * vec1[i];
    magnitude = sqrt(magnitude);
    distance /= magnitude;
    magnitude = 0;
    for (i = 0; i < dimension; i++) magnitude += vec2[i] * vec2[i];
    magnitude = sqrt(magnitude);
    distance /= magnitude;
    return distance;
}

Sunday, November 5, 2017

#classifier
another way to do kmeans : cexamples/classifier.c
but unit-tests are missing -sigh

Yesterday we discussed virtualization that is helpful to visualize data. In fact visualization is an important functional area for software development and many tools are written and developed to find knowledge in vast sets of data.
Today we explore data visualization. This is what distinguishes Data Mining from machine learning.
While machine learning uses concepts such as supervised and unsupervised classifiers, it can be understood as a set of algorithms. Data Mining on the other hand uses those and other algorithms in conjunction with a database so that the data can be queried to yield the result set that summarizes the findings. These result sets can then be drawn on charts and represented on dashboards.
Yet data mining and machine learning are separate domains in themselves. Machine learning may find use with text analysis and images and other static data that is not represented in tables. Data Mining on the other than translates most data into something that can be stored in a database and this has worked well for organizations that want to safeguard their data. Moreover, we can view the difference as top down and bottoms up view as well. For example, when we use statistics for building a regression model, we are binding different parameters together to mean something together and tuning it with experimental data. An unsupervised machine learning algorithm on the other hand builds a decision tree classifier based on the data as it is made available.  The output from a machine learning algorithm may be input for a data mining process. Some of the machine learning algorithms are forms of batch processing while data mining techniques may be applied in a streaming manner.
Both data mining and machine learning have been domain specific such as in finance, retail or telecommunications industry These tools integrate the domain specific knowledge with data analysis techniques to answer usually very specific queries.
Tools are evaluated on data types, system issues, data sources, data mining functions, coupling with a database or data warehouse, scalability, visualization and user interface.  Among these visual data mining is popular for its designer style user interface that renders data, results and process in a graphical and usually interactive presentation.
Visualization tools such as graphana stack for viewing elaborate charts and eye candies only require read permissions on the data as they execute queries on the result to fetch the data for making the charts.


Saturday, November 4, 2017

Data Virtualization deep dive
Data evolves over time and with the introduction of new processes. As data ages, it becomes difficult to re-organize it. In some cases, the data is actively used by the business that may not even permit a downtime. Moreover, as data grows, it may be repurposed with changing requirements. As more and more departments and organizations visit the data, it may require separation of concerns. For example, an organization may want to see a customer's identity but not his or her credit cards. Similarly, another might want to see the items purchased by a customer but not the shipping addresses. Data also explodes at a phenomenal rate and once it starts accruing it does not stop until the business shuts down.
Organizations grapple to tame the data with compartmentalized databases. Databases are convenient to store data because they ensure atomicity, consistency, integrity and durability of data. They are also extremely performant and efficient in how the data is stored physically and accessed over the web. By separating databases for different purposes, companies try to be nimble in their effort and reduce the time to release operations to production. However, this is merely suited for expediting new offerings to market. It does not handle data analysis and insights. Consequently, data is staged from operations for loading into a warehouse which is more suited to gather all the data for analysis. Even then the warehouses proliferate. In addition, workflows that extract-transform-load the data between operational databases are found reusable for different databases. This makes more copies of the data. Syntax and semantics varies for the same entity from database to database. Databases also become distributed and separated over regions requiring the usage of web services to pull and process data.
There are many types of databases used by companies because they serve different purposes. A relational database organizes data for efficient querying. A NoSQL database organizes data for large scale distributed batch processing. A graph database persists many forms of relationships between entities. Databases fragment the view of data from the perspective of the business domain. This calls for some unified experience regardless of where or how the data is store. Data Virtualization tries to address this with consistent, wholesome, unified views and manipulation. It introduces a platform and tool that abstracts away the real topology of how data is organized.
The word virtual is a term to indicate that we are no longer looking at physical representation and instead we are looking at the semantics. With data virtualization, we can explore and discover related information. We can also view the entire collection of databases as a unified repository.  The actual data source may not just be a database. It could be a database, a data warehouse, Online Analytical Processing application, web services, Software-as-a-Service, a NoSQL database or any mix of these.
A certain degree of consolidation and consistency is preferred by data virtualization users. It is easier to query something with the same syntax rather than have to change it over and over again. Even though virtualization may aim to span a vast breadth of technologies and software stacks, it cannot be a panacea. Therefore, virtualization runs the risk of being fragmented just like databases. Some have questioned this to another degree. Can each database also come with its own logic and granular enough to make it available over the web? In other words, can each data source be a service in itself so that databases and data virtualization are no longer the frontend for users. Instead they can mix and match different data sources with the same programmability over the web? This so called microservices architecture puts nice boundaries on the source of truth and still manages to hide the complexity of a farm or a cluster behind the service. While services are great for programmers, they are not intended for users who want to visually work with the data using a tool. Therefore data virtualization has moved even closer to the user by pushing down the microservices as a source of data. Finally data virtualization comes with immense capabilities to browse and search the data like none other.