Cluster computing

Tuesday, September 19, 2017

Today we start reviewing the slides from Stanford that introduce Natural Language Processing via Vector Semantics. This is useful to study when we wonder about the usefulness of graphs in text analysis. Its true that a vector of features is like a row with columns.Tabular data is easy to work with using conventional database and data mining techniques. Moreover, representing a problem in vector space means we can utilize all the matrix analysis techniques that we have known for a long while. In fact, new improvements may be possible as we re-discover the suitability of more and more techniques from this domain. For example, matrix Factorization, Eigen values and Eigen vectors, gaussian and laplacian are straight out of mathematical text books and continue to serve as techniques to try with NLP. Matrix mathematics is also useful to succinctly describe a problem with their use of notation for an entire matrix. Similarly, Vector space also helps visualize the problems in cartesian co-ordinates with relative or absolute point of reference. This is another form of analysis that we have long understood and continue to find it useful to explain ideas. Vector space gives us a way to describe the problem in a space where we can describe and visualize magnitude and direction. Together with matrix, vectors allows us to see transformations.
Graphs came about as an evolution from disconnected and linearly related data. When we establish linear dependencies, we immediately see dependencies that exist beyond just two instances in a linear model and contributions from neighbors other than those two. Graph also gave us methods such as traversal, centrality and pagerank that introduced us to ways we can visualize and solve problems. People who work with graph databases become so involved with seeing relationships described as edges between neighbors that they start questioning what purpose relational databases have and why are there even joins between tables. They see graph as liberating to store and analyze relationships. Graphs now can even be worked with in batch no-sql mode which lets us scale our operations like never before. In fact, graphs take a long time to create but once they are created, they can serve useful analysis in a fraction of the time that was spent on creating it. Graphs are also supported out of box from graph databases and work very well with their techniques. Many packages for analysis also make it popular to work with graphs.
Vectors can be transformed into graphs based on similarity between vectors. Depending on the strength of similarity, we may draw edges between the nodes that represent the vectors. This gives us a way to determine how important a vector among its neighbors - a very useful concept if we have to select a few vectors from the many. Still the selection can also work with representing vectors as tabular rows and a classifier that can issue tag. So while these different techniques may perform similar tasks, they are not mutually exclusive to each other and can even work with one another. We will start reading on how vector representation improves the notion of what the nodes or entities are.
#coding exercise
Find the largest rectangular area in a histogram.
Each bar of the histogram may serve as the start of a rectangle. So the top left corner may act as the top left of a rectangle whose width may increment over other bars as long as they allow it. Therefore for each bar in the histogram, we try to draw a rectangle starting at the top left and take the maximum by enclosing area of all the rectangles formed.

Cluster computing

Tuesday, September 19, 2017

No comments:

Post a Comment