Thursday, November 21, 2013

I came across a book Algorithms for clustering data by Jain and Dubes. I want to start posting from that book. But first I want to jump to a section towards the end of the book which talks about Graph theory. As we may know, a graph has a finite set of nodes and they can represent objects being clustered.  A set of edges records the interactions between pairs of vertices. The vertices and edges are related by a function f that maps edges into unordered pairs of vertices. we discuss finite undirected graphs when we say unordered pairs of vertices. Therefore a graph is a tuple V, E, f.
Some applications require that labels or weights be assigned to the vertices or to the edges.
As an example, the edges could be labeled by the distance between patterns representing the objects. There are some common definitions as well. A subgraph is a subset of nodes and vertices of the original graph where there are no edges or vertices that are not in the original graph. A path in graph is an alternating sequence of vertices and edges that contains no repeated edges and no repeated vertices and for which edge ei is incident to vertices vi and vi+1 for i = 1 to n.
 A graph is connected if a path exists between any two vertices in the graph.
 A component is a maximal piece of a connected graph. Maximal means as many nodes as possible. So a component is not a proper subgraph. of another graph because an edge may be drawn where it didn't exist in the original graph.  A graph is complete if an edge is assigned to every possible pair of nodes. A complete graph on n nodes contains exactly n(n-1)/2 edges.
 There are some interesting observations as well. A maximal complete subgraph of G is a complete subgraph of G that is not a proper subgraph of any other complete subgraph of G. In other words, H is a complete subgraph of G if an edge is incident to every pair of vertices in H but no additional vertices of G can be added to H without destroying the completeness.  A maximal complete subgraph is also called a clique.
A cycle is the same as a path except that vertices v1 and vn are the same vertex.  A tree is a connected graph with no cycles. If a subgraph has m vertices, a tree containing these vertices has m-1 edges. A spanning tree is a tree containing all vertices in the graph. The weight of a tree is the sum of the weights of the edges when the weights represent dissimilarity or ratio scale. A minimal spanning tree is the one with the least weight  among all other spanning trees.
A number of algorithms are available for finding the minimum spanning tree. MSTs of complete graph are important in cluster analysis.
Prim's algorithm is easy to try out by hand. The edges are ranked by their weights and insert the edges by order of dissimilarity. The shortest is included first until a tree is formed. We prevent any cycles when building an MST.
A threshold graph is one which has n labeled nodes and N unlabeled edges. A random graph is one which has edges inserted at random. A rank graph is a threshold graph in which the edges are labeled. N nodes can be sampled from n(n-1)/2 edges without replacement or ordering and combined with N! to get the number of rank graphs.

No comments:

Post a Comment