Thursday, November 20, 2014

This is a writeup on how to collect and visualize data that indicates who talks to whom and represent it as a neat little graph alongside your favorite application with which you browse this communication.  Let us take an example of an e-mail application that has mails sent to you.  Inevitably you will be in all of the e-mails. That’s what pigeonholing might do to you. But even with this data, you may be seeing mails where you were included just to be in the loop. These from and to labeled people may give indications not only to who is chatty but also whose mail is more important than others especially when it comes from outside your organization. Although this is a complex domain and it has many subjective variables, we can begin with this use case.

Let us say we iterate through our list of mails and populate the following table:
-       - - - - - - - - - - -
-       From |    To    |
-       - - - - - - - - - - -
-       A              B         
-       B               C            
-       D               E             
:
:

This may start to look like a graph where we can quickly form nodes with participants such as A, B, C, D and E. We also form edges between them for each entry that we make in this table. Notice that the edges have weights and they are directed.

We can write this graph as an adjacency list or matrix. We could prefer Matrix but for our purposes we simply want to illustrate a visualization of the information we have learned.  Notice though that although we have decided to layout a graph, it can be drawn in several different ways. Some can be messier than others. Perhaps a better way to draw them would be to have them spaced apart and try not to have the edges cross over too much. This will make it easier to see the graph.

Now let us try the same with this graph. We have a technique by which we can optimize this graph layout.

# The following program lays out a graph with little or no crossing lines.
# This is adapted from a sample in "Programming Collective Intelligence" by OReilly Media

from PIL import Image, ImageDraw
import math
import random

vertex = ['A','B','C','D','E']
links=[('A', 'B'),
('B', 'C'),
('C', 'D'),
('D', 'E'),
('E', 'A'),
('C', 'E'),
('A', 'D'),
('E', 'B')]
domain=[(10,370)]*(len(vertex)*2)

def randomoptimize(domain,costf):
    best=999999999
    bestr=None
    for i in range(1000):
        # Create a random solution
        r=[random.randint(domain[i][0],domain[i][1]) for i in range(len(domain))]
        # Get the cost
        cost=costf(r)
        # Compare it to the best one so far
        if cost<best:
            best=cost
            bestr=r
    return r

def annealingoptimize(domain,costf,T=10000.0,cool=0.95,step=1):
    # Initialize the values randomly
    vec=[float(random.randint(domain[i][0],domain[i][1]))
         for i in range(len(domain))]

    while T>0.1:
        # Choose one of the indices
        i=random.randint(0,len(domain)-1)
        # Choose a direction to change it
        dir=random.randint(-step,step)
        # Create a new list with one of the values changed
        vecb=vec[:]
        vecb[i]+=dir
        if vecb[i]<domain[i][0]: vecb[i]=domain[i][0]
        elif vecb[i]>domain[i][1]: vecb[i]=domain[i][1]

        # Calculate the current cost and the new cost
        ea=costf(vec)
        eb=costf(vecb)
        p=pow(math.e,(-eb-ea)/T)
        # Is it better, or does it make the probability
        # cutoff?
        if (eb<ea or random.random( )<p):
            vec=vecb

        # Decrease the temperature
        T=T*cool
    return vec

def crosscount(v):
    # Convert the number list into a dictionary of person:(x,y)
    loc=dict([(vertex[i],(v[i*2],v[i*2+1])) for i in range(0,len(vertex))])
    total=0

    # Loop through every pair of links
    for i in range(len(links)):
      for j in range(i+1,len(links)):

        # Get the locations
        (x1,y1),(x2,y2)=loc[links[i][0]],loc[links[i][1]]
        (x3,y3),(x4,y4)=loc[links[j][0]],loc[links[j][1]]
       
        den=(y4-y3)*(x2-x1)-(x4-x3)*(y2-y1)
       
        # den==0 if the lines are parallel
        if den==0: continue

        # Otherwise ua and ub are the fraction of the
        # line where they cross
        ua=((x4-x3)*(y1-y3)-(y4-y3)*(x1-x3))/den
        ub=((x2-x1)*(y1-y3)-(y2-y1)*(x1-x3))/den
       
        # If the fraction is between 0 and 1 for both lines
        # then they cross each other
        if ua>0 and ua<1 and ub>0 and ub<1:
            total+=1

    for i in range(len(vertex)):
        for j in range(i+1,len(vertex)):
          # Get the locations of the two nodes
          (x1,y1),(x2,y2)=loc[vertex[i]],loc[vertex[j]]

          # Find the distance between them
          dist=math.sqrt(math.pow(x1-x2,2)+math.pow(y1-y2,2))
          # Penalize any nodes closer than 50 pixels
          if dist<50:
            total+=(1.0-(dist/50.0))
    return total
                       
def drawnetwork(loc):
    #create the image
    img = Image.new('RGB', (400,400),(255,255,255))
    draw=ImageDraw.Draw(img)
    #create the position dict
    pos=dict([(vertex[i],(loc[i*2],loc[i*2+1])) for i in range(0, len(vertex))])
    #Draw Links
    for (a,b) in links:
        draw.line((pos[a],pos[b]), fill=(255,0,0))
    #Draw vertex
    for (n,p) in pos.items():
        draw.text(p,n,(0,0,0))
    img.save('graph.jpg', 'JPEG')
    img.show()

sol=randomoptimize(domain,crosscount)
crosscount(sol)
sol=annealingoptimize(domain,crosscount,step=50,cool=0.99)
crosscount(sol)
drawnetwork(sol)

Now this draws the graph as :



This is a neat little graph and shows that the participant A is more popular.

No comments:

Post a Comment