This is a writeup on how to collect and visualize data that
indicates who talks to whom and represent it as a neat little graph alongside
your favorite application with which you browse this communication. Let us take an example of an e-mail
application that has mails sent to you.
Inevitably you will be in all of the e-mails. That’s what pigeonholing might
do to you. But even with this data, you may be seeing mails where you were
included just to be in the loop. These from and to labeled people may give
indications not only to who is chatty but also whose mail is more important
than others especially when it comes from outside your organization. Although
this is a complex domain and it has many subjective variables, we can begin
with this use case.
Let us say we iterate through our list of mails and populate
the following table:
-
- - - - - - - - - - -
-
From |
To |
-
- - - - - - - - - - -
-
A
B
-
B
C
-
D
E
:
:
This may start to look like a graph where we can quickly
form nodes with participants such as A, B, C, D and E. We also form edges between
them for each entry that we make in this table. Notice that the edges have
weights and they are directed.
We can write this graph as an adjacency list or matrix. We
could prefer Matrix but for our purposes we simply want to illustrate a
visualization of the information we have learned. Notice though that although we have decided
to layout a graph, it can be drawn in several different ways. Some can be
messier than others. Perhaps a better way to draw them would be to have them
spaced apart and try not to have the edges cross over too much. This will make
it easier to see the graph.
Now let us try the same with this graph. We have a technique
by which we can optimize this graph layout.
# The following program lays out a graph with little or no
crossing lines.
# This is adapted from a sample in "Programming
Collective Intelligence" by OReilly Media
from PIL import Image, ImageDraw
import math
import random
vertex = ['A','B','C','D','E']
links=[('A', 'B'),
('B', 'C'),
('C', 'D'),
('D', 'E'),
('E', 'A'),
('C', 'E'),
('A', 'D'),
('E', 'B')]
domain=[(10,370)]*(len(vertex)*2)
def randomoptimize(domain,costf):
best=999999999
bestr=None
for i in
range(1000):
# Create a
random solution
r=[random.randint(domain[i][0],domain[i][1]) for i in
range(len(domain))]
# Get the cost
cost=costf(r)
# Compare it
to the best one so far
if
cost<best:
best=cost
bestr=r
return r
def annealingoptimize(domain,costf,T=10000.0,cool=0.95,step=1):
# Initialize the
values randomly
vec=[float(random.randint(domain[i][0],domain[i][1]))
for i in
range(len(domain))]
while T>0.1:
# Choose one
of the indices
i=random.randint(0,len(domain)-1)
# Choose a
direction to change it
dir=random.randint(-step,step)
# Create a new
list with one of the values changed
vecb=vec[:]
vecb[i]+=dir
if
vecb[i]<domain[i][0]: vecb[i]=domain[i][0]
elif
vecb[i]>domain[i][1]: vecb[i]=domain[i][1]
# Calculate
the current cost and the new cost
ea=costf(vec)
eb=costf(vecb)
p=pow(math.e,(-eb-ea)/T)
# Is it
better, or does it make the probability
# cutoff?
if (eb<ea
or random.random( )<p):
vec=vecb
# Decrease the
temperature
T=T*cool
return vec
def crosscount(v):
# Convert the
number list into a dictionary of person:(x,y)
loc=dict([(vertex[i],(v[i*2],v[i*2+1])) for i in range(0,len(vertex))])
total=0
# Loop through
every pair of links
for i in
range(len(links)):
for j in
range(i+1,len(links)):
# Get the
locations
(x1,y1),(x2,y2)=loc[links[i][0]],loc[links[i][1]]
(x3,y3),(x4,y4)=loc[links[j][0]],loc[links[j][1]]
den=(y4-y3)*(x2-x1)-(x4-x3)*(y2-y1)
# den==0 if
the lines are parallel
if den==0:
continue
# Otherwise ua
and ub are the fraction of the
# line where
they cross
ua=((x4-x3)*(y1-y3)-(y4-y3)*(x1-x3))/den
ub=((x2-x1)*(y1-y3)-(y2-y1)*(x1-x3))/den
# If the
fraction is between 0 and 1 for both lines
# then they
cross each other
if ua>0 and
ua<1 and ub>0 and ub<1:
total+=1
for i in
range(len(vertex)):
for j in
range(i+1,len(vertex)):
# Get the
locations of the two nodes
(x1,y1),(x2,y2)=loc[vertex[i]],loc[vertex[j]]
# Find the
distance between them
dist=math.sqrt(math.pow(x1-x2,2)+math.pow(y1-y2,2))
# Penalize
any nodes closer than 50 pixels
if
dist<50:
total+=(1.0-(dist/50.0))
return total
def drawnetwork(loc):
#create the image
img =
Image.new('RGB', (400,400),(255,255,255))
draw=ImageDraw.Draw(img)
#create the
position dict
pos=dict([(vertex[i],(loc[i*2],loc[i*2+1])) for i in range(0,
len(vertex))])
#Draw Links
for (a,b) in
links:
draw.line((pos[a],pos[b]), fill=(255,0,0))
#Draw vertex
for (n,p) in
pos.items():
draw.text(p,n,(0,0,0))
img.save('graph.jpg', 'JPEG')
img.show()
sol=randomoptimize(domain,crosscount)
crosscount(sol)
sol=annealingoptimize(domain,crosscount,step=50,cool=0.99)
crosscount(sol)
drawnetwork(sol)
Now this draws the graph as :
This is a neat little graph and shows that the participant A
is more popular.
No comments:
Post a Comment