Tuesday, July 16, 2019

#codingexercise
Method to classify synonyms:
def classify_synonyms(): 
    words = [{'cat': ['animal', 'feline']}, {'dog':['animal', 'lupus']}, {'dolphin':['fish', 'pisces']}, {'spider':['insect','arachnid']}] 
    groups = [] 
    for item in words: 
        if item not in groups: 
             merged = False 
             for key in groups: 
                 group = next(iter(key)) 
                 for value in iter(item.values()): 
                  if group in value: 
                    index = groups.index(key) 
                    old = iter(groups[index].values()) 
                    new = iter(item.keys()) 
                    merged = [] 
                    for v in old: 
                        merged += v 
                    merged += new 
                    groups[index] = {group:merged} 
                    merged = True 
             if not merged: 
                k = next(iter(item.keys())) 
                v = next(iter(item.values())) 
                groups.append({v[0] : [k]}) 
    print(groups) 
classify_synonyms() 
#[{'animal': ['cat', 'dog']}, {'fish': ['dolphin']}, {'insect': ['spider']}] 
The above method merely classifies the input to the first level of grouping. It does not factor in multiple matches between synonyms, selection of the best match in the synonym, unavailability of synonyms, unrelated words, and unrelated synonyms. The purpose is just to show that given a criterion for the selection of a group, the words can be merged. The output of the first level could then be taken as the input of the second level. The second level can then be merged and so on until a dendrogram appears.  
Given this dendrogram, it is possible to take edge distance as the distance metric for semantic similarity.  
Since we do this hierarchical classification only for the finite number of input words in a text, we can take it to be a bounded cost of O(nlogn) assuming fixed upper cost for each merge. 

def nlevel(id, group_dict=df.GroupID, _cache={0:0}): 
    if id in _cache: 
        return _cache[id] 
  
    return 1+nlevel(group_dict[id],group_dict) 
  
df['nLevel'] = df.ID.map(nlevel) 
  
print df[['nLevel','ID','Group']] 

It is also possible to apply page rank given that synonyms are connected by edges. 
def pagerank (u, constant): 
sum = constant 
For node-v in adjacencies (u): 
     Sum += Pagerank_for_node_v (node-v) / number_of_links (node-v) 
Return sum  

No comments:

Post a Comment