Monday, July 1, 2019

Today we continue discussing the data structure for storage of a Thesaurus.
We referred to hierarchical representation of the words based on synonyms using recursive CTE above as a way of establishing clusters. Here we mention that the hierarchy level is incremented based on the merging of words. Two words can be merged into the same group only when there is a common term/terms within their synonyms or a threshold degree of separation between their synonyms, if such extended processing is permitted. 
def classify_synonyms(): 
    words = [{'cat': ['animal', 'feline']}, {'dog':['animal', 'lupus']}, {'dolphin':['fish', 'pisces']}, {'spider':['insect','arachnid']}] 
    groups = [] 
    for item in words: 
        if item not in groups: 
             merged = False 
             for key in groups: 
                 group = next(iter(key)) 
                 for value in iter(item.values()): 
                  if group in value: 
                    index = groups.index(key) 
                    old = iter(groups[index].values()) 
                    new = iter(item.keys()) 
                    merged = [] 
                    for v in old: 
                        merged += v 
                    merged += new 
                    groups[index] = {group:merged} 
                    merged = True 
             if not merged: 
                k = next(iter(item.keys())) 
                v = next(iter(item.values())) 
                groups.append({v[0] : [k]}) 
    print(groups) 
classify_synonyms() 
#[{'animal': ['cat', 'dog']}, {'fish': ['dolphin']}, {'insect': ['spider']}] 
The above method merely classifies the input to the first level of grouping. It does not factor in multiple matches between synonyms, selection of the best match in the synonym, unavailability of synonyms, unrelated words, and unrelated synonyms. The purpose is just to show that given a criterion for the selection of a group, the words can be merged. The output of the first level could then be taken as the input of the second level. The second level can then be merged and so on until a dendrogram appears.  
Given this dendrogram, it is possible to take edge distance as the distance metric for semantic similarity.  
Since we do this hierarchical classification only for the finite number of input words in a text, we can take it to be a bounded cost of O(nlogn) assuming fixed upper cost for each merge. 

def nlevel(id, group_dict=df.GroupID, _cache={0:0}): 
    if id in _cache: 
        return _cache[id] 
  
    return 1+nlevel(group_dict[id],group_dict) 
  
df['nLevel'] = df.ID.map(nlevel) 
  
print df[['nLevel','ID','Group']] 

No comments:

Post a Comment