Wednesday, May 26, 2021

Decision Tree modeling on API Error root cause analysis   

Problem statement: Given a method to collect root causes from many data points from API error codes in logs, can there be a determination of relief time? 

   

Solution: There are two stages to solving this problem:  

1.       Stage 1 – discover root cause and create a summary to capture it  

2.       Stage 2 – use a decision tree modeling to determine relief time.  

 

Stage 1:  

The first stage involves a data pipeline that converts log entries to a json object with request and response details, statuscode, error message, remote server info, query and request parameters and hashes them into buckets.  When the dictionary  are collected from a batch of log entries, we can transform it into a vector representation and using the notable request-response pairs as features. Then we can generate a hidden weighted matrix for the neural network  

We use that hidden layer to determine the salience using the gradient descent method.       

   

All values are within [0,1] co-occurrence probability range.      

   

The solution to the quadratic form representing the embeddings is found by arriving at the minima represented by Ax = b using conjugate gradient method.    

We are given input matrix A, b, a starting value x, a number of iterations i-max and an error tolerance  epsilon < 1    

   

This method proceeds this way:     

   

set I to 0     

   

set residual to b - Ax     

   

set search-direction to residual.    

   

And delta-new to the dot-product of residual-transposed.residual.    

   

Initialize delta-0 to delta-new    

   

while I < I-max and delta > epsilon^2 delta-0 do:     

   

    q = dot-product(A, search-direction)    

   

    alpha = delta-new / (search-direction-transposed. q)     

   

    x = x + alpha.search-direction    

   

    If I is divisible by 50     

   

        r = b - Ax     

   

    else     

   

        r = r - alpha.q     

   

    delta-old = delta-new    

   

    delta-new = dot-product(residual-transposed,residual)    

   

     Beta = delta-new/delta-old    

   

     Search-direction = residual + Beta. Search-direction    

   

     I = I + 1     

   

Root cause capture – API error summaries that are captured from various sources and appear in the logs can be stack hashed. The root cause can be described by a specific summaries, its associated point of time, the duration over which it appears, and the time of fix introduced, if known.   

   

Stage 2: Decision Tree modeling can help predict relief time. involves both a classification and a regression tree. A function divides the rows into two datasets based on the value of a specific column. The two list of rows that are returned are such that one set matches the criteria for the split while the other does not. When the attribute to be chosen is clear, this works well. 

To see how good an attribute is, the entropy of the whole group is calculated.  Then the group is divided by the possible values of each attribute and the entropy of the two new groups are calculated. The determination of which attribute is best to divide on, the information gain is calculated which is the difference between the current entropy and the weighted-average entropy of the two new groups. The algorithm calculates the information gain for every attribute and chooses the one with the highest information gain. 

Each set is subdivided only if the recursion of the above step can proceed. The recursion is terminated if a solid conclusion has been reached which is a way of saying that the information gain from splitting a node is no more than zero. The branches keep dividing, creating a tree by calculating the best attribute for each new node. If a threshold for entropy is set, the decision tree is ‘pruned’.  

When working with a set of tuples, it is easier to reserve the last one for results during a recursion level. Text and numeric data do not have to be differentiated for this algorithm to run. The algorithm takes all the existing rows and assumes the last row is the target value. A training/testing dataset is used with the application for each dataset. Usually, a training/testing data split of 70/30% is used in this regard.  

 


No comments:

Post a Comment