Monday, June 6, 2022

Continuous root cause analysis via analysis of time-series events:  


Problem statement: Given a method to collect many data points for errors in logs, can there be prediction on the resolution time of the next root-cause   


Solution: There are two stages to solving this problem:

1.       Stage 1 – discover root cause and create a summary to capture it

2.       Stage 2 – use a time-series algorithm to predict the relief time.


Stage 1:

When the exception stack traces are collected from a batch of log entries, we can transform it into a vector representation and using the notable stacktraces as features. Then we can start with the hidden weighted matrix that the neural network layer generates and then use that hidden layer to determine the salience using the gradient descent method.     


All values are within [0,1] co-occurrence probability range.    


The solution to the quadratic form representing the embeddings is found by arriving at the minima represented by Ax = b using conjugate gradient method.  

We are given input matrix A, b, a starting value x, a number of iterations i-max and an error tolerance  epsilon < 1  


This method proceeds this way:   


set I to 0   


set residual to b - Ax   


set search-direction to residual.  


And delta-new to the dot-product of residual-transposed.residual.  


Initialize delta-0 to delta-new  


while I < I-max and delta > epsilon^2 delta-0 do:   


    q = dot-product(A, search-direction)  


    alpha = delta-new / (search-direction-transposed. q)   


    x = x +  


    If I is divisible by 50   


        r = b - Ax   




        r = r - alpha.q   


    delta-old = delta-new  


    delta-new = dot-product(residual-transposed,residual)  


     Beta = delta-new/delta-old  


     Search-direction = residual + Beta. Search-direction  


     I = I + 1   


Root cause capture – Exception stack traces that are captured from various sources and appear in the logs can be stack hashed. The root cause can be described by  a specific stacktrace, its associated point of time, the duration over which it appears and the time of fix introduced, if known. 


Stage 2: A time-series algorithm does not need any attributes other than the historical collection of relief times to be able to predict the next relief time. It only looks at scalar value regardless of the type or factors playing into the relief time of an individual incident or its root cause attributes. The historical data is utilized to predict an estimation on the incoming event as if the relief were a scatter plot along the timeline. Unlike other data mining algorithms that involve additional attributes of the event, this approach uses a single auto-regressive method on the continuous data to make a short-term prediction. The regression is automatically trained as the data accrues.     

No comments:

Post a Comment