Exception StackTrace associations for root cause analysis
Problem statement: Given a method to collect root causes from many data points in errors in logs, can there be a determination of associations between root causes?
Solution: There are two stages to solving this problem:
Stage 1 – discover root cause and create a summary to capture it
Stage 2 – use an association data mining algorithm on root causes.
Stage 1:
The first stage involves a data pipeline that converts log entries to exception stacktraces and hashes them into buckets. Sample included. When the exception stack traces are collected from a batch of log entries, we can transform them into a vector representation and using the notable stack frames as features. Then we can generate a hidden weighted matrix for the neural network
We use that hidden layer to determine the salience using the gradient descent method.
All values are within [0,1] co-occurrence probability range.
The solution to the quadratic form representing the embeddings is found by arriving at the minima represented by Ax = b using the conjugate gradient method.
We are given input matrix A, b, a starting value x, several iterations i-max, and an error tolerance epsilon < 1
This method proceeds this way:
set I to 0
set residual to b - Ax
set search-direction to residual.
And delta-new to the dot-product of residual-transposed.residual.
Initialize delta-0 to delta-new
while I < I-max and delta > epsilon^2 delta-0 do:
q = dot-product(A, search-direction)
alpha = delta-new / (search-direction-transposed. q)
x = x + alpha.search-direction
If I is divisible by 50
r = b - Ax
else
r = r - alpha.q
delta-old = delta-new
delta-new = dot-product(residual-transposed,residual)
Beta = delta-new/delta-old
Search-direction = residual + Beta. Search-direction
I = I + 1
Root cause capture – Exception stack traces that are captured from various sources and appear in the logs can be stack hashed. The root cause can be described by a specific stacktrace, its associated point of time, the duration over which it appears, and the time of fix introduced, if known.
Stage 2:
Association data mining determines whether two root causes occur together. The computation involves two computed columns namely Support and Probability. Support defines the percentage of cases in which a rule must exist before it is considered valid. We define that a rule must be found in at least 1 percent of cases.
Probability defines how likely an association must be before it is considered valid. We will consider any association with a probability of at least 10 percent.
Bayesian conditional probability and confidence can also be used. Associations have association rules formed with a pair of antecedent and consequent item-sets, so named, because we want to find the value of taking one item with another. Let I be a set of items, T be a set of transactions. Then an association A is defined as a subset of I that occurs together in T. Support (S1) is a fraction of T containing S1. Let S1 and S2 be subsets of I, then the association rule to associate S1 to S2 has support(S1->S2) defined as Support(S1 union S2) and a confidence (S1->S2) = Support(S1 union S2)/ Support(S1). A third metric Lift is determined as Confidence(S1->S2)/Support(S2) and is preferred because a popular S1 gives high confidence for any S2 and lift corrects that by having a value greater than 1.0 when S2 is also significant.
Certain databases allow the creation of association models that can be persisted and evaluated against each incoming request. Usually, a training/testing data split of 70/30% is used in this regard.
Sample: https://jsfiddle.net/g2snw4da/
No comments:
Post a Comment