Continuous root cause analysis via analysis of
time-series events:
Problem statement: Given a method to collect many
data points for errors in logs, can there be prediction on the resolution time
of the next root-cause
Solution: There are two stages to solving this
problem:
1. Stage 1 – discover root cause and create a
summary to capture it
2. Stage 2 – use a time-series algorithm to predict
the relief time.
Stage 1:
When the exception stack traces are collected from a batch of log
entries, we can transform it into a vector representation and using the notable
stacktraces as features. Then we can start with the hidden weighted matrix that the neural
network layer generates and then use that hidden layer to determine the
salience using the gradient descent method.
All values are within [0,1] co-occurrence
probability range.
The solution to the quadratic form representing
the embeddings is found by arriving at the minima represented by Ax = b using
conjugate gradient method.
We are given input matrix A, b, a starting value
x, a number of iterations i-max and an error tolerance epsilon <
1
This method proceeds this way:
set I to 0
set residual to b - Ax
set search-direction to residual.
And delta-new to the dot-product of
residual-transposed.residual.
Initialize delta-0 to delta-new
while I < I-max and delta > epsilon^2
delta-0 do:
q = dot-product(A, search-direction)
alpha = delta-new /
(search-direction-transposed. q)
x = x + alpha.search-direction
If I is divisible by 50
r = b - Ax
else
r = r - alpha.q
delta-old = delta-new
delta-new =
dot-product(residual-transposed,residual)
Beta =
delta-new/delta-old
Search-direction =
residual + Beta. Search-direction
I = I + 1
Root cause capture – Exception stack traces that
are captured from various sources and appear in the logs can be stack hashed.
The root cause can be described by a
specific stacktrace, its associated point of time, the duration over which it
appears and the time of fix introduced, if known.
Stage 2: A time-series algorithm does not need any attributes
other than the historical collection of relief times to be able to predict the
next relief time. It only looks at scalar value regardless of the type or
factors playing into the relief time of an individual incident or its root
cause attributes. The historical data is utilized to predict an estimation on
the incoming event as if the relief were a scatter plot along the timeline.
Unlike other data mining algorithms that involve additional attributes of the
event, this approach uses a single auto-regressive method on the continuous
data to make a short-term prediction. The regression is automatically trained
as the data accrues.
No comments:
Post a Comment