Decision Tree modeling on API Error root cause analysis
Problem statement: Given a method to collect root causes from many
data points from API error codes in logs, can there be a determination of relief
time?
Solution: There are two stages to solving this problem:
1.
Stage 1 – discover root cause and create a summary to capture it
2.
Stage 2 – use a decision tree modeling to determine relief time.
Stage 1:
The first stage involves a data pipeline that
converts log entries to a json object with request and response details,
statuscode, error message, remote server info, query and request parameters and
hashes them into buckets. When the dictionary are collected from a batch of log entries, we
can transform it into a vector representation and using the notable request-response
pairs as features. Then we can generate a hidden weighted matrix for the neural
network
We use that hidden layer to determine the salience
using the gradient descent method.
All values are within [0,1] co-occurrence
probability range.
The solution to the quadratic form representing
the embeddings is found by arriving at the minima represented by Ax = b using
conjugate gradient method.
We are given input matrix A, b, a starting value
x, a number of iterations i-max and an error tolerance epsilon < 1
This method proceeds this way:
set I to 0
set residual to b - Ax
set search-direction to residual.
And delta-new to the dot-product of
residual-transposed.residual.
Initialize delta-0 to delta-new
while I < I-max and delta > epsilon^2
delta-0 do:
q = dot-product(A, search-direction)
alpha = delta-new /
(search-direction-transposed. q)
x = x + alpha.search-direction
If I is divisible by 50
r = b - Ax
else
r = r - alpha.q
delta-old = delta-new
delta-new =
dot-product(residual-transposed,residual)
Beta = delta-new/delta-old
Search-direction = residual + Beta.
Search-direction
I = I + 1
Root cause capture – API error summaries that are
captured from various sources and appear in the logs can be stack hashed. The
root cause can be described by a specific summaries, its associated point of
time, the duration over which it appears, and the time of fix introduced, if
known.
Stage 2: Decision Tree modeling can help predict relief time. involves both a classification and a
regression tree. A function divides the rows into two datasets based on the
value of a specific column. The two list of rows that are returned are such
that one set matches the criteria for the split while the other does not. When
the attribute to be chosen is clear, this works well.
To see how good an attribute is, the
entropy of the whole group is calculated. Then the group is divided by
the possible values of each attribute and the entropy of the two new groups are
calculated. The determination of which attribute is best to divide on, the
information gain is calculated which is the difference between the current
entropy and the weighted-average entropy of the two new groups. The algorithm
calculates the information gain for every attribute and chooses the one with
the highest information gain.
Each set is subdivided only if the
recursion of the above step can proceed. The recursion is terminated if a solid
conclusion has been reached which is a way of saying that the information gain
from splitting a node is no more than zero. The branches keep dividing,
creating a tree by calculating the best attribute for each new node. If a
threshold for entropy is set, the decision tree is ‘pruned’.
No comments:
Post a Comment