Thursday, January 2, 2020

Exception Stream:
When a computer program has a fault, it generates an Exception with a stacktrace of all the methods leading up to the point of failure. Developers write code in a way that lets exceptions to be written out to logs for offline analysis by DevOps Engineers. This article tries to list the implementation considerations for an automated solution to find the top generators of exceptions on a continuous basis.
We list the high-level steps first and then document the implementation considerations specific to continuous automated reporting of top exception generators. These are:
1) parse the logs by age and extract the exception
2) generate hash and store exception stacktrace as well as its hash in a separate stream
3) run a program to continuously monitor the exceptions written to the stream for categorizing and creating a histogram
4) report the top generators of exceptions from the histogram at any point of time on demand.
The choice of a stream store for exceptions is unique to meeting the demands of continuous reporting.  Traditional solutions usually poll the exception histogram periodically to draw the bar chart for top generators of exceptions.
The number of generators of exceptions and the number of exceptions for any generator can be arbitrary.  It helps to store each exception one by one in the stream. The generation of the histogram and its persistence with the positional reference of the last accounted exception from the exception stack trace stream is optional. Having a separate stream for log entries, exception stack traces and histograms enables efficiency in computation without repetitions. Otherwise just the exception stack trace stream is sufficient in the stream store.
The natural choice for writing and reading the exception stacktraces stream is Apache Flink and the programs to read and write can be separate.
The writer of exception stacktraces to its stream has to parse the logs. Usually the logs are available on the filesystem and seldom in the stream store but it is advisable to propagate the log to its own stream store. The parsing of log entries for extracting exceptions and the hashing of exception stack trace is well-known with stackhasher programs. The generation of histogram is a straight forward Flink query. The persistence of the histogram and the position of the last read exception only reduces the work set for which the histogram needs to be improved. Regular state persistence in the Flink application code can flush the histogram to disk.
It may be argued that the histogram is a metric rather than the log entry or the exception stack trace. A suitable metric stack and an influxQL can suffice to do it. Even storing the histogram in an object store for universal web access, metrics and alerts may be an alternative. However, the convenience of updating the histogram at the point of reading the exception stack trace makes it easy to include it in the Flink Application.

No comments:

Post a Comment