Cluster computing: Retry

Tuesday, February 23, 2021

Retry

The case for Retry in JavaScript errors:

Problem statement: Scripts written for DevOps are susceptible to intermittent failures that can be overcome with retries. For example, a database may not be reachable within the limited timeout that is usually set for scripts. Nested commands can be called that might each take their own timeout. Such errors do not occur under normal conditions, so they escape the functional tests. If a limited number of retries were to be attempted, the code executes successfully, and the issue does not recur. Retry also has the side-effect that it reduces the cost of detection and investigation to other spurious errors from logs, incidents and false defects. How do we handle intermittent failures is the subject of this article.

Solution: The implementation of the retry is independent of the callee. Scripts are not written in programming languages like Java where the callee can be standardized to a Runnable. Scripts also do not mandate a conformance to a convention across their usages. The most common form of a callee is a typical JavaScript function. The invocation of the function could result in a timeout exception, so the retry only works with the use of try-catch handler.

The catch handler could specifically look for the timeout exceptions and handle it accordingly where the callee is retried. The number of retries for the callee is set to a small and finite number such as ten.

The retry logic ensures that the result returned from the retry is that of the callee otherwise it propagates unknown exception. The use of retry is only for known exceptions and the unknown exceptions pass through to the caller. Therefore, the behavior is the same as that of the callee. If the results are returned as undefined in error conditions by the callee, it must also be returned as undefined by the retry.

The Retry script can optionally log the number of attempts to help find the root cause for the intermittent errors and to ensure that they do not go undetected for a long time. Such automatic handling of intermittent failures along with the transparency of what those errors were will likely bring down the overall number of exceptions encountered in a system from operations. This will improve the health of the system and the metrics as well as reports that are used in the monitoring dashboards.

The retry logic can be made flexible and reusable to include numRetries and delayMillis between the retries on the callee. These parameters can be used with the initialization of the retry function object or passed in as parameters.

Since the retry addresses a broad range of callee and their intermittent failures, it could be put in a library with the callee passed as parameter. The reuse of the retry logic brings consistency and possible instrumentation for future investigations and their benefits.

Conclusion: Retry logic is a simple and effective technique to improve the DevOps and bring down the number of errors and exceptions substantially.

// Sample code

// for retries with exponential backoff

var RetryUtil = Class.create();

RetryUtil.prototype = {

initialize: function() {

this.numRetries = 10;

this.delayMillis = 0;

this.toleratedExceptions = [];

retryOperation: function (fn, numRetries, delayMillis, toleratedExceptions) {

if (numRetries) { this.numRetries = numRetries; }

if (delayMillis) {this.delayMillis = delayMillis; }

if (toleratedExceptions && toleratedExceptions.length == 0) {this.toleratedExceptions = toleratedExceptions;}

for (var r = 0; r < this.numRetries; r++) {

try {

return fn.call();

} catch (e) {

for (var i = 0; i < this.toleratedExceptions.length; i++) {

if (e instanceof this.toleratedExceptions[i]) {

if (this.delayMillis > 0) {

gs.sleep(delayMillis * Math.pow(2, r));

}

continue;

}

throw e;

}

type: 'RetryUtil'

};

Cluster computing

Tuesday, February 23, 2021

Retry

No comments:

Post a Comment