Monday, January 10, 2022

Fault injection Testing:

Stability and Resiliency of software is critical for smooth running of an application. Fault injection testing is the deliberate introduction of errors and faults to a system to observe its behavior. The goal is for the software to work correctly despite errors encountered from calls made to dependencies such as other APIs, system calls and so on. By introducing intermittent failure conditions over time, the application behaves as realistically as in production where hardware and software faults can occur randomly, but the services must remain available, and the business continuity must be maintained.

System needs to be resilient to the conditions that cause production disruptions.  The dependencies might include infrastructure, platform, network, 3rd party software, or APIs. The risk of impact from dependency failure may be direct or cascading. Fault injection methods are a way to increase coverage and validate software robustness and error handling, either at build time or at run-time with the intention of embracing failure as a part of development lifecycle. These methods assist service teams in designing and continuously validating for failure, accounting for known and unknown failure conditions, architect for redundancy and employ retry and back-off mechanisms. Together with the introduction of intermittent failures and continuous monitoring in the stage environment of service deployments, these methods promote near total coverage of known and unknown faults that can impact the service in production. The purpose of the monitoring aspect during these experiments is the observation of fault and its recovery time, overview of symptoms in related components and the determination of the threshold and values with which alerts can be set.

Fault engineering is equally applicable to software, protocol, and infrastructure.  Software faults include error-handling code paths and in-process memory management for which edge-case unit-tests, integration tests and stress and soak load tests are written. Protocol faults include the vulnerabilities in communication interfaces such as command line parameters or APIs. Examples of tests that mitigate this includes fuzzing which provides invalid, unexpected, or random data as input and we can access the level of protocol stability of a component.  Infrastructure faults include outages, networking, and hardware failures. The tests that mitigate these cause fault in the underlying infrastructure such as shutting down virtual machines, crashing processes, expiring certificates and others.

One of the challenges with these methods is the signal to noise ratio from the errors. A fault is a hypothesis of an error. An error is a failure in the system and can lead to other errors. Since they occur in a cycle, the fault-error-failure cycle can lead to many errors from which the ones that must be fixed to improve system resilience and reliability need to be discerned. When these experiments are run for short durations, the number of errors to investigate is usually low. The leveraging of automation to continuously validate what matters during the experiment allows the detection of even errors that are hard to find manually.

Such automation can even be introduced into the pipeline to release software. This promotes a shift-left approach where the testing occurs as early in the development and project timeline as when the code is written. It follows the test early and often principle and the benefit is in the possibility to troubleshoot the issues encountered via debugging.

The outcomes of the fault injection testing are the measurement and definitions of a steady healthy state for the system’s interoperability, finding the difference between the baseline state and the anomalous state and documenting the processes and observations to identify and act on the result.

No comments:

Post a Comment