Fault injection Testing:
Stability and Resiliency of software is critical for smooth
running of an application. Fault injection testing is the deliberate
introduction of errors and faults to a system to observe its behavior. The goal
is for the software to work correctly despite errors encountered from calls
made to dependencies such as other APIs, system calls and so on. By introducing
intermittent failure conditions over time, the application behaves as
realistically as in production where hardware and software faults can occur randomly,
but the services must remain available, and the business continuity must be maintained.
System needs to be resilient to the conditions that cause
production disruptions. The dependencies
might include infrastructure, platform, network, 3rd party software,
or APIs. The risk of impact from dependency failure may be direct or cascading.
Fault injection methods are a way to increase coverage and validate software
robustness and error handling, either at build time or at run-time with the
intention of embracing failure as a part of development lifecycle. These
methods assist service teams in designing and continuously validating for
failure, accounting for known and unknown failure conditions, architect for
redundancy and employ retry and back-off mechanisms. Together with the
introduction of intermittent failures and continuous monitoring in the stage
environment of service deployments, these methods promote near total coverage
of known and unknown faults that can impact the service in production. The
purpose of the monitoring aspect during these experiments is the observation of
fault and its recovery time, overview of symptoms in related components and the
determination of the threshold and values with which alerts can be set.
Fault engineering is equally applicable to software,
protocol, and infrastructure. Software
faults include error-handling code paths and in-process memory management for
which edge-case unit-tests, integration tests and stress and soak load tests
are written. Protocol faults include the vulnerabilities in communication
interfaces such as command line parameters or APIs. Examples of tests that
mitigate this includes fuzzing which provides invalid, unexpected, or random
data as input and we can access the level of protocol stability of a
component. Infrastructure faults include
outages, networking, and hardware failures. The tests that mitigate these cause
fault in the underlying infrastructure such as shutting down virtual machines,
crashing processes, expiring certificates and others.
One of the challenges with these methods is the signal to
noise ratio from the errors. A fault is a hypothesis of an error. An error is a
failure in the system and can lead to other errors. Since they occur in a
cycle, the fault-error-failure cycle can lead to many errors from which the
ones that must be fixed to improve system resilience and reliability need to be
discerned. When these experiments are run for short durations, the number of
errors to investigate is usually low. The leveraging of automation to
continuously validate what matters during the experiment allows the detection
of even errors that are hard to find manually.
Such automation can even be introduced into the pipeline to
release software. This promotes a shift-left approach where the testing occurs
as early in the development and project timeline as when the code is written.
It follows the test early and often principle and the benefit is in the
possibility to troubleshoot the issues encountered via debugging.
The outcomes of the fault injection testing are the
measurement and definitions of a steady healthy state for the system’s
interoperability, finding the difference between the baseline state and the
anomalous state and documenting the processes and observations to identify and
act on the result.
No comments:
Post a Comment