Recently, I came across a situation where CI/CD pipelines
were making unintended changes to the resources in the Azure public cloud. The
IaC was written in Terraform and the resource provider was Azure. The symptom
that manifested was that when a code change was pushed through the pipeline,
settings on unrelated resources would fall off. This impacted the uptime of
those resources and business continuity suffered whenever those settings were
to be restored. In addition, it was getting hard to tell which resources were
going to be affected since the author of any change had nothing to do with
those resources in his change. The team responsible for the IaC is referred to
as the infrastructure team.
There is also some context to this situation that came
before these symptoms manifested. First, the subscription where these resources
were impacted had long been a shared subscription and one of the first to be
tried out. Consequently, there were proof of concepts, multiple versions, and
many stakeholders, sometimes even with contributor access who could update
their specific resources. The sheer number of resource groups, subnets and
virtual networks had grown to be quite large and neglected in a few cases. The
specific resources that were most affected by this exchange were the app
services and it just so happened that the application engineering team had
started requiring changes more often now than ever before for an improvement
they owned.
One specific example of changes that were accumulated in the
portal was virtual network integration for these resources and whenever these
settings fell off, the connectivity was disrupted resulting in some downtime. While
this applied to the outbound traffic from the resource, similar discrepancies
were noticed on the incoming side where access restrictions were lost on the
resource. Since the inbound traffic and outbound traffic settings were
maintained by the infrastructure team, they were supposed to be captured in the
IaC. Briefly, some of these definitions indeed appeared in the IaC but on
closer inspection, they turned out to be improper or even failing enforcement.
Other settings happened to be specific to that resource only and very much
linked to the code or container image being deployed to those resources. The
application engineering team managed these.
Another source of errors was attributed to the Terraform
state. Irrespective of the resources in the portal or their definitions in the
IaC, the state was maintained and even updated without corresponding changes
elsewhere. This was done to overcome the conflicts that were found during the
compile or the execution of the IaC but it resulted in other sets of conflicts
found when the pipeline ran. Consequently, resources were even destroyed during
the execution of the pipeline. It is not wrong to edit the state file, but it
is usually done to keep it in sync with both the portal and the IaC. Keeping it
in sync with the portal first and then backpropagating the changes to the IaC
is one direction of the edits. The other direction is to write through the
state with the changes in the IaC and then push it to the resources in the
portal. Both Non-prod and prod resources must have their own sets of IaC, state
and actual resources and must also be kept separate.
Lastly, the changes being made to keep all three in sync
were often spread out over time and distributed among authors leading to sources
of errors or discrepancies. Establishing a baseline combination of state, IaC
and corresponding resources is necessary to make incremental changes. It is
also important to keep them in sync going forward. The best way to do this
would be to close the gap by enumerating all discrepancies to establish a
baseline and then have the process and the practice to enforce that they do not
get out of sync.
References: Earlier articles on IaC shortcomings and
resolutions: IacResolutionsPart21.docx