Cluster computing

Thursday, August 31, 2023

Recently, I came across a situation where CI/CD pipelines were making unintended changes to the resources in the Azure public cloud. The IaC was written in Terraform and the resource provider was Azure. The symptom that manifested was that when a code change was pushed through the pipeline, settings on unrelated resources would fall off. This impacted the uptime of those resources and business continuity suffered whenever those settings were to be restored. In addition, it was getting hard to tell which resources were going to be affected since the author of any change had nothing to do with those resources in his change. The team responsible for the IaC is referred to as the infrastructure team.

There is also some context to this situation that came before these symptoms manifested. First, the subscription where these resources were impacted had long been a shared subscription and one of the first to be tried out. Consequently, there were proof of concepts, multiple versions, and many stakeholders, sometimes even with contributor access who could update their specific resources. The sheer number of resource groups, subnets and virtual networks had grown to be quite large and neglected in a few cases. The specific resources that were most affected by this exchange were the app services and it just so happened that the application engineering team had started requiring changes more often now than ever before for an improvement they owned.

One specific example of changes that were accumulated in the portal was virtual network integration for these resources and whenever these settings fell off, the connectivity was disrupted resulting in some downtime. While this applied to the outbound traffic from the resource, similar discrepancies were noticed on the incoming side where access restrictions were lost on the resource. Since the inbound traffic and outbound traffic settings were maintained by the infrastructure team, they were supposed to be captured in the IaC. Briefly, some of these definitions indeed appeared in the IaC but on closer inspection, they turned out to be improper or even failing enforcement. Other settings happened to be specific to that resource only and very much linked to the code or container image being deployed to those resources. The application engineering team managed these.

Another source of errors was attributed to the Terraform state. Irrespective of the resources in the portal or their definitions in the IaC, the state was maintained and even updated without corresponding changes elsewhere. This was done to overcome the conflicts that were found during the compile or the execution of the IaC but it resulted in other sets of conflicts found when the pipeline ran. Consequently, resources were even destroyed during the execution of the pipeline. It is not wrong to edit the state file, but it is usually done to keep it in sync with both the portal and the IaC. Keeping it in sync with the portal first and then backpropagating the changes to the IaC is one direction of the edits. The other direction is to write through the state with the changes in the IaC and then push it to the resources in the portal. Both Non-prod and prod resources must have their own sets of IaC, state and actual resources and must also be kept separate.

Lastly, the changes being made to keep all three in sync were often spread out over time and distributed among authors leading to sources of errors or discrepancies. Establishing a baseline combination of state, IaC and corresponding resources is necessary to make incremental changes. It is also important to keep them in sync going forward. The best way to do this would be to close the gap by enumerating all discrepancies to establish a baseline and then have the process and the practice to enforce that they do not get out of sync.

References: Earlier articles on IaC shortcomings and resolutions: IacResolutionsPart21.docx

Cluster computing

Thursday, August 31, 2023

No comments:

Post a Comment