Cluster computing

Thursday, August 11, 2022

This is a continuation of a series of articles on hosting solutions and services on Azure public cloud with the most recent discussion on Multitenancy here This article continues to discuss troubleshooting the Azure Arc instance with a few cases for data services.

Tbe troubleshooting of Azure Arc data services is similar to that for the resource bridge.

Logs can be collected for further investigation, and this is probably the foremost resolution techniques.

Errors pertaining to Logs upload can stem from missing Log Analytics workspace credentials. This is generally the case for Azure Arc data controllers that are deployed in the direct connectivity mode using kubectl, and the logs upload error message reads “spec.settings.azure.autoUploadLogs is true, but failed to get log-workspace-secret secret.” Creating a secret with the Log Analytics workspace credentials containing the WorkspaceID and SharedAccessKey resolves this error.

Similarly, metrics upload also might cause errors in direct connected mode. The permissions needed for the MSI must be properly granted otherwise the error message will indicate “AuthorizationFailed” This can be resolved by retrieving the MSI for Azure Arc data controller extension and granting the required roles such as Monitoring Metrics Publisher. Automatic upload of metrics can be set up with a command as: az arcdata dc update --name arcdc --resource-group <myresourcegroup> --auto-upload-metrics true.

Usage information is different from logs and metrics but the technique for resolving errors say for metrics is the same for usage information. The typical error is Authorization failed even though when the Azure Arc data controller is setup in the direct connected mode, the permissions for uploading usage information is automatically granted. Resolving the permissions issue requires retrieving the MSI and granting the required roles.

Errors pertaining to upgrades usually come from incorrect image tags. The error message encountered reads as follows: “Failed to await bootstrap job complete after retrying for two minutes” The bootstrap job status reflects “ErrImagePull” and the pod description reads “Failed to pull image”. The version log will have the correct image tag. Running the upgrade command with the correct image tag resolves this error.

If there are no errors but the upgrade job runs for longer than fifteen minutes, it is likely that there was a connection failure to the repository or registry. When we view the pods description, it will usually say “failed to resolve reference” in this case. If the image was deployed from a private registry, the yaml file used for upgrade might be referring to mcr.microsoft.com instead of the private registry. Correctly specifying the registry and the repository in the yaml file will resolve this error.

Upgrade jobs can also run long if there are not enough resources. Looking for a pod that shows only some of the containers as ready is a good symptom for this trouble. Viewing the events or the logs can point to the root cause as insufficient cpu or memory. More nodes can be added to the Kubernetes cluster or more resources can be assigned to the existing nodes to overcome this error.

#codingexercise

void ToInOrderList(Node root, ref List<node> all)

{

if (root == null) return;

ToInOrderList(root.left, ref all);

all.Add(root);

ToInOrderList(root.right, ref all);

}

Cluster computing

Thursday, August 11, 2022

No comments:

Post a Comment