Cluster computing

Thursday, July 13, 2023

Path based routing and rewrite rule sets in Application Gateway.

One of the frequently encountered issues with app service is incorrect redirect URLs. This might mean that the app service URL is exposed in a browser when there’s a redirection or the authentication and authorization for the app service is broken because of a redirect with the wrong hostname. The root-cause for this symptom is a setup that overrides the hostname as used by the application gateway towards app service into a different hostname as is seen by the browser. This hostname is typically overridden to the default “azurewebsites.net” domain. In such cases, the application gateway must either have a “pick hostname from backend address” enabled in HTTP Settings or have an “Override with specific domain name” set to a value different from what the browser request has.

The typical resolution for this symptom is to rewrite the location header. The application gateway’s domain name is set as the host name in the location header. A rewrite rule is added to do just that and specified with a condition that evaluates if the location header in the response contains azurewebsites.net.

When the Azure AppService sends a redirection response, it uses the same hostname in the location header of its response as the one in the request it receives from the application gateway. This makes a client redirect to contoso.azurewebsites.net/path2 instead of going through the application gateway. This bypass is not desirable. Setting the location header to the application gateways’ domain name resolves this situation.

The steps for replacing the hostname are as follows:

First, a rewrite rule is written with a condition that evaluates if the location header in the response contains azurewebsites.net. The pattern for this can be specified in PCRE format as (https?):\/\/.*azurewebsites\.net(.*)$

Second, the location header is rewritten to include the application gateways’ hostname. This is done by entering {http_resp_Location_1}://contoso.com{http_resp_Location_2} as the header value. Alternatively, we can also use the server variable ‘host’ to set the hostname to match the original request.

This manifests as an If-Then condition-action pair. The If section specifies the type of variable to check as an HTTP header, the header type as Response, the header name as a Common Header, the name of the Common Header as the Location, the case-sensitivity turned off, the operator to evaluate as an equals-to and the pattern to match as (https?):\/\/.*azurewebsites\.net(.*)$. The Then Section specifies the action type as Set, the Header Type as Response, the Header name as Common Header, the name of the Common Header as Location, and the Header Value as {http_resp_Location_1}://contoso.com{http_resp_Location_2}

This completes the rewrite. Care must be taken to ensure the appropriate use of rewrite versus redirect.

A URL is rewritten by the application gateway before it is sent to the backend. It won’t change what the user sees in the browser because the changes are hidden from the user. A URL is redirected by the application gateway to a new URL and sent as a response to the client. The URL that the user sees in the browser will update to a new URL.

Wednesday, July 12, 2023

Overwatch implementation involves cluster configuration, overwatch configuration and run. The steps outlined in this article are a guide to realizing cluster utilization reports from Azure Databricks instance. It starts with some concepts as an overview and context in which the steps are performed, followed by the listing of the steps and closing with the running and viewing of the Overwatch jobs and reports. This is a continuation of the previous article also on similar topics.

Overwatch can be taken as an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost. The audit logs and cluster logs are primary data sources but the cluster logs are crucial to get the cluster utilization data. It requires dedicated storage account for these and the time-to-live must be enabled so that the retention does not grow to incur unnecessary costs. The cluster logs must not be stored on the DBFS directly but can reside on an external store. When there are different Databricks workspaces numbered say 1 to N, each workspace pushes the diagnostic data to the EventHubs and writes the cluster logs to the per region dedicated storage account. One of the Databricks workspace is chosen to deploy Overwatch. The Overwatch jobs read the storage account and the event hub diagnostic data to create bronze, silver and gold data pipelines which can be read from anywhere for the reports.

The steps involve with overwatch configurations include the following: 1. Create a storage account 2. Create an Azure Event Hub namespace 3. Store the Event Hub connection string in a KeyVault 4. Enable Diagnostic settings in the Databricks instance for the event hub 5. Store the Databricks PAT token in the KeyVault, 6. Create a secret scope 7. Use the Databricks overwatch notebook from link and replace the parameters 8. Configure the storage account within the workspace. 9. Create the cluster and add the Maven libraries to the cluster com.databricks.labs:overwatch and com.microsoft.azure:azure-eventhubs-spark and run the Overwatch notebook.

There are a few elaborations to the above steps that can be called out otherwise the steps are routine. All the Azure resources can be created with default settings. The connection string for the EventHub is stored in the KeyVault as a secret. The personal access token aka PAT token created from the Databricks is also stored in the KeyVault as a secret. The PAT Token is created from the user settings of the Azure Databricks instance. A scope is created to import the token back from the KeyVault to the Databricks. A cluster is created to run the Databricks job. The two maven libraries are added to the databricks clusters’ library. The logging tab of the advanced options in the cluster’s configuration will allow us to specify a dbfs location pertaining to the external storage account we created to store the cluster logs. The Azure navigation bar for the Azure Databricks instance will allow for the diagnostic settings data to be sent to EventHub.

The Notebook to describe the Databricks jobs for the Overwatch takes the above configuration as parameters including those for the dbfs location for the cluster logs target, the Extract-Transform-Load database name which stores the tables used for the dashboard, the consumer database name, the secret scope, the secret key for the PAT token, the secret key for the EventHub, the topic name in the EventHub, the primordial date to start the Overwatch, the maximum number of days to bound the data and the scopes.

Overwatch provides both the summary as well as the drill-down options to understand the operations of a Databricks instance. It has two primary modes: Historical and Real-time. It coalesces all the logs produced by Sparks and Databricks via a periodic job run and then enriches this data through various API calls. The jobs from the notebook creates the configuration string with OverwatchParams. Most functionalities can be realized by instantiating the workspace object with these OverwatchParams. It provides two tables the dbuCostDetails table and the instanceDetails table which can then be used for reports.

Tuesday, July 11, 2023

The previous article discussed Azure Databricks and compute cluster usage monitoring. There were many ways suggested.

First, the user interface provided a way to view the usages by cluster by navigating to the CPU utilization metrics reported for each compute resource. It was based on total CPU seconds’ cost. The metric was averaged out for the time interval chosen which had a default value of an hour.

Second, there is a PowerShell scripting library that allows the use of the REST APIs to query the databricks workspace directly. For this the access token suggested was the user’s Personal Access Token or PAT token and required to pass in the cluster’s id as a request parameter. The response of this API was at par with the information from the User Interface and provided a scriptable approach.

Since this might not provide visibility across the databricks instance, the use of REST APIs directly against the Azure resource was suggested. The authentication for this API required the Microsoft login servers to issue a token. While this provided properties and metadata for the resource and provided general manageability, the information pertaining to the cluster usages was rather limited. Still, it was possible to query the last used time to give an indication for whether the compute can be shut down for a workspace user.

Databricks itself provides a set of REST APIS and these give information on the last used time and if the duration from the last usage exceeds a certain threshold, the compute can be stopped. This is referred to as the auto termination time and some values in the order of minutes are sufficient. The default value is 72 hours but that may be quite extravagant.

Third, there is an SDK available from Databricks that can be called with Python to make the above step easier. This option makes it tidy to create charts and graphs from the cluster usages and works well for reporting purposes. Samples can include listing the budgets or directly querying the clusters:

from databricks.sdk import AccountClient

a = AccountClient()

all = a.budgets.list()

a.billable_usage.download(start_month="2023-06", end_month="2023-07")

One of the benefits of using the SDK is that we are no longer required to concern ourselves with setting up the Unity Catalog for querying the system tables ourselves. The SDK can provide authoritative information.

Similarly, another convenience to query cluster usage is with Overwatch. It provides both the summary as well as the drill-down options to understand the operations of a Databricks instance. It has two primary modes: Historical and Real-time. It coalesces all the logs produced by Sparks and Databricks via a periodic job run and then enriches this data through various API calls. Most functionalities can be realized by instantiating the workspace object with suitable OverwatchParams. It provides two tables the dbuCostDetails table and the instanceDetails table.

Monday, July 10, 2023

Azure Databricks Cluster usage report.

1. From UI:

a. Click compute in the sidebar.

b. Select the compute resource to view metrics for

c. Click the metrics tab.

d. Refer to the CPU utilization cluster metrics. This is based on total CPU seconds’ cost. The metric is averaged out based on which time interval is displayed in the chart. The default time interval is the last hour.

e. There is also a GPU metric chart. This is based on the percentage of GPU utilization also averaged out for the chosen time interval.

2. From PowerShell:

a. Install-Module -Name DatabricksPS

b. RESPONSE=$(curl http://localhost:50342/oauth2/token --data "resource=https://management.azure.com/" -H Metadata:true -s)

c. ACCESS_TOKEN=$(echo $RESPONSE | python -c 'import sys, json; print (json.load(sys.stdin)["access_token"])')

d. accesstoken=$ACCESS_TOKEN

e. $apiUrl="https://management.azure.com/subscriptions/656e67c6-f810-4ea6-8b89-636dd0b6774c/resourceGroups/rg-temp/providers/Microsoft.Databricks/workspaces/wks-rg-temp-1?api-version=2023-02-01"

f. $headers=@{"Authorization"= "${accesstoken}"}

g. Invoke-RestMethod -Method "GET" -Uri $apiUrl -headers $headers

Sample Response:

```

properties : @{managedResourceGroupId=/subscriptions/656e67c6-f810-4ea6-8b89-636dd0b6774c/resourceGroups/dbw-rg-temp-1; parameters=;

provisioningState=Succeeded; authorizations=System.Object[]; createdBy=; updatedBy=; workspaceId=8965842579407484;

workspaceUrl=adb-8965842579407484.4.azuredatabricks.net; createdDateTime=7/9/2023 8:14:11 PM}

id : /subscriptions/656e67c6-f810-4ea6-8b89-636dd0b6774c/resourceGroups/rg-temp/providers/Microsoft.Databricks/workspaces/wks-rg-temp-1

name : wks-rg-temp-1

type : Microsoft.Databricks/workspaces

sku : @{name=premium}

location : centralus

tags :

```

h. $accesstoken=”dapie12c.....b3a2-2”

i. $headers=@{"Authorization"= "Bearer ${accesstoken}"}

j. $apiUrl="https://adb-8965...484.4.azuredatabricks.net/api/2.0/clusters/get" # curl --request GET "https://${apiUrl}/api/2.0/clusters/get" --header "Authorization: Bearer ${accesstoken}" --data '{ "cluster_id": "1234-567890-a12bcde3" }'

Sample Response:

PS /home/ravi> Invoke-RestMethod -Method "GET" -Uri $apiUrl -headers $headers -Body '{ "cluster_id": "0709-203735-epaybeni" }'

cluster_id : 0709-203735-epaybeni

creator_user_name : ravibeta@hotmail.com

driver : @{private_ip=10.139.64.4; public_dns=40.113.230.98; node_id=6a94d151f85c48f691ecfa2b501ddb8c;

instance_id=9f3e9e6cbf344608942f28e5db5c22a2; start_timestamp=1688935207640; node_attributes=; host_private_ip=10.139.0.4}

spark_context_id : 2865750645175411723

driver_healthy : True

jdbc_port : 10000

cluster_name : Ravi Rajamani's Personal Compute Cluster

spark_version : 13.2.x-cpu-ml-scala2.12

spark_conf : @{spark.databricks.delta.preview.enabled=true; spark.databricks.cluster.profile=singleNode; spark.master=local[*, 4]}

azure_attributes : @{first_on_demand=1; availability=ON_DEMAND_AZURE; spot_bid_max_price=-1}

node_type_id : Standard_DS3_v2

driver_node_type_id : Standard_DS3_v2

custom_tags : @{ResourceClass=SingleNode}

autotermination_minutes : 4320

enable_elastic_disk : True

disk_spec :

cluster_source : UI

single_user_name : ravibeta@hotmail.com

policy_id : 00164B5BAEB11244

enable_local_disk_encryption : False

instance_source : @{node_type_id=Standard_DS3_v2}

driver_instance_source : @{node_type_id=Standard_DS3_v2}

data_security_mode : LEGACY_SINGLE_USER_STANDARD

runtime_engine : STANDARD

effective_spark_version : 13.2.x-cpu-ml-scala2.12

state : RUNNING

state_message :

start_time : 1688935055795

last_state_loss_time : 0

last_activity_time : 1688935483841

last_restarted_time : 1688935294611

num_workers : 0

cluster_memory_mb : 14336

cluster_cores : 4

default_tags : @{Vendor=Databricks; Creator=ravibeta@hotmail.com; ClusterName=Ravi Rajamani's Personal Compute Cluster;

ClusterId=0709-203735-epaybeni}

init_scripts_safe_mode : False

k. Set-DatabricksEnvironment -AccessToken $accesstoken -ApiRootUrl $apiUrl

l. Get-DatabricksCluster | Stop-DatabricksCluster

The above uses PAT tokens. It can also use Azure AD as follows:

PS /home/ravi> $credUser = Get-Credential

PowerShell credential request

Enter your credentials.

User: ravibeta@hotmail.com

Password for user ravibeta@hotmail.com: ****************

PS /home/ravi> $tenantId = "1f4c33e1-e960-43bf-a135-6db8b82b6885"; $subscriptionId = "656e67c6-f810-4ea6-8b89-636dd0b6774c";

PS /home/ravi> $resourceGroupName = "rg-temp"

PS /home/ravi> $workspaceName="ravibeta@hotmail.com"

PS /home/ravi> $azureResourceId="/subscriptions/656e67c6-f810-4ea6-8b89-636dd0b6774c/resourceGroups/rg-temp/providers/Microsoft.Databricks/workspaces/wks-rg-temp-1"

PS /home/ravi> $workspaceName="wks-rg-temp-1" PS /home/ravi> $clientId="50996fd9-da74-4f41-b262-490d074bc807"

PS /home/ravi> $apiUrl="https://adb-8965842579407484.4.azuredatabricks.net/" PS /home/ravi> Set-DatabricksEnvironment -ClientID $clientId -Credential $credUser -AzureResourceID $azureResourceId -TenantID $tenantId -ApiRootUrl $apiUrl

Connect-AzAccount -UseDeviceAuthentication will not be sufficient for above cmdlet.

The same credential can be used with more than one environment.

Reference: https://learn.microsoft.com/en-us/azure/databricks/administration-guide/account-settings/usage-detail-tags

Sunday, July 9, 2023

Application of MicrosoftML rxFastTree algorithm to Insurance payment validations and predictions:

Logistic regression is a well-known statistical technique that is used to model binary outcomes. It can be applied to detect root causes of payment errors. It uses statistical measures, is highly flexible, takes any kind of input and supports different analytical tasks. This regression folds the effects of extreme values and evaluates several factors that affect a pair of outcomes.

Logistic regression differs from the other Regression techniques in the use of statistical measures. Regression is very useful to calculate a linear relationship between a dependent and independent variable, and then use that relationship for prediction. Errors demonstrate elongated scatter plots in specific categories. Even when the errors come with different error details in the same category, they can be plotted with correlation. This technique is suitable for specific error categories from an account.

One advantage of logistic regression is that the algorithm is highly flexible, taking any kind of input, and supports several different analytical tasks:

- Use demographics to make predictions about outcomes, such as probability of defaulting payments.

- Explore and weight the factors that contribute to a result. For example, find the factors that influence customers to make a repeat late payment.

- Classify claims, payments, or other objects that have many attributes.

Microsoft ML rxFastTree algorithm is also an example.

The gradient boost algorithm for rxFastTree is possible with several loss functions including the squared loss function.

The algorithm for the least squares regression can be written as :

1. Set the initial approximation

2. For a set of successive increments or boosts each based on the preceding iterations, do

3. Calculate the new residuals

4. Find the line of search by aggregating and minimizing the residuals

5. Perform the boost along the line of search

6. Repeat 3,4,5 for each of 2.

Saturday, July 8, 2023

Some methods of organization for large scale Infrastructure-as-a-Code deployments.

The purpose of IaC is to provide a dynamic, reliable, and repeatable infrastructure suitable for cases where manual approaches and management practices cannot keep up. When automation increases to the point of becoming a cloud-based service responsible for the deployment of cloud resources and stamps that provision other services that are diverse, consumer facing and public cloud general availability services, some learnings can be called out that apply universally across a large spectrum of industry clouds.

A service that deploys other services must accept IaC deployment logic with templates, intrinsics, and deterministic execution that works much like any other workflow management system. This helps to determine the order in which to run them and with retries. The tasks are self-described. The automation consists of a scheduler to trigger scheduled workflows and to submit tasks to the executor to run, an executor to run the tasks, a web server for a management interface, a folder for the directed acyclic graph representing the deployment logic artifacts, and a metadata database to store state. The workflows don’t restrict what can be specified as a task which can be an Operator or a predefined task using say Python, a Sensor which is entirely about waiting for an external event to happen, and a Custom task that can be specified via a Python function decorated with a @task.

The organization of such artifacts posed two necessities. First, to leverage the builtin templates and deployment capabilities of the target IaC provider as well as their packaging in the format suitable to the automation that demands certain declarations, phases, and sequences to be called out. Second the co—ordination of context management switches between automation service and IaC provider. This involved a preamble and an epilogue to a context switch for bookkeeping and state reconciliation.

This taught us that large IaC authors are best served by uniform, consistent and global naming conventions, registries that can be published by the system for cross subscription and cross region lookups, parametrizing diligently at every scope including hierarchies, leveraging dependency declarations, and reducing the need for scriptability in favor of system and user defined organizational units of templates. Leveraging supportability via read-only stores and frequently publishing continuous and up-to-date information on the rollout helps alleviate the operations from the design and development of IaC.

IaC writers frequently find themselves in positions where the separation between pipeline automation and IaC declarations are not clean, self-contained or require extensive customizations. One of the approaches that worked on this front is to have multiple passes on the development. With one pass providing initial deployment capability and another pass consolidating and providing best practice via refactoring and reusability. Enabling the development pass to be DevOps based, feature centric and agile helps converge to a working solution with learnings that can be carried from iteration to iteration. The refactoring pass is more generational in nature. It provides cross-cutting perspectives and non-functional guarantees.

A library of routines, operators, data types, global parameters and registries are almost inevitable with large scale IaC deployments but unlike the support for programming language-based packages, these are often organically curated in most cases and often self-maintained. Leveraging tracking and versioning support of source control, its possible to provide compatibility as capabilities are made native to the IaC provider or automation service.

Reference: IaC shortcomings and resolutions.

Friday, July 7, 2023

IaC Resolutions Part 8:

Previous articles in this regard have been discussing resolutions for shortcomings in the use of Infrastructure-as-a-code (IaC) in various scenarios. This section discusses the resolution for the case when changes to resources involves breaking a deadlock in state awareness between, say a pair.

Let us make a specific association between say a firewall and a network resource such as a gateway. The firewall must be associated with the gateway to prevent traffic flow through that appliance. When they remain associated, they remember the identifier and the state for each other. Initially, the firewall may remain in detection mode where it is merely passive. It becomes active in the prevention mode. When the modes are attempted to be toggled, the association prevents it. Neither end of the association can tell what state to be in without exchanging the information and when they are deployed or updated in place, neither knows about or informs the other.

There are two ways to overcome this limitation.

First, there is a direction established between the resources where the update to one forcibly updates the state of the other. This is supported by the gateway when it allows the information to be written through by the update in the state of one resource.

Second, the changes are made by the IaC provider first to one resource and then to the other so that the update to the other picks up the state of the first during its change. In this mode, the firewall can be activated after the gateway knows that there is such a firewall.

If the IaC tries to form an association while updating the state of one, the other might end up with an inconsistent state. One of the two resolutions above works to mitigate this.

This is easy when there is a one-to-one relationship between resources. Sometimes there are one-to-many relationships. For example, a gateway might have more than a dozen app services as its backend members and each member might be allowing public access. If the gateway must consolidate access to all the app services, then there are changes required on the gateway to route traffic to each app service as intended by the client and a restriction on the app services to allow only private access from the gateway.

Consider the sequence in which these changes must be made given that the final operational state of the gateway is only acceptable when all barring none remain reachable for a client through the gateway.

If the app services toggle the access from public to gateway sooner than the gateway becomes operational, there is some downtime to them, and the duration is not necessarily bounded if one app service fails to listen to the gateway. The correct sequence would involve first making the change in the gateway to set up proper routing and then restricting the app services to accept only the gateway. Finally, the gateway validates all the app service flows from a client before enabling them.

Each app service might have nuances about whether the gateway can reach it one way or another. Usually, if they are part of the same vnet, then this is not a concern, otherwise peering might be required. Even if the peering is made available, routing by address or resolution by name or both might be required unless they are universally known on the world wide web. If the public access is disabled, then the private links must be established, and this might require both the gateway and the app service to do so. Lastly, with each change, an app service must maintain its inbound and outbound properly for bidirectional communication, so some vetting is required on the app service side independent of the gateway.

Putting this altogether via IaC requires that the changes be made in stages and each stage validated independently.