Cluster computing: June 2023

Friday, June 30, 2023

How to enable Unity Catalog for Azure Databricks?

Azure Databricks is an Azure managed service for provisioning Databricks instances which is a platform that unifies data, analytics and AI. Databricks users who have previously used older versions of Databricks may not have migrated to Unity Catalog which is a centralized administrative module. This article explains how to enable and work with Unity Catalog.

Databricks does not force us to migrate our data into proprietary storage systems to use the platform. Instead, it allows us to integrate the platform with external storage and deploys compute to process the data. We control the integrations and manage permissions. Unity Catalog further extends this relationship by managing permissions for accessing data using SQL syntax from within Azure Databricks.

The primary purpose is integrated access control:

· Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. It offers a single place to administer data access policies that apply across all workspaces and personas. It automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages and personas.

· An Azure managed identity can access external storage on behalf of Unity Catalog users. Managed identities provide an identity for applications to use when they connect to resources that support Azure Active Directory (Azure AD) authentication.

The Unity Catalog comprises a hierarchy of Metastore at the top level, followed by Catalog, then by Schema and Tables and views at the leaf level. All items are referenced via a three-level namespace in the format catalog.schema.table. Metastore is the top level container for metadata. Other than the metastore, Unity Catalog comprises of user management module.

The steps to follow to setup Unity Catalog are:

1. Configure a storage container and Azure managed identity with read-write access to it.

2. Create a metastore

3. Attach workspaces to the metastore

4. Add users, groups and service principals to the Azure Databricks account.

Many people struggle to follow these steps because the navigation to get started is hidden behind their user icon on the admin accounts portal under the menu item “Manage Account”. Once they find this item, it is easy to follow the get started tutorial to create a metastore and setup the unity catalog as directed.

The steps to follow for setting up an integration of a fresh new instance with Azure data lake storage are:

1. Create an Azure Databricks instance in a vnet.

2. Create an ADB access connector resource for ADLS.

3. Use the access connector MI to access the Unity Catalog root storage account by specifying the access connector id under Data->Metastore.

4. Create a storage credential in the Unity catalog for this Managed Identity

5. Set up your data lake storage account with storage firewall that allows only Optum Ips

6. Grant access to this storage account by specifying to allow access from specific resource type and the databricks instance.

7. Setup storage Credential with external location mapping and access control policies for users and groups in the Unity Catalog.

Problem Statement: A 0-indexed integer array nums is given.

Swaps of adjacent elements are able to be performed on nums.

A valid array meets the following conditions:

· The largest element (any of the largest elements if there are multiple) is at the rightmost position in the array.

· The smallest element (any of the smallest elements if there are multiple) is at the leftmost position in the array.

Return the minimum swaps required to make nums a valid array.

Example 1:

Input: nums = [3,4,5,5,3,1]

Output: 6

Explanation: Perform the following swaps:

- Swap 1: Swap the 3^rd and 4^th elements, nums is then [3,4,5,3,5,1].

- Swap 2: Swap the 4^th and 5^th elements, nums is then [3,4,5,3,1,5].

- Swap 3: Swap the 3^rd and 4^th elements, nums is then [3,4,5,1,3,5].

- Swap 4: Swap the 2^nd and 3^rd elements, nums is then [3,4,1,5,3,5].

- Swap 5: Swap the 1^st and 2^nd elements, nums is then [3,1,4,5,3,5].

- Swap 6: Swap the 0^th and 1^st elements, nums is then [1,3,4,5,3,5].

It can be shown that 6 swaps is the minimum swaps required to make a valid array.

Example 2:

Input: nums = [9]

Output: 0

Explanation: The array is already valid, so we return 0.

Constraints:

· 1 <= nums.length <= 10⁵

· 1 <= nums[i] <= 10⁵

Solution:

class Solution {

public int minimumSwaps(int[] nums) {

int min = Arrays.stream(nums).min().getAsInt();

int max = Arrays.stream(nums).max().getAsInt();

int count = 0;

while (nums[0] != min && nums[nums.length-1] != max && count < 2 * nums.length) {

var numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var end = numsList.lastIndexOf(max);

for (int i = end; i < nums.length-1; i++) {

swap(nums, i, i+1);

count++;

}

numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var start = numsList.indexOf(min);

for (int j = start; j >= 1; j--) {

swap(nums, j, j-1);

count++;

}

return count;

}

public void swap (int[] nums, int i, int j) {

int temp = nums[j];

nums[j] = nums[i];

nums[i] = temp;

}

Input

nums =

[3,4,5,5,3,1]

Output

Expected

Input

nums =

[9]

Output

Expected

Thursday, June 29, 2023

Workflows with AirFlow:

Apache AirFlow is a platform used to build and run workflows. A workflow is represented as a Directed Acyclic Graph where the nodes are the tasks and the edges are the dependencies. This helps to determine the order in which to run them and with retries. The tasks are self-described. An Airflow deployment consists of a scheduler to trigger scheduled workflows and to submit tasks to the executor to run, an executor to run the tasks, a web server for a management interface, a folder for the DAG artifacts, and a metadata database to store state. The workflows don’t restrict what can be specified as a task which can be an Operator or a predefined task using say Python, a Sensor which is entirely about waiting for an external event to happen, and a Custom task that can be specified via a Python function decorated with an @task.

Runs of the tasks in a workflow can occur repeatedly by processing the DAG and can occur in parallel. Edges can be modified by setting the upstream and downstream for a task and its dependency. Data can be passed between tasks using an XCom, a cross-communications system for exchanging state, uploading and downloading from an external storage, or via implicit exchanges.

Airflows send out tasks to run on workers as space become available so they can fail but they will eventually complete. Notions for sub-DAGs and TaskGroups are introduced for better manageability.

One of the characteristics of AirFlow is that it prioritizes flow, so that there is no need to describe data input or output and all aspects of the flow can be visualized whether they include pipeline dependencies, progress, logs, code, tasks and success status.

AirFlow is in use by over ten thousand organizations with popular use cases involving orchestrating batch ETL jobs, organizing, executing and monitoring data flow, building ETL pipelines for extracting batch data from hybrid data sources and running Spark jobs, training machine learning models, generating automated reports and performing backups and other DevOps tasks. It might not be ideal for streaming events because the scheduling required is different between batch and stream. Also AirFlow does not offer versioning of pipelines, so a source control might become necessary for such cases. AirFlow epitomizes pipeline as a code with artifacts described in Python for creating jobs, stitching jobs, programming other necessary data pipelines and debugging and troubleshooting.

Wednesday, June 28, 2023

The following IaC shows github integration.

data "azuread_client_config" "current" {}

variable namespace {

description = “The namespace for end-user deployment”

type = string

   default = "${var.name}-" + uuid()

}

resource "azuread_group" "contributor_group" {

display_name = "${var.namespace} contributor group"

owners = [data.azuread_client_config.current.object_id]

security_enabled = true

onpremises_group_type = "UniversalSecurityGroup"

onpremises_sync_enabled = true

}

resource "azuread_group" "operator_group" {

display_name = "${var.namespace} operator group"

owners = [data.azuread_client_config.current.object_id]

security_enabled = true

onpremises_group_type = "UniversalSecurityGroup"

onpremises_sync_enabled = true

}

resource "github_team" "deployment_contributors" {

name = "${var.namespace} contributor-team"

description = "Has read-write access"

privacy = "closed"

}

resource "github_team" "deployment_operators" {

name = "${var.namespace} operator-team"

description = "Has read-only access"

privacy = "closed"

}

resource "github_repository" "pipelines" {

name = "${var.namespace}-pipelines"

description = "${var.namespace} pipeline artifacts"

visibility = "private"

private = true

auto_init = true

template {

owner = "MyOrganization"

repository = "pipeline-template"

include_all_branches = true

}

resource "github_branch" "contributors-branch" {

repository = github_repository.pipelines.name

branch = "contributors-branch"

}

resource "github_branch" "operators-branch" {

repository = github_repository.pipelines.name

branch = "operators-branch"

}

resource "github_branch_protection" "contributors_branch_protection" {

repository_id = github_repository.pipelines.name

pattern = github_branch.contributors-branch.branch

enforce_admins = true

allows_deletions = false

push_restrictions = [

data.github_team.deployment_contributors.name,

]

}

resource "github_branch_protection" "operators_branch_protection" {

repository_id = github_repository.pipelines.name

pattern = github_branch.operators-branch.branch

enforce_admins = true

allows_deletions = false

push_restrictions = [

data.github_team.deployment_operators.name,

]

}

Tuesday, June 27, 2023

Firewall Rules:

This article follows up on a previous one regarding firewall rules. A firewall serves to deter hacker attacks against web applications. They are also referred to as Web Application Shields or Web Application Security Filters. This section of the article is aimed at technical decision makers as well as application owners so that they can be better prepared with the concepts behind the best practices in setting up a web application firewall.

The access to a web application measures the extent to which the required changes to the application source code are carried out in-house, on time, or can be carried out by third parties. Between the extremes of no access and full access, a WAF can come useful to consolidate access and provide safety measures such as encryption. In between these extremes, the benefits of a WAF is less when the application is mostly developed in house with low buy-ins and more when the application has high percentage modifications and more buy-ins.

Unlike securing transport of data between clients and servers, the firewall does not come with an option to offload to an external device and is designed to be a software plug-in. Prioritizing the web applications for securing behind a firewall depends on access to personal data, access to confidential information, essential requirement for the completion of critical business processes, and the relevance for the attainment of critical certifications. When access is denied from a firewall, some risks and costs apply such as interruption of business processes, damage compensation claims, and others. Maintenance contract of the applications and the short error replication times play a significant role in how a firewall is perceived just as much as its features are used even when configured correctly.

A WAF can help with cookie protection with its support for signed and encrypted cookies. It can prevent information leakage with the use of a cloaking filter or cleaning filter. It tackles session riding with URL encryption/token. It can check for viruses on file upload. It can deter parameter tampering and forced browsing. It provides protection against path traversal and link validation. It provides logging for specific or permitted parts of the requests. It can force SSL, prevent cross-site tracing, command injection, SQL injection, and just in time patching. It provides protection against HTTP request smuggling.

The central or decentral infrastructure, performance criteria, conforming to existing security policies, iterative implementation from basic security to full protection, role distribution, prioritizing applications and providing full protection are some of the areas of best practice.

Monday, June 26, 2023

Firewall rules:

Access control can be role-based with a determination of the identity of a caller based on what the caller has or what the caller knows to authenticate and authorize the caller and then determining the permissions based on the associated role. While this can be an elaborate mechanism, some simple criteria can suffice to determine admission control. The difference between the two is clear when the system does not want to provide any further information on the failures of the calls including whether a second chance or a retry is permitted. The behavior is at par with rate limits against calls to a resource, so the behavior is expected for API calls. The two mechanisms can even be complimentary to each other.

An allowlist allows access to those in the list but does not specify any behavior for those that are not in the list. That is why another list is required to articulate who must be denied. The addition of a single entity on an allowlist implicitly adds all others to the deny list, so some systems go the extra length of automatically adding a deny all whenever an allowlist entry is added. These complimentary lists must be made mutually exclusive and with a priority for allow and a catch all as deny.

An allowlist is maintained across various types of clients to the resource. Everyone must opt into the whitelist, or the system must enforce it, otherwise the allowlist simply means nothing. It is also harder to enforce when the system does not actively monitor or make use of the allowlist, and they are understood and enforced outside the system by third party.

Inclusion of an allowlist or a denylist is sometimes considered to be an antipattern, especially, when the system does not need to provide admission control and can scale arbitrarily to any load without affecting other callers. When there is possibility of denial of service to others by the calls of a few specific clients or bots, then these lists can be justified. If there are many criteria by which allow or deny must be decided, then the lists don’t suffice and a fully developed classifier to fully encapsulate the decision-making ability and to interpret the properties of the callers might be justified. The convention for evaluating rules in a classifier is to evaluate them one by one and in program order. The processing can stop at any rule or falls through the rest of the rules. The rules are evaluated as if there was an ‘or’ condition between the previous and the current. The logic, on the other hand, inside a single rule can be complex involving both the ‘or’ and ‘and’ logical operators.

Often allowlists and denylists can become part of a rule and the system stores and processes these rules. The expressions involved in a rule can be evaluated as a tree if it were not restricted to a flat sequence of first-level predicates. The evaluation of an expression tree is recursive and might require the plan to be compiled and saved so that they are not repeatedly prepared for matching against the criteria of a caller. The way the regular expressions are compiled before matching speaks to how the classifier must match the incoming criteria.

Sunday, June 25, 2023

# REQUIRES -Version 2.0

Synopsis: The following Powershell script serves as the complimentary

example towards the backup and restore of an AKS cluster introduced with backup script

The concept behind this form of BCDR solution is described here:

https://learn.microsoft.com/en-us/azure/backup/azure-kubernetes-service-cluster-backup-concept

param (

[Parameter(Mandatory=$true)][string]$resourceGroupName,

[Parameter(Mandatory=$true)][string]$accountName,

[Parameter(Mandatory=$true)][string]$subscriptionId,

[Parameter(Mandatory=$true)][string]$aksClusterName,

[Parameter(Mandatory=$true)][string]$aksClusterRG,

[string]$backupVaultRG = "testBkpVaultRG",

[string]$backupVaultName = "TestBkpVault",

[string]$location = "westus",

[string]$containerName = "backupc",

[string]$storageAccountName = "sabackup",

[string]$storageAccountRG = "rgbackup",

[string]$environment = "AzureCloud"

)

Connect-AzAccount -Environment "$environment"

Set-AzContext -SubscriptionId "$subscriptionId"

Write-Host "Before we start, test the backup vault"

$TestBkpVault = Get-AzDataProtectionBackupVault -VaultName $backupVaultName -ErrorAction Stop

if ($TestBkpVault -eq $null) {

Write-Host "This script should not be executed if the vault cannot be found."

exit 1

}

$policyDefn = Get-AzDataProtectionPolicyTemplate -DatasourceType AzureKubernetesService

$policyDefn.PolicyRule[0].Trigger | fl

ObjectType: ScheduleBasedTriggerContext

ScheduleRepeatingTimeInterval: {R/2023-04-05T13:00:00+00:00/PT4H}

TaggingCriterion: {Default}

$policyDefn.PolicyRule[1].Lifecycle | fl

DeleteAfterDuration: P7D

DeleteAfterObjectType: AbsoluteDeleteOption

SourceDataStoreObjectType : DataStoreInfoBase

SourceDataStoreType: OperationalStore

TargetDataStoreCopySetting:

$aksBkpPol = Get-AzDataProtectionBackupPolicy -ResourceGroupName $backupVaultRG -VaultName $TestBkpVault.Name -Name "aksBkpPolicy"

if ($aksBkpPol -eq $null) {

Write-Host "This script should not be executed if there was no backup policy"

}

Write-Host "Tracking all the backup jobs"

$job = Search-AzDataProtectionJobInAzGraph -Subscription $subscriptionId -ResourceGroupName $backupVaultRG -Vault $TestBkpVault.Name -DatasourceType AzureKubernetesService -Operation OnDemandBackup

Write-Host "Fetch the relevant recovery point"

$AllInstances = Get-AzDataProtectionBackupInstance -ResourceGroupName $backupVaultRG -VaultName $TestBkpVault.Name

Write-Host "Searching across multiple vaults and subscriptions"

$AllInstances = Search-AzDataProtectionBackupInstanceInAzGraph -ResourceGroupName $backupVaultRG -VaultName $TestBkpVault.Name -DatasourceType AzureKubernetesService -ProtectionStatus ProtectionConfigured

if ($AllInstances -eq $null) {

Write-Host "This script should not be executed if there was no backup instance."

}

Write-Host "Once the instance is identified, fetch the relevant recovery point"

$rp = Get-AzDataProtectionRecoveryPoint -ResourceGroupName $backupVaultRG -VaultName $TestBkpVault.Name -BackupInstanceName $AllInstances[2].BackupInstanceName

Write-Host "Prepare the restore request"

$aksClusterId= "/subscriptions/$subscriptionId/resourceGroups/$resourceGroup/providers/Microsoft.ContainerService/managedClusters/$aksClusterName"

$aksRestoreCriteria = New-AzDataProtectionRestoreConfigurationClientObject -DatasourceType AzureKubernetesService -PersistentVolumeRestoreMode RestoreWithVolumeData -IncludeClusterScopeResource $true -NamespaceMapping @{"sourceNamespace"="targetNamespace"}

$backupInstance = $AllInstance[2]

$aksRestoreRequest = Initialize-AzDataProtectionRestoreRequest -DatasourceType AzureKubernetesService -SourceDataStore OperationalStore -RestoreLocation $location -RestoreType OriginalLocation -RecoveryPoint $rps[0].Property.RecoveryPointId -RestoreConfiguration $aksRestoreCriteria -BackupInstance $backupInstance

Write-Host "Trigger the restore"

$validateRestore = Test-AzDataProtectionBackupInstanceRestore -SubscriptionId $subscriptionId -ResourceGroupName $aksClusterRG -VaultName $backupVaultName -RestoreRequest $aksRestoreRequest -Name $backupInstance.BackupInstanceName

$restoreJob = Start-AzDataProtectionBackupInstanceRestore -SubscriptionId $subscriptionId -ResourceGroupName $aksClusterRG -VaultName $backupVaultName -BackupInstanceName $backupInstance.BackupInstanceName -Parameter $aksRestoreRequest

Write-Host "Track all the restore jobs"

$job = Search-AzDataProtectionJobInAzGraph -Subscription $subscriptionId -ResourceGroupName $backupVaultRG -Vault $TestBkpVault.Name -DatasourceType AzureDisk -Operation OnDemandBackup