Cluster computing

Tuesday, April 11, 2023

The following is a Terraform module to create an Azure dashboard for a specific user supplied content and query. It quickly deploys a panel for a query on a supported metric that can be authored in KQL externally and specified here via a URL to the query in the dashboard definitions. Dashboard panel and visualizations can be created from the Shared dashboards URL: https://portal.azure.com/#view/HubsExtension/BrowseResource/resourceType/Microsoft.Portal%2Fdashboards.

provider "azurerm" {

features {}

}

data "azurerm_subscription" "current" {}

resource "azurerm_resource_group" example_rg {

name = var.resource_group

location = "Central US"

}

resource "azurerm_portal_dashboard" example_dashboard_name {

name = var.dashboard_name

resource_group_name = azurerm_resource_group.example_rg.name

location = azurerm_resource_group.example_rg.location

tags = {

source = "terraform"

}

dashboard_properties = <<DASH

{

"lenses": {

"0": {

"order": 0,

"parts": {

"0": {

"position": {

"x": 0,

"y": 0,

"rowSpan": 5,

"colSpan": 5

"metadata": {

"inputs": [],

"type": "Extension/HubsExtension/PartType/MarkdownPart",

"settings": {

"content": {

"settings": {

"content": "${var.query_content}",

"subtitle": "",

"title": "${var.dashboard_title}"

}

"1": {

"position": {

"x": 5,

"y": 0,

"rowSpan": 5,

"colSpan": 6

"metadata": {

"inputs": [],

"type": "Extension/Microsoft_Azure_Monitoring/PartType/MetricsChartPart",

"settings": {

"content": {

"settings": {

"title": "Important Information",

"subtitle": "",

"src": "${var.query_link}",

"autoplay": true

}

"metadata": {

"owner": "${var.dashboard_owner_email}",

"model": {

"timeRange": {

"value": {

"relative": {

"duration": 24,

"timeUnit": 1

}

"type": "MsPortalFx.Composition.Configuration.ValueTypes.TimeRange"

"filterLocale": {

"value": "en-us"

"filters": {

"value": {

"MsPortalFx_TimeRange": {

"model": {

"format": "utc",

"granularity": "auto",

"relative": "24h"

"displayCache": {

"name": "UTC Time",

"value": "Past 24 hours"

"filteredPartIds": [

"StartboardPart-UnboundPart-ae44fef5-76b8-46b0-86f0-2b3f47bad1c7"

]

}

DASH

}

As with all Terraform modules, this can be run via the following Terraform commands:

terraform init

terraform plan

terraform apply

and the command terraform destroy can be run when this resource is no longer needed.

Monday, April 10, 2023

This is a continuation of the previous posts on Azure Data Platform and discusses the considerations for a specific scenario of moving data from an on-premises IBM object storage to Azure storage.

Organization of storage assets to support governance, operational management and accounting requirements is necessary for data migration to the cloud. Well-defined naming and metadata tagging conventions help to quickly locate and manage resources. These conventions also help to associate cloud usage costs with business teams via chargeback and showback accounting mechanisms.

Proper naming is essential for security purposes but tagging improves metadata. They serve different purposes. The naming includes parts that indicate the business unit or project owner. Names could also have parts for workload, application, environment, criticality and other such information. Tags could also include these names but enhance the metadata that do not necessarily need to be reflected in the name.

An effective naming strategy includes one with:

<resourceType>-<workload/application>-<environment>-<region>-<instance>.

For example,

rg-projectx-nonprod-centralus-001

is an example of a good naming convention for a resource group.

The projectx here refers to the name of the project or business capability but it could comprise multiple parts for hierarchical information such as contoso-fin-navigator for <organization>-<department>-<service>

Similarly, the <instance> could comprise of <role>-<instanceSuffix> for example vm-transactional-nonprod-centralus-db-001.

For storage accounts, the project name can contain spaces, but these could be removed. There might be no option to substitute the space character with a hyphen. Container names permit hyphens. A container name must be between 3-63 characters long and all lowercase.

Existing naming conventions can be used, if they have been serving adequately and are compatible with the public cloud naming restrictions.

It is preferable to introduce containers for buckets and storage accounts for namespaces between destination and source mapping on a one-on-one basis.

The current limits for an on-premises IBM object storage include the following:

100 buckets per object storage instance.

10 TB max per object

Unlimited number of objects in an instance.

1024-character key length

Storage classes at bucket level.

Changing storage class requires manually copying data from one bucket to another.

Archive independently of storage class.

IBM COS is accessible via S3 protocol.

And these are well within the limits of Azure storage account

Sunday, April 9, 2023

The following script can be used to upload local files and folders to the Azure storage account from an on-premises machine using the AzCopy utility which is a command line tool that can be used to copy blobs or files to or from a storage account, and then transfer data. The tool requires authentication but allows unattended login via a security principal. It also resumes a previous execution of the copy command with help of journaling. A custom location for the journal folder can be specified via the –resume option. Azure Data Lake Gen2 storage works with only the latest versions of the AzCopy such as v10 onwards. For multi-region deployments, it is recommended to have the data landing in one region and then replicated globally using AzCopy.

#!/bin/sh

throw() {

echo "$*" >&2

exit 1

}

STORAGE_ACCOUNT_NAME=

CONTAINER_NAME=

LOCAL_FOLDER_PATH=

usage() {

echo

echo "Usage: $(basename $0) -b arg -c arg -l arg [-h]"

echo

echo "-b - The name of the blob storage account."

echo "-c - The name of the container."

echo "-l - The name of the local system folder."

echo "-h - This help text."

echo

}

parse_options() {

while getopts ':b:l:c:h' opt; do

case "$opt" in

STORAGE_ACCOUNT_NAME="$OPTARG"

;;

LOCAL_FOLDER_PATH="$OPTARG"

;;

CONTAINER_NAME="$OPTARG"

;;

echo "Processing option 'h'"

usage

exit 0

;;

echo "option requires an argument.\n"

usage

exit 1

;;

echo "Invalid command option.\n"

usage

exit 1

;;

esac

done

shift "$(($OPTIND -1))"

}

parse_options "$@"

if ([ -z "$LOCAL_FOLDER_PATH" ] || [ -z "$STORAGE_ACCOUNT_NAME" ] || [ -z "$CONTAINER_NAME" ]);

then

echo "Invalid command.\n"

usage

exit 1

./azcopy login

./azcopy copy "$LOCAL_FOLDER_PATH" "https://$STORAGE_ACCOUNT_NAME.blob.core.windows.net/$CONTAINER_NAME" --recursive=true

./azcopy sync "$LOCAL_FOLDER_PATH" "https://$STORAGE_ACCOUNT_NAME.blob.core.windows.net/$CONTAINER_NAME" --recursive=true

# crontab -e

# */5 * * * * sh /path/to/upload.sh

The script can be made to run periodically so that the delta changes can also be propagated.

#codingexercise

Problem Statement: A 0-indexed integer array nums is given.

Swaps of adjacent elements are able to be performed on nums.

A valid array meets the following conditions:

· The largest element (any of the largest elements if there are multiple) is at the rightmost position in the array.

· The smallest element (any of the smallest elements if there are multiple) is at the leftmost position in the array.

Return the minimum swaps required to make nums a valid array.

Example 1:

Input: nums = [3,4,5,5,3,1]

Output: 6

Explanation: Perform the following swaps:

- Swap 1: Swap the 3^rd and 4^th elements, nums is then [3,4,5,3,5,1].

- Swap 2: Swap the 4^th and 5^th elements, nums is then [3,4,5,3,1,5].

- Swap 3: Swap the 3^rd and 4^th elements, nums is then [3,4,5,1,3,5].

- Swap 4: Swap the 2^nd and 3^rd elements, nums is then [3,4,1,5,3,5].

- Swap 5: Swap the 1^st and 2^nd elements, nums is then [3,1,4,5,3,5].

- Swap 6: Swap the 0^th and 1^st elements, nums is then [1,3,4,5,3,5].

It can be shown that 6 swaps is the minimum swaps required to make a valid array.

Example 2:

Input: nums = [9]

Output: 0

Explanation: The array is already valid, so we return 0.

Constraints:

· 1 <= nums.length <= 10⁵

· 1 <= nums[i] <= 10⁵

Solution:

class Solution {

public int minimumSwaps(int[] nums) {

int min = Arrays.stream(nums).min().getAsInt();

int max = Arrays.stream(nums).max().getAsInt();

int count = 0;

while (nums[0] != min && nums[nums.length-1] != max && count < 2 * nums.length) {

var numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var end = numsList.lastIndexOf(max);

for (int i = end; i < nums.length-1; i++) {

swap(nums, i, i+1);

count++;

}

numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var start = numsList.indexOf(min);

for (int j = start; j >= 1; j--) {

swap(nums, j, j-1);

count++;

}

return count;

}

public void swap (int[] nums, int i, int j) {

int temp = nums[j];

nums[j] = nums[i];

nums[i] = temp;

}

Input

nums =

[3,4,5,5,3,1]

Output

Expected

Input

nums =

[9]

Output

Expected

Saturday, April 8, 2023

Using Azure Data Factory to upload and transform data to Azure Data Lake Storage Gen2.

This is a continuation of the articles on Azure Data Platform as they appear here and discusses data security and compliance for ADF and Data Lakes.

ADF can work with many types of data sources and ingest files and folders of varying size and number and to the tune of petabytes in size. Microsoft Purview can be used to govern, protect, and manage data estates. It provides integrated coverage, helps address the fragmentation of data across rest and transit. This kind of solution helps an organization to protect sensitive data across clouds, applications and devices and identify data risks and manage regulatory compliance requirements. It helps to create an up-to-date map of the entire data estate that includes data classification and end-to-end lineage, identify sensitive data and create a secure environment for data consumers to find valuable data and generate insights about how the data is stored and used. With data in ADF and data lake, such a report is very helpful to meet compliance with standards such as SOC, ISO, HiTrust, FedRamp and HIPAA.

ADF can be connected to Microsoft Purview. There are two options to do so:

1. Connect to Microsoft Purview account in Azure Data Factory and

2. Register Data Factory in Microsoft Purview

3. Complete with Azure Monitor based alerts

The prerequisites are an Owner or Contributor role on the ADF to connect to a Microsoft Purview Account and ADF to have a system assigned managed identity enabled. The connection takes merely the Azure subscription to locate the Purview account and one of them is selected. The connection information is stored in the ADF resource. ADF’s managed identity is used to authenticate lineage push operations from the ADF to the Microsoft Purview account. The Data Curator role on the root collection of the Microsoft Purview must be assigned to the managed identity of the ADF.

Both this connection and the Purview integration capabilities can be monitored. The default integration capability is the data lineage pipeline. When this pipeline is executed, the lineage information is transmitted to the Purview account. The search bar at the top center of the Data Factoring authoring UI can be used to search for data and perform actions. This is very helpful to understand the data based on metadata, lineage, and annotations. Many organizations heavily rely on tagging and metadata and even to the point of specifying paths and dedicating storage containers for such information.

With the data searched by Microsoft Purview, it is possible to create Linked Service, Dataset, or dataflow over the data.

All the activity runs from ADF have status, copy duration, throughput, data read, files read, data written, files written, peak connections for both read and write, the parallel copies used, the data integration units, the queue and the transfer durations to provide complete information on the activities performed for monitoring or troubleshooting.