Cluster computing

Saturday, April 22, 2023

Missing ranges: 

You are given an inclusive range [lower, upper] and a sorted unique integer array nums, where all elements are in the inclusive range. 

A number x is considered missing if x is in the range [lower, upper] and x is not in nums. 

Return the smallest sorted list of ranges that cover every missing number exactly. That is, no element of nums is in any of the ranges, and each missing number is in one of the ranges. 

Each range [a,b] in the list should be output as: 

· "a->b" if a != b 

· "a" if a == b 

Example 1: 

Input: nums = [0,1,3,50,75], lower = 0, upper = 99 

Output: ["2","4->49","51->74","76->99"] 

Explanation: The ranges are: 

[2,2] --> "2" 

[4,49] --> "4->49" 

[51,74] --> "51->74" 

[76,99] --> "76->99" 

Example 2: 

Input: nums = [-1], lower = -1, upper = -1 

Output: [] 

Explanation: There are no missing ranges since there are no missing numbers. 

Constraints: 

· -10⁹ <= lower <= upper <= 10⁹ 

· 0 <= nums.length <= 100 

· lower <= nums[i] <= upper 

· All the values of nums are unique. 

class Solution { 

    public List<String> missingRanges(int lower, int upper, int[] range) {

var result = new List<String>();

int start = lower;

String candidate = String.Empty;

for (int i = 0; i < range.length; i++) {

if (start != range[i]) {

int end = range[i]-1;

if (start == end) candidate = start;

else candidate = start + “->” + end;

result.add(candidate);

candidate = “”;

}

start = range[i]+1;

}

if (start == upper) {

result.add(start);

}

if (start < upper){

result.add(start + “->” + upper);

}

return result;

}

Friday, April 21, 2023

This is a continuation of articles as they appear here for a discussion on the Azure Data Platform:

All storage operations with S3 command line tools are idempotent and retriable which plays well for robustness against provider failures, rate limits and network connectivity disruptions. While s3cmd and azcopy must invoke download of files and folders independent from the upload of the same, rclone can accomplish simultaneous download and upload in a single command invocation by specifying different data connections as command line arguments and in a streaming manner which is more performant and leaves little or no footprint on the local filesystem.

For example:

user@macbook localFolder % rclone copy –metadata –progress aws:s3-namespace azure:blob-account
Transferred: 215.273 MiB / 4.637 GiB, 5%, 197.226 KiB/s, ETA 6h32m14s

will stream from one to another leaving no trace in the local filesystem.

The following script can be used to set tags on objects when copying an object from s3 source to destination even if –metadata parameter with rclone command does not work:

#! /bin/bash

for name in `rclone lsf con-aws-1:srcFolder --recursive`;

if [[ $name != */ ]]

then

var=`aws s3api get-object-tagging --bucket "srcFolder" --key "$name" --output json --query "TagSet"`

echo "$var"

count=`echo "$var" | jq length`

tags=""

for (( i=0; i < $count; i++)) ; do tags+=`echo $var | jq .[$i].Key`"="`echo $var | jq .[$i].Value`"&"; done

echo $tags | tr -d '"'

$tags=`echo $tags | tr -d '"'`

echo $tags

#Key2=Value2&Key1=Value1&

if [[ -n $tags ]]

then

echo rclone copy --metadata-set `echo $tags | tr -d '"'` con-aws-1:/srcFolder/$name con-aws-1:/srcFolder

else

echo rclone copy con-aws-1:/srcFolder/$name con-aws-1:/destFolder

done

# {

# "Key": "Key2",

# "Value": "Value2"

# },

# {

# "Key": "Key1",

# "Value": "Value1"

# }

#Key2=Value2&Key1=Value1&

Thursday, April 20, 2023

Azure Data factory and self-hosted Integration Runtime:

This is a continuation of the articles on Azure Data Platform as they appear here. Azure Data Factory is a managed cloud service from the Azure public cloud that supports data migration and integration between networks. This article focuses on setting up a site-to-site VPN for connecting on-premises to the Azure cloud.

Azure, self-hosted and Azure-SSIS integration runtimes are the flavors of compute infrastructure that the Azure Data Factory uses to provide data integration capabilities across different network environments. These include executing a data flow in a managed Azure compute environment, copying data across data stores in a public or private networks, dispatching and monitoring transformation activities and natively executing SQL Server integration services packages in a managed Azure compute environment. Out of these, the self-hosted runtime can be used for data movement and activity dispatch across on-premises and Azure networks. Self-hosted integration runtime cannot be used for managed compute, autoscale and dataflow but it can be used for on-premises data access, private link/private endpoint and custom component/driver. It requires the on-premises network to be connected to Azure via ExpressRoute or VPN. The private endpoints are managed by the Azure Data Factory Service.

The setting up of site-to-site connection involves the use of Azure Virtual WAN.

An IPSec/IKE VPN connection is required to connect to Azure resources over virtual WAN. This involves a VPN device located on-premises that has an externally facing public IP address assigned to it. The steps involved to set this up are: 1. Create a virtual WAN, 2. Configure virtual hub Basic settings, 3. Configure site-to-site VPN gateway settings. 4. Create a site, 5. Connect a site to a virtual hub, 6. Connect a VPN site to a virtual hub, 7. Connect a VNet to a virtual hub, 8. Download a configuration file, and 9. View or edit the VPN gateway.

The pre-requisites on the Azure side of the connection are 1. An Azure subscription, 2. A virtual network without any existing virtual network gateways and IP address range to use for the virtual hub private address space.

The Virtual WAN is actually a set of resources collectively insantiated to represent a virtual overlay of the Azure network. It requires subscription, resource group, location, name and type as Basic or Standard. Basic is used to create only the site-to-site connection while Standard has advanced features.

A virtual hub is required to contain a dedicated gateway for site-to-site functionality. It requires subscription, location, name, private address space in CiDR notation, capacity in terms of routing infrastructure units, routing preference and a router Autonomous System Number.

The site-to-site connection is configured with the router ASN, Gateway scale units and routing preference as Microsoft network or Internet.

Next, a site is configured in the Virtual WAN to correspond to the physical location from where the connections will be initiated. It requires the location, name, device vendor as Citrix, Cisco, Barracuda, etc. and a private address space. Links can be added to represent the physical links at the location.

When the site is created, it can be viewed from the virtual WAN page. The VPN site is then connected to the virtual hub. The connection of sites requires settings such as a Pre-shared key, protocol such as IKEv2 or IKEv1, IPSec as default or custom, a flag to indicate if the default route will propagate so that virtual networks connecting to the hub will have this gateway reachability added to their routing table, a flag to indicate if the policy based traffic selector must be left disabled, a flag to indicate if the traffic selector must be configured and a connection mode selected from default, initiator only or responder only choices.

When the connection is made, its status will show as updating. After the updating completes, the site shows the connection and connectivity status. A virtual network can then be connected and the VPN device configuration information can be downloaded.

Tuesday, April 18, 2023

Azure Data factory and self-hosted Integration Runtime:

This is a continuation of the posts on Azure Data Platform. Azure Data Factory is a managed cloud service from the Azure public cloud that supports data migration and integration between networks.

This article describes how we can configure a self-hosted integration runtime once it has been identified as the right choice of Integration Runtime for the purpose at hand. The right choice also positions the infrastructure against growing business needs and any future increase in the workload, especially given that there is no one size fits all approach.

As a compute resource for the ADF pipeline, the self-hosted integration runtime benefits from being close to the data source so that the data movement, activity and data transformation are achieved with improved performance. While this can be automatically decided by the Azure Integration runtime, some constraints apply. For example, the self-hosted integration runtime will be located in the region with the local machines, virtual machines and their scale sets. If the data store is on an on-premises network or behind a firewall, the integration runtime could be on a managed virtual network or the self-hosted integration runtime.

When the data store is not publicly accessible, such as when it is behind a firewall or when it is on private network, some additional setup is necessary such as the use of Azure Private Link and Load Balancer in the case of Azure integration runtime on a managed virtual network or a VPN connection in the case of self-hosted integration runtime.

If the data is highly confidential, it is better to encrypt data in transit and rest. Communications must happen over https and TLS. Private endpoints can be created to access the Azure resources over the VNet.

The self-hosted integration runtime is installed on customer machines, the end-uses maintain them. Auto-updates and expire notifications can be set up to facilitate this. A diagnostic tool to perform some health checks is also available. And as always, Azure Monitor and Azure Log Analytics help with troubleshooting. The number of concurrent activities that the self-hosted integration runtime can run depends on the machine size and cluster size.