Cluster computing

Monday, April 17, 2023

Azure Data factory and self-hosted Integration Runtime:

This is a continuation of the articles on Azure Data Platform as they appear in the previous posts. Azure Data Factory is a managed cloud service from the Azure public cloud that supports data migration and integration between networks.

Azure self-hosted integration runtime is the compute infrastructure that the Azure Data Factory uses to provide data integration capabilities across different network environments. A self-hosted integration runtime can run copy activities between a cloud datastore and a data store in a private network. It can also dispatch transform activities across compute resources in both networks. This article describes how we can create and configure a self-hosted integration runtime.

There are a few considerations. A self-hosted integration runtime does not have to be dedicated to a data source or a network or an Azure Data Factory instance. It can also be shared with others in the same tenant. A single instance of the self-hosted IR must be installed on a single computer, and it can be put in the sharing mode for two different Azure Data Factories or there can be two on-premises computers, one for each Data Factory. The self-hosted IR is computing capability and does not need to be installed on the same computer as the data store. There can be many flavors of the self-hosted IR installed on different machines and forming connections between the same data source and Azure public cloud. This is particularly helpful to data sources that are behind a firewall with no line of sight to the public cloud.

Tasks that are hosted on this compute capability with FIPS compliant encryption, can fail and when that happens, there are two options: 1. Disable FIPS compliant encryption or 2. Store credentials and secrets in a Key Vault. The registry key for the FIPS compliant encryption is in the HKLM\System\Current ComtrolSet\Control\Lsa\FipsAlgorithmPolicy\Enabled with the values 1 for enabled and 0 for disabled.

The self-hosted IR performs read-write against the on-premises data store and sets up control and data paths against Azure. It encrypts the credentials by using the Windows Data Protection Application Programming Interface and saves the credentials locally. If multiple nodes are set up for high availability, the credentials are further synchronized across the other nodes. Each node encrypts the credentials by using DPAPI and stores them locally.

The Azure Data Factory pipelines communicate with the self-hosted IR to schedule and manage jobs. Communication is via a control channel that uses a shared Azure Relay connection. When an activity job needs to be run, the service queues the request along with the credential information. The IR polls the queue and starts the job.

The self-hosted integration runtime copies data between the on-premises store and cloud storage. The direction of the copy depends on how the copy activity is configured in the data pipeline. It can perform these data transfers over the https. It can also operate as a man-in-the middle for certain types of data transfers but usually the direction of copying with specific source and destination must be specified.

Sunday, April 16, 2023

Amazon Simple Storage Service or S3 for short is a limitless durable cloud storage service that can be used to store, backup and protect data in the form of blobs or files. There is a well-defined API to access the storage and many independent on-premises S3 storage solution providers largely conform to this protocol with suitable enhancements for their solutions. This allows high interoperability between command line and user interface tools for browsing the storage. Some of the well-known UI tools used to browse S3 storage on-premises or in the cloud are the S3 Browser for Windows and Commander One for Mac OS. Similarly, command-line tools used are s3cmd, azcopy and rclone. This article highlights some of the features of the command-line tools beginning with the common functionalities and proceeding to the differences.

All command line tools require authentication with the S3 storage prior to making the API requests. Each request is given a signature based on the credentials usually in the form of a pair of access key and secret. These credentials can encapsulate a variety of permissions to use with the namespaces, buckets and object hierarchy and issued by the owner of those storage containers. These tools also recognize the account and credentials issued by the cloud for use of their blob service as an alternative to storage account or container specific credentials. Since the create, update, delete and retrieve of the storage objects and the signing of request follow well-defined protocols, the credentials and their providers can be set as parameters to use with the tools. Some version incompatibilities might exist, but many object storage providers follow this pattern. Many of them also recognize shared access signatures (SAS) for accessing a specific container in this storage. While keys and secrets are recorded and saved for reuse, shared access signatures are not saved at the provider and can expire after a certain duration. Although they can grant access to the bearer and cannot be revoked once issued, they are usually not considered risky enough to warrant revocation mechanisms or enforcements. For this reason, policies are not put in place to discourage their use and cloud providers and security standards do not restrict their usage.

Between the command-line tools of s3cmd, azcopy and rclone, there is a lot of contention to which storage providers they support. S3cmd is favored for linux based storage appliances and azcopy works well with Microsoft’s Azure public cloud but rclone wins hands down in covering not only the widest possible selection but also the most advanced features than other tools. By virtue of differences between operating systems and the cloud providers playing to the strengths of those ecosystems and having incompatibilities between Azure and AWS storage protocols, azcopy can read from either cloud whereas s3cmd works with those conforming to Amazon S3 service. All storage operations with these tools are idempotent and retriable which plays well for robustness against provider failures, rate limits and network connectivity disruptions. While s3cmd and azcopy must invoke download of files and folders independent from the upload of the same, rclone can accomplish simultaneous download and upload in a single command invocation by specifying different data connections as command line arguments and in a streaming manner which is more performant and leaves less footprint. For example:

user@macbook localFolder % rclone copy –metadata –progress aws:s3-namespace azure:blob-account
Transferred: 215.273 MiB / 4.637 GiB, 5%, 197.226 KiB/s, ETA 6h32m14s

will stream from one to another leaving no trace in the local filesystem.

Saturday, April 15, 2023

This is a continuation of the posts on Azure Data Platform and discusses the connections to be made for Azure Data Factory to reach on-premises data stores.

The networking solution to connect Azure Data Factory to on-premises is the same as for any other Azure service to reach the on-premises network. It includes private endpoint for incoming traffic, virtual network integrations for outgoing traffic, gateways and on-premises connectivity. Every virtual network can have its own gateway and this gateway can be used to connect to the on-premises network. When virtual networks connect via their gateways, the traffic between them flows through the peering configuration. This traffic uses the Azure Backbone. A virtual network can configure a gateway as a transit point to an on-premises network, but it will not have its own gateway in this case because there can be only one gateway either local or remote. Gateway transit is a peering property that lets one virtual network use the VPN gateway in the peered virtual network for cross-premises or VNet-to-VNet connectivity.

Gateway transit allows the peered virtual networks to use the Azure VPN gateway in the hub. Connectivity available on the VPN gateway includes Site-to-Site connection to the on-premises site, Point-to-Site connection for VPN clients and VNet-to-VNet connections to other VNets. The VPN gateway is placed in the hub virtual network and the virtual network gateway must be in the resource manager deployment model, not the classic deployment model. The resource manager and classic deployment models are two different ways of deploying Azure solutions and involve different API sets. The two models aren’t compatible with each other. Virtual machines, storage accounts and virtual networks support both Resource Manager and classic deployment models and all other services support resource manager. Resources created by these two different methods do not behave the same.

In the hub and spoke cloud virtual networks topology, the gateway transit allows them to share the VPN gateway in the hub. Routes to the gateway connected virtual network or to the on-premises will propagate to the routing tables of all the peered networks. This default behavior must be explicitly restricted in the VPN gateway by creating a table with the “Disable BGP route propagation” option and the routing table to the subnets must be associated with. The Network Contributor role that grants permission for Microsoft.Network/virtualNetworkPeerings/write must be granted.

In the Azure portal view of the Hub RM virtual network, a peering is added with the values specified in “This virtual network” section as “Allow” for traffic to remote virtual network, “Allow” for traffic forwarded to remote virtual network and with the option selected to “Use this virtual network’s gateway or Route Server”. On the same page, the values for the “Remote Virtual network” section must specify the deployment model as “Resource Manager”, the virtual network to be the Spoke RM virtual network, the traffic to remote virtual network to be allowed, the traffic forwarded from the remote virtual network to be allowed. The virtual network gateway for this section will specify to use the remote virtual network’s gateway which is the hub’s gateway. When this peering is created, it’s status will appear as connected on both virtual networks.

To confirm that the virtual networks are peered, we can check the effective routes by spot checking the routes of a network interface in any subnet in the virtual network. These subnets will have routes with next hop type as VNet peering, for each address space in each peered virtual network.

#codingexercise

Problem Statement: A 0-indexed integer array nums is given.

Swaps of adjacent elements are able to be performed on nums.

A valid array meets the following conditions:

· The largest element (any of the largest elements if there are multiple) is at the rightmost position in the array.

· The smallest element (any of the smallest elements if there are multiple) is at the leftmost position in the array.

Return the minimum swaps required to make nums a valid array.

Example 1:

Input: nums = [3,4,5,5,3,1]

Output: 6

Explanation: Perform the following swaps:

- Swap 1: Swap the 3^rd and 4^th elements, nums is then [3,4,5,3,5,1].

- Swap 2: Swap the 4^th and 5^th elements, nums is then [3,4,5,3,1,5].

- Swap 3: Swap the 3^rd and 4^th elements, nums is then [3,4,5,1,3,5].

- Swap 4: Swap the 2^nd and 3^rd elements, nums is then [3,4,1,5,3,5].

- Swap 5: Swap the 1^st and 2^nd elements, nums is then [3,1,4,5,3,5].

- Swap 6: Swap the 0^th and 1^st elements, nums is then [1,3,4,5,3,5].

It can be shown that 6 swaps is the minimum swaps required to make a valid array.

Example 2:

Input: nums = [9]

Output: 0

Explanation: The array is already valid, so we return 0.

Constraints:

· 1 <= nums.length <= 10⁵

· 1 <= nums[i] <= 10⁵

Solution:

class Solution {

public int minimumSwaps(int[] nums) {

int min = Arrays.stream(nums).min().getAsInt();

int max = Arrays.stream(nums).max().getAsInt();

int count = 0;

while (nums[0] != min && nums[nums.length-1] != max && count < 2 * nums.length) {

var numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var end = numsList.lastIndexOf(max);

for (int i = end; i < nums.length-1; i++) {

swap(nums, i, i+1);

count++;

}

numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var start = numsList.indexOf(min);

for (int j = start; j >= 1; j--) {

swap(nums, j, j-1);

count++;

}

return count;

}

public void swap (int[] nums, int i, int j) {

int temp = nums[j];

nums[j] = nums[i];

nums[i] = temp;

}

Input

nums =

[3,4,5,5,3,1]

Output

Expected

Input

nums =

[9]

Output

Expected

Friday, April 14, 2023

This is a continuation of the posts on Azure Data Platform and discusses the connections to be made for Azure Data Factory to reach on-premises data stores.

Computer networks are what protect hosts from attacks in public networks. They also allow connections to each other so that resources in one network can communicate with resources in another network. Networks can be on-premises or in the cloud, logical or physical and use subnets and CIDR ranges that can result in a similar looking IP address such as 10.x.y.z but are both unique and meaningful only within a network. Gateways are often used to allow other networks to resolve an IP address that does not belong to the current network. While gateways work well for outgoing addresses, endpoints and DNS resolvers serve well for incoming requests.

There are three forms of connectivity that are often re-used patterns across different connectivity requirements. These are:

1. Point to Point:

These are commonly used to connect one endpoint to another. Endpoints refer to a combination of IP address and port. When a point-to-point connectivity is established, it allows a network flow between the two that can be uniquely identified with a 5-part tuple of source IP address, source port, destination IP address, destination port and protocol. A rule establishing a connectivity between point to point allows bidirectional traffic and must be authored once to reflect on both the resources.

2. Point to Site:

This is established between an endpoint and a network so that it is easy for that endpoint to communicate with any resource in the destination network and for them to revert.

This connection is great for people who require little or no changes to their network but would like to connect with another network. When the point to site involves a virtual private network, the communications are sent through an encrypted tunnel over the IP network such as the internet.

3. Site to Site:

These connect different networks. When it involves virtual networks in the same cloud, this form of connectivity is often called peering. Peering doesn’t always connect cloud networks. It can connect virtual networks that are hosted independently in the cloud and on-premises. When the site-to-site involves a virtual private network, the communications are sent through an encrypted tunnel over the IP network such as the internet. In this case, usually the on-premises VPN device is connected to the Azure VPN gateway.

Lastly, connectivity can also be outsourced to third-party network providers so that networks say on-premises and in the cloud can be connected via a dedicated network that remains isolated from the public internet. These are also called ExpressRoute connections because they are more reliable, faster, have consistent latencies and higher security than typical connections over the internet. ExpressRoute providers usually have many choices of connectivity models and pricing for their customers.

Thursday, April 13, 2023

This is a continuation of the previous posts on Azure Data Platform and discusses one of the best clients to access storage both on-premises and cloud providers.

1.download:

rclone config

Current remotes:

Name Type

==== ====

con-s3-mac-prj-1 s3

e) Edit existing remote

n) New remote

d) Delete remote

r) Rename remote

c) Copy remote

s) Set configuration password

q) Quit config

e/n/d/r/c/s/q> e

Select remote.

Choose a number from below, or type in an existing value.

1 > con-s3-mac-prj-1

remote> 1

Editing existing "con-s3-mac-prj-1" remote with options:

- type: s3

- provider: IBMCOS

- access_key_id: <your_access_key>

- secret_access_key: <your_access_secret>

- endpoint: s3-onpremise-store.company.com

2. Upload:

Data downloaded via rclone can be uploaded to Azure via AzCopy:

AzCopy is a command-line utility that can run on the system inside an on-premises network with no link connectivity to the Azure public cloud and leverage the web accessibility to transfer files to an Azure storage account with or without hierarchical namespace support.

The command is usually specified as:

Azcopy copy ‘./localFolder’ ‘https://mystorageacount.dfs.core.windows.net/mycontainer/folder/’ --recursive

Wednesday, April 12, 2023

Working with Azure Data Factory.

This is a continuation of the articles on Azure Data Platform as they appear here and discusses the validations to be performed when using the Azure Data Factory to copy large sets of files from source to sink.

Validations to be performed:

1. Are the file types at the source in the supported list of file formats which includes: Avro, binary, delimited text, Excel, Json, ORC, Parquet, and XML.

2. Is the duration of the transfer worked out in terms of the size/bandwidth? A 1PB over 50
Mbps transfer would take 64.7 months and a 100 TB over 10 Gbps would take 1 day.

3. Has the integrity of objects been checked with additional checksums? A checksum can be provided at the time of the upload and verified on completion. Different checksums such as CRC and SHA can be used.

4. There can be up to 256 data integration units for each copy activity, in a serverless manner. Is this adequate for tuning performance? A single copy activity reads and writes with multiple workers.

5. The dataset at source must be surveyed for folder structure, file pattern, and data schema. When testing with artificial data, ensure adequate size such as with “{repeat 10000 echo some test} > large-file” or “sudo fallocate -l 2G bigfile”

6. Are the metadata preserved? These could include tagging. There are five built-in system properties namely contentType, contentLanguage, contentEncoding, contentDisposition, and cacheControl that can be preserved during the copy activity. Same for all end-user specified metadata. Validation of copied metadata on spot checking and sampling basis is recommended. This is achieved by specifying the “preserve”: [“Attributes”] attribute in the copy activity JSON.

7. Copy Activity supports preserves the ACL, owner, and group for the users. This is achieved by specifying the "preserve": ["ACL", "Owner", "Group"] in the copy activity JSON. Spot checking and sampling of Copy Activity is recommended.

8. The copy activity is executed on the integration runtime. Even copying from on-premises without the direct link connectivity but leveraging the http connectivity can be done via an Azure Integration Runtime. Copy activity can include multiple stages of ingest, prepare, transform, and analyze, and publish.

9. The copy activity between a source and a sink involves read, integration runtime steps and writing, each of which must be robust against supported formats. Limit the customizations of the copy activity and the validations of the read, transformations and writes can be avoided.

10. Data Factory can support incremental copying and validation of the delta data from a source to a sink can be done by verifying the LastModifiedDate on which the filter is based. Since it can involve scanning a large number of files for possibly a short list of changed files, this can take time. The template to be used with ADF that supports incremental copying is named “Copy new files only by LastModifiedDate” and defines six parameters – FolderPath_Source, Directory_Source, FolderPath_Destination, Directory_Destination, LastModified_From and LastModified_To. The last two are used to bound the time window. Data source connections to both the source and the sink must be specified with this template. When the connections are specified, the pipeline user interface will appear and it will require these six parameters to be defined along with a trigger. A tumbling window trigger that executes every fifteen minutes is sufficient to catch up on the delta when the “no end” option is specified. All components of this pipeline including the trigger must be published. The trigger run parameters are the same as the incremental parameters and can be referenced via @trigger().outputs.windowStartTime

11. Monitoring for incremental copying is set up with the same time interval as the tumbling window trigger time interval. The results of the monitoring will indicate whether only the files changed in the time window were copied in each pipeline run. It will have a link to the run ID.