Cluster computing

Tuesday, April 18, 2023

Azure Data factory and self-hosted Integration Runtime:

This is a continuation of the posts on Azure Data Platform. Azure Data Factory is a managed cloud service from the Azure public cloud that supports data migration and integration between networks.

Azure, self-hosted and Azure-SSIS integration runtimes are the flavors of compute infrastructure that the Azure Data Factory uses to provide data integration capabilities across different network environments. These include executing a data flow in a managed Azure compute environment, copying data across data stores in a public or private networks, dispatching and monitoring transformation activities and natively executing SQL Server integration services packages in a managed Azure compute environment. Out of these, the self-hosted runtime can be used for data movement and activity dispatch across on-premises and Azure networks. Self-hosted integration runtime cannot be used for managed compute, autoscale and dataflow but it can be used for on-premises data access, private link/private endpoint and custom component/driver. It requires the on-premises network to be connected to Azure via ExpressRoute or VPN. The private endpoints are managed by the Azure Data Factory Service.

This article describes how we can configure a self-hosted integration runtime once it has been identified as the right choice of Integration Runtime for the purpose at hand. The right choice also positions the infrastructure against growing business needs and any future increase in the workload, especially given that there is no one size fits all approach.

As a compute resource for the ADF pipeline, the self-hosted integration runtime benefits from being close to the data source so that the data movement, activity and data transformation are achieved with improved performance. While this can be automatically decided by the Azure Integration runtime, some constraints apply. For example, the self-hosted integration runtime will be located in the region with the local machines, virtual machines and their scale sets. If the data store is on an on-premises network or behind a firewall, the integration runtime could be on a managed virtual network or the self-hosted integration runtime.

When the data store is not publicly accessible, such as when it is behind a firewall or when it is on private network, some additional setup is necessary such as the use of Azure Private Link and Load Balancer in the case of Azure integration runtime on a managed virtual network or a VPN connection in the case of self-hosted integration runtime.

If the data is highly confidential, it is better to encrypt data in transit and rest. Communications must happen over https and TLS. Private endpoints can be created to access the Azure resources over the VNet.

The self-hosted integration runtime is installed on customer machines, the end-uses maintain them. Auto-updates and expire notifications can be set up to facilitate this. A diagnostic tool to perform some health checks is also available. And as always, Azure Monitor and Azure Log Analytics help with troubleshooting. The number of concurrent activities that the self-hosted integration runtime can run depends on the machine size and cluster size.

Cluster computing

Tuesday, April 18, 2023

No comments:

Post a Comment