Monday, April 17, 2023

Azure Data factory and self-hosted Integration Runtime:

 

This is a continuation of the articles on Azure Data Platform as they appear in the previous posts. Azure Data Factory is a managed cloud service from the Azure public cloud that supports data migration and integration between networks.

Azure self-hosted integration runtime is the compute infrastructure that the Azure Data Factory uses to provide data integration capabilities across different network environments.  A self-hosted integration runtime can run copy activities between a cloud datastore and a data store in a private network. It can also dispatch transform activities across compute resources in both networks. This article describes how we can create and configure a self-hosted integration runtime.

There are a few considerations. A self-hosted integration runtime does not have to be dedicated to a data source or a network or an Azure Data Factory instance. It can also be shared with others in the same tenant. A single instance of the self-hosted IR must be installed on a single computer, and it can be put in the sharing mode for two different Azure Data Factories or there can be two on-premises computers, one for each Data Factory. The self-hosted IR is computing capability and does not need to be installed on the same computer as the data store. There can be many flavors of the self-hosted IR installed on different machines and forming connections between the same data source and Azure public cloud. This is particularly helpful to data sources that are behind a firewall with no line of sight to the public cloud.

Tasks that are hosted on this compute capability with FIPS compliant encryption, can fail and when that happens, there are two options: 1. Disable FIPS compliant encryption or 2. Store credentials and secrets in a Key Vault. The registry key for the FIPS compliant encryption is in the HKLM\System\Current ComtrolSet\Control\Lsa\FipsAlgorithmPolicy\Enabled with the values 1 for enabled and 0 for disabled.

The self-hosted IR performs read-write against the on-premises data store and sets up control and data paths against Azure. It encrypts the credentials by using the Windows Data Protection Application Programming Interface and saves the credentials locally. If multiple nodes are set up for high availability, the credentials are further synchronized across the other nodes. Each node encrypts the credentials by using DPAPI and stores them locally.

The Azure Data Factory pipelines communicate with the self-hosted IR to schedule and manage jobs. Communication is via a control channel that uses a shared Azure Relay connection. When an activity job needs to be run, the service queues the request along with the credential information. The IR polls the queue and starts the job.

The self-hosted integration runtime copies data between the on-premises store and cloud storage. The direction of the copy depends on how the copy activity is configured in the data pipeline. It can perform these data transfers over the https. It can also operate as a man-in-the middle for certain types of data transfers but usually the direction of copying with specific source and destination must be specified.

 

No comments:

Post a Comment