This is a
continuation of the articles on Azure Data Platform as they appear in the previous posts. Azure Data Factory
is a managed cloud service from the Azure public cloud that supports data
migration and integration between networks.
Azure self-hosted integration runtime is the
compute infrastructure that the Azure Data Factory uses to provide data
integration capabilities across different network environments. A self-hosted integration runtime can run
copy activities between a cloud datastore and a data store in a private
network. It can also dispatch transform activities across compute resources in
both networks. This article describes how we can create and configure a
self-hosted integration runtime.
There are a few considerations. A self-hosted
integration runtime does not have to be dedicated to a data source or a network
or an Azure Data Factory instance. It can also be shared with others in the
same tenant. A single instance of the self-hosted IR must be installed on a
single computer, and it can be put in the sharing mode for two different Azure
Data Factories or there can be two on-premises computers, one for each Data
Factory. The self-hosted IR is computing capability and does not need to be
installed on the same computer as the data store. There can be many flavors of
the self-hosted IR installed on different machines and forming connections
between the same data source and Azure public cloud. This is particularly
helpful to data sources that are behind a firewall with no line of sight to the
public cloud.
Tasks that are hosted on this compute capability
with FIPS compliant encryption, can fail and when that happens, there are two
options: 1. Disable FIPS compliant encryption or 2. Store credentials and
secrets in a Key Vault. The registry key for the FIPS compliant encryption is
in the HKLM\System\Current ComtrolSet\Control\Lsa\FipsAlgorithmPolicy\Enabled
with the values 1 for enabled and 0 for disabled.
The self-hosted IR performs read-write against
the on-premises data store and sets up control and data paths against Azure. It
encrypts the credentials by using the Windows Data Protection Application
Programming Interface and saves the credentials locally. If multiple nodes are
set up for high availability, the credentials are further synchronized across
the other nodes. Each node encrypts the credentials by using DPAPI and stores
them locally.
The Azure Data Factory pipelines communicate with
the self-hosted IR to schedule and manage jobs. Communication is via a control
channel that uses a shared Azure Relay connection. When an activity job needs
to be run, the service queues the request along with the credential
information. The IR polls the queue and starts the job.
The self-hosted integration runtime copies data
between the on-premises store and cloud storage. The direction of the copy
depends on how the copy activity is configured in the data pipeline. It can
perform these data transfers over the https. It can also operate as a
man-in-the middle for certain types of data transfers but usually the direction
of copying with specific source and destination must be specified.