Working with Azure Data Factory.
This is a continuation of the articles on Azure Data
Platform as they appear here and
discusses the validations to be performed when using the Azure Data Factory to
copy large sets of files from source to sink.
Validations to be performed:
1.
Are the file types at the source in the
supported list of file formats which includes: Avro, binary, delimited text,
Excel, Json, ORC, Parquet, and XML.
2.
Is the duration of the transfer worked out in
terms of the size/bandwidth? A 1PB over 50
Mbps transfer would take 64.7 months and a 100 TB over 10 Gbps would take 1 day.
3.
Has the integrity of objects been checked with
additional checksums? A checksum can be provided at the time of the upload and
verified on completion. Different checksums such as CRC and SHA can be used.
4.
There can be up to 256 data integration units for
each copy activity, in a serverless manner. Is this adequate for tuning
performance? A single copy activity reads and writes with multiple workers.
5.
The dataset at source must be surveyed for
folder structure, file pattern, and data schema. When testing with artificial
data, ensure adequate size such as with “{repeat 10000 echo some test} >
large-file” or “sudo fallocate -l 2G bigfile”
6.
Are the metadata preserved? These could include
tagging. There are five built-in system properties namely contentType, contentLanguage,
contentEncoding, contentDisposition, and cacheControl that can be preserved
during the copy activity. Same for all end-user specified metadata. Validation
of copied metadata on spot checking and sampling basis is recommended. This is
achieved by specifying the “preserve”: [“Attributes”] attribute in the copy
activity JSON.
7.
Copy Activity supports preserves the ACL, owner,
and group for the users. This is
achieved by specifying the "preserve": ["ACL",
"Owner", "Group"] in the copy activity JSON. Spot checking
and sampling of Copy Activity is recommended.
8.
The copy activity is executed on the integration
runtime. Even copying from on-premises without the direct link connectivity but
leveraging the http connectivity can be done via an Azure Integration Runtime.
Copy activity can include multiple stages of ingest, prepare, transform, and
analyze, and publish.
9.
The copy activity between a source and a sink
involves read, integration runtime steps and writing, each of which must be
robust against supported formats. Limit the customizations of the copy activity
and the validations of the read, transformations and writes can be avoided.
10.
Data Factory can support incremental copying and
validation of the delta data from a source to a sink can be done by verifying
the LastModifiedDate on which the filter is based. Since it can involve
scanning a large number of files for possibly a short list of changed files,
this can take time. The template to be used with ADF that supports incremental
copying is named “Copy new files only by LastModifiedDate” and defines six
parameters – FolderPath_Source, Directory_Source, FolderPath_Destination,
Directory_Destination, LastModified_From and LastModified_To. The last two are
used to bound the time window. Data source connections to both the source and
the sink must be specified with this template. When the connections are
specified, the pipeline user interface will appear and it will require these
six parameters to be defined along with a trigger. A tumbling window trigger that
executes every fifteen minutes is sufficient to catch up on the delta when the
“no end” option is specified. All components of this pipeline including the
trigger must be published. The trigger
run parameters are the same as the incremental parameters and can be referenced
via @trigger().outputs.windowStartTime
11.
Monitoring for incremental copying is set up
with the same time interval as the tumbling window trigger time interval. The
results of the monitoring will indicate whether only the files changed in the
time window were copied in each pipeline run. It will have a link to the run
ID.
No comments:
Post a Comment