Cluster computing

Wednesday, April 12, 2023

Working with Azure Data Factory.

This is a continuation of the articles on Azure Data Platform as they appear here and discusses the validations to be performed when using the Azure Data Factory to copy large sets of files from source to sink.

Validations to be performed:

1. Are the file types at the source in the supported list of file formats which includes: Avro, binary, delimited text, Excel, Json, ORC, Parquet, and XML.

2. Is the duration of the transfer worked out in terms of the size/bandwidth? A 1PB over 50
Mbps transfer would take 64.7 months and a 100 TB over 10 Gbps would take 1 day.

3. Has the integrity of objects been checked with additional checksums? A checksum can be provided at the time of the upload and verified on completion. Different checksums such as CRC and SHA can be used.

4. There can be up to 256 data integration units for each copy activity, in a serverless manner. Is this adequate for tuning performance? A single copy activity reads and writes with multiple workers.

5. The dataset at source must be surveyed for folder structure, file pattern, and data schema. When testing with artificial data, ensure adequate size such as with “{repeat 10000 echo some test} > large-file” or “sudo fallocate -l 2G bigfile”

6. Are the metadata preserved? These could include tagging. There are five built-in system properties namely contentType, contentLanguage, contentEncoding, contentDisposition, and cacheControl that can be preserved during the copy activity. Same for all end-user specified metadata. Validation of copied metadata on spot checking and sampling basis is recommended. This is achieved by specifying the “preserve”: [“Attributes”] attribute in the copy activity JSON.

7. Copy Activity supports preserves the ACL, owner, and group for the users. This is achieved by specifying the "preserve": ["ACL", "Owner", "Group"] in the copy activity JSON. Spot checking and sampling of Copy Activity is recommended.

8. The copy activity is executed on the integration runtime. Even copying from on-premises without the direct link connectivity but leveraging the http connectivity can be done via an Azure Integration Runtime. Copy activity can include multiple stages of ingest, prepare, transform, and analyze, and publish.

9. The copy activity between a source and a sink involves read, integration runtime steps and writing, each of which must be robust against supported formats. Limit the customizations of the copy activity and the validations of the read, transformations and writes can be avoided.

10. Data Factory can support incremental copying and validation of the delta data from a source to a sink can be done by verifying the LastModifiedDate on which the filter is based. Since it can involve scanning a large number of files for possibly a short list of changed files, this can take time. The template to be used with ADF that supports incremental copying is named “Copy new files only by LastModifiedDate” and defines six parameters – FolderPath_Source, Directory_Source, FolderPath_Destination, Directory_Destination, LastModified_From and LastModified_To. The last two are used to bound the time window. Data source connections to both the source and the sink must be specified with this template. When the connections are specified, the pipeline user interface will appear and it will require these six parameters to be defined along with a trigger. A tumbling window trigger that executes every fifteen minutes is sufficient to catch up on the delta when the “no end” option is specified. All components of this pipeline including the trigger must be published. The trigger run parameters are the same as the incremental parameters and can be referenced via @trigger().outputs.windowStartTime

11. Monitoring for incremental copying is set up with the same time interval as the tumbling window trigger time interval. The results of the monitoring will indicate whether only the files changed in the time window were copied in each pipeline run. It will have a link to the run ID.

Cluster computing

Wednesday, April 12, 2023

No comments:

Post a Comment