Cluster computing

Monday, April 24, 2023

This article focuses on the rolling back of changes when the data transfer results in errors.

Both Azure Data Factory and Data Migration tool flag errors that need to be corrected prior to performing a migration to Azure. Warnings and errors can be received when preparing to migrate. After correcting each error, the validations can be run again to verify resolution of all errors.

Any previously created dry-run artifacts must be eliminated before a new one commences. Starting from a clean slate is preferable with any data migration effort because it is harder to sift through the artifacts from the previous run to say whether they are relevant or not.

Renaming imported data and containers to prevent conflicts at the destination is another important preparation. Reserving namespaces and allocating containers are essential for a smooth migration.

Even with the most careful planning, errors can come from environmental factors such as API failures, network disconnects, disk failures and rate limits. Proper response can help ensure that the overall progress of the data transfer has a safe start, incremental progress throughout the duration of the transfer and a good finish. The monitoring and alerts from the copy activity during the transformation is an important tool to guarantee that, just as much as it is important to maximize the bandwidth utilization and parallelization of copy activities to reduce the duration of the overall transfer.

A few numbers might help indicate the spectrum of copy activity in terms of size and duration. A 1GB data transfer over a 50Mbps connection takes about 2.7 min and on a 5GBps connection takes about 0.03 min. Organizations usually have data in the order of TB or PB which are orders of magnitude greater than a GB. A 1 PB data transfer over 50 Mbps takes over 64.7 month and on a 10Gbps takes over 0.3 month.

Restarting the whole data transfer is impractical when the duration is in the order of days or months. Some preparation is required to make progress incremental. Fortunately, workload segregation helps isolate the data transfers so that they can happen parallelly and the different containers to which the data is written can reduce the scope and severity of errors.

Calls made for copying are idempotent and retriable, so they detect the state of the destination and do not make changes if the copying is completed earlier. The artifacts are not found if the copying is not completed. Many times, the errors during copying are transient and the logs would indicate that a retry succeeds. However, some might not proceed further, and these would become visible via the metrics and alerts that are set up. The dashboard provides continuous monitoring and indication for the source of the error and helps to zero in on the activity to rectify.

Finally, one of the most important considerations is that the logic and customizations during the copy activity must be reduced as the data transfers span the network. When restructuring becomes part of the data transfer or there are additional routines that include adding tags or metadata, then they can introduce more failure points. If these could be done at the destination after the data transfer has been avoided, the copy activities go smoothly.

Cluster computing

Monday, April 24, 2023

No comments:

Post a Comment