Cluster computing

This is a continuation of articles as they appear in the previous posts for a discussion on the Azure Data Platform and it focuses on the best practices around copying large datasets from source to destination.

As businesses want to do more with their data, they build analytical capabilities that feel constrained to use the same on-premises data storage appliances. Both the existing infrastructure and the new analytical stacks are increasingly dependent on cloud technologies and making the data available from the cloud leverages the recent trends in using the data. The foremost challenge with these data transfers has been the size and count of the containers to transfer.

A few numbers might help indicate the spectrum of copy activity and the duration it takes. A 1GB data transfer over a 50Mbps connection takes about 2.7 min and on a 5GBps connection takes about 0.03 min. Organizations usually have data in the order of TB or PB which are orders of magnitude greater than a GB. A 1 PB data transfer over 50 Mbps takes over 64.7 months and on a 10Gbps takes over 0.3 months.

The primary tool to overcome these challenges is automation and the best place for such automation continues to be the cloud as it facilitates orchestration between source and destination. The following are some of the best practices and considerations for such automations.

First, the more uniform the copy activity workload, the simpler the automation and the less fragile it becomes against the vagaries of the containers and their data. Moving large amounts of data is repetitive once the technique for moving one container is worked out and all others follow the same routine.

Second, the robustness of the copy activity is required and possible when calls made for copying are idempotent and retriable, so they detect the state of the destination and do not make changes if the copying has completed earlier. The artifacts are not found if the copying has not been completed. Many times, the errors during copying are transient and the logs would indicate that a retry succeeds. However, some might not proceed further, and these would become visible via the metrics and alerts that are set up. The dashboard provides continuous monitoring and indication for the source of the error and helps to zero in on the activity to rectify.

Third, even with the most careful planning, errors can come from environmental factors such as API failures, network disconnects, disk failures and rate limits. Proper response can help ensure that the overall progress of the data transfer has a safe start, incremental progress throughout the duration of the transfer and a good finish. The monitoring and alerts from the copy activity during the transformation is an important tool to guarantee that, just as much as it is important to maximize the bandwidth utilization and parallelization of copy activities to reduce the duration of the overall transfer.

It is important to consider both security and performance. A global-read access against an entire collection of containers is more useful to the automation than individual access via separate credentials. Any time there is an inventory to be persisted for containers, their access and mappings to the destination, there is more chance that it can get into an inconsistent state. A dynamic determination of source and destination and the simplicity of a copy command without marshalling many parameters results in a faster and simpler management of overall data transfer. Parallel copying via isolation of datasets is another technique to boost the performance of copying activity such that the overall duration converges faster than sequential.

Transformation sneaks into the data transfers from source container primarily on demand from the owners of the data as they seek to modernize the usage of data and want to include the changes so that the data at destination is in a better shape for the code to be developed. On the other hand, it is important for the automation to take ownership of the data transfer so that the customization are kept out of the transfers as much as possible and the transaction is completed. The data at destination is just as amenable to transformation privately by the owner as it is at the source because it is a copy.

There is a key difference between transferring staging data and production data in that the latter must be carefully scoped, isolated, secured and communicated prior to the transfer. It is in this case, that certain transformations and elimination might be sought to be performed as part of the migration. It is also important that the production data must deny all access before granting permissions to only those that will use them.

The overall budget for copy activity always comprises stages for planning, deploying and executing and the cost can be reduced by careful preparation and leveraging tried and tested methods.

Cluster computing

Thursday, April 27, 2023

No comments:

Post a Comment