This is a continuation of articles as they appear in the previous posts for a discussion on the Azure Data Platform and it
focuses on the best practices around copying large datasets from source to
destination.
As businesses
want to do more with their data, they build analytical capabilities that feel
constrained to use the same on-premises data storage appliances. Both the
existing infrastructure and the new analytical stacks are increasingly
dependent on cloud technologies and making the data available from the cloud
leverages the recent trends in using the data. The foremost challenge with
these data transfers has been the size and count of the containers to transfer.
A few numbers might help indicate the spectrum of copy
activity and the duration it takes. A
1GB data transfer over a 50Mbps connection takes about 2.7 min and on a 5GBps
connection takes about 0.03 min. Organizations usually have data in the order
of TB or PB which are orders of magnitude greater than a GB. A 1 PB data transfer over 50 Mbps takes over
64.7 months and on a 10Gbps takes over 0.3 months.
The primary tool to overcome these challenges is
automation and the best place for such automation continues to be the cloud as
it facilitates orchestration between source and destination. The following are
some of the best practices and considerations for such automations.
First, the more uniform the copy activity workload, the
simpler the automation and the less fragile it becomes against the vagaries of
the containers and their data. Moving large amounts of data is repetitive once
the technique for moving one container is worked out and all others follow the
same routine.
Second, the robustness of the copy activity is required
and possible when calls made for copying are idempotent and retriable, so they
detect the state of the destination and do not make changes if the copying has
completed earlier. The artifacts are not found if the copying has not been
completed. Many times, the errors during copying are transient and the logs
would indicate that a retry succeeds. However, some might not proceed further,
and these would become visible via the metrics and alerts that are set up. The
dashboard provides continuous monitoring and indication for the source of the
error and helps to zero in on the activity to rectify.
Third, even with the most careful planning, errors can
come from environmental factors such as API failures, network disconnects, disk
failures and rate limits. Proper response can help ensure that the overall
progress of the data transfer has a safe start, incremental progress throughout
the duration of the transfer and a good finish. The monitoring and alerts from
the copy activity during the transformation is an important tool to guarantee
that, just as much as it is important to maximize the bandwidth utilization and
parallelization of copy activities to reduce the duration of the overall
transfer.
It is important to consider both security and
performance. A global-read access against an entire collection of containers is
more useful to the automation than individual access via separate credentials.
Any time there is an inventory to be persisted for containers, their access and
mappings to the destination, there is more chance that it can get into an
inconsistent state. A dynamic determination of source and destination and the
simplicity of a copy command without marshalling many parameters results in a
faster and simpler management of overall data transfer. Parallel copying via
isolation of datasets is another technique to boost the performance of copying
activity such that the overall duration converges faster than sequential.
Transformation sneaks into the data transfers from source
container primarily on demand from the owners of the data as they seek to
modernize the usage of data and want to include the changes so that the data at
destination is in a better shape for the code to be developed. On the other
hand, it is important for the automation to take ownership of the data transfer
so that the customization are kept out of the transfers as much as possible and
the transaction is completed. The data at destination is just as amenable to
transformation privately by the owner as it is at the source because it is a
copy.
There is a key difference between transferring staging
data and production data in that the latter must be carefully scoped, isolated,
secured and communicated prior to the transfer. It is in this case, that
certain transformations and elimination might be sought to be performed as part
of the migration. It is also important that the production data must deny all
access before granting permissions to only those that will use them.
The overall budget for copy activity always comprises
stages for planning, deploying and executing and the cost can be reduced by
careful preparation and leveraging tried and tested methods.
No comments:
Post a Comment