Friday, April 28, 2023

Microservice extractors for on-premises enterprise applications

Abstract:

Metamodels have been the traditional way to discover complex legacy applications. Unfortunately, this technique has remained manual with tools that are limited mainly to academic interests. There are seven metamodels including Knowledge Discovery Metamodel, Abstract Syntax Tree Metamodel, the Software Measurement Metamodel, analysis program, visualization, refactoring, and transformation. This paper argues that the final state for application modernization routinely converges to a known form. Since the monolithic legacy application is refactored into a ring of independent microservices, it is better for the tools to work backwards from this state rather than attempt to accurately describe the metamodels. There are two main proposals in this document. First, that interfaces in a legacy system are all evaluated as candidates for microservices and run through a set of rules in a classifier, grouped, ranked, sorted, and selected into a shortlist of microservices. Second, that the design of each microservice extracted holistically and individually from the legacy system offers more benefit than a single pass. The benefits for these can be summed up in terms of developer satisfaction.

Description:

Model driven Software development evolves existing systems and facilitates the creation of new software systems. 

The salient features of model driven software development include: 

1.       Domain-specific languages (DSLs) that express models at different abstraction levels. 

2.       DSL notation syntaxes collected separately. 

3.       Model transformations for generating code from models either directly by model-to-text transformations or indirectly by intermediate model-to-model transformations. 

An abstract syntax is known by a metamodel that uses a metamodeling language to describe a set of concepts and their relationships. These languages use object-oriented constructs to build metamodels. The relationship between a model and a metamodel can be described by a “conforms-to” relationship. 

There are seven metamodels including Knowledge Discovery Metamodel, Abstract Syntax Tree Metamodel, the Software Measurement Metamodel, analysis program, visualization, refactoring, and transformation. 

ASTM and KDM are complimentary in modeling software systems’ syntax and semantics. ASTMs use Abstract Syntax Trees to mainly represent the source code’s syntax, KDM helps to represent semantic information about a software system, ranging from source code to higher level of abstractions. KDM is the language of architecture and provides a common interchange format intended for representing software assets and tools interoperability. Platform, user interface or data can each have its own KDM and are organized as packages. These packages are grouped into four abstract layers to improve modularity and separation of concerns: infrastructure, program elements, runtime resource and abstractions.  

SMM is the metamodel that can represent both metrics and measurements. It includes a set of elements to describe the metrics in KDM models and their measurements.  

Taking the example of the modernization of a database forms application and migrating it to a Java platform, an important part of the migration could involve PL/SQL triggers in legacy Forms code. In a Forms application, the sets of SQL statements corresponding to triggers are tightly coupled to the User Interface. The cost of the migration project is proportional to the number and complexity of these couplings. The reverse engineering process involves extracting KDM models from the SQL code.  

An extractor that generates the KDM model from SQL code can be automated. A framework that provides domain specific languages for extraction of model is available and this can be used to create a model that conforms to a target KDM from program that conforms to grammar. Dedicated parsers can help with this code-to-model transformation. 

With the popularity of machine learning techniques and SoftMax classification, extracting domain classes according to syntax tree meta-model and semantic graphical information has become more meaningful. The two-step process of parsing to yield Abstract Syntax Tree Meta-model and restructuring to express Abstract Knowledge Discovery Model becomes enhanced with collocation and dependency information. This results in classifications at code organization units that were previously omitted. For example, code organization and call graphs can be used for such learning as shown in reference.

The discovery of KDM and SMM can also be broken down into independent learning mechanisms with the Dependency Complexity being one of them.  

The migration to microservices is sometimes referred to as the “horseshoe model” comprising three steps: reverse engineering, architectural transformations, and forward engineering. The system before the migration is the pre-existing system. The system after the migration is the new system. The transitions between the pre-existing system and the new system can be described via pre-existing architecture and microservices architecture. 

The reverse engineering step comprises the analysis by means of code analysis tools or some existing documentation and identifies the legacy elements which are candidates for transformation to services. The transformation step involves the restructuring of the pre-existing architecture into a microservice based one as with reshaping the design elements, restructuring the architecture, and altering business models and business strategies. Finally, in the forward engineering step, the design of the new system is finalized. 

Therefore, the pattern of parsing, reverse engineering, restructuring, and forward engineering is common whether it is done once or individually for each microservice. The repetition of the cycle end-to-end for each microservice provides significant improvements and those that were humanly impossible earlier. After all the microservices have been formed, the repetitions provide significant learnings to make them leaner and meaner, thus improving their quality and separation of concerns.

The motivation behind this approach is that the application readiness is usually understood by going through a checklist. The operational and application readiness checklist assesses several dozens of characteristics. This helps with the data driven and quantitative analysis for an approach to modernization.

The use of a classifier to run these rules is well-established in the industry with plenty of precedence. Typically, they are evaluated as a program order of conditions. Learning about interfaces can also be improved with data mining techniques. These include:

1.       Classification algorithms - This is useful for finding similar groups based on discrete variables. 

It is used for true/false binary classification. Multiple label classifications are also supported. There are many techniques, but the data should have either distinct regions on a scatter plot with their own centroids or if it is hard to tell, scan breadth first for the neighbors within a given radius forming trees or leaves if they fall short.
Use Case: Useful for categorization of symbols beyond the nomenclature. The primary use case is to see clusters of symbols match based on features. By translating to a vector space and assessing the quality of cluster with a sum of square of errors, it is easy to analyze substantial number of symbols as belonging to specific clusters for management perspective.   

2.       Regression algorithms   - This is particularly useful to calculate a linear relationship between a dependent and independent variable, and then use that relationship for prediction.  
Use case: Source code symbols demonstrate elongated scatter plots in specific categories. Even when the symbols come dedicated to a category, the lifetimes are bounded and can be plotted along the timeline. One of the best advantages of linear regression is the prediction about time as an independent variable. When the data point has many factors contributing to their occurrence, a linear regression gives an immediate ability to predict where the next occurrence may happen. This is far easier to do than coming up with a model that behaves like a good fit for all the data points.

3.        Segmentation algorithms- A segmentation algorithm divides data into groups, clusters, or items that have similar properties.  
Use Case: Customer segmentation based on symbol feature set is a quite common application of this algorithm. It helps prioritize the usages between consumers. 

4.       Association algorithms - This is used for finding correlations between different attributes in a data set.

Use Case: Association data mining allows these users to see helpful messages such as “consumers who used this set of symbols also used this other set of symbols”

5.       Sequence Analysis Algorithms: This is used for finding groups via paths in sequences. A Sequence Clustering algorithm is like a clustering algorithm mentioned above but instead of finding groups based on similar attributes, it finds groups based on similar paths in a sequence. A sequence is a series of events. For example, a series of web clicks by a user is a sequence. It can also be compared to the IDs of any sortable data maintained in a separate table. Usually, there is support for a sequence column. The sequence data has a nested table that contains a sequence ID which can be any sortable data type.
Use Case: This is especially useful to find sequences in symbol usages across a variety of components. Generally, a set of SELECT SQL statements would follow the opening of a database connection which could lead to an interpretation that this querying is useful for resource state representation. This sort of sequence determination in a data driven manner helps find new sequences and target them actively even suggesting transitions that might have escaped the casual source code reader.

Sequence Analysis helps with leveraging state-based encoded meaning behind the use of symbols  

6.       Outliers Mining Algorithm: Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given.   
The choices for similarity measures between rows include distance functions such as Euclidean, Manhattan, string-edits, graph-distance etc. and L2 metrics. The choices for aggregate dissimilarity measures is the distance of K nearest neighbors, density of neighborhood outside the expected range and the attribute differences with nearby neighbors  

Use Case: The steps to determine outliers can be listed as: 1. Cluster regular via K-means, 2.  Compute distance of each tuple in R to nearest cluster center and 3. choose top-K rows, or those with scores outside the expected range. Finding outliers is sometimes humanly impossible because the volume of the symbols might be quite high. Outliers are important to discover new insights to encompass them. If there are numerous outliers, they will significantly increase KDM building costs. If they were not, then the patterns help identify efficiencies.

7.       Decision tree: This is one of the most heavily used and easy to visualize mining algorithms. The decision tree is both a classification and a regression tree. A function divides the rows into two datasets based on the value of a specific column. The two list of rows that are returned are such that one set matches the criteria for the split while the other does not. When the attribute to be chosen is clear, this works well.

Use Case: A Decision Tree algorithm uses the attributes of the service symbols to make a prediction such as a set of symbols representing a component that can be included or excluded. The ease of visualization of split at each level helps throw light on the importance of those sets.  This information becomes useful to prune the tree and to draw the tree.

8.       Logistic Regression: This is a form of regression that supports binary outcomes. It uses statistical measures, is highly flexible, takes any kind of input and supports different analytical tasks. This regression folds the effects of extreme values and evaluates several factors that affect a pair of outcomes.

Use Case: This can be used for finding repetitions in symbol usages.

9.       Neural Network: This is a widely used method for machine learning involving neurons that have one or more gates for input and output. Each neuron assigns a weight usually based on probability for each feature and the weights are normalized across resulting in a weighted matrix that articulates the underlying model in the training dataset. Then it can be used with a test data set to predict the outcome probability. Neurons are organized in layers and each layer is independent of the other and can be stacked so they take the output of one as the input to the other.

Use Case: This is widely used for SoftMax classifier in NLP associated with source code as text. This finds latent semantics in the usage of symbols based on their co-occurrence.  

10.   Naïve Bayes algorithm: This is probably the most straightforward statistical probability-based data mining algorithm compared to others.  The probability is a mere fraction of interesting cases to total cases. Bayes probability is conditional probability which adjusts the probability based on the premise.

Use Case: This is widely used for cases where conditions apply, especially binary conditions such as with or without. If the input variables are independent, their states can be calculated as probabilities, and if there is at least a predictable output, this algorithm can be applied. The simplicity of computing states by counting for class using each input variable and then displaying those states against those variables for a give value, makes this algorithm easy to visualize, debug and use as a predictor.

11.   Plugin Algorithms: Several algorithms get customized to the domain they are applied to resulting in unconventional or new algorithms. For example, a hybrid approach on association clustering can benefit determining relevant associations when the matrix is quite large and has a large tail of irrelevant associations from the cartesian product. In such cases, clustering could be done prior to association to determine the key items prior to this market-basket analysis.
Use Case: Source code symbols are notoriously susceptible to being similar even when they appear with variations even when pertaining to the same category. These symbols do not have pre-populated fields from a template, and everyone enters values for inputs that differ from one to another. Using a hybrid approach, it is possible to preprocess these symbols with clustering before analyzing such as with association clustering.   

12.   Simultaneous classifiers and regions-of-interest regressors: Neural nets algorithms typically involve a classifier for use with the tensors or vectors. But regions-of-interest regressors provide bounding-box localizations. This form of layering allows incremental semantic improvements to the underlying raw data.  

Use Case: Symbol usages are time-series data, and as more and more are opened, specific time ranges become as important as the semantic classification of the symbols.

13.   Collaborative filtering: Recommendations include suggestions for knowledge base, or to find model service symbols. In order to make a recommendation, first a group sharing similar taste is found and then the preferences of the group are used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the people in the selected group.  To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group.  

Use Case: Several approaches mentioned earlier provide a perspective to solving this case. This is different from those in that opinions from multiple pre-established profiles in a group are taken to determine the best set of interfaces to recommend.  

14.   Collaborative Filtering via Item-based filtering: This filtering is like the previous except that it was for user-based approach, and this is for item-based approach. It is significantly faster than the user-based approach but requires storage for an item similarity table.

Use Case: There are certain filtering cases where divulging which profiles go with what preferences is helpful to the profiles. At other times, it is preferable to use item-based similarity. Similarity scores are computed in both cases. All other considerations being same, item-based approach is better for sparse dataset. Both user-based and item-based approach perform similarly for the dense dataset.

15.   Hierarchical clustering: Although classification algorithms vary quite a lot, hierarchical algorithm stands out and is called out separately in this category. It creates a dendrogram where the nodes are arranged in a hierarchy.

Use Case: Specific domain-based ontology in the form of dendrogram can be quite helpful to mining algorithms. 

16.   NLP algorithms: Popular NLP algorithms like BERT can be used towards text mining.  

NLP models come extremely useful for processing text from work notes and other associated attachments in the symbols.  

Machine learning algorithms are a tiny fraction of the overall code that is used to realize prediction systems in production. As noted in the paper on “Hidden Technical Debt in Machine Learning systems” by Sculley, Holt and others, the machine learning code comprises mainly of the model but all the other components such as configuration, data collection, features extraction, data verification, process management tools, machine resource management, serving infrastructure, and monitoring comprise the rest of the stack. All these components are usually hybrid stacks in nature especially when the model is hosted on-premises. Public clouds do have a pipeline and relevant automation with better management and monitoring programmability than on-premises, but it is usually easier for startups to embrace public clouds than established large companies who have significant investments in their inventory, devOps and datacenters. 

Monitoring and pipeline contribute significantly towards streamlining the process and answering questions such as why did the model predict this? When was it trained? Who deployed it? Which release was it deployed in? At what time was the production system updated? What were the changes in the predictions? What did the key performance indicators show after the update? Public cloud services have enabled both ML pipeline and their monitoring. The steps involved in creating a pipeline usually involves configuring a workspace and creating a datastore, downloading and storing sample data, registering, and using objects for transferring intermediate data between pipeline steps, downloading, and registering the model, creating, and attaching the remote computer target, writing a processing script, building the pipeline by setting up the environment and stack necessary to execute the script that is run in this pipeline, creating the configuration to wrap the script, creating the pipeline step with the above mentioned environment, resource, input and output data, and reference to the script, and submitting the pipeline. Many of these steps are easily automated with the help of built-in objects published by the public cloud services to build and run such a pipeline. A pipeline is a reusable object and one can that can be invoked over the wire with a web-request.  

  

Machine learning services collect the same kinds of monitoring data as the other public cloud resources. These logs, metrics and events can then be collected, routed, and analyzed to tune the machine learning model.

 

Conclusion:

Many companies will say that they are in the initial stages of the migration process because the number and size of legacy elements in their software portfolio continues to be a challenge to get through. That said, these companies also deploy anywhere from a handful to hundreds of microservices while still going through the deployment. Some migrations require several months and even a couple of years. The management is usually supportive of migrations. The business-IT alignment comprising of technical solutions and business strategies are more overwhelmingly supportive of migrations. 

Evaluating the overall quality of the microservices refactored from the original source code can be evaluated based on a score from a set of well-known criteria involving DRY principles.

Microservices are implemented as small services by small teams that suits Amazon’s definition of Two-Pizza Team. The migration activities begin with an understanding of both the low-level and the high-level sources of information. The source code and test suites comprise the low-level.  The higher-level comprises of textual documents, architectural documents, data models or schema and box and lines diagrams. The relevant knowledge about the system also resides with people and in some extreme cases as tribal knowledge. Less common but useful sources of information include UML diagrams, contracts with customers, architecture recovery tools for information extraction and performance data. Very rarely but also found are cases where the pre-existing system is considered so bad that its owners do not look at the source code. 

Such an understanding can also be used towards determining whether it is better to implement new functionalities in the pre-existing system or in the new system. This could also help with improving documentation, or for understanding what to keep or what to discard in the new system.

Thursday, April 27, 2023

 

This is a continuation of articles as they appear in the previous posts for a discussion on the Azure Data Platform and it focuses on the best practices around copying large datasets from source to destination.

 

As businesses want to do more with their data, they build analytical capabilities that feel constrained to use the same on-premises data storage appliances. Both the existing infrastructure and the new analytical stacks are increasingly dependent on cloud technologies and making the data available from the cloud leverages the recent trends in using the data. The foremost challenge with these data transfers has been the size and count of the containers to transfer.

A few numbers might help indicate the spectrum of copy activity and the duration it takes.  A 1GB data transfer over a 50Mbps connection takes about 2.7 min and on a 5GBps connection takes about 0.03 min. Organizations usually have data in the order of TB or PB which are orders of magnitude greater than a GB.  A 1 PB data transfer over 50 Mbps takes over 64.7 months and on a 10Gbps takes over 0.3 months.

The primary tool to overcome these challenges is automation and the best place for such automation continues to be the cloud as it facilitates orchestration between source and destination. The following are some of the best practices and considerations for such automations.

First, the more uniform the copy activity workload, the simpler the automation and the less fragile it becomes against the vagaries of the containers and their data. Moving large amounts of data is repetitive once the technique for moving one container is worked out and all others follow the same routine.

Second, the robustness of the copy activity is required and possible when calls made for copying are idempotent and retriable, so they detect the state of the destination and do not make changes if the copying has completed earlier. The artifacts are not found if the copying has not been completed. Many times, the errors during copying are transient and the logs would indicate that a retry succeeds. However, some might not proceed further, and these would become visible via the metrics and alerts that are set up. The dashboard provides continuous monitoring and indication for the source of the error and helps to zero in on the activity to rectify.

Third, even with the most careful planning, errors can come from environmental factors such as API failures, network disconnects, disk failures and rate limits. Proper response can help ensure that the overall progress of the data transfer has a safe start, incremental progress throughout the duration of the transfer and a good finish. The monitoring and alerts from the copy activity during the transformation is an important tool to guarantee that, just as much as it is important to maximize the bandwidth utilization and parallelization of copy activities to reduce the duration of the overall transfer.

It is important to consider both security and performance. A global-read access against an entire collection of containers is more useful to the automation than individual access via separate credentials. Any time there is an inventory to be persisted for containers, their access and mappings to the destination, there is more chance that it can get into an inconsistent state. A dynamic determination of source and destination and the simplicity of a copy command without marshalling many parameters results in a faster and simpler management of overall data transfer. Parallel copying via isolation of datasets is another technique to boost the performance of copying activity such that the overall duration converges faster than sequential.

Transformation sneaks into the data transfers from source container primarily on demand from the owners of the data as they seek to modernize the usage of data and want to include the changes so that the data at destination is in a better shape for the code to be developed. On the other hand, it is important for the automation to take ownership of the data transfer so that the customization are kept out of the transfers as much as possible and the transaction is completed. The data at destination is just as amenable to transformation privately by the owner as it is at the source because it is a copy.

There is a key difference between transferring staging data and production data in that the latter must be carefully scoped, isolated, secured and communicated prior to the transfer. It is in this case, that certain transformations and elimination might be sought to be performed as part of the migration. It is also important that the production data must deny all access before granting permissions to only those that will use them.

The overall budget for copy activity always comprises stages for planning, deploying and executing and the cost can be reduced by careful preparation and leveraging tried and tested methods.

 

Wednesday, April 26, 2023

This is a continuation of articles as they appear here for a discussion on the Azure Data Platform: 

This article continues the discussion on copying between source and destination with a focus on the declaration of such a copy activity in the Azure Data Factory.

 

A copy activity in Azure Data Factory that copies from all the buckets under an S3 account to the Azure Data Lake Gen 2 would require including an iteration in the pipeline logic. For example,

{

    "name": "CopyPrjxPodItemsPipeline_23n",

    "properties": {

        "activities": [

            {

                "name": "ForEachItemInPod",

                "type": "ForEach",

                "dependsOn": [

                    {

                        "activity": "GetPodContents",

                        "dependencyConditions": [

                            "Succeeded"

                        ]

                    }

                ],

                "userProperties": [],

                "typeProperties": {

                    "items": {

                        "value": "@activity('GetPodContents').output.childItems",

                        "type": "Expression"

                    },

                    "isSequential": true,

                    "activities": [

                        {

                            "name": "CopyPodItem",

                            "type": "Copy",

                            "dependsOn": [],

                            "policy": {

                                "timeout": "0.12:00:00",

                                "retry": 0,

                                "retryIntervalInSeconds": 30,

                                "secureOutput": false,

                                "secureInput": false

                            },

                            "userProperties": [

                                {

                                    "name": "preserve",

                                    "value": "Attributes"

                                }

                            ],

                            "typeProperties": {

                                "source": {

                                    "type": "BinarySource",

                                    "storeSettings": {

                                        "type": "AmazonS3CompatibleReadSettings",

                                        "recursive": true

                                    },

                                    "formatSettings": {

                                        "type": "BinaryReadSettings"

                                    }

                                },

                                "sink": {

                                    "type": "BinarySink",

                                    "storeSettings": {

                                        "type": "AzureBlobFSWriteSettings",

                                        "copyBehavior": "PreserveHierarchy"

                                    }

                                },

                                "preserve": [

                                        "Attributes"

                                ],

                                "enableStaging": false

                            },

                            "inputs": [

                                {

                                    "referenceName": "SourceDataset_23n",

                                    "type": "DatasetReference",

                                    "parameters": {

                                        "bucketName": "@item().name"

                                    }

                                }

                            ],

                            "outputs": [

                                {

                                    "referenceName": "DestinationDataset_23n",

                                    "type": "DatasetReference",

                                    "parameters": {

                                        "bucketName": "@item().name"

                                    }

                                }

                            ]

                        }

                    ]

                }

            },

            {

                "name": "GetPodContents",

                "type": "GetMetadata",

                "dependsOn": [],

                "policy": {

                    "timeout": "0.12:00:00",

                    "retry": 0,

                    "retryIntervalInSeconds": 30,

                    "secureOutput": false,

                    "secureInput": false

                },

                "userProperties": [],

                "typeProperties": {

                    "dataset": {

                        "referenceName": "SourceDataset_prjx",

                        "type": "DatasetReference"

                    },

                    "fieldList": [

                        "childItems"

                    ],

                    "storeSettings": {

                        "type": "AmazonS3CompatibleReadSettings",

                        "enablePartitionDiscovery": false

                    },

                    "formatSettings": {

                        "type": "BinaryReadSettings"

                    }

                }

            }

        ],

        "annotations": [],

        "lastPublishTime": "2023-04-25T15:18:34Z"

    },

    "type": "Microsoft.DataFactory/factories/pipelines"

}

There are a few things to note about the pipeline logic above:

1.       It requires source and destination connections as prerequisite to the copy activity.

2.       The copy activity is inside the forEach loop

3.       The forEach loop gets the item list from the GetPodContents activity which reads the buckets ta the source

4.       The metadata is asked to be preserved for each object as it is copied from source to destination.

5.       The iteration is sequential because if all of the activities were writing to the same location, the original length of the location might be read differently by each.

6.       The destination happens to be a distributed file system and preserves the original hierarchy of objects.


Monday, April 24, 2023

 

This article focuses on the rolling back of changes when the data transfer results in errors.

 

Both Azure Data Factory and Data Migration tool flag errors that need to be corrected prior to performing a migration to Azure. Warnings and errors can be received when preparing to migrate.  After correcting each error, the validations can be run again to verify resolution of all errors.

 

Any previously created dry-run artifacts must be eliminated before a new one commences. Starting from a clean slate is preferable with any data migration effort because it is harder to sift through the artifacts from the previous run to say whether they are relevant or not.

 

Renaming imported data and containers to prevent conflicts at the destination is another important preparation. Reserving namespaces and allocating containers are essential for a smooth migration.

 

Even with the most careful planning, errors can come from environmental factors such as API failures, network disconnects, disk failures and rate limits. Proper response can help ensure that the overall progress of the data transfer has a safe start, incremental progress throughout the duration of the transfer and a good finish. The monitoring and alerts from the copy activity during the transformation is an important tool to guarantee that, just as much as it is important to maximize the bandwidth utilization and parallelization of copy activities to reduce the duration of the overall transfer.

A few numbers might help indicate the spectrum of copy activity in terms of size and duration.  A 1GB data transfer over a 50Mbps connection takes about 2.7 min and on a 5GBps connection takes about 0.03 min. Organizations usually have data in the order of TB or PB which are orders of magnitude greater than a GB.  A 1 PB data transfer over 50 Mbps takes over 64.7 month and on a 10Gbps takes over 0.3 month.

Restarting the whole data transfer is impractical when the duration is in the order of days or months. Some preparation is required to make progress incremental. Fortunately, workload segregation helps isolate the data transfers so that they can happen parallelly and the different containers to which the data is written can reduce the scope and severity of errors.

Calls made for copying are idempotent and retriable, so they detect the state of the destination and do not make changes if the copying is completed earlier. The artifacts are not found if the copying is not completed. Many times, the errors during copying are transient and the logs would indicate that a retry succeeds. However, some might not proceed further, and these would become visible via the metrics and alerts that are set up. The dashboard provides continuous monitoring and indication for the source of the error and helps to zero in on the activity to rectify.

Finally, one of the most important considerations is that the logic and customizations during the copy activity must be reduced as the data transfers span the network. When restructuring becomes part of the data transfer or there are additional routines that include adding tags or metadata, then they can introduce more failure points. If these could be done at the destination after the data transfer has been avoided, the copy activities go smoothly.

Sunday, April 23, 2023

 

This is a continuation of articles as they appear here for a discussion on the Azure Data Platform: 

This article focuses on the troubleshooting of import and migration errors.

 

Both Azure Data Factory and Data Migration tool flag errors that need to be corrected prior to performing a migration to Azure. Warnings and errors can be received when preparing to migrate.  After correcting each error, the validations can be run again to verify resolution of all errors.

 

One of the first preparation steps in the task summary for importing data from on-premises must be to complete a dry-run of the end-to-end import before scheduling the production import. The collection of data sets must be determined, assigned and mapped next. Putting this information together on how the source data will appear in the destination is more than just deciding on the layout. It helps to iron out all the concerns around the current and future use of the data in the cloud and their access controls as well as planning of the resources even before a single data transfer commences. Creating a portable backup of this collection is essential to get consensus from all stakeholders so that there is enough transparency to the data transfers.

 

Generating SAS keys to enable import is equally important. SAS keys are allowed to expire and they give sufficient permissions to complete the data transfers. This can be included with the collection so that it is easy to lookup the SAS key for the data set to import. When the data is only read from the source without any modifications, there is no need to take a backup of the source. If there are transformation at the source prior to the migration, it is better to take a backup.

 

Any previous dry-run artifacts must be eliminated before a new one commences. Starting from a clean slate is preferable with any data migration efforts because it is harder to sift through the artifacts from the previous run to say whether it is relevant or not.

 

Renaming imported data and containers to prevent conflicts at the destination is another important preparation. Reserving namespaces and allocating containers are essential for a smooth migration.

 

Billing and pay-per-use must represent the proper costs. This can be done by setting up billing for the organization.

 

Reconnect to destination for all consumers must be error free and this can be tried out prior to onboarding them. A guide for the new namespace, organization and access control could additionally be provided.