Cluster computing

Azure Machine Learning provides an environment to create and manage the end-to-end life cycle of Machine Learning models. Unlike general purpose software, Azure machine learning has significantly different requirements such as the use of a wide variety of technologies, libraries and frameworks, separation of training and testing phases before deploying and use of a model and iterations for model tuning independent of the model creation and training etc. Azure Machine Learning’s compatibility with open-source frameworks and platforms like PyTorch and TensorFlow makes it an effective all-in-one platform for integrating and handling data and models which tremendously relieves the onus on the business to develop new capabilities. Azure Machine Learning is designed for all skill levels, with advanced MLOps features and simple no-code model creation and deployment.

We will compare this environment with TensorFlow but for those unfamiliar with the latter, here is a use case with TensorFlow. A JavaScript application performs image processing with a machine learning algorithm. When enough training data images have been processed, the model learns the characteristics of the drawings which results in their labels. Then as it runs through the test data set, it can predict the label of the drawing using the model. TensorFlow has a library called Keras which can help author the model and deploy it to an environment such as Colab where the model can be trained on a GPU. Once the training is done, the model can be loaded and run anywhere else including a browser. The power of TensorFlow is in its ability to load the model and make predictions in the browser itself.

The labeling of drawings starts with a sample of say a hundred classes. The data for each class is available on Google Cloud as numpy arrays with several images numbering say N, for that class. The dataset is pre-processed for training where it is converted to batches and outputs the probabilities.

As with any ML learning example, the data is split into 70% training set and 30% test set. There is no order to the data and the split is taken over a random set.

TensorFlow makes it easy to construct this model using the TensorFlow Lite ModelMaker. It can only present the output after the model is trained. In this case, the model must be run after the training data has labels assigned. This might be done by hand. The model works better with fewer parameters. It might contain 3 convolutional layers and 2 dense layers. The pooling size is specified for each of the convolutional layers and they are stacked up on the model. The model is trained using the tf.train.AdamOptimizer() and compiled with a loss function, optimizer just created, and a metric such as top k in terms of categorical accuracy. The summary of the model can be printed for viewing the model. With a set of epochs and batches, the model can be trained. Annotations help TensorFlow Lite converter to fuse TF.Text API. This fusion leads to a significant speedup than conventional models. The architecture for the model is also tweaked to include projection layer along with the usual convolutional layer and attention encoder mechanism which achieves similar accuracy but with much smaller model size. There is native support for HashTables for NLP models.

With the model and training/test sets defined, it is now as easy to evaluate the model and run the inference. The model can also be saved and restored. It is executed faster when there is GPU added to the computing.

When the model is trained, it can be done in batches of predefined size. The number of passes of the entire training dataset called epochs can also be set up front. A batch size of 256 and the number of steps as 5 could be used. These are called model tuning parameters. Every model has a speed, Mean Average Precision and output. The higher the precision, the lower the speed. It is helpful to visualize the training with the help of a high chart that updates the chart with the loss after each epoch. Usually there will be a downward trend in the loss which is referred to as the model is converging.

When the model is trained, it might take a lot of time say about 4 hours. When the test data has been evaluated, the model’s efficiency can be predicted using precision and recall, terms that are used to refer to positive inferences by the model and those that were indeed positive within those inferences.

Azure Machine Learning has a drag and drop interface that can be used to train and deploy models. It uses a machine learning workspace to organize shared resources such as pipelines, datasets, compute resources, registered models, published pipelines, and real-time endpoints. A visual canvas helps build end to end machine learning workflow. It trains, tests and deploys models all in the designer. The datasets and components can be dragged and dropped onto the canvas. A pipeline draft connects the components. A pipeline run can be submitted using the resources in the workspace. The training pipelines can be converted to inference pipelines and the pipelines can be published to submit a new pipeline that can be run with different parameters and datasets. A training pipeline can be reused for different models and a batch inference pipeline can be used to make predictions on new data.

Cluster computing

Wednesday, December 22, 2021

No comments:

Post a Comment