Cluster computing

Friday, March 15, 2024

This is a continuation of previous articles on IaC shortcomings and their resolutions. While the previous articles focused on Azure Machine Learning Workspace as a resource to train models, this lists some of the choices data scientists make between models. Classification, regression, recommendation and clustering are some common machine learning tasks, but the use case determines the model, so the list here is drawn from the purpose.

If the purpose is to predict between two categories, two-class classification models are appropriate. Simple yes or no answers fall in this category. If there are 100 features or less a linear model using Two-Class support vector machine is good. If a fast training, linear model is needed, a two-class averaged perceptron is suitable. Similarly, a two-class decision forest is for accurate training, a two-class logistic regression for fast training, , two-class boosted decision tree for accurate, fast and with large memory footprint training and a two-class neural network for accurate but long training times.

On that note, multiclass classifications are used when there are multiple possible answers. Multiclass logistic regression is for fast-training times, multiclass neural network for accurate but long training times, multiclass decision forest for accurate and fast training times, One-vs-all multiclass for a dependency on two-class classifier, one-vs-one multiclass for when the use case is less sensitive to an imbalanced dataset and with larger complexity. Multi-class boosted decision tree is for a need with non-parametric, fast training times and scalability.

Regression models are used to make forecasts by estimating the relationships between values. Predicting a distribution can be done by a Fast-Forest Quantile Regression, predicting event counts by Poisson regression, fast training with a Linear Regression, small data sets with Bayesian Linear Regression, accurate and fast training with Decision Forest, accurate but long training with Neural Network, accurate, fast and with large memory footprint with Boosted Decision Tree Regression.

Recommenders are used when the use case involves what might be interesting to someone. A hybrid recommender with both collaborative filtering and content-based approach would require one that trains wide and deep. Collaborative filtering would justify an SVD recommender.

Clustering separates similar data points into intuitive groups for organization. One with unsupervised learning to discover structure could be done by a K-Means clusterer.

Unusual occurrences to find rare data points or outliers can be done by anomaly detection models such as One-Class SVM when there is an aggressive boundary or by PCA-based anomaly detection for fast training times.

Image classification models interpret images by using deep learning neural network. ResNet and DenseNet are some examples in this category.

Text Analytics models interpret text. Words can be converted to values with Word2Vector model for use in NLP tasks, cleaning operations on text like removal of stop-words and case normalization can be done with Preprocess Text analytics, converting text to features using Vowpal Wabbit library can be done with Feature Hashing, dictionary of n-grams can be extracted with Extract N-grams Features and topic modeling can be done with Latent Dirichlet Allocation. Since text analytics is often part of a pipeline transforming text to vectors and discovering embeddings, the above tasks are often used together. Cloud services provide a great endpoint for these models. Azure Cognitive Services provides a rich text analytics API. Azure Text Analytics V3 supports multiple languages.

The best machine learning model for your predictive analytics solution is driven both by the nature of the data and the purpose at hand.

Previous articles: IaCResolutionsPart92.docx

Cluster computing

Friday, March 15, 2024

No comments:

Post a Comment