Cluster computing

Some of fleet management data science algorithms are captured via a comparison table of well-known data mining algorithms as follows:

Data Mining Algorithms Description Use case

Classification algorithms This is useful for finding similar groups based on discrete variables

It is used for true/false binary classification. Multiple label classifications are also supported. There are many techniques, but the data should have either distinct regions on a scatter plot with their own centroids or if it is hard to tell, scan breadth first for the neighbors within a given radius forming trees or leaves if they fall short.

Useful for categorization of fleet path changes beyond the nomenclature. Primary use case is to see clusters of service request that match based on features. By translating to a vector space and assessing the quality of cluster with a sum of square of errors, it is easy to analyze large number of changes as belonging to specific clusters for management perspective.

Regression algorithms This is very useful to calculate a linear relationship between a dependent and independent variable, and then use that relationship for prediction. Fleet path changes demonstrate elongated scatter plots in specific categories. Even when the path changes come demanding different formations in the same category, the reorientation times are bounded and can be plotted along the timeline. One of the best advantages of linear regression is the prediction about time as an independent variable. When the data point has many factors contributing to their occurrence, a linear regression gives an immediate ability to predict where the next occurrence may happen. This is far easier to do than coming up with a model that behaves like a good fit for all the data points.

Segmentation algorithms A segmentation algorithm divides data into groups or clusters or items that have similar properties. Path change stimuli segmentation based on fleet path change feature set is a very common application of this algorithm. It helps prioritize the response to certain stimuli.

Association algorithms This is used for finding correlations between different attributes in a data set Association data mining allows these users to see helpful messages such as “stimulii who caused a path change for this fleet type also caused a path change for this other fleet formation”

Sequence Analysis Algorithms This is used for finding groups via paths in sequences. A Sequence Clustering algorithm is like a clustering algorithm mentioned above but instead of finding groups based on similar attributes, it finds groups based on similar paths in a sequence. A sequence is a series of events. For example, a series of web clicks by a user is a sequence. It can also be compared to the IDs of any sortable data maintained in a separate table. Usually, there is support for a sequence column. The sequence data has a nested table that contains a sequence ID which can be any sortable data type. This is very useful to find sequences of fleet path changes opened across customers. Generally, a transit failure could result in a cascading failure across the transport network. This sort of sequence determination in a data driven manner helps find new sequences and target them actively even suggesting the same to the stimulii who cause ath changes to the fleet formations so that they can be better prepared for failures across relays.

Sequence Analysis also helps with interactive formation changes as described here.

Outliers Mining Algorithm Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given.

The choices for similarity measures between rows include distance functions such as Euclidean, Manhattan, string-edits, graph-distance etc and L2 metrics. The choices for aggregate dissimilarity measures is the distance of K nearest neighbors, density of neighborhood outside the expected range and the attribute differences with nearby neighbors The steps to determine outliers can be listed as: 1. Cluster regular via K-means, 2. Compute distance of each tuple in R to nearest cluster center and 3. choose top-K rows, or those with scores outside the expected range. Finding outliers is sometimes humanly impossible because the number of path changes can be quite high. Outliers are important to discover new strategies to encompass them. If there are numerous outliers, they will significantly increase costs. If they were not, then the patterns help identify efficiencies.

Decision tree This is probably one of the most heavily used and easy to visualize mining algorithms. The decision tree is both a classification and a regression tree. A function divides the rows into two datasets based on the value of a specific column. The two list of rows that are returned are such that one set matches the criteria for the split while the other does not. When the attribute to be chosen is clear, this works well. A Decision Tree algorithm uses the attributes of the external stimulii to make a prediction such as the reorientation time on a next path change. The ease of visualization of split at each level helps throw light on the importance of those attributes. This information becomes useful to prune the tree and to draw the tree

Logistic Regression This is a form of regression that supports binary outcomes. It uses statistical measures, is highly flexible, takes any kind of input and supports different analytical tasks. This regression folds the effects of extreme values and evaluates several factors that affect a pair of outcomes. Path changes based on stimulii category can be used to predict the likelihood of a path change from a category of stimulii. It can also be used for finding repetitions in requests

Neural Network This is a widely used method for machine learning involving neurons that have one or more gates for input and output. Each neuron assigns a weight usually based on probability for each feature and the weights are normalized across resulting in a weighted matrix that articulates the underlying model in the training dataset. Then it can be used with a test data set to predict the outcome probability. Neurons are organized in layers and each layer is independent of the other and can be stacked so they take the output of one as the input to the other. Widely used for SoftMax classifier in NLP associated with fleet path changes. Since descriptions of stimulii, fleet formation changes, path adjustments and adjustment time to modified path and formation captured by spatial and temporal are conformant to narratives with metric-like quantizations, Natural Language Processing could become a significant part of the data mining and ML portfolio

Naïve Bayes algorithm This is probably the most straightforward statistical probability-based data mining algorithm compared to others.

The probability is a mere fraction of interesting cases to total cases. Bayes probability is conditional probability which adjusts the probability based on the premise. This is widely used for cases where conditions apply, especially binary conditions such as with or without. If the input variables are independent, their states can be calculated as probabilities, and if there is at least a predictable output, this algorithm can be applied. The simplicity of computing states by counting for class using each input variable and then displaying those states against those variables for a give value, makes this algorithm easy to visualize, debug and use as a predictor.

Plugin Algorithms Several algorithms get customized to the domain they are applied to resulting in unconventional or new algorithms. For example, a hybrid approach on association clustering can benefit determining relevant associations when the matrix is quite large and has a large tail of irrelevant associations from the cartesian product. In such cases, clustering could be done prior to association to determine the key items prior to this market-basket analysis. Fleet path changes are notoriously susceptible to apply with variations even when pertaining to the same category. These path changes do not have pre-populated properties from a template, and spatial and temporal changes can vary drastically along one or both. Using a hybrid approach, it is possible to preprocess these path changes with clustering before analyzing such as with association clustering.

Simultaneous classifiers and regions-of-interest regressors Neural nets algorithms typically involve a classifier for use with the tensors or vectors. But regions-of-interest regressors provide bounding-box localizations. This form of layering allows incremental semantic improvements to the underlying raw data. Fleet path changes are time-series data and as more and more are applied, specific time ranges become as important as the semantic classification of the origin of path changes and their descriptions. Using this technique, underlying issues can be discovered as tied to internal or external factors. The determination of root cause behind a handful of path changes is valuable information.

Collaborative filtering Recommendations include suggestions for a knowledge base, or to find model service requests. To make a recommendation, first a group sharing similar taste is found and then the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the people in the selected group. To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. Several approaches mentioned earlier provide a perspective to solving a problem. This is different from those in that opinions from multiple participants or sensors in a stimuli creation or recognition agens are taken to determine the best set of fleet formation or path changes to recommend.

Collaborative Filtering via Item-based filtering This filtering is like the previous except that it was for user-based approach, and this is for item-based approach. It is significantly faster than the user-based approach but requires the storage for an item similarity table. There are certain filtering cases where divulging which stimuli/sensors go with what formation/path change, is helpful to the fleet manager or participants. At other times, it is preferable to use item/flight path-based similarity. Similarity scores are computed in both cases. All other considerations being same, item-based approach is better for sparse dataset. Both stimuli-based and item-based approach perform similarly for the dense dataset.

Hierarchical clustering Although classification algorithms vary quite a lot, hierarchical algorithm stands out and is called out separately in this category. It creates a dendrogram where the nodes are arranged in a hierarchy. Specific domain-based ontology in the form of dendrogram can be quite helpful to mining algorithms.

NLP algorithms Popular NLP algorithms like BERT can be used towards text mining. NLP models come very useful for processing flight path commentary and associated artifacts in the fleet flight management.

Algorithm Implementations:

https://jsfiddle.net/g2snw4da/

https://jsfiddle.net/jmd62ap3/

https://jsfiddle.net/hqs4kxrf/

https://jsfiddle.net/xdqyt89a/

#codingexercise https://1drv.ms/w/s!Ashlm-Nw-wnWhPAI9qa_UY0gWf8ZPA?e=PRMYxU

Cluster computing

Sunday, June 23, 2024

No comments:

Post a Comment