Monday, December 19, 2022

 

The application of data mining and machine learning techniques to Reverse Engineering.

An earlier article1 introduced the notion and purpose of reverse engineering. This article focuses on the transitions between text to model for source code so that the abstract knowledge discovery model can be enhanced.

The premise for doing this is similar to what a compiler does in creating a symbol table and maintaining dependencies. In particular, we recognize that the symbols as nodes and their dependencies as edges presents a rich graph on which relationships can be superimposed and queried for different insights. These insights help with better representation of the KDM. Specifically, some queries can be based on the well-known architecture designs such as model-view-controllers that leverage both the functionality and layout of source code. But the purpose of this article is to leverage well-known data mining algorithms to glean more insights. Even a basic linear or non-linear ranking of the symbols and thresholding them can be very useful towards representing the architecture.

We cover just a few of the data mining algorithms to begin with and close that with a discussion on machine learning methods including SoftMax classification that can make excellent use of co-occurrence data. Finally, we suggest that this does not need to be a one-pass KDM builder and that the use of pipeline and metrics can be helpful towards incremental or continually enhancing the KDM. The symbol and dependency graph is merely the persistence of information learned which can be leveraged for analysis and reporting such as rendering a KDM

Classification algorithms

This is useful for finding similar groups based on discrete variables

It is used for true/false binary classification. Multiple label classifications are also supported. There are many techniques, but the data should have either distinct regions on a scatter plot with their own centroids or if it is hard to tell, scan breadth first for the neighbors within a given radius forming trees or leaves if they fall short.

 

Useful for categorization of symbols beyond the nomenclature. Primary use case is to see clusters of symbols match based on features. By translating to a vector space and assessing the quality of cluster with a sum of square of errors, it is easy to analyze large number of symbols as belonging to specific clusters for management perspective.

Decision tree

This is probably one of the most heavily used and easy to visualize mining algorithm. The decision tree is both a classification and a regression tree. A function divides the rows into two datasets based on the value of a specific column. The two list of rows that are returned are such that one set matches the criteria for the split while the other does not. When the attribute to be chosen is clear, this works well.

A Decision Tree algorithm uses the attributes of the service symbols to make a prediction such as a set of symbols representing a component can be included or excluded. The ease of visualization of split at each level helps throw light on the importance of those sets.  This information becomes useful to prune the tree and to draw the tree


No comments:

Post a Comment