The application of data mining and machine
learning techniques to Reverse Engineering.
An earlier article1 introduced the
notion and purpose of reverse engineering. This article focuses on the
transitions between text to model for source code so that the abstract
knowledge discovery model can be enhanced.
The premise for doing this is similar to what a
compiler does in creating a symbol table and maintaining dependencies. In
particular, we recognize that the symbols as nodes and their dependencies as
edges presents a rich graph on which relationships can be superimposed and queried
for different insights. These insights help with better representation of the
KDM. Specifically, some queries can be based on the well-known architecture
designs such as model-view-controllers that leverage both the functionality and
layout of source code. But the purpose of this article is to leverage
well-known data mining algorithms to glean more insights. Even a basic linear
or non-linear ranking of the symbols and thresholding them can be very useful
towards representing the architecture.
Classification
algorithms |
This is useful for finding similar groups based on discrete
variables It is
used for true/false binary classification. Multiple label classifications are
also supported. There are many techniques, but the data should have either
distinct regions on a scatter plot with their own centroids or if it is hard
to tell, scan breadth first for the neighbors within a given radius forming
trees or leaves if they fall short. |
Useful
for categorization of symbols beyond the nomenclature. Primary use case is to
see clusters of symbols match based on features. By translating to a vector
space and assessing the quality of cluster with a sum of square of errors, it
is easy to analyze large number of symbols as belonging to specific clusters
for management perspective. |
Decision
tree |
This is
probably one of the most heavily used and easy to visualize mining algorithm.
The decision tree is both a classification and a regression tree. A function
divides the rows into two datasets based on the value of a specific column.
The two list of rows that are returned are such that one set matches the
criteria for the split while the other does not. When the attribute to be
chosen is clear, this works well. |
A
Decision Tree algorithm uses the attributes of the service symbols to make a
prediction such as a set of symbols representing a component can be included
or excluded. The ease of visualization of split at each level helps throw
light on the importance of those sets.
This information becomes useful to prune the tree and to draw the tree
|
No comments:
Post a Comment