Wednesday, March 31, 2021

 Applications of Data Mining to Reward points collection service  

Continuation of use cases:  Outliers can also be detected by data mining algorithms where the choices for similarity measures between rows could include distance functions such as Euclidean distance, Manhattan distance, graph-distance, and L2 metrics. The choices for aggregate dissimilarity measures are the distance of K nearest neighbors, the density of neighborhood outside the expected range, and the attribute differences with nearby neighbors. Outliers are important to discover new strategies to encompass them. If there are numerous outliers, they will significantly increase organizational costs. If they were not, then the patterns help identify efficiencies. A Decision Tree algorithm uses the attributes of the service requests to make a prediction such as the relief time on a case resolution. The ease of visualization of split at each level helps throw light on the importance of those attributes.  This information becomes useful to prune the tree and to draw the tree. Logistic regression helps with the determination of user appreciations based on demographics and it can be used for finding repetitions in requests. Neural networks can be used with softmax classifier to classify appreciation terms in chat text on channels.  

Naiive Bayes algorithm can be used for use-cases where binary conditions apply. The set of policies determined by organizations for reward point grants as employee appreciation is usually authored with the help of some conditions. These conditions can be maintained in the service. When these conditions pertain to attributes from the source where the reward points are published, then their probabilities become relevant to this algorithm especially when the input states are taken on a with or a without basis and when the input variables are independent. The simplicity of counting or summing reward points that meet the binary condition, together with the ease to visualize, debug and use as a predictor makes this algorithm quite popular. For example, the reward points can be counted based on whether the appreciation came from a specific person or otherwise.   

Collaborative filtering is another use case where the binary conditions apply. This is particularly useful when there are multiple participants in a group whose opinions determine the best grant of reward points. In the earlier approaches, the algorithms were articulating conditions. In this algorithm, we avoid the use of conditions and replace it with ratings. The participants in the group can be selected such that they form a diverse set or a cohesive set depending on the purpose. The calculation of grants based on existing reward points can be determined with the help of this opinion group and it helps to avoid many of the pitfalls with the logic associated with conditions. Some of these include the disclosure of rules, taking advantage of the rules, and circumventing them.  

Hierarchical clustering is helpful when we want to cluster the reward points to match with the organizational hierarchy to give credit to the manager when their reporting employees do well. This is a standard practice in many companies. It may not be evident from the flat independent grants assigned to individuals that the reward points can be grouped based on the hierarchy to which the user belongs. Distance between members based on organizational hierarchy can also be used as a metric to determine the hierarchical clustering of reward point grants.  

Conclusion: There are several algorithms in data mining that are applicable to the Reward points repository.  

 

 

Tuesday, March 30, 2021

 Applications of Data Mining to Reward points collection service 

Continuation of use cases:  Some use cases that stand out from the others for reward points collection service include the following.  Classification algorithms are useful for the categorization of reward point assignments based on source attributes. The primary use case is to see clusters of appreciation patterns that match based on attributes. By translating to a vector space and assessing the quality of a cluster with a sum of squares of errors, it is easy to analyze a large number of grants as belonging to specific clusters which can provide insight to management for group dynamics. Reward points grant for a user demonstrate elongated scatter plots in specific categories. Even when the grants come from varying contexts, the time to next appreciation can be plotted along the timeline. One of the best advantages of linear regression is the prediction of time as an independent variable. When the data point has many factors contributing to their occurrence, a linear regression gives an immediate ability to predict where the next occurrence may happen. This is far easier to do than come with up a model that behaves like a good fit for all the data points. Customer segmentation based on reward points is a very common application of this algorithm. It helps prioritize the response to certain customers. Association data mining allows the management of an organization to see helpful patterns such as “employees who appreciated this user also appreciated this other user”. Sequence clustering can be used for patterns of appreciation from the same user. With these examples, it is possible for organizations to understand and appreciate what may be missing from mere performance evaluation from the organizational hierarchy. 

Outliers can also be detected by data mining algorithms where the choices for similarity measures between rows could include distance functions such as Euclidean distance, Manhattan distance, graph-distance, and L2 metrics. The choices for aggregate dissimilarity measures are the distance of K nearest neighbors, the density of neighborhood outside the expected range, and the attribute differences with nearby neighbors. Outliers are important to discover new strategies to encompass them. If there are numerous outliers, they will significantly increase organizational costs. If they were not, then the patterns help identify efficiencies. A Decision Tree algorithm uses the attributes of the service requests to make a prediction such as the relief time on a case resolution. The ease of visualization of split at each level helps throw light on the importance of those attributes.  This information becomes useful to prune the tree and to draw the tree. Logistic regression helps with the determination of user appreciations based on demographics and it can be used for finding repetitions in requests. Neural networks can be used with a softmax classifier to classify appreciation terms in chat text on channels. 

Conclusion: The rewards point service is an investment for the organization where the costs and benefits are both improved by promoting organizational health, satisfaction, and productivity. The use of data mining algorithms on this data also empowers the management with better knowledge of group dynamics. 

 

 

 

 

Monday, March 29, 2021

 Applications of Data Mining to Reward points collection service 

Problem statement: As with any data collected by a web service, analysis is not restricted to mere queries on the accumulated data. Deep learning techniques and data mining provide insights that can empower organizations beyond the employee appreciation for which the service is used. We review some of these uses cases in this article. 

Solution: Data mining is a tried and tested method for gaining insights into relational data. The Reward points are collected with a relation between users and their accumulated peer appreciation. Standard mining techniques such as Clustering, sequence mining, decision tree, regression, segmentation, and association algorithms provide a lot of insights.  When there are many choices between data mining algorithms that can be applied to a given dataset, it might require some exploration of the data.  

If the use case was well articulated, the choice for the data mining algorithm becomes immediately clear. The use case becomes clear only when the data is well-known and the objective for the business purpose is known. Usually, only the latter is mentioned such as the prediction of an attribute associated with the data. For example, the dataset, suitable for supervised learning could have labels that are best determined with some exploration of training data. These techniques are required to determine the rules with which to assign labels to the raw data. If the rules were available for business purposes, then the assignment of labels is merely an automation task and helps prepare the training set for the data.  

In the absence of business rules to assign labels to the data, the dataset for data mining is usually large and cannot be compared by mere inspection. Some visualization tools are necessary. In this regard, two algorithms stand out for making this task easier. First, the decision tree algorithm can be used to find the relationships between the rows, and the visualization in the form of attributes that are significant to the outcome can be established. The tree can be pruned to see which attributes matter and which do not matter. The split of the nodes on each level helps visualize the relative strength of those attributes across rows. This is very helpful when the tree is generated without supervision.  

The other algorithm is the use of the Naive Bayes Classifier to assign data. This classifier is helpful to explore data, finding relationships between input columns and predictable columns, and then using the initial exploration to create additional algorithms. Since it compares across columns for a given row, it evaluates the binary probabilities for with and without that attribute in each column.  

Together these attributes can help with the initial exploration of data to choose the right algorithm for a given purpose. Usually, the split between training data and test data for the purpose of prediction, is 70% for training data and 30% for test data.  

Some use cases that stand out from the others for reward points collection service include the following.  Classification algorithms are useful for the categorization of reward point assignments based on source attributes. The primary use case is to see clusters of appreciation patterns that match based on attributes. By translating to a vector space and assessing the quality of a cluster with a sum of squares of errors, it is easy to analyze a large number of grants as belonging to specific clusters which can provide insight to management for group dynamics. Reward points grant for a user demonstrate elongated scatter plots in specific categories. Even when the grants come from varying contexts, the time to next appreciation can be plotted along the timeline. One of the best advantages of linear regression is the prediction of time as an independent variable. When the data point has many factors contributing to their occurrence, a linear regression gives an immediate ability to predict where the next occurrence may happen. This is far easier to do than come with up a model that behaves like a good fit for all the data points. Customer segmentation based on reward points is a very common application of this algorithm. It helps prioritize the response to certain customers. Association data mining allows the management of an organization to see helpful patterns such as “employees who appreciated this user also appreciated this other user”. Sequence clustering can be used for patterns of appreciation from the same user. With these examples, it is possible for organizations to understand and appreciate what may be missing from mere performance evaluation from the organizational hierarchy. 

Conclusion: The rewards point service is an investment for the organization where the costs and benefits are both improved by promoting organizational health, satisfaction, and productivity. The use of data mining algorithms on this data also empowers the management with better knowledge of group dynamics. 

 

 

 

Sunday, March 28, 2021

Hashes

 A play on hash

The idea behind hashing is that an entity’s representation can be mapped to a value in a set with the nice side effect that those values have a uniform random distribution even when the entity’s representations do not. This allows us to count the number of distinct elements by finding prefixes of leading zeros in a binary representation of length sufficient to capture a number of possible values. For example, if k is the length of the longest sequence of leading zeros in a binary representation of length n, then Math.pow(10, k) represents the number of unique elements because on average k zeros will occur once every those many elements. We size the count of distinct elements with the help of the prefix. We ignore the bias and the outliers because they are known and can be corrected by a harmonic mean and constant correction factor respectively.

If we use m hash functions generating m hashes for the same entity’s representation and each map to a bit position starting from an all zero-bit array, then the presence of even a single zero at the hashing of an entity’s representation immediately discounts that entity as a member of the set.

If we could go a step further to find all bit positions where the hashes come out to be zero, then we can identify at least one unique hash in Math.pow(2,m) hashes, where the corresponding element is not part of the set by inverting those positions to 1 bit and keeping the others as 0 bit.


Saturday, March 27, 2021

 The difference between a bloom filter and HyperLogLog: 

A bloom filter is used when we must determine whether a member is definitely not in a set. It can return more than what should be in a set but this is tolerated when we need to know that the member is not part of the set.  We will see how we can put this to use but let us first review this data structure. Members can be added to the dataset but not removed. Instead of having a large hash table that does not fit into memory, it works with just enough memory and an error-free hash that provides uniform random distribution.  An empty bloom filter is a bit array of m bits that is all set to zero. There are k hash functions that hash some of the members to one of the m array positions. m is proportional to k. When a member is added, each of the k hash functions maps it to their corresponding positions, setting those bits to 1. In the test for whether a member is in a setthe k hash functions are re-evaluated. If any of the hash functions return 0, the member is definitely not in the set. If all the positions are set to 1, then either the member is part of the set or the bits have been set for some other member when it was inserted. The choice of error-free hash functions must be such that as m and k increase, there is little or no correlation between different bit-fields of such a hash. In some cases, a single hash function with k offsets to initial values suffices. In the large data sets, some examples of such hash functions include the enhanced double hashing or triple hashing functions which tend to use random numbers based on two or three hash-values. The removal of values is discouraged by the hashing functions because it is not easy to tell which positions are to be undone. 

In key-value storage, bloom filters are helpful to determine whether a key exists. Values can exist on disk but the bloom filter itself can exist in memory.  

A HyperLogLog is used to determine the number of distinct members in a multi-set. Determining the exact number of members in a multiset takes a lot of memory. Instead, this algorithm uses significantly less memory to determine an approximate number of members. A membership of more than a billion can be determined with just about a couple of Kilobyte of memory. This works by assigning a number to each of the members, converting the numbers to binary, and then determining the maximum number of leading zerosIf there are n bits to represent a number, then there are Math.pow(2,n) number of members in a superset of all members. A hash function can be applied to each member in the original multiset to get transformed and uniformly distributed random numbers with just as many of them as there were in the original multiset. There is a large variance possible in this estimation but it can be reduced by splitting the multiset into numerous subsets, estimating the number of leading zeros for each subset, and using a harmonic mean to combine these estimates. Variance measures how far a set of numbers is spread from its mean value and a harmonic mean is a special kind of mean that can be applied to rates.