Event management solutions are popular in supporting early attack detection, investigation, and response. Leading products in this space focus on analyzing security data in real-time. These systems for security information and event management (SIEM) collect, store, investigate, mitigate, and report on security data for incident response, forensics, and regulatory compliance.
The data that is collected for analysis comes from logs, metrics, and events which are usually combined with contextual information about users, assets, threats, and vulnerabilities. This machine data comes from a variety of sources and usually carries a timestamp for every entry. The stores for this data are usually a time-series database that knows how to consistently keep the order of events as they arrive. The stores also support Watermarks and Savepoints for its readers and writers respectively so that they can resume from an earlier point in time.
These systems, by their nature of dependence on the timelines, require analytical processing to occur on historical batches of data. Elastic storage for big data promoted the use of the map-reduce technique of processing batches where the computations for each batch were mapped before the results from their processing were reduced to a result. This took a lot of time to be useful for high volume traffic such as from the Internet-of-Things. When the batches of data became smaller, the latency for each batch could be reduced but it still did not alleviate the compute. New algorithms for streaming computed the result continuously as data arrived as events one by one. Stream processing libraries supported the aggregation and transformation by processing one event at a time. With the help of these libraries, the logic for analytics could now be conveniently written by filtering or aggregating a subset of events. Higher-level querying languages such as SQL could also be supported by viewing these events to be in a table.
The table is a popular means for analytics because the database industry has had a rich tradition for analytics. Family of data mining techniques such as Association Mining, Clustering, Decision Trees, Linear Regression, Logistic Regression, Naïve Bayes, Neural Network, Sequence Clustering, and Time-Series was now available out of the box when the data appears in a table. Unfortunately, not all these techniques apply directly to streaming data where the data appear one by one. The standard practice for using these techniques for forecasting, risks and probability analysis, recommendations, finding sequences, and grouping starts with a model. First, the problem is defined, then the data is prepared, it is explored to help with the selection of a model, then the models are built and trained over the data, then the models are validated and finally they are kept up to date. Usually, 70% of the data is used for training, and the remaining 30% is used for testing and predictions. A similar technique is used with machine learning packages and algorithms.