Wednesday, July 16, 2014

Machine data has several traits that are different from texts and discourses. When we search machine data we see fields such as messages, timestamps, applications, source, sourcetype, host, sizes, counts, protocols and protocol data formats etc. In other words, there is a set of domain terms that is finite and covers a large portion of the logs seen.
There is a high number of repetitions of these domain terms due to the fact that machines and repeaters often spit out the same set of messages at different times. At the same time, there are messages that have a high degree of variation in content such as web traffic capture. But these two different types are well delineated in terms of flows.
Machine data therefore can be separated into three different types:
Data that doesn't vary a lot in terms of content and terms
Data that varies a lot in terms of content only, metadata is still same
Data that mixes the two above
For the the first two cases, when they constitute a bulk of messages, we can treat them differently. The mixture of the two cases amounts to noise and can be ignored for the moment. We will treat this as any other arbitrary text later.
Predictive patterns in the data help with analysis so the first two case are interesting and lend themselves to field extraction, summarization etc.
Terms extracted from the data are already falling into a frequency table that is skewed as opposed to the concordance from human text.  So the selection of these terms is different from that of regular text.
Moreover context for the data can be described in terms of the metadata fields. These fields are predetermined or even specified by the forwarders of the data. The indexers that store the raw and searchable data have both data and metadata from a wide variety of sources.
Association between the terms and the context let us look for those that can be considered the semantics of the machine data. These terms are special in that they tag the data in a special way. Finding these terms is helpful in knowing what the machine data is about.
We can associate the terms and the context with bipartite graphs.
When we perform a random walk to connect the terms and the context, we are performing a special case of Markov chain which is iterative. In a random walk, we follow  a stochastic process with random variables X1, X2, ..., Xk such that X1 = 0 and X i+1 is a vertex chosen uniformly at random from the neighbors of Xi. The number pv,w,k(G) is the probability that a random walk of length k starting at v ends at w. If the edges are considered in terms of an ohm resistor, then the resistance between a point to infinity is finite when the graph is transient. We will  review Aldous and Fill book on random walks on graph next.

No comments:

Post a Comment