Cluster computing

Tuesday, June 17, 2014

In the JoBim Text project Gliozzo et al introduced an interactive visualization component. JoBim is an open source platform for large scale distributional semantics based on graph representation. A distributional thesaurus is computed bipartite graphs of words and context features. For every sentence, a context is generated for semantically similar words. Then the capabilities of the conceptualized text is expanded in an interactive visualization.
The visualization can be used as a semantic parser as well as disambiguation of word senses that is induced by graph clustering.
The paper comes from the view that the meaning in a text can be fully defined by semantic oppositions and relations between words. To get this knowledge, co-occurrences with syntactic contexts are extracted from a very large corpora. This approach does not use a quadratic to compute the word similarities. Instead it uses the contemporary MapReduce algorithm. The advantage is that the MapReduce algorithm works well on sparse context and scales to large corpora.
The result is a graph that connects the most discriminative context to terms with explicit linking between the most similar terms. This graph represents a local model of semantic relations for each term. Compare this with a model of semantic relations with fixed dimensions .In other words, this is an example of ego networks. This paper describes how to compute a distributional thesaurus and how to contextualize distributional similarity.
The JoBim framework is named after the applied observations of terms (Jos) and context (Bims) pairs with edges. This operation of splitting observations into JoBim pairs is referred to as the holing operation.
The significance of each pair (t,c) is computed and then only the p most significant pairs are kept per term t resulting in the first order graph. The second order graph is extracted as the similarity between two Jos. This similarity is based on the number of salient features the two Jos share. The similarity over the Jos defines a distributional thesaurus and the paper says this can be computed efficiently in a few MapReduce steps and is said to be better than other measures. This can be replaced by any other mechanism as well as the paper proceeds to discuss the contextualization which as we know depends a lot on smoothing. There are many term context pairs that are valid and may or may not be represented in the corpus. To find similar contexts, they expand the term arguments with similar terms. But the similarity of these terms depend on context. The paper therefore leverages a joint inference to expand terms in context using a marginal inference in conditional random field (CRF) CRF works something like this. A particular word, x is defined as a single definite sequence of either original or expanded words. Its weight depends on the degree to which the term context associations present in this sentence are present in the corpus as well as the out of context similarity of each expanded term to the corresponding term in the sentence. The proportion of the latter to the former is specified a tuning parameter.

We will now look at word sense induction, disambiguation and cluster labeling. With the ranking based on contextually similar terms, there is some implicit word sense disambiguation. But this paper addresses it explicitly with word sense induction.

OK so we will cover some more of this in tonight's post.
The authors mention a WSI technique and use information extracted by IS-A pattern to label clusters of terms that pertain to same taxonomy or domain. The aggregated context features of the clusters help attribute the terms in the distributional thesaurus with the word cluster senses and these are assigned in the context part of the entry.
The clustering algorithm they use is called the Chinese Whispers graph clustering algorithm, which finds the number of clusters automatically The IS-A relationships between terms and their frequency is found from part of speech. This gives us a list of IS-A relationships between terms and their frequencies. Then we find clusters of words that share the same word sense. The aggregates for the IS-A relationships for each cluster is found by summing the frequency of the hypernyms and multiplying this sum by the number of words in the cluster that elicited this hypernym. This results in the labels for the clusters that is taxonomic and provides an abstraction layer over the terms. For example, jaguar can be clustered into the cat sense or car sense and the highest scoring hypernyms provide a concise description of these senses. The occurrences of ambiguous words in context can now be disambiguated to these cluster senses.
The visualization for the expansion of the term context pairs uses the Open Domain Model which is trained from newspaper corpora.

We will next talk about Interactive Visualization Features but before we do that let us first talk about the Chines Whispers clustering algorithm.

The Chinese Whispers is a randomized graph clustering algorithm (Biemann). The edges are added in increasing numbers over time. The algorithm partitions the nodes of weighted undirected graphs. The name is derived from a children's playing game where they whisper words to each other. The game's goal is to arrive at some funny derivative of the original message by passing it through several noisy channels.
The CW algorithm aims at finding groups of nodes that broadcast the same message to their neighbors.
The algorithm proceeds something like this:
First assign all the vertices to different class
while there are changes,
for all the vertices taken in randomized order:
class(v) = highest ranked class in neighborhood of v;
Then the nodes are processed for a small number of iterations and inherit the strongest class in the local neighborhood. This is the class whose sum of edge weights to the current node is maximal.

Having discussed centrality measures in social graph, let us look at some more social Graph techniques. I'm going to discuss ego network. Ego network has to do with the individual as the focus. But before we delve into that, I will review the idea of embedding in a network. Embedding is the extent to which individuals find themselves in dense strong ties. These ties are reciprocative and they suggest some measure of constraints on individuals from the way that they are connected to others. This gives us an idea of the population and some subtleties but not so much about the positives and negatives facing the individual. If we look at the context of an individual, then we are looking at a local scope and this is the study of ego networks. Ego means an individual focus as in a node of the graph. There are as many egos as there are nodes. An ego can be a person, group, Or organization. The neighborhood of an ego is the collection of all egos around the individual to which there is a connection or a path. The connections need not be one step but they usually are. The boundaries of an ego network are defined in terms of neighborhood. Ego networks are generally undirected because the connections are symmetric. If they are different, then it's possible to define an in network and an out network. The ties in an in network are inwards and those in the other network are outwards. The strength and weakness of the ties can be defined in probabilities. Using these we can define thresholds for the ones we want to study. Finding these ties is a technique referred to as data extraction. The second technique is subgraph extraction. The network density of an ego network can be represented by the number of indexes for each ego in a dataset. We will review more graph techniques shortly.
Courtesy: Hanneman online text.

Monday, June 16, 2014

The benefits of registering a handler to create tracker events as described in the previous posts is that it can be invoked from different sources such as UI, CLI ( command line interface) and configuration files. The handler adds/deletes/lists the trackers. Trackers are listed based on the pipeline data they are registered for. We start with a global tracker so we may want to list just one.

In the previous two posts we have discussed adding a tracer/marker that has tracking and operation log information to the event data in Splunk. We mentioned all events have host and timestamp information and we covered why we wanted this special event. The crux of the implementation is the introduction of this special events to data pipeline. To create the event, we follow the same mechanism as we do for audit events.
Before :
1 2 3
__ __ __
|__| |__| |__| Events --->
------------------------------------

After :
1 2 Tracker 3
__ __ __ __
|__| |__| | __| |__| Events --->
-------------------------------------

When we generate these events we want to inject them locally to each stream. But we could start with a global tracker that traces the route of the data between various instances in the deployments. The events make their way to internal index. The event is written by creating an instance of the pipeline data from a model that's pre specified. Fields are added to the event to conform it as a regular event.
These are then sent on their way to a queue. Initially we could send the events directly to the queue serviced by and Indexing thread but in order to not hold up that queue or deadlock ourselves, we could create a separate queue that feeds into the other queue.

Sunday, June 15, 2014

In addition to the previous post, where we described in some detail how trackers work when they are injected into a pipeline for data in Splunk, let us now look at the changes to the indexer to support adding labels or trackers in the data pipeline When the data flows into an input processor, it is typically initiated by a monitor or a processor in the forwarder. When it arrives in the indexer, it is typically handled by one or more processors that apply the configurations specified by the user. When the raw data is written to the Splunk database, it is received by an Indexed Value Receiver that writes it to the appropriate Index. As an example, if we take the signing processor that encrypts the data before it is written to the database, we see that it takes the data from the forwarder and forwards it to Splunk database. Here the data is changed before it makes its way to the database.
Addition of this tracker is via a keyword that can be added to any existing pipeline data. Adding a keyword does not need to be user visible. This can appear in the transformation or expansion of user queries or inputs. For example at search time, the use of the keyword tracker can be used to display the tracking data that was injected into the data stream. This can be implemented by a corresponding Search Processor. Similarly on the input side, a corresponding processor can inject the trackers into the data pipeline. The data might go to different indexes and its debatable whether data should go only into the internal index or to the one specified for data. The choice of internal index lets us persist the trackers separately from the data so that the handlers for the different processors do not affect the data itself and this requires little code change. On the other hand, if we keep the trackers and the data in the same index, then almost all components reading the data from the index will need to differentiate between the data and the trackers. Associations between the data and the trackers even if kept separately are easy to maintain because we keep track of timestamps and trackers pertaining to an input can show where they are inlined with the data by means of the timestamps.
IndexProcessor has methods to retrieve buckets and events by timestamps. We add a method to IndexProcessor to add the tracker to internal buckets.

Saturday, June 14, 2014

Why do we discuss a tracer in the event data if all data carry hostname and timestamp field ? We will answer this shortly but tracers enable to generate artificial markers of varying size that test the flow of data in production without affecting or requiring any user changes to the existing data or its manipulation. In other words we add new data with different payloads. As an indicator of the payload, it could journal ( yes i'm using that term ) not only what hosts are participating at what time but also record all operations taken on that data flow so that we can see if the record of operations matches the final result on the data. If the tracer weren't there we would have to id the machines that participated and the logs on those machines to reconstruct the timeline, whereas these records are automatically available. This is no replacement to the logger of existing components but just an addition for each processor to indicate what it executed.
There are three components to enabling a tracer in the event data for Splunk. First is the forwarder which creates the event. Second is the indexer which indexes the data. Third is the search-head or peer that searches the data.
To create a tracer in the data, an admin handler is registered. This takes the following methods:
to create/update
to delete
to list
and such others as reload etc.
The tracer can be triggered via UI, conf file and and even CLI.
The intention of the tracer is to show how the data is flowing and can be put in any pipeline.
Therefore its a processor that can be invoked in different sources and streams. However, we will deal with creating a dedicated data flow for tracer that can be started from any forwarder.
To add the processor and the admin handler, we just implementing the existing interfaces.
To enable the indexer to add its data is slightly different.
We will cover that in more detail shortly.
First the important thing is that the data has to be labeled. We can create a custom data record with a header that identifies this type of data uniquely.
The second thing is to create a record for each host as the data makes its way. This host adds a separate entry with its hostname and time stamp. Further details can be added later as appropriate.
The destination index for this kind of data should always be internal since this is for diagnostics. The destination could be switched to nullQueue if specified but this is not relevant at this time.
The third thing is to create a mechanism to turn this on and off. This could be done via the controls that the admin handler processes.
Initially, there needs to be only one tracer for a Splunk instance but this can be expanded to different inputs as desirable. The config section entry that specifies it globally to the Splunk instance can be specified local to any input.
The presence of an entry in the config for tracer indicates that Splunk will attempt to send a tracer data every 15 minutes through its deployments to its indexes which can be viewed globally.
The tracer data is very much like audit data except for being more public and user friendly. It has information that enables a holistic view of all actively participating Splunk instances.

Friday, June 13, 2014

How do we implement a tracer event data in Splunk ?
First, we create a modular input that can take a special kind of data. We know the data structures to do this because we have seen how to use them before
Second, we need a destination or sink where this data will be indexed. We should not be using the null queue since these tracers carry useful information and gathering them over time will help with statistics.
The information is added per node during transit. and different types of nodes such as search peers or indexers or forwarders can stamp their node type.
The overall implementation is just a class declaration and definition that registers itself as a processor for all incoming data and kicks into action only when a specific kind of data is encountered.
The data is very simple and has a reserved field to used to identify the tracer.
The payload in the tracer consists of a simple structured data involving the timestamp, the node type, the node id, and duration of time spent.
Also in any topology, the tracer data will flow from one source to one destination For multicast, there will be many more of the copies of the tracers made. Once they are all indexed we can group them. Also the final destination for the data can be one index or all indexes. In other words we flood the topology to cover all the indexers.
Where the tracer differs from the existing heartbeat functionality is that this is more for the entire route rather than between adjacent source destination. A tracer is triggered to flow through a path consisting of one or more nodes. It should be triggered by the user or periodic runs.

Today we look at a little bit more on this feature. We want to see how the parser, processor, and indexer will treat this data.
Let's start with the parser.

Today I will implement this some more.