Cluster computing

Sunday, March 30, 2014

We talk about distributed search in Splunk today. As we know from the discussion on distributed deployment, Splunk Enterprise performs three major functions as it moves data through the data pipeline. First, it consumes data from files, the network or elsewhere. Then it indexes the data. Finally it runs interactive or scheduled searches on the indexed data Each of this functionality can be split into dedicated instances of Splunk that can number from a few to thousands.
The dedicated instances that consume data are called forwarders and they can be light weight or universal. The indexers do the heavy indexing and they are usually the ones with the higher performance configurations. Typically, there's one indexer for every 150/200 GBs of daily indexing and a 12 to 24 core server for each indexer instance. Indexers can also be clustered and configured to replicate each others data. This process is known as index replication. Replication helps prevent data loss and improves data availability. Clusters also have a builtin distributed search capability which we will get to shortly. The other type of dedicated instances are the search heads. The search heads co-ordinate searches across the set of indexers, consolidate their results and present them to the user. For the largest environments, several search heads sharing a single configuration set can be pooled together. With search head pooling, we can co-ordinate simultaneous searches across a large number of indexers. With these logical definitions, we can now mix and match the components on same or different physical machines and configurations to suit our needs. Often if the instances share the same machine, it will be a high end machine with several cores and GBs of RAM. Deployment decisions are based on the amount of incoming data, the amount of indexed data, the number of concurrent users, the number of saved searches, the types of searches employed and the choice to run Splunk apps.
Note that distributed search is not about deployment. This is a builtin functionality for Splunk.
In distributed search we look at use cases such as the following:
As a Splunk administrator, I want to distribute indexing and searching loads across multiple Splunk instances to that it is possible to search and index large quantities of data in reasonable time.
As a Splunk administrator, I want to distribute search to control access to indexed data such that the admins and security personnel have full access and others require access only to their data.
As a Splunk administrator, I want to have different Splunk instances in different geographical offices such that local offices have access to their own data while the centralized headquarters hae access at the corporate level.
In the distributed search, the Splunk Enterprise instance that does the searching is referred to as the search head and those that participate in the search are called search peers. Typically, the search head will run search across all its search peers unless otherwise limited as specified by the splunk_server field in the query. If you want to pick a subset for searches, we do this with search head pooling.
When a distributed search is initiated, the search head replicates and distributes its knowledge objects to its search peers. Knowledge objects include saved searches, event types and other entities used in searching across indexes. This is packaged in a knowledge bundle and distributed to the peers.
Knowledge bundles typically reside under the $SPLUNK_HOME/var/run folder and have a .bundle extension. Bundles can contain almost all the contents of the search head's apps. Instead of copying over the bundles to different peers, its sometimes advisable to mount the knowledge bundles. Together with the bundle user authorization flows to the peers. The user running the search, the role and the location of the distributed authorize.conf file are included with the bundle.

Cluster computing

Sunday, March 30, 2014

No comments:

Post a Comment