Cluster computing

Friday, January 31, 2014

In this post, we look at a few of the examples from the previous post:
    source = job_listings | where salary > industry_average
uses the predicate to filter the results.
    dedup source sortby -delay
shows the first of few unique sources sorted by the delay field in descending order
head (action="startup")
returns the first few events until the one matching startup is found
   transaction clientip maxpause=5s
group events that share the same client IP address and have no gaps or pauses longer than five seconds
Results often include duration and event count
The stats calculate statistical values on events grouped by the value of the fields.
stats dc(host) returns distinct count of host values
stats count(eval(method="GET")) as GET by host returns the number of GET requests for each webserver. percentage and range are other functions that can be used with it.
timechart only not applicable to chart or stats
   chart max(delay) over host
returns max(delay) for each value of host
   timechart span=1m avg(CPU) by host
charts the average value of CPU usage each minute for each host.
Filtering, modifying and adding fields can be done with commands such as eval, rex, and lookup.
The eval command calculates the value of a new field based on an existing field.
The rex command is used to create new fields by using regular expressions
The lookup commands add fields based on a lookup table for value lookups
fields can be specified as a set col1-colN format or with wild card characters

Thursday, January 30, 2014

We will now describe the Search Processing Language. We mentioned earlier that Splunk shifts focus from organizing data to useful queries. The end result may only be a few records from a mountain of original data set. Its the ease of use of a query language to be able to retrieve that result.
Search commands can be separated by pipe operator. This is the well known operator to redirect output of one command as input to another. For example, we could specify the column attributes of the top few rows of an input data set as search | top | fields commands with their qualifiers.
If you are not sure about what to filter on, then we can list all events, group them and even cluster them. There are some mining tips available as well. This method of exploration has been termed 'spelunking' and hence the term for the product.
Some tips for using the search commands include using quotation marks, using the case-insensitivity to arguments specified, boolean logic of AND as default between search commands unless explicitly specified with OR which has higher precedence, using subsearches where one search command is the argument to another search command and specified with square brackets etc.
The common SPL commands include the following:
Sorting results - ordering the results and optionally limiting the number of results with the sort command
filtering results - selecting only a subset of the original set of events and executed with one or more of the following commands: search, where, dedup, head, tail etc.
grouping results - grouping the events based on some pattern as with the transaction command
Reporting results - displaying the summary of the search results such as with top/rare, stats, chart, timechart etc.
Filtering Modifying and Adding fields - this enhances or transforms the results by removing, modifying or adding new fields such as with the fields, replace, eval, rex and lookup events.
Commands often work with about 10,000 events at a time by default unless explicitly overriden to include all. No, there is no support for C like statements as with dtrace. And its not as UI oriented as Instruments. However a variety of arguments can be passed to each of the search commands and its platform agnostic. Perhaps it should support indexing and searching its own logs. These include operators such as startswith, endswith etc and key-values operands.
Statistical fuctions is available with the stats command and supports a variety of builtin functions
chart and timechart commands are used with report builders

In this post we look at how to get data into Splunk. Splunk divides raw discrete data into events. When you do a search, it looks for matching events.Events can be visualized as structured data with attributes. It can also be viewed as a set of keyword/value pairs. Since events are timestamped, Splunk's indexes can efficiently retrieve events in time-series order. However, events need to be textual not binary, image, sound or data files. Coredumps can be converted to stacktrace. User can specify custom transformation before indexing. Data sources can include files, network and scripted inputs. Downloading, installing and starting splunk is easy and then when you reach the welcome screen, there's an add data button to import the data. Indexing is unique and efficient in that it associates the time to the words in the event without touching the raw data. With this map of time based words, the index looks up the corresponding events. A stream of data can be divided into individual events. The timestamp field enables Splunk to retrieve events within a time range.
Splunk has a user interface called the Summary Dashboard. It gives you a quick overview of the data. It has a search bar, a time range picker, and a running total of the indexed data, three panels - one each for sources, source types, and hosts. The sources panel shows which sources (files, network or scripted inputs) the data comes from. The source type is the type of the source. The hosts is the hosts the data comes from. The contents of the search dashboard include the following:
Timeline - this indicates the matching events for the search over time.
Fields Sidebar: these are the relevant fields along with the events
Fields discovery switch : This turns automatic field discovery on or off.
Results area: Events are ordered by timestamp and includes raw text for each event including the fields selected in the fields sidebar along with their values.

Wednesday, January 29, 2014

Today I'm going to talk about Splunk.
And perhaps I will first delve into one of the features. As you probably know Splunk allows great analytics with Machine Data. And it treats data as key value pairs that can be looked up just as niftily and as fast as with any Big Data. This is the crux of the splunk in that it allows search over machine data to find the relevant information when its otherwise difficult to navigate the data due to its volume. Notice that it eases the transition from organizing data to better query. The queries can be expressed in select form and language.
While I will go into these in detail including the technical architecture shortly, I want to cover the regex over the data. Regex is powerful because it allows for matching and extracting data. The patterns can be specified separately. They use the same meta characters for describing the pattern as anywhere else.
The indexer can selectively filter out events based on this Regex. This is specified via two configuration files Props.conf and Transforms.conf - one for configuring Splunks processing properties and another for configuring data transformations.
Props.conf is used for linebreaking multiline events, setting up character set encoding, processing binary files, recognizing timestamps, setting up rules based source type recognition, anonymizing or obfuscating data, routing select data, creating new index time field extractions, creating new search time field extractions and setting up lookup tables for fields from external sources. Transforms.conf is used for configuring similar attributes. All of these require corresponding settings in props.conf
This feature adds a powerful capability to the user by transforming the events, selectively filtering the events and adding enhanced information. Imagine not only working with original data but working on something that can be transformed to more meaningful representations. Such a feature not only helps with search and results but also helps better visualize the data.

In today's post, I want to briefly discuss the differences between SOAP and REST based services.
SOAP follows a convention where the data is exchanged as messages and the operations are usually methods on objects. Data can be exchanged in both text and binary. It needs specific tools to inspect the messages.
REST is a lightweight protocol in that it describes resources and operations as a set of predefined verbs usually HTTP verbs. Instead of a big XML format, it can use JSON format. Data can be intercepted by standard http traffic tools such as Fiddler.

Tuesday, January 28, 2014

In this post, I wanted to start on a book for Splunk. I will have to defer that to the next few posts. Instead I want to add a few more items on AWS specifically Amazon EC2 instance store volumes. These are called ephemeral drives. and provide block level storage. The blocks are preconfigured and pre-attached to the same physical server that hosts the EC2 instance. The EC2 instance determines the size of the storage. In some instances, such as micro instance there may be no instance storage and the Amazon EBS storage may be used instead.
Instances can have very fast HI1 solid state drive. In contrast, HS1 instances are optimized for very high storage density and sequential IO.
Local instance store volumes are ideal for temporary storage of information or that which is continually changing. such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances such as a load balanced pool of servers.
The instance owns the data and storage unlike EBS where store volumes can be attached or detached. High I/O typically uses instance store volumes backed by SSD and are ideally suited for many high performance database workloads such as Cassandra and MongoDB.
Data warehouses, Hadoop storage nodes, cluster file systems involve a lot of high sequential IO.
These are better supported by high storage instances. EC2 instance storage have same performance characteristics as Amazon EBS volumes. RAID striping is supported. The bandwidth to the disks is not limited to the network.
Unlike EBS volume data, data on instance stores are persisted only for the lifetime of the instance. For durability, the main concern here is the persistence between instance reboots. The cost of the Amazon EC2 instance includes any local instance store volumes.
Storage volumes are fixed and defined by the instance type, so scaling the number of store volume is not an option but the overall instance and data can be instantiated multiple times to scale.
Local instance store volumes are tied to a particular Amazon EC2 instance and are fixed in number and size

I'm going to pick up a book on Splunk next but I will complete the discussion on AWS and Amazon S3 (simple storage service) first. We talked about usage patterns, performance, durability and availability. We will look at Cost Model, Scalability and Elasticity, Interfaces and AntiPatterns next. Then we will review Amazon Glacier.
S3 supports virtually an unlimited number of files in a directory. Unlike a disk drive that restricts the size of the data before partitioning, S3 can store unlimited number of bytes. Objects are stored in a single bucket, and S3 will scale and distribute redundant copies of the information.
In terms of interfaces, standard REST and SOAP based interfaces are provided. These support both management and data operations. Objects are stored in buckets. Objects have unique key. Each object is web based and rather than a file system but has a file system like hierarchy.
There's an SDK available over these interfaces that are more popular in most programming languages.
Where S3 is not the right choice include the following: S3 is not a standalone filesystem
It cannot be queried to retrieve a specific object unless you know the bucket name and key.
It doesn't support rapidly changing data and its also not a backup or archival storage. While it is ideal for websites, it is used to store the static content with the dynamic content stored on EC2.
Amazon Glacier is an extremely low cost storage service for backup and archival. Customers can reliably store their data for as little as 1 cent per gigabyte per month. You store data in Amazon glacier as archives. Archives are limited to 4 TB but there is no limit on their number.