Cluster computing

Thursday, January 30, 2014

We will now describe the Search Processing Language. We mentioned earlier that Splunk shifts focus from organizing data to useful queries. The end result may only be a few records from a mountain of original data set. Its the ease of use of a query language to be able to retrieve that result.
Search commands can be separated by pipe operator. This is the well known operator to redirect output of one command as input to another. For example, we could specify the column attributes of the top few rows of an input data set as search | top | fields commands with their qualifiers.
If you are not sure about what to filter on, then we can list all events, group them and even cluster them. There are some mining tips available as well. This method of exploration has been termed 'spelunking' and hence the term for the product.
Some tips for using the search commands include using quotation marks, using the case-insensitivity to arguments specified, boolean logic of AND as default between search commands unless explicitly specified with OR which has higher precedence, using subsearches where one search command is the argument to another search command and specified with square brackets etc.
The common SPL commands include the following:
Sorting results - ordering the results and optionally limiting the number of results with the sort command
filtering results - selecting only a subset of the original set of events and executed with one or more of the following commands: search, where, dedup, head, tail etc.
grouping results - grouping the events based on some pattern as with the transaction command
Reporting results - displaying the summary of the search results such as with top/rare, stats, chart, timechart etc.
Filtering Modifying and Adding fields - this enhances or transforms the results by removing, modifying or adding new fields such as with the fields, replace, eval, rex and lookup events.
Commands often work with about 10,000 events at a time by default unless explicitly overriden to include all. No, there is no support for C like statements as with dtrace. And its not as UI oriented as Instruments. However a variety of arguments can be passed to each of the search commands and its platform agnostic. Perhaps it should support indexing and searching its own logs. These include operators such as startswith, endswith etc and key-values operands.
Statistical fuctions is available with the stats command and supports a variety of builtin functions
chart and timechart commands are used with report builders

In this post we look at how to get data into Splunk. Splunk divides raw discrete data into events. When you do a search, it looks for matching events.Events can be visualized as structured data with attributes. It can also be viewed as a set of keyword/value pairs. Since events are timestamped, Splunk's indexes can efficiently retrieve events in time-series order. However, events need to be textual not binary, image, sound or data files. Coredumps can be converted to stacktrace. User can specify custom transformation before indexing. Data sources can include files, network and scripted inputs. Downloading, installing and starting splunk is easy and then when you reach the welcome screen, there's an add data button to import the data. Indexing is unique and efficient in that it associates the time to the words in the event without touching the raw data. With this map of time based words, the index looks up the corresponding events. A stream of data can be divided into individual events. The timestamp field enables Splunk to retrieve events within a time range.
Splunk has a user interface called the Summary Dashboard. It gives you a quick overview of the data. It has a search bar, a time range picker, and a running total of the indexed data, three panels - one each for sources, source types, and hosts. The sources panel shows which sources (files, network or scripted inputs) the data comes from. The source type is the type of the source. The hosts is the hosts the data comes from. The contents of the search dashboard include the following:
Timeline - this indicates the matching events for the search over time.
Fields Sidebar: these are the relevant fields along with the events
Fields discovery switch : This turns automatic field discovery on or off.
Results area: Events are ordered by timestamp and includes raw text for each event including the fields selected in the fields sidebar along with their values.

Wednesday, January 29, 2014

Today I'm going to talk about Splunk.
And perhaps I will first delve into one of the features. As you probably know Splunk allows great analytics with Machine Data. And it treats data as key value pairs that can be looked up just as niftily and as fast as with any Big Data. This is the crux of the splunk in that it allows search over machine data to find the relevant information when its otherwise difficult to navigate the data due to its volume. Notice that it eases the transition from organizing data to better query. The queries can be expressed in select form and language.
While I will go into these in detail including the technical architecture shortly, I want to cover the regex over the data. Regex is powerful because it allows for matching and extracting data. The patterns can be specified separately. They use the same meta characters for describing the pattern as anywhere else.
The indexer can selectively filter out events based on this Regex. This is specified via two configuration files Props.conf and Transforms.conf - one for configuring Splunks processing properties and another for configuring data transformations.
Props.conf is used for linebreaking multiline events, setting up character set encoding, processing binary files, recognizing timestamps, setting up rules based source type recognition, anonymizing or obfuscating data, routing select data, creating new index time field extractions, creating new search time field extractions and setting up lookup tables for fields from external sources. Transforms.conf is used for configuring similar attributes. All of these require corresponding settings in props.conf
This feature adds a powerful capability to the user by transforming the events, selectively filtering the events and adding enhanced information. Imagine not only working with original data but working on something that can be transformed to more meaningful representations. Such a feature not only helps with search and results but also helps better visualize the data.

In today's post, I want to briefly discuss the differences between SOAP and REST based services.
SOAP follows a convention where the data is exchanged as messages and the operations are usually methods on objects. Data can be exchanged in both text and binary. It needs specific tools to inspect the messages.
REST is a lightweight protocol in that it describes resources and operations as a set of predefined verbs usually HTTP verbs. Instead of a big XML format, it can use JSON format. Data can be intercepted by standard http traffic tools such as Fiddler.

Tuesday, January 28, 2014

In this post, I wanted to start on a book for Splunk. I will have to defer that to the next few posts. Instead I want to add a few more items on AWS specifically Amazon EC2 instance store volumes. These are called ephemeral drives. and provide block level storage. The blocks are preconfigured and pre-attached to the same physical server that hosts the EC2 instance. The EC2 instance determines the size of the storage. In some instances, such as micro instance there may be no instance storage and the Amazon EBS storage may be used instead.
Instances can have very fast HI1 solid state drive. In contrast, HS1 instances are optimized for very high storage density and sequential IO.
Local instance store volumes are ideal for temporary storage of information or that which is continually changing. such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances such as a load balanced pool of servers.
The instance owns the data and storage unlike EBS where store volumes can be attached or detached. High I/O typically uses instance store volumes backed by SSD and are ideally suited for many high performance database workloads such as Cassandra and MongoDB.
Data warehouses, Hadoop storage nodes, cluster file systems involve a lot of high sequential IO.
These are better supported by high storage instances. EC2 instance storage have same performance characteristics as Amazon EBS volumes. RAID striping is supported. The bandwidth to the disks is not limited to the network.
Unlike EBS volume data, data on instance stores are persisted only for the lifetime of the instance. For durability, the main concern here is the persistence between instance reboots. The cost of the Amazon EC2 instance includes any local instance store volumes.
Storage volumes are fixed and defined by the instance type, so scaling the number of store volume is not an option but the overall instance and data can be instantiated multiple times to scale.
Local instance store volumes are tied to a particular Amazon EC2 instance and are fixed in number and size

I'm going to pick up a book on Splunk next but I will complete the discussion on AWS and Amazon S3 (simple storage service) first. We talked about usage patterns, performance, durability and availability. We will look at Cost Model, Scalability and Elasticity, Interfaces and AntiPatterns next. Then we will review Amazon Glacier.
S3 supports virtually an unlimited number of files in a directory. Unlike a disk drive that restricts the size of the data before partitioning, S3 can store unlimited number of bytes. Objects are stored in a single bucket, and S3 will scale and distribute redundant copies of the information.
In terms of interfaces, standard REST and SOAP based interfaces are provided. These support both management and data operations. Objects are stored in buckets. Objects have unique key. Each object is web based and rather than a file system but has a file system like hierarchy.
There's an SDK available over these interfaces that are more popular in most programming languages.
Where S3 is not the right choice include the following: S3 is not a standalone filesystem
It cannot be queried to retrieve a specific object unless you know the bucket name and key.
It doesn't support rapidly changing data and its also not a backup or archival storage. While it is ideal for websites, it is used to store the static content with the dynamic content stored on EC2.
Amazon Glacier is an extremely low cost storage service for backup and archival. Customers can reliably store their data for as little as 1 cent per gigabyte per month. You store data in Amazon glacier as archives. Archives are limited to 4 TB but there is no limit on their number.

Monday, January 27, 2014

In this post, we continue our discussion on AWS particularly the storage options. As we have seen, AWS is a flexible cost-effective easy to use cloud computing platform. The different choices for storage are memory based storage such as file caches, object caches, in-memory databases, and RAM, message queues that provide temporary durable storage for data sent asynchronously between computer systems or application components etc. Other options include storage area networking (SAN) where virtual disk LUNs often provide the highest level of disk performance and durability, Direct attached storage where local hard disk drives or arrays residing in each server provide higher performance than a SAN, Network attached storage which provide a file level interface that can be shared across multiple systems. Finally, we have the traditional databases where structured data resides as well as a NoSQL non-relational database or a data warehouse. The backup and archive include non-disk media such as tapes or optical media.

These options differ in the performance, durability and cost as well as in their interfaces. Architects consider all these factors when making choices. Sometimes, these combinations form a hierarchy of data tiers. These Amazon simple storage service is storage for the internet. It could store any amount of data, at any time, from within the compute cloud. or from anywhere on the web. Writing, reading and deleting objects of any sizes is now possible. It is also highly scalable allowing concurrent read write access. Amazon S3 is commonly used for financial transactions and clickstream analytics or media transcoding

The performance of Amazon S3 from within the Amazon EC2 in the same region is fast. It is also built to scale storage, requests and users. To speed access to the relevant data, Amazon S3 is also often paired with a database such as DynamoDB or Amazon RDS. Amazon S3 stores the actual information while the database serves as a repository for metadata. This metadata can be easily indexed and queried. In this case it helps to locate an object's reference with the help of a query.

Durability is guaranteed via automatic and synchronous saving of data across multiple devices and multiple facilities. Availability for mission critical data is designed for such a high percentage that there is miniscule or no downtime. For non-critical data, reduced redundancy storage in Amazon S3 can be used.