Sunday, July 27, 2014

Splunk indexes both text and binary data. In this post,  we will see how we can use Splunk for archival storage devices. Companies like Datadomain have great commercial advantage in data backup and Archival. Their ability to backup data on a continuous basis and use deduplication to reduce the size of the data makes the use of Splunk interesting. But first let us look at what it means to index binary data. We know that it's different from text where indexing is using compact hash values and an efficient data structure such as a B+ tree for lookup and retrieval. Text data also lends itself to key value pairs extraction which come in handy with NoSQL databases. And the trouble with the binary data is that it cannot be meaningfully searched and analyzed. Unless there is textual metadata associated with it, binary data is not helpful. For example, an image file data is not as helpful as the size, creation tool, username, camera, gps location etc. Also even textual representations such as XML are also not helpful since they are difficult to read by humans and it requires parsing.  As an example, serializing code objects in an application may be helpful but logging its significant key value pairs may be even better since they will be in textual format that lends itself to Splunk forwarding, indexing and searching.
This can be used with periodic and acyclic maintenance data archival as well. The applications that archive data are moving sometimes terabytes of data. Moreover, they are interpreting and organizing this data in a way where nothing is lost during the move, yet the delta changes between say two backup runs is collected and saved with efficiency in size and computation. There is a lot of metadata gathered in the process by these applications and the same can be logged to Splunk. Splunk in turn enables superior analytics on these data. One characteristic of such data that are backed up or archived regularly is that they have a lot of recurrence and repetitions That is why companies like DataDomain are able to de-duplicate the events and reduce the footprint of the data to be archived. Those computations can have a lot of information associated that can be expressed in rich metadata suitable for analytics later. For example, the source of the data, the programs that use the data, metadata on the data that was de-duped, the location and availability of the archived data are all relevant for analytics later. This way those applications need not fetch the data to search for answers to analytical queries but directly work off the Splunk indexes instead. 

No comments:

Post a Comment