Cluster computing

Wednesday, August 13, 2014

Having discussed Splunk DB in the previous posts, I want to outline and discuss one of the key concepts and its implementation in this post today. This is the time series partitioning of data or buckets and the inverted time search on these buckets. There are other very powerful concepts such as key value map and reduce that enables powerful analytics. We suggested and follow up on whether we could externalized this to NoSQL databases. In the meantime, I want to focus on just the time series data, its persistence and search - essentially whether we can avoid buckets.
Events that flow into Splunk are marked with timestamp. The timestamp gives an order to the event. In the absence of any other search query parameters, the timestamp helps partition the data correctly.
The events need to be sorted by timestamp. However, events can arrive out of order. Its helpful to think of these events as precious data. We should not drop events, if possible. So a sufficient window to let the out of order events arrive could be maintained. When the events arrive continuously, they can be streamed to a file or a table. There is no need to split the file or the table logically. If we are concerned about the reliability and fault tolerance, we can handle it externally. If we are concerned about a single point of maintenance, we can distribute it. If we are concerned about ACID properties, we can use a traditional relational database. Outwards, there is nothing preventing us from viewing it as a singular data store.
The ability to put an order to the late arrival events and then to search back time for query matching is the consideration for this post. One of the ways to do this is with a sliding window protocol. While we cannot solve the problem of unlimited sequence numbers, we can hope to index and maintain events that are of reasonable delay. If we cannot control that, we specify an auxilary storage for such overly delayed packets, which we can later retry sequencing asynchronously. We could do this with fixed sequence numbers. For every event sequenced, the sliding window moves forward by one event. Sequencing one event involves inserting it in the right place. Typically we avoid the lookup and the insertion by copying the minimum time-stamp to the start position of the sliding window and exchanging it with the element at the start. We do this only if the element at start is not the minimum. We also check that the element before the start is earlier than the element at the start. Then we advance the sliding window forward by one event.Since we check only the elements within the window and the element prior to start we can wrap-around the index for a fixed size sequence number.

Cluster computing

Wednesday, August 13, 2014

No comments:

Post a Comment