Saturday, January 25, 2020

Filters:
Filters are common in applications that query data. When data is stored in tables, filter usually appears as predicates that reduce the number of records to view. Tables were earmarked with lot of metadata and queries were prepared and cached that made running these filters very fast.
With the advent of BigData storage, the applications that involved map-reduce had to do their own filtering. There was neither an obvious way of preparing the data nor an easy way of caching the results.
The same is true for streaming applications. An application that wants to write a filter as follows:
    private static class EventFilter implements FilterFunction<String> {
        @Override
        public boolean filter(String line) throws Exception {
               return !line.contains("NonEvent"); 
        }
    }
has no knowledge of how much time it would take to go through the data in the stream for the results of the filtering.
If the filters are not re-used, then they can be in-lined with the application logic since the set of operations taken to evaluate one record for complete analysis will not take much longer than the single operation on the filter.
If the filters are re-used, then it is possible to package them independently. This allows for the filtering logic to be run again and again on different data streams.
Another strategy to make filtering superfast is to do it once, rather than several times. For example, the filtering to one data stream will result in another stream with just the data of interest. And this resulting stream can be used as a feed for all application logic.
The strategy to make filtering superfast is to combine transformation with enhancement of statistics. For example, the purpose of filtering was just to count, the counts from non-transformed stream could be collected via map-reduce and saved in forms that can be re-used later.
These are some of the techniques for filtering on large data sets.

No comments:

Post a Comment