Cluster computing

Saturday, July 26, 2014

As I mentioned in the previous post, we are going to write a custom command that transforms search results into xml. Something like :
        SearchResults::iterator it;
        for (it = results.begin(); it != results.end(); ++it) {
            SearchResult r = **it;
            _output.append("<SearchResult>");
            std::set<Str> allFields;
            results.getAllKeys(allFields);
            for (std::set<Str>::const_iterator sit = allFields.begin(); sit !=
                allFields.end(); ++sit) {
               sr_index_t index = results.getIndex(*sit);
                // check all xml tags are constructed without whitespaces
                if (r.exists(index)){
                    _output.append("<" + (*sit).trim() + ">");
                    _output.append(r.getValue(index));
                    _output.append("</" + (*sit).trim() + ">");
                }
            }
            _output.append("</SearchResult>");
        }

but Splunk already has xpath.py that makes event value valid xml i.e it makes <data>%s<data> where the innerxml is the value corresponding to _raw in the event. This is different from above.
There are data-structure to xml python recipes on the web such as Recipe #577268 here.

There's also another way described in the Splunk sdk as follows:
To use the reader, instantiate :class:`ResultsReader` on a search result stream
as follows:::

    reader = ResultsReader(result_stream)
    for item in reader:
        print(item)

We try to do it this way :
rrajamani-mbp15r:splunkb rrajamani$ cat etc/system/local/commands.conf
[smplcmd]
filename = smplcmd.py
streaming = true
local = true
retainsevents = true
overrides_timeorder = false
supports_rawargs = true

# untested
#!/usr/bin/python
import splunk.Intersplunk as si
import time
if __name__ == '__main__':
    try:
        keywords,options = si.getKeywordsAndOptions()
        results,dummyresults,settings = si.getOrganizedResults()
        myxml = "<searchResults>"
fields = ["host", "source", "sourcetype", "_raw", "_time"]
        outfield = options.get('outfield', 'xml')
        for result in results:
            element = "<searchResult>"
            for i in fields:
                field = options.get('field', str(i))
                 val = result.get(field, None)
                 if val != None:
                    element += "<" + str(field).strip() + ">" + str(val) + "</" + str(field).strip() + ">"
            element += "/<searchResult>"
            myxml += element
        myxml += "</searchResults>"
         result[outfield] = myxml
        si.outputResults(results)
    except Exception, e:
        import traceback
        stack = traceback.format_exc()
        si.generateErrorResults("Error '%s'. %s" % (e, stack))

Friday, July 25, 2014

Today I'm going to talk about writing custom search commands in python. You can use them with search operators in Splunk this way :
index=_internal | head 1 | smplcmd

rrajamani-mbp15r:splunkb rrajamani$ cat etc/system/local/commands.conf
[smplcmd]
filename = smplcmd.py
streaming = true
local = true
retainsevents = true
overrides_timeorder = false
supports_rawargs = true

rrajamani-mbp15r:splunkb rrajamani$ cat ./etc/apps/search/bin/smplcmd.py
#!/usr/bin/python
import splunk.Intersplunk as si
import time
if __name__ == '__main__':
    try:
        keywords,options = si.getKeywordsAndOptions()
        defaultval = options.get('default', None)
        results,dummyresults,settings = si.getOrganizedResults()
        # pass through
        si.outputResults(results)
    except Exception, e:
        import traceback
        stack = traceback.format_exc()
        si.generateErrorResults("Error '%s'. %s" % (e, stack))

we will write a custom command that transforms search results to xml

This SUMMER I'm going to devote a series of detailed posts to implement Splunk entirely in .Net and being a git based developer, we will write some light weight packages with .nuget and force a test driven development and a continuous integration on a git repository to go with it. Effectively we will build SplunkLite in .Net

Wednesday, July 23, 2014

In tonight's post we continue the discussion on file security checks for path names. Some of these checks are internalized by the APIs of the operating system. The trouble with path names is that it comes from untrusted users and as with all strings generates risks of buffer overruns. In addition it might point to device or pseudo device location that may pass for a path but can amount to a security breach. Even if the application is running with low privilege or not requiring administrator privileges, not validating path names adequately on Windows will cause vulnerabilities that can be exploited. These include gaining access to the application or redirection to invoke a malicious software. The application can be compromised from what it was intended. Checks to safeguard against this include validating local and UNC paths as well as securing access with ACLs. Device driver, printer and registry paths should be avoided. It is preferable to treat the path as opaque and interpreted with OS API rather than parsing it. Some simple checks are not ruled out though and the level of security should be modulated with the rest of the application. It is not right to block the window if the door is open. Also choice of API matters. For example a single API call can perform most of the checks we want.

Tuesday, July 22, 2014

We will discuss some configuration file entries for Splunk particularly one related to path specifiers say for certificates to launch Splunk in https mode, its syntax, semantics and migration issues. When Splunk is configured to run in https mode, the user indicates a flag called enableSplunkWebSSL and two paths for the certificates - the private cert (privKeyPath) and the certification authority cert (caCertPath). The path specified with these keys is considered relative to the 'splunk_home' directory. However, users could choose to keep the certificates wherever they like and so the paths could have '..' specifiers included. Paths could also start with '/' the specifier on unix style machines but these are generally not supported when the path is taken as relative. The '/' prefix to path is considered to be an absolute path specifier.
Since the user can store certificates anywhere on the machine, the path could be read as an absolute path. This way the user can directly specify path without the cumbersome '..' notation and the paths will be treated the same as the other configuration keys for Splunk. Other than that there are no advantages.
Now lets look at the caveats for converting relative to absolute paths.
First if the keys were specified then Splunk was working in https mode, so the certificates exist on the target. If the certificates are found under splunk_home then during migration, we can normalize and convert them to absolute paths. If the certificates are found under the root by way of '..' entries in the path, then this too can be made absolute with something like os.path.normpath(join(os.getcwd(), path)) in the migration script. If the certificates are not found by either means, then these keys should be removed so that Splunk can launch in default http mode ( although this will constitute a change in behavior )
Now that absolute paths have been specified in the configuration files, splunkd can assume that these directly point to the certificates and need not prepend them with splunk_home. So it first checks that the certificates are pointed to where the path is available. Next it checks where the certificates are found under splunk_home with the path specified. This step could not have been avoided because we cannot rely on the migration script all the time. The user can change the settings anytime after first run. We could rely on the prefix '/' since the migration script makes paths absolute with a '/' prefix and if it is missing we proceed to look for the certificates under the splunk_home. However the '/' prefix is only for linux. On windows we don't have that luxury. The os.path.isabs(x) may need to be implemented and used by splunkd. Besides path on windows has several security issues : for example we should not allow paths to begin with \\?\ etc and device and pseudo-device specifiers. Merely checking whether the path exists may not be enough. Besides, certificates should not be on remote machines.
Finally with the new change to support absolute and relative paths, the splunkd process assumes that most paths encountered are absolute. These paths need to be checked for prefixes, length and validity before the certificates are found under them. If the certificates are not found either because they don't exist or because they are not accessible, then if the path is relative we look for the certificates under splunk_home and if that doesn't work we error out.
if (absolute)
check_and_return
if (relative)
check_and_return
error_and_escape

Today we discuss another application for Splunk. I want to spend the next few days reviewing the implementation of some core components of Splunk. But for now, I want to talk about API monitoring. Splunk exposes a REST API model for its features that are called from the UI and by SDK. These APIs are logged in the web access log. The same APIs can be called from mobile applications on Android devices and iPhones/iPads. The purpose of this application is to get statistics from API calls such as percentage of times error was encountered, the number of internal server errors, the number and distribution of timeouts. And with the statistics gathered, we can set up alerts on thresholds exceeded. Essentially, this is along the same lines as Mashery api management solution. While APIs monitored by Mashery help study traffic from all devices to the API providers, in this case, we are talking about that for a Splunk instance from the enterprise users. Mobile apps are currently not available for Splunk but when it does, this kind of application would help to troubleshoot those applications as well because it would show the differences between other callers and those devices.
The way Mashery works is with the use of a http/s proxy. However in this case we rely on the logs directly assuming that all the data we need is available in the logs. The difference between searching the logs and running this application is that the application has continuous visualization and fires alerts.
This kind of application is different from a REST modular input because the latter indexes the response from the APIs and in this case we are not keen on the responses but the response code. At the same time we are also interested in user-agent and other such header information to enrich our stats just so long as they are logged.
Caching is a service available in Mashery or from Applications such as AppFabric but this is likely a candidate feature for Splunk rather than this application due to the type of input to the application. Caching works well when requests responses are intercepted but in our case this application is expected to use the log as an input.

Monday, July 21, 2014

Continuing from the previous post, we were discussing a logger for software components. In today's post we look at the component registration of logging channels. Initially a component may just specify a name (string) or an identifier (guid) to differentiate its logging channel but requiring that each new component specify a new channel is not usually enforced. Furthermore, the logging at all levels is left to the discretion of the component owners and this is generally inadequate. Besides, some components are considered too core for any interest to users and consequently their logging is left out. With the new logger, we require that the components have a supportability review and that they are facilitated to log as machine data without restriction on size or frequency and at the same time support a lot more features.
Hence one of the improvements we require from component registration is the metadata for the component's logging channel. This metadata includes among other things intended audience, frequency, error message mapping for corrective actions, support for payload, grouping etc. In other words, it helps the logging consumer take appropriate actions on the logging payload. Today the consumer decides whether to flush to disk, send to logging subscribers, redirect to a database, It slaps headers on the data for information such as for the listener when sending over the network etc, takes different actions when converting the data to binary mode, support operations such as compression, encryption, etc and maintains different modes of operation such as performance oriented with fast flush to disk or feature oriented such as above. Throttling and resource management of logging channels is possible via redirection to null queue.
In general, a sliding window protocol could be implemented for the logging channel with support for sequence number, There are many features that can be compared with the similarity to a TCP implementation.
TCP has several features - reordering, flow control etc . For our purposes we don't have reordering issues.

Sunday, July 20, 2014

In today's post we continue to investigate applications of Splunk. One of the applications is supportability. Processes, memory, CPU utilization, file descriptor usages, system call failures are pretty much the bulk of the failures that require supportability measures. The most important of the supportability measures is the logging and although all components log, most of the fear around verbose logging has centered around pollution of logs. In fact most often used components lack helpful logging only because they are used so often that it rapidly grows the size of the log to an overwhelming number. Such a log is found offensive to admins who view the splunkd log as actionable and for their eyes only.
Now searches have their own logs and they generate logs for the duration of the sessions. Search artifacts are a blessing for across the board troubleshooting. It can be turned to debug mode, the generate log file is persisted only for the duration of the user session invoking the search and it does not bother the admins.
What is required from the components that don't log even to the search logs because they are so heavily used or are used at times other than searches is to combine the technique for search logs with this kind of logging.
The call for action is not just for components to log more or support logging to a different destination or have grades of logging but fundamentally allow a component to log without any concern for resources or impact. Flags can be specified by the component for concerns such as logging levels or actions. A mechanism may also be needed for loggers to specify round robin.
The benefit of a round robin in memory log buffer is the decoupling of producers from the consumers. We will talk about logging improvements a lot more and cover a lot of aspects but the goal for now is to cover just this.
The in-memory buffer is entirely owned by the application and as such the components can given the slot number to write to. The entry or content for the log entries will follow some format but we will discuss that later. There can be only one consumer for this in-memory buffer and that services one or more out of process consumers that honor the user/admin's choices for destination, longevity and transformations.