Cluster computing

Wednesday, July 23, 2014

In tonight's post we continue the discussion on file security checks for path names. Some of these checks are internalized by the APIs of the operating system. The trouble with path names is that it comes from untrusted users and as with all strings generates risks of buffer overruns. In addition it might point to device or pseudo device location that may pass for a path but can amount to a security breach. Even if the application is running with low privilege or not requiring administrator privileges, not validating path names adequately on Windows will cause vulnerabilities that can be exploited. These include gaining access to the application or redirection to invoke a malicious software. The application can be compromised from what it was intended. Checks to safeguard against this include validating local and UNC paths as well as securing access with ACLs. Device driver, printer and registry paths should be avoided. It is preferable to treat the path as opaque and interpreted with OS API rather than parsing it. Some simple checks are not ruled out though and the level of security should be modulated with the rest of the application. It is not right to block the window if the door is open. Also choice of API matters. For example a single API call can perform most of the checks we want.

Tuesday, July 22, 2014

We will discuss some configuration file entries for Splunk particularly one related to path specifiers say for certificates to launch Splunk in https mode, its syntax, semantics and migration issues. When Splunk is configured to run in https mode, the user indicates a flag called enableSplunkWebSSL and two paths for the certificates - the private cert (privKeyPath) and the certification authority cert (caCertPath). The path specified with these keys is considered relative to the 'splunk_home' directory. However, users could choose to keep the certificates wherever they like and so the paths could have '..' specifiers included. Paths could also start with '/' the specifier on unix style machines but these are generally not supported when the path is taken as relative. The '/' prefix to path is considered to be an absolute path specifier.
Since the user can store certificates anywhere on the machine, the path could be read as an absolute path. This way the user can directly specify path without the cumbersome '..' notation and the paths will be treated the same as the other configuration keys for Splunk. Other than that there are no advantages.
Now lets look at the caveats for converting relative to absolute paths.
First if the keys were specified then Splunk was working in https mode, so the certificates exist on the target. If the certificates are found under splunk_home then during migration, we can normalize and convert them to absolute paths. If the certificates are found under the root by way of '..' entries in the path, then this too can be made absolute with something like os.path.normpath(join(os.getcwd(), path)) in the migration script. If the certificates are not found by either means, then these keys should be removed so that Splunk can launch in default http mode ( although this will constitute a change in behavior )
Now that absolute paths have been specified in the configuration files, splunkd can assume that these directly point to the certificates and need not prepend them with splunk_home. So it first checks that the certificates are pointed to where the path is available. Next it checks where the certificates are found under splunk_home with the path specified. This step could not have been avoided because we cannot rely on the migration script all the time. The user can change the settings anytime after first run. We could rely on the prefix '/' since the migration script makes paths absolute with a '/' prefix and if it is missing we proceed to look for the certificates under the splunk_home. However the '/' prefix is only for linux. On windows we don't have that luxury. The os.path.isabs(x) may need to be implemented and used by splunkd. Besides path on windows has several security issues : for example we should not allow paths to begin with \\?\ etc and device and pseudo-device specifiers. Merely checking whether the path exists may not be enough. Besides, certificates should not be on remote machines.
Finally with the new change to support absolute and relative paths, the splunkd process assumes that most paths encountered are absolute. These paths need to be checked for prefixes, length and validity before the certificates are found under them. If the certificates are not found either because they don't exist or because they are not accessible, then if the path is relative we look for the certificates under splunk_home and if that doesn't work we error out.
if (absolute)
check_and_return
if (relative)
check_and_return
error_and_escape

Today we discuss another application for Splunk. I want to spend the next few days reviewing the implementation of some core components of Splunk. But for now, I want to talk about API monitoring. Splunk exposes a REST API model for its features that are called from the UI and by SDK. These APIs are logged in the web access log. The same APIs can be called from mobile applications on Android devices and iPhones/iPads. The purpose of this application is to get statistics from API calls such as percentage of times error was encountered, the number of internal server errors, the number and distribution of timeouts. And with the statistics gathered, we can set up alerts on thresholds exceeded. Essentially, this is along the same lines as Mashery api management solution. While APIs monitored by Mashery help study traffic from all devices to the API providers, in this case, we are talking about that for a Splunk instance from the enterprise users. Mobile apps are currently not available for Splunk but when it does, this kind of application would help to troubleshoot those applications as well because it would show the differences between other callers and those devices.
The way Mashery works is with the use of a http/s proxy. However in this case we rely on the logs directly assuming that all the data we need is available in the logs. The difference between searching the logs and running this application is that the application has continuous visualization and fires alerts.
This kind of application is different from a REST modular input because the latter indexes the response from the APIs and in this case we are not keen on the responses but the response code. At the same time we are also interested in user-agent and other such header information to enrich our stats just so long as they are logged.
Caching is a service available in Mashery or from Applications such as AppFabric but this is likely a candidate feature for Splunk rather than this application due to the type of input to the application. Caching works well when requests responses are intercepted but in our case this application is expected to use the log as an input.

Monday, July 21, 2014

Continuing from the previous post, we were discussing a logger for software components. In today's post we look at the component registration of logging channels. Initially a component may just specify a name (string) or an identifier (guid) to differentiate its logging channel but requiring that each new component specify a new channel is not usually enforced. Furthermore, the logging at all levels is left to the discretion of the component owners and this is generally inadequate. Besides, some components are considered too core for any interest to users and consequently their logging is left out. With the new logger, we require that the components have a supportability review and that they are facilitated to log as machine data without restriction on size or frequency and at the same time support a lot more features.
Hence one of the improvements we require from component registration is the metadata for the component's logging channel. This metadata includes among other things intended audience, frequency, error message mapping for corrective actions, support for payload, grouping etc. In other words, it helps the logging consumer take appropriate actions on the logging payload. Today the consumer decides whether to flush to disk, send to logging subscribers, redirect to a database, It slaps headers on the data for information such as for the listener when sending over the network etc, takes different actions when converting the data to binary mode, support operations such as compression, encryption, etc and maintains different modes of operation such as performance oriented with fast flush to disk or feature oriented such as above. Throttling and resource management of logging channels is possible via redirection to null queue.
In general, a sliding window protocol could be implemented for the logging channel with support for sequence number, There are many features that can be compared with the similarity to a TCP implementation.
TCP has several features - reordering, flow control etc . For our purposes we don't have reordering issues.

Sunday, July 20, 2014

In today's post we continue to investigate applications of Splunk. One of the applications is supportability. Processes, memory, CPU utilization, file descriptor usages, system call failures are pretty much the bulk of the failures that require supportability measures. The most important of the supportability measures is the logging and although all components log, most of the fear around verbose logging has centered around pollution of logs. In fact most often used components lack helpful logging only because they are used so often that it rapidly grows the size of the log to an overwhelming number. Such a log is found offensive to admins who view the splunkd log as actionable and for their eyes only.
Now searches have their own logs and they generate logs for the duration of the sessions. Search artifacts are a blessing for across the board troubleshooting. It can be turned to debug mode, the generate log file is persisted only for the duration of the user session invoking the search and it does not bother the admins.
What is required from the components that don't log even to the search logs because they are so heavily used or are used at times other than searches is to combine the technique for search logs with this kind of logging.
The call for action is not just for components to log more or support logging to a different destination or have grades of logging but fundamentally allow a component to log without any concern for resources or impact. Flags can be specified by the component for concerns such as logging levels or actions. A mechanism may also be needed for loggers to specify round robin.
The benefit of a round robin in memory log buffer is the decoupling of producers from the consumers. We will talk about logging improvements a lot more and cover a lot of aspects but the goal for now is to cover just this.
The in-memory buffer is entirely owned by the application and as such the components can given the slot number to write to. The entry or content for the log entries will follow some format but we will discuss that later. There can be only one consumer for this in-memory buffer and that services one or more out of process consumers that honor the user/admin's choices for destination, longevity and transformations.

Saturday, July 19, 2014

Today we will look at the Markov chain a little bit more to answer the following questions:
Will the random walk ever return to where it started from ?
If yes, how long will it take ?
If not, where will it go.
If we take the probability that the walk (started at 0) attains value 0 in n steps as u(n) for even n, we now want to find the probability f(n) that the walk will return to 0 for the first time in n steps.
Let Xn, n >= 0 be Markov with transition probabilities pij. Let St (t >= 0) be independent from Xn. Sn can be considered a selection.
The theorem for the calculation of f(n) is stated this way:
Let n be even and u(n) = P0 (Sn = 0). P0 denotes the return to zero.
Then f(n) = P0( S1 != 0, ... Sn-1 != 0, Sn = 0) = u(n)/ n - 1
And here's the proof:
Since the random walk cannot change sign before becoming zero,
f(n) = P0(S1 > 0, ... Sn-1 > 0, Sn = 0) + P0( S1 < 0, ... Sn-1 < 0, Sn = 0)
which comprises of two equal terms.
Now,
P0 (Sn-1 > 0, Sn = 0 ) = P0 (Sn = 0) P(Sn -1 > 0 | Sn = 0) which is the use of conditional probability.
we know that the first term is u(n) and given that the Sn-1 = 1 or -1 with equal probability. So the last term is 1/2. Sn = 0. By the Markov property at time n-1 we can omit the last event.
So f(n) can be rewritten as 2 u(n) 1/2 P0(S1 > 0, S2 > 0 ... Sn-2 > 0 | Sn-1 = 1)
and the last term by ballot theorem is 1/(n-1) which proves that f(n) = u(n)/n-1

To complete the post on Random walk for developers as discussed in the previous two, we briefly summarize the goal and the process.
Random walk is a special case of a process called Markov Chain where the future events are conditionally dependent on the present and not the past events. It is based on random numbers that assign a real number to the events based on probabilities. The probabilities to move between a finite set of states (sometimes two - forward and backward as in the case of a drunkard walking in a straight line) is called transition probabilities. Random walk leads to diffusion.
The iterations in a random walk are done based on the equation px,y = px+z,y+z for any translation z where px,y is the transition probability in space S.
Random walks possesses certain other properties
- time homogeneity
- space homogeneity
and sometime skip-free transitions which we have already discussed
The transition probabilities are the main thing to be worked out in a random walk.
Usually this is expressed in terms of ranges such as
p(x) = pj if xj = ej where j = 1 to d in a d-dimensional vector defined space.
= qj if xj = -ej
= 0
We have since how this simplifies to a forward and backward motion in the case of a drunkard's linear walk.
The walk is just iterative calculating a new position based on the outcome of transition probabilites.
The walk itself may be performed k times to average out the findings (hitting time) from each walk.
Each step traverses the bipartite graph.
http://1drv.ms/1nf79KL