Cluster computing

Saturday, January 11, 2014

Today I'm going to read from a book called Jumpstart Node.js by Don Nguyen.
Just a quick reminder that the Node.js is a platform to write server side applications. It achieves high throughput via non-blocking I/O and a single threaded event-loop. Node.js contains a built-in HTTP server library so Apache or Lightpd is not required.
The book cites an application WordSquared as an introduction to applications in Node.js. This is an online realtime infinite game of Scrabble.
Node.js is available from GitHub via a package manager
Over the http server is a framework called Connect that provides support for cookies, sessions, logging and compression, to name a few.On top of Connect is Express which has support for routing templates and view rendering engine.
Node.js is minimalistic. Access to web server files is provided via fs module. express and routes are available via express and routes module.
Node.js allows callback functions and this is used widely since Node.js is asynchronous. for example
setTimeout(function(){console.log('example');}, 500); The line after this statement is executed immediately while the example is rendered after the timeout.
Node.js picks up changes to code only on restart. This can become tedious after a while, so a node supervisor is installed to automatically restart the system upon changes to file.
MongoLab is a cloud based NoSQL provider and can come in useful for applications requiring a database.
Backbone is the MVC framework for the Node.js. It can be combined with the Node.js framework to provide a rich realtime user interface.
To create a custom stock ticker, a filter for the code of the stock could be implemented. When the user submits a request, Backbone makes a request to the server-side API. The data is placed into a model on the client side. Subsequent changes are made to the model and bindings specify how these changes should be reflected in the user interface. To display the view, we have an initialize and setVisibility function. In the initialize function, a change in the model is bind-ed to the setVisibility function. In the latter we query the properties and set the view accordingly. When the filtering is applied, the stock list is thus updated.

In the previous post, we examined some tools on linux to troubleshoot system issues. We continue our discussion with high cpu utilization issues. One approach is to read logs. Another approach is to take a core dump and restart the process.The ps and kill command comes very useful to take a dump. By logs we mean performance counters logs. For linux, this could come with sar tool or vmstat tool that can run in sampling mode. The logs help identify which core in a multicore processor is utilized and if there's any processor affinity to the code of the process running on that core.
User workloads is also important to analyze if present. High cpu utilization could be triggered by a workload. This is important to identify not merely because the workload will give insight into which component is being utilized but also because the workload also gives an idea of how to reproduce the problem deterministically. Narrowing down the scope of the occurrence throws a lot of light into the underlying issue with the application such as knowing when the problem occurs, which components are likely affected, what's on the stack and what frame is likely on the top of the stack. If there are deterministic steps to reproduce the problem, we can repeatedly trigger the situation for a better study. In such cases the frame, the source code in terms of module, file and line can be identified. Then a resolution can be found.
Memory utilization is also a very common issue. There are two approaches here as well. One approach is to have instrumentation either via linker or via trace to see the call sequences to identify memory allocations. Another approach is to use external tools to capture stacktraces at all allocations so that the application memory footprint can show which allocations have not been freed and the corresponding offending code. Heap allocations are generally tagged to identify memory corruption issues. They work on the principle that the tags at the beginning and the end of an allocation are not expected to be overwritten by the process code since the allocations are wrapped by tags by the tool. Any write access on the tags is likely from a memory corrupting code and a stack trace at such a time will point to the code path. This is very useful for all sizes of allocations and de-allocations.
Leaks and corruptions are two different syndromes that need to be investigated and resolved differently.
In the case of leaks, a codepath may continuously leak memory when invoked. Tagging all allocations and capturing the stack at such allocations or reducing the scope to a few components and tracking the objects created by the component can give an insight into which object or allocation is missed. Corruption on the other hand is usually indeterministic and can be caused by such things as timing issues. The place of corruption may also be random. Hence, it's important to identify from the pattern of corruption which component is likely involved and whether there can be minimal instrumentation introduced to track all such objects that have such a memory footprint.

Friday, January 10, 2014

Troubleshooting high cpu usage in c++ programs on Linux requires the same discipline and order in debugging as it does anywhere else. The threads in the program could be hogging cpu because its spinning in a loop or because it's doing a computationally intensive routine. High CPU utilization may not always mean busy thread, it could be spinning. The uptime tool in Linux gives output for what has been happening for the last fifteen minutes. Using top, we can see what processes are the biggest contributors to the problem. Sar is yet another tool that can give more detailed information on the CPU utilization trends. The data is often available offline for use with tools like isag to generate charts or graph. The raw data for the sar tool is stored under /var/log/sa where the various files represent the days of the respective month.The ps and pstree are also useful tools for system analysis. ps -L option gives thread level information. mpstat command is used to report on each of the available CPU on a multiprocessor server. Global average activities among all CPUs are also reported. The KDE system guard (KSysGuard) is the KDE task manager and performance monitor. It enables monitoring of local and remote hosts. The vmstat tool provides information about processes, memory, paging, block I/O, traps and CPU activity. The vmstat command displays either average data or actual samples. The vmstat tool report cpu time as one of the following four:
us: time spent running non-kernel code (user time, including nice time)
sy: time spent running kernel code (system time)
id: time spent idle,
wa: time spent waiting for IO.
vmstat can run in a sampling mode.
Yielding between such computations is a good programming practice but to know where to add these instructions to relieve the cpu requires first to figure out who the culprit is.
Valgrind is a useful tool for detecting memory leaks.
The GNU C library comes with builtin functionality to help detect memory issues. however, it does not log the call stacks of the memory allocations it tracks. There are static code analysis tools that can significantly detect code issues much earlier.

Event monitoring software can accelerate software development and test cycles. Event monitoring data is usually machine data generated by the IT systems. Such data can enable real-time searches to gain insights into user experience. Dashboards with charts can then help analyze the data. This data can be accessed over TCP, UDP and HTTP. Data can also be warehoused for analysis. Issues that frequently recur can be documented and searched more quickly with the availability of such data leading to faster debugging and problem solving. For example, data can be queried to identify errors in the logs which could be addressed remotely.
Machine data is massive and generated in streams. Being able to quickly navigate the volume to find the most relevant information for triaging issues is a differentiating factor for the event monitoring software. Early warning notifications, running rules engine, detecting trends are some of the features that enable not only rapid development and test by providing feedback of deployed software but also increase customer satisfaction as code is incrementally build and released.
Data is available to be collected, indexed, searched and reported. Applications can target specific interests such as security or correlations for building rules and alerts. Data is also varied such as from network, from applications, and from enterprise infrastructure. Powerful querying increases the usability of such data. For example, security data may inform about threats, the ability to include non-security user and machine data may add insight into unknown threats. Queries could also cover automated anomaly and outlier detection that help with understanding advanced threats. Queries for such key valued data can be written using PIG commands such as load/read, store/write, foreach/iterate, filter/predicate, group-cogroup, collect, join, order, distinct, union, split, stream, dump and limit. The depth and breadth of possibilities with event monitoring data seems endless. As more and more data becomes available and richer and powerful analytical techniques grow, this will help arm the developers and operation engineers to better address the needs of the organization. Some of the differentiators of such software include the ability to have one platform, fast return on investment, ability to use different data collectors, use non-traditional flat file data stores, ability to create and modify existing reports, ability to create baselines and study changes, programmability to retrieve information as appropriate and ability to include compliance, security, fraud detection etc. If applications are able to use the event monitoring software, it will be evident from the number of applications that are written.

Thursday, January 9, 2014

bool IsMatch(string input, string pattern)
{
if (string.IsNullOrWhiteSpace(input) || string.IsNullOrWhiteSpace(pattern)) return false;
var constants = pattern.split(new char[] { '*', '/' });
int start = -1;
var index = constants.Select( x => { int s = pattern.IndexOf(x, start + 1); start = s; return s;} ).ToList();
int prev = 0;
string wildcards = string.empty;
for (int i = 0; i < index.Length; i++)
{

// start must be the specified literal
start = input.IndexOf(constants[i]);
if (start == -1) return false;

// skip only as specified by wildcards
if (wildcards.Length > 0 )
{
int c = wildCards.Count(x => x == '?') ;
int last = start;
while (start != -1)
{
   if (start - prev - 1 < c) continue;
   else break;
   start = input.IndexOf(constant[i], start + 1);
}
if (start == -1 || start - prev - 1 < c) return false;
}
if (start == -1) start = last;

int wildcardsStart = index[i] + constants[i].Length;
wildcardsLen = (i+1<index.Length) ? index[i + 1] - wildcardsStart-1 : 0;
wildcards = pattern.SubString(wildcardsStart, wildCardsLen);

Debug.Assert (wildcards.Length > 0 || wildCards.ForEach(x => x == '?' || x = '*'));

}
return true;
}
string input can be "ABCDBDXYZ"
string pattern can be "A*B?D*Z"

This is a read from the blog post by Julian James. iOS continuous integration builds can be setup with HockeyApp, OSX Mavericks, Server and XCode. These are installed first and Xcode remote repositories point to source control. BotRuns are stored in their own folder. Project scheme can be edited to include pre-action and post-action schemes.Bots can be specified with a schedule and a target. OCMock and OCHamcrest can be used for unit-testing. Archive post-action completion signals the availability of the latest build. Instruments can then be run with Javascript file to test the UI. Then when the bot is run, it could show the number of tests passed.
Javascript for UI automation can make use of the iOS UI automation library reference. This has an object model for all the UI elements and document navigation. Workflows are automated using the different methods on these objects and UIANavigationBar. The target is obtained with UIATarget, the application is referred to with UIAApplication, pages are available via UIAPageIndicator etc. A variety of controls are available via UIAButton, UIAElement, UIAPicker, UIAPickerWheel, UIAPopover, UIASearchBar, UIASecureTextField, UIATabBar, and UIAWebView. Organzation of UI elements can be used with UIATableCell, UIATableGroup and UIATableView to access the individual cells. UIATabGroup allows navigation between tabs.

Wednesday, January 8, 2014

A recap of on five progressive steps by Barry Wise to database normalization
We start out with an example where we store user's name, company, company address, and personal urls - say url1 and url2
The zero form is when all of this is in a single table and no normalization has occurred.
The first normal form is achieved by
1) eliminating repeating groups in individual tables
2) creating a separate table for each set of data
3) identify each set of related data with a primary key
so this yields a table where the user information is repeated for each url so url field limitation is solved
The Second normal form is achieved by
1) Creating separate tables for a set of values that apply to multiple records
2) Relate these tables with a foreign key
Basically, we break the url values into a separate table so we can add more in the future
The third normal form is achieved by
1) eliminating fields that do not depend on the key
Company name and address have nothing to do with the user id, so they are broken off into their own table
The fourth and higher form depend on data relationships involving one-one, one-to-many and many-to-many.
The Fourth normal form is
1) In many to many relationship, independent entities cannot be stored in the same table.
To get many users related to many urls, we define a url_relations where they user id and url id are paired.
The next normal form is the Fifth normal form which suggests that
1) The original table must be reconstructed from the tables into which it has been broken down. This is a way to check that no new columns have been added.
As always, remember that denormalization has its benefits as well.
Also, Litt's tips additionally mentions the following :
1) create a table for each list. More than likely every list will have additional information
2) create non-meaningful identifiers.
This is to make sure that business rule changes do not affect the primary identifier