Cluster computing

Sunday, January 12, 2014

This blog post is on study material for Sun certification for J2EE:
Architecture considers the functions of components, their interfaces, their interactions and components. The architecture specification helps with the basis of for application design and implementation steps. This book mentions that flexible systems minimize the need to adapt by maximizing their range of normal situations. In J2EE environment, there could be a JNDI agent - Java Naming and Directory interface agent who knows what systems elements are present, where they are and what services they offer.
Classes and Interfaces for an Enterprise JavaBeans Component include the following: Home(EJBHome) Interface, Remote(EJBObject) Interface, XML deployment descriptor, Bean class, and the Context objects. The Home interface provides the lifecycle operations (Create, remove, find) for an EJB. The JNDI agent is used by the client to locate an EJBHome object.
The remote object interface provides access to the business methods within the EJB. An EJBObject represents a client view of the EJB. The EJBObject is a proxy for the EJB. It exposes the application related interfaces for the object but not the interfaces that allows the container to manage and control the object. The container implements the state management, transaction control, security and persistence services transparently. For each EJB instance, there is a SessionContext object and an EntityContext Object. The context object is used to co-ordinate transactions, security persistence and other system services.
package examples
public interface Service {
   public void sayBeanExample();
}
@Stateless
@TransactionAttribute(NEVER)
@Remote({examples.Service.class})
@ExcludeDefaultInterceptors
public class ServiceBean
implements Service
{
   public void sayBeanExample() {
    System.out.println("Hello From Service Bean!");
}
}
The import statements are used to import say the metadata annotations, the InvocationContext that maintains state between interceptors etc. The @Stateful specifies that the EJB is of type stateful, the @Remote interface specifies the name of the remote interface, the @EJB annotation is used for dependency injection and specifies the dependent "ServiceBean" stateless session bean context. The @Interceptors and @ExcludeClassInterceptors specifies that the bean is associated with an Interceptor class and that the interceptors methods should not fire for the annotated method respectively. The @PreDestroy method is used for cleanup

We looked at a Node.js backbone solution from the book JumpStart Node.js. The book continues with an example for a real-time trades appearing in the browser. Here we add a function to store the exchange data every time it changes. The first task is to store the data, then its transformed and sent to the client.
Instead of transmitting the entire data, it is transformed. When the client makes the initial request, we transmit this data. Client side filters could change. Hence its better to use templates. We can use JQuery's get function to retrieve the template and send an initial 'requestData' message to the server so that the initial data can be sent.
As before, we use the initialize to call render function. We iterate through all the models and render each row individually with a separate view for the real-time trades. With the static template, this is now easier to render than when string was used. With the data loaded, its easier to handle just the updates with a separate method.
Heroku can be used to deploy the Node.Js application.
Express.js supports both production and development settings. One of the differences between the two is the settings that handle errors. In development, we want as much error information as possible while in production we lock it down.
We also provide a catchall handling any request not processed by prior routes.
When it comes to hosting the application, there are several options available - such as IaaS, PaaS, or Amazon's EC2. However, the cost of all this convenience is the loss of control. In general, this is not a problem and the convenience is far worth it.
For those choosing to deploy on either a dedicated server or EC2, it is better to use an option that frequently restarts the application upon any crash or file change. A node-supervisor helps in this regard but for production it is better to use the package forever since it has minimal overhead.
Version control and Heroku deployment should go together in order that we can do rollbacks on unstable change. With incremental changes, 'git push heroku master' could then become a habit.
We did not cover Socket.IO and scoop.it

Saturday, January 11, 2014

Today I'm going to read from a book called Jumpstart Node.js by Don Nguyen.
Just a quick reminder that the Node.js is a platform to write server side applications. It achieves high throughput via non-blocking I/O and a single threaded event-loop. Node.js contains a built-in HTTP server library so Apache or Lightpd is not required.
The book cites an application WordSquared as an introduction to applications in Node.js. This is an online realtime infinite game of Scrabble.
Node.js is available from GitHub via a package manager
Over the http server is a framework called Connect that provides support for cookies, sessions, logging and compression, to name a few.On top of Connect is Express which has support for routing templates and view rendering engine.
Node.js is minimalistic. Access to web server files is provided via fs module. express and routes are available via express and routes module.
Node.js allows callback functions and this is used widely since Node.js is asynchronous. for example
setTimeout(function(){console.log('example');}, 500); The line after this statement is executed immediately while the example is rendered after the timeout.
Node.js picks up changes to code only on restart. This can become tedious after a while, so a node supervisor is installed to automatically restart the system upon changes to file.
MongoLab is a cloud based NoSQL provider and can come in useful for applications requiring a database.
Backbone is the MVC framework for the Node.js. It can be combined with the Node.js framework to provide a rich realtime user interface.
To create a custom stock ticker, a filter for the code of the stock could be implemented. When the user submits a request, Backbone makes a request to the server-side API. The data is placed into a model on the client side. Subsequent changes are made to the model and bindings specify how these changes should be reflected in the user interface. To display the view, we have an initialize and setVisibility function. In the initialize function, a change in the model is bind-ed to the setVisibility function. In the latter we query the properties and set the view accordingly. When the filtering is applied, the stock list is thus updated.

In the previous post, we examined some tools on linux to troubleshoot system issues. We continue our discussion with high cpu utilization issues. One approach is to read logs. Another approach is to take a core dump and restart the process.The ps and kill command comes very useful to take a dump. By logs we mean performance counters logs. For linux, this could come with sar tool or vmstat tool that can run in sampling mode. The logs help identify which core in a multicore processor is utilized and if there's any processor affinity to the code of the process running on that core.
User workloads is also important to analyze if present. High cpu utilization could be triggered by a workload. This is important to identify not merely because the workload will give insight into which component is being utilized but also because the workload also gives an idea of how to reproduce the problem deterministically. Narrowing down the scope of the occurrence throws a lot of light into the underlying issue with the application such as knowing when the problem occurs, which components are likely affected, what's on the stack and what frame is likely on the top of the stack. If there are deterministic steps to reproduce the problem, we can repeatedly trigger the situation for a better study. In such cases the frame, the source code in terms of module, file and line can be identified. Then a resolution can be found.
Memory utilization is also a very common issue. There are two approaches here as well. One approach is to have instrumentation either via linker or via trace to see the call sequences to identify memory allocations. Another approach is to use external tools to capture stacktraces at all allocations so that the application memory footprint can show which allocations have not been freed and the corresponding offending code. Heap allocations are generally tagged to identify memory corruption issues. They work on the principle that the tags at the beginning and the end of an allocation are not expected to be overwritten by the process code since the allocations are wrapped by tags by the tool. Any write access on the tags is likely from a memory corrupting code and a stack trace at such a time will point to the code path. This is very useful for all sizes of allocations and de-allocations.
Leaks and corruptions are two different syndromes that need to be investigated and resolved differently.
In the case of leaks, a codepath may continuously leak memory when invoked. Tagging all allocations and capturing the stack at such allocations or reducing the scope to a few components and tracking the objects created by the component can give an insight into which object or allocation is missed. Corruption on the other hand is usually indeterministic and can be caused by such things as timing issues. The place of corruption may also be random. Hence, it's important to identify from the pattern of corruption which component is likely involved and whether there can be minimal instrumentation introduced to track all such objects that have such a memory footprint.

Friday, January 10, 2014

Troubleshooting high cpu usage in c++ programs on Linux requires the same discipline and order in debugging as it does anywhere else. The threads in the program could be hogging cpu because its spinning in a loop or because it's doing a computationally intensive routine. High CPU utilization may not always mean busy thread, it could be spinning. The uptime tool in Linux gives output for what has been happening for the last fifteen minutes. Using top, we can see what processes are the biggest contributors to the problem. Sar is yet another tool that can give more detailed information on the CPU utilization trends. The data is often available offline for use with tools like isag to generate charts or graph. The raw data for the sar tool is stored under /var/log/sa where the various files represent the days of the respective month.The ps and pstree are also useful tools for system analysis. ps -L option gives thread level information. mpstat command is used to report on each of the available CPU on a multiprocessor server. Global average activities among all CPUs are also reported. The KDE system guard (KSysGuard) is the KDE task manager and performance monitor. It enables monitoring of local and remote hosts. The vmstat tool provides information about processes, memory, paging, block I/O, traps and CPU activity. The vmstat command displays either average data or actual samples. The vmstat tool report cpu time as one of the following four:
us: time spent running non-kernel code (user time, including nice time)
sy: time spent running kernel code (system time)
id: time spent idle,
wa: time spent waiting for IO.
vmstat can run in a sampling mode.
Yielding between such computations is a good programming practice but to know where to add these instructions to relieve the cpu requires first to figure out who the culprit is.
Valgrind is a useful tool for detecting memory leaks.
The GNU C library comes with builtin functionality to help detect memory issues. however, it does not log the call stacks of the memory allocations it tracks. There are static code analysis tools that can significantly detect code issues much earlier.

Event monitoring software can accelerate software development and test cycles. Event monitoring data is usually machine data generated by the IT systems. Such data can enable real-time searches to gain insights into user experience. Dashboards with charts can then help analyze the data. This data can be accessed over TCP, UDP and HTTP. Data can also be warehoused for analysis. Issues that frequently recur can be documented and searched more quickly with the availability of such data leading to faster debugging and problem solving. For example, data can be queried to identify errors in the logs which could be addressed remotely.
Machine data is massive and generated in streams. Being able to quickly navigate the volume to find the most relevant information for triaging issues is a differentiating factor for the event monitoring software. Early warning notifications, running rules engine, detecting trends are some of the features that enable not only rapid development and test by providing feedback of deployed software but also increase customer satisfaction as code is incrementally build and released.
Data is available to be collected, indexed, searched and reported. Applications can target specific interests such as security or correlations for building rules and alerts. Data is also varied such as from network, from applications, and from enterprise infrastructure. Powerful querying increases the usability of such data. For example, security data may inform about threats, the ability to include non-security user and machine data may add insight into unknown threats. Queries could also cover automated anomaly and outlier detection that help with understanding advanced threats. Queries for such key valued data can be written using PIG commands such as load/read, store/write, foreach/iterate, filter/predicate, group-cogroup, collect, join, order, distinct, union, split, stream, dump and limit. The depth and breadth of possibilities with event monitoring data seems endless. As more and more data becomes available and richer and powerful analytical techniques grow, this will help arm the developers and operation engineers to better address the needs of the organization. Some of the differentiators of such software include the ability to have one platform, fast return on investment, ability to use different data collectors, use non-traditional flat file data stores, ability to create and modify existing reports, ability to create baselines and study changes, programmability to retrieve information as appropriate and ability to include compliance, security, fraud detection etc. If applications are able to use the event monitoring software, it will be evident from the number of applications that are written.

Thursday, January 9, 2014

bool IsMatch(string input, string pattern)
{
if (string.IsNullOrWhiteSpace(input) || string.IsNullOrWhiteSpace(pattern)) return false;
var constants = pattern.split(new char[] { '*', '/' });
int start = -1;
var index = constants.Select( x => { int s = pattern.IndexOf(x, start + 1); start = s; return s;} ).ToList();
int prev = 0;
string wildcards = string.empty;
for (int i = 0; i < index.Length; i++)
{

// start must be the specified literal
start = input.IndexOf(constants[i]);
if (start == -1) return false;

// skip only as specified by wildcards
if (wildcards.Length > 0 )
{
int c = wildCards.Count(x => x == '?') ;
int last = start;
while (start != -1)
{
   if (start - prev - 1 < c) continue;
   else break;
   start = input.IndexOf(constant[i], start + 1);
}
if (start == -1 || start - prev - 1 < c) return false;
}
if (start == -1) start = last;

int wildcardsStart = index[i] + constants[i].Length;
wildcardsLen = (i+1<index.Length) ? index[i + 1] - wildcardsStart-1 : 0;
wildcards = pattern.SubString(wildcardsStart, wildCardsLen);

Debug.Assert (wildcards.Length > 0 || wildCards.ForEach(x => x == '?' || x = '*'));

}
return true;
}
string input can be "ABCDBDXYZ"
string pattern can be "A*B?D*Z"