Cluster computing

Saturday, January 11, 2014

In the previous post, we examined some tools on linux to troubleshoot system issues. We continue our discussion with high cpu utilization issues. One approach is to read logs. Another approach is to take a core dump and restart the process.The ps and kill command comes very useful to take a dump. By logs we mean performance counters logs. For linux, this could come with sar tool or vmstat tool that can run in sampling mode. The logs help identify which core in a multicore processor is utilized and if there's any processor affinity to the code of the process running on that core.
User workloads is also important to analyze if present. High cpu utilization could be triggered by a workload. This is important to identify not merely because the workload will give insight into which component is being utilized but also because the workload also gives an idea of how to reproduce the problem deterministically. Narrowing down the scope of the occurrence throws a lot of light into the underlying issue with the application such as knowing when the problem occurs, which components are likely affected, what's on the stack and what frame is likely on the top of the stack. If there are deterministic steps to reproduce the problem, we can repeatedly trigger the situation for a better study. In such cases the frame, the source code in terms of module, file and line can be identified. Then a resolution can be found.
Memory utilization is also a very common issue. There are two approaches here as well. One approach is to have instrumentation either via linker or via trace to see the call sequences to identify memory allocations. Another approach is to use external tools to capture stacktraces at all allocations so that the application memory footprint can show which allocations have not been freed and the corresponding offending code. Heap allocations are generally tagged to identify memory corruption issues. They work on the principle that the tags at the beginning and the end of an allocation are not expected to be overwritten by the process code since the allocations are wrapped by tags by the tool. Any write access on the tags is likely from a memory corrupting code and a stack trace at such a time will point to the code path. This is very useful for all sizes of allocations and de-allocations.
Leaks and corruptions are two different syndromes that need to be investigated and resolved differently.
In the case of leaks, a codepath may continuously leak memory when invoked. Tagging all allocations and capturing the stack at such allocations or reducing the scope to a few components and tracking the objects created by the component can give an insight into which object or allocation is missed. Corruption on the other hand is usually indeterministic and can be caused by such things as timing issues. The place of corruption may also be random. Hence, it's important to identify from the pattern of corruption which component is likely involved and whether there can be minimal instrumentation introduced to track all such objects that have such a memory footprint.

Cluster computing

Saturday, January 11, 2014

No comments:

Post a Comment