Cluster computing

Tuesday, February 25, 2014

Program execution logging seems an art. While it can be dismissed as a chore, for sustaining engineering, this seems an invaluable diagnostic. What would make it easier to troubleshoot problems is when there is a descriptive message when errors occur. Typically these messages are for at-the-moment errors without any indication of what customer could do to mitigate it. I don't mean that error messages need to be expanded to include corrective actions in all cases. That would help but perhaps an association between error messages and corrective actions could be maintained. Say if we keep all our error message strings in one place then it could be easy to correlate the errors to the actions by keeping them side by side.
The corrective action strings need not even be in the logging but the association could help support and sustaining to diagnose issues. Especially when the workarounds are something that's domain knowledge. These will avoid a lot of communication and even help the engineers on the field.
At the same time, this solution may not be appropriate in all cases. For example, where we don't want to be too informative to our customers and where we don't found confound them with too much details. Even in such cases, being elaborate in the error conditions and the descriptive messages may help the appropriate audience to target their actions.
Lastly, I want to add that many feature developers might already be aware of common symptoms and mitigations during their development phase. Capturing these artifacts will help in common troubleshooting with the feature at a later point of time. Building a history or a database of such knowledge via simple bug tracking would immensely help. Since troubleshooters ofter search the bug database to see for similar problems reported.
Another consideration is that the application maintain data structures exclusively for supportability. For example, if there is an enumeration of all the workers for a given component, their tasks, their objects and states and if these can be queried in a pull operation independent of the method they are working on, it would be great. These pull operations could be invoked by views specific to runtime diagnostics. So they can be exposed via methods specific to management. These are different from logging in the sense that they are actually calls to the product to retrieve enhanced runtime information.

Cluster computing

Tuesday, February 25, 2014

No comments:

Post a Comment