Cluster computing

Sunday, April 26, 2020

The reconfiguration of longevity tool
The longevity tool requires task configuration which is usually provided at the start of the execution with the help of a configmap that mounts the configuration on a read-only fileshare. The tool runs for a long time and in the process may encounter intermittent exceptions. Some exceptions can be ignored without requiring the tool to stop and resetting the counters gathered from the run.
This called for addressing the tools limitations on several fronts. First the retries were added to readers so that they can get past some of the exceptions that otherwise brought the tool down. The readers were chosen because they were independent and launched with a reader group config. The writers also encounter exceptions but this could be prioritized after the retries for the readers since the events were random data and the readers had to perform validations that the writer did not.
Among the validations, byte-level validations was important because the validations required enabled the data written to be read the same. The readers showed zero malformed events exceptions in regular runs and only a rare number when connections were abruptly closed. The most common exception encountered were SegmentTruncatedException. A number of exceptions were reduced when the tools connectors to the Pravega store were upgraded.
The tool also improved immensely when diagnostic logging and spot bug fixes were made. These reduced the exceptions but the retries were still needed to give indication to the tester that the store was functional and that the tools subsequent requests went through. The retry logic was expanded to include numRetries and delayMillis between the retries from the user. These parameters could be read from the task configuration at the start of the tool.
Subsequently, restart logic was required to be added so that the readers could resume from the last position rather than from the beginning of the stream. This was solved with the help of checkpoints that was used to reset the reader group so that readers may come and go but the progress could be made from the last position. The checkpoints were added to readers but it was necessary a configurable parameter to the readers to indicate that the reader was restartable. This parameter was also added to the task configuration.
The Longevity tool runs on docker containers so there was no easy way to specify the retart as an argument to the tool after the tool was launched. A pair of apis were added to restart the readers and writers. This gave the ability to the tester to get past failures of the tools by bringing down and reviving the writers and readers.
Checkpoints meant that there were StreamCut positions that could now be used to reduce the segment range the readers need to work on. The range is specified as a pair of head and tail where the tail points to the current boundary to read the next event but the head could be adjusted to not be at the start of the stream. Since the segment ranges are logged, the tester could associate a point of time that the tool could be resumed from. This was added via a segment number and last position pair in the task configuration. A change to the configuration, during the execution, was a limitation with the tool. This was relaxed with the help of an API that could accept a new test configuration altogether and kick off the restart with the new parameters.

Cluster computing

Sunday, April 26, 2020

No comments:

Post a Comment