Cluster computing

Thursday, March 7, 2019

Today we continue discussing the best practice from storage engineering:

539) From supercomputers to large scale clusters, the size of compute, storage and network can be made to vary quite a bit. However, the need to own or manage such capability reduces significantly once it is commoditized and outsourced.

540) Some tasks are high priority and are usually smaller in number than the general class of tasks. If they arrive out of control, it can be significant cost. Most storage products try to control the upstream workload for which they are designed. For example, if the tasks can be contrasted significantly, it can be advantageous.

541) The scheduling policies for tasks can vary from scheduler to scheduler. Usually a simple policy scales much better than complicated policies. For example, if all the tasks have a share in a pie representing the scheduler, then it is simpler to expand the pie rather than re-adjusting the pie slices dynamically to accommodate the tasks.

542) The weights associated with tasks are set statically and then used in computations to determine the scheduling of the tasks. This can be measured in quantums of time and if a task takes more than what is expected, it is called a quantum thief. A scheduler uses tallying to find and make a quantum thief yield to other tasks.

543) Book-keeping is essential for both scheduler and allocator not only to keep track of grants but also for analysis and diagnostics.

544) A scheduler and allocator can each have their own manager that separates the concerns of management from their work

545) The more general purpose the scheduler and allocator become, the easier it is to use them in different components. Commodity implementations win hands down against specialized ones because they scale.

546) The requests for remote resources are expected to perform longer than local operations. If they incur timeouts, the quantum grants may need to stretch over.

547) Timeout must expand to include timeouts from nested operations.

548) Some event notification schemes are helpful to handle them at the appropriate scope.
549) A recovery state machine can help with global event handling for outages and recovery.
550) The number of steps taken to recover from outages can be reduced by dropping scoped containers in favor of standby
551) Adding and dropping containers are easy to address cleanup.
552) The number of replication groups is determined by the data that needs to be replicated. Generally there is very little data

Cluster computing

Thursday, March 7, 2019

No comments:

Post a Comment