Cluster computing

Friday, November 30, 2018

Today we continue discussing the best practice from storage engineering:

119) When storage operations don’t go as planned, Exceptions need to be raised and reported. Since the exceptions bubble up from deep layers, the need to be properly wrapped and translated for them to be actionable to the user. Such exception handling and the chaining often breaks leading to costly troubleshooting. Consequently, code revisits and cleanup become a routine chore

120) Exceptions and alerts don’t matter to the customer if they don’t come with a wording that explains the mitigatory action needed to be taken by the user. Error code, level and severity are other useful ways to describe the error. Diligence in preparing better error messages go a long way to help end users.

121) The number of background jobs to data path workers is an important ratio. It is easy to delegate jobs to the background in order to make the data path fast. However, if there is only one data path worker and the number of background jobs is very high, then efficiency reduces and message passing increases. Instead it might be better to serialize the tasks on the same worker. The trade-off is even more glaring when the background workers are polling or executing in scheduled cycles because it introduces delays.

122) Event based programming is harder to co-ordinate and diagnose as compared to sequential programming yet it is fondly used in many storage drivers and even in user mode components which do not need to be highly responsive or where there might be significant delay between action triggers. This requires a driver verifier to analyze all the code paths. Instead, synchronous execution suffices with object oriented design for better organization and easier troubleshooting. While it is possible to mix the two, the notion that the execution follows the timeline in the logs for the activities performed by the storage product helps, reduce overall cost of maintenance.

Thursday, November 29, 2018

Today we continue discussing the best practice from storage engineering:

115) Upgrade scenarios : As with any other product, a storage server also has similar concerns for changes to data structures or requests/responses. While it is important for each feature to be backward compatible, it is also important to have a design which can introduce flexibility without encumberance.

116) Multi-purpose applicability of logic: When we covered diagnostic tools, scripts, we mentioned a logic that make use of feedback from monitored data paths. This is one example but verification and validation such as these are also equally applicable for external monitoring and troubleshooting of product. The same logic may apply in a diagnostic API as well as in product code as active data path corrections. Furthermore, such a logic may be called from User Interface, command line or other SDKs. Therefore, validations throughout the product are candidates for repurposing.

117) Read-only/ Read-Write It is better to separate out read -only from read-write portion of data because it separates the task access for the data. Usually online processing can be isolated to read write while analytical processing can be separated to read only. The same holds true for static plans versus dynamic policies and the server side resources from the client side policies.

118) While control path is easier to describe and maintain, the data path is more difficult to determine upfront because customers use it any way they want. When we discussed assigning labels to incoming data and workloads, it was a reference to classify the usages of the product so we can gain insight into how it is being used. I’m a feedback cycle, such labels provide convenient understanding of the temporal and spatial nature of the data flow.

119) When storage operations don’t go as planned, Exceptions need to be raised and reported. Since the exceptions bubble up from deep layers, the need to be properly wrapped and translated for them to be actionable to the user. Such exception handling and the chaining often breaks leading to costly troubleshooting. Consequently, code revisits and cleanup become a routine chore

120) Exceptions and alerts don’t matter to the customer if they don’t come with a wording that explains the mitigatory action needed to be taken by the user. Error code, level and severity are other useful ways to describe the error. Diligence in preparing better error messages go a long way to help end users.

Wednesday, November 28, 2018

Today we continue discussing the best practice from storage engineering:
111) Memory configuration: In a cluster environment, most of the nodes are commodity. Typically, they have reasonable memory. However, the amount of storage data that can be processed by a node depends on fitting the corresponding data structures in memory. The larger the memory, the higher the capability of the server component in the control path. Therefore, there must be some choice of memory versus capability in the overall topology of the server so that it can be recommended to customers.
112) Cpu Configuration: Typically, VMs added as nodes to storage cluster come in T-shirt size configurations with the number of CPUs and the memory configuration defined for each T-shirt size. There is no restriction for the storage server to be deployed in a container or a single Virtual Machine. And since the virtualization of the compute makes it harder to tell the scale up of the host, the general rule of thumb has been more the better. This does not need to be so and a certain configuration may provide the best advantage. Again, the choice and the recommendation must be conveyed to the customer.
113) Serverless Computing: Many functions of a storage server/product are written in the form of microservices or perhaps as components within layers if they are not separable. However, the notion of modularity can be taken in the form of serverless computing so that the lease on named compute servers does not affect the storage server.
114) Expansion: Although some configurations may be optimal, the storage server must be flexible to what the end user want as configuration for the system resources. Availability of flash storage and its configuration via external additions to the hardware is a special case. But upgrading the storage server from one hardware to another must be permitted.
115) Upgrade scenarios : As with any other product, a storage server also has similar concerns for changes to data structures or requests/responses. While it is important for each feature to be backward compatible, it is also important to have a design which can introduce flexibility without encumberance.

Tuesday, November 27, 2018

Today we continue discussing the best practice from storage engineering:

106) CAP theorem states that a system cannot have availability, consistency, and partition tolerance at the same time. However, it is possible to work around this when we use layering and design the system around a specific fault model. The append only layer provides high availability. The partitioning layer provides strong consistency guarantees. Together they can handle specific set of faults with high availability and strong consistency.
107) Workload profiles: Every storage engineering product will have data I/O and one of the ways to try out the product is to use a set of workload profiles with varying patterns of data access.
108) Intra/Inter: Since data I/O crosses multiple layers, a lower layer may perform operations that are similarly applicable to artifacts in another layer at a higher scope. For example, replication may be between copies of objects within a single PUT operation and may also be equally applicable to objects spanning sites designated for content replication. This not only emphasizes reusability but also provides a way to check the architecture for consistency.
109) Local/Remote: While many components within the storage server take the disk operations to be local there are certain components that gather information across disks and components directly writing to it. In such case, even if the disk is local, it would prove consistent to access local via a loopback and simplify the logic to assuming every such operation as remote.
110) Resource consumption: We referred to performance engineering for improving the data operations. However, the number of resources used per request was not called out because it may have been perfectly acceptable if the elapsed time was within bounds. However, resource conservation has a lot to do with reducing interactions which in turn leads to efficiency.

Monday, November 26, 2018

Today we continue discussing the best practice from storage engineering:
100)
Conformance to verbs: Service oriented architecture framework of providing web services defined contract and behavior in addition to address and binding for services but the general shift in the industry has been towards RESTful services from that architecture. This paradigm introduces well known verbs for operations permitted. Storage products that provide RESTful services must conform to the well-defined mapping of verbs to create-update-delete operations on their resources.

101) Storage products tend to form a large code base which significantly hurts developer productivity when build time takes more than a few minutes. Consequently code base may need to be constantly refactored or the build needs to be completed with more workers, memory and profiling.

102) Profiling is not limited to build time. Like the performance counters mentioned earlier, there is a way to build instrumented code so that bottlenecks may be identified. Like build profiling, this has to be repeated and trends need to be monitored.

103) Stress testing the storage product also helps gain valuable insights into whether the product’s performance changes over time. This covers everything from memory leak to resource starvation.

104) Diagnostic tools and scripts including log queries that are used to troubleshoot during development time also become useful artifacts to share with the developer community for the storage product. Even if the storage product is used mostly for archival, there is value in sharing these and documentation with the community

105) Older versions of the storage product may have had to be diagnosed with scripts and log queries but bringing them into the product in its current version as diagnostic API makes it mainstream. Documentation for these and other APIs make it easier on the developer community.

Sunday, November 25, 2018

Today we continue discussing the best practice from storage engineering:
95) Reconfiguration: Most storage products are subject to some pools of available resources managed by some policies that can change from time to time. Whenever the server resources are changed, they must be done in one operation so that the system presents a consistent view to all usages going forward. Such a system wide is a reconfiguration and is often implemented across storage products.

96) Auto-tuning: This is the feedback loop cycle with which we allow the storage server/appliance/product to perform better because the dynamic parameters are adjusted to values that better suit the workload.

97) Acceptance: This is the user-defined level of service-level agreement for the APIs to the storage server so that they maintain satisfactory performance with the advantage that the clients can now communicate with a pre-existing contract.

98) Address: This defines how the storage is discovered by the clients. For example, if there were services, this would define how the service would be discovered. If it were a network share, this would define how the remote share would be mapped. While most storage products enable users to create their own address to their storage artifacts, not every storage product provides a gateway to those addresses.

99) Binding: A binding protocol defines the transport protocol, encoding and security requirements before the data transfer can be initiated. Although storage products concern themselves with data at rest, they must provide ways to secure data in transit.

100)
Conformance to verbs: Service oriented architecture framework of providing web services defined contract and behavior in addition to address and binding for services but the general shift in the industry has been towards RESTful services from that architecture. This paradigm introduces well known verbs for operations permitted. Storage products that provide RESTful services must conform to the well-defined mapping of verbs to create-update-delete operations on their resources.

#codingexercise
int GetNodeWithLeavesEqualToThreshold(Node root, int threshold, ref List<Node> result)
{
if (root == null) return 0;
if (root.left == null && root.right == null) return 1;
int left = GetNodeWithLeavesEqualToThreshold (root.left, threshold, ref result);
int right = GetNodeWithLeavesEqualToThreshold (root.right, threshold, ref result);
if (left + right == threshold) {
result.Add(root);
}
return left + right;
}

Saturday, November 24, 2018

Today we continue discussing the best practice from storage engineering:

92) Words: For the past fifty years that we have learned to persist our data, we have relied on the physical storage being the same for our photos and our documents and relied on the logical organization over this storage to separate our content, so we may run or edit them respectively. From file-systems to object storage, this physical storage has always been binaries with both the photos and documents appearing as 0 or 1. However, text content has syntax and semantics that facilitate query and analytics that are coming of age. Recently, natural language processing and text mining has made significant strides to help us do such things as classify, summarize, annotate, predict, index and lookup that were previously not done and not at such scale as where we save them today such as in the cloud. Even as we are expanding our capabilities on text, we have still relied on our fifty-year-old tradition of mapping letters to binary sequence instead of the units of organization in natural language such as words. Our data structures that store words spell out the letters instead of efficiently encoding the words. Even when we do read words and set up text processing on that content, we limit ourselves to what others tell us about their content. Words may appear not just in documents, they may appear even in such unreadable things as executables. Neither our current storage nor our logical organization is enough to fully locate all items of interest, we need ways to expand our definitions of both.

93) Inverted Lists: We have referred to collections both in the organization of data as well as from the queries over data. Another way we facilitate search over the data is by maintaining inverted lists of terms from the storage organizational units. This enables a faster lookup of locations corresponding to the presence of the search term. This inverted list may be constantly updated so that it remains consistent with the data. The lists are also helpful to gather overall ordering of terms by their occurrences.

94) Deletion policies/ Retention period: This must be a configurable setting which helps ensure that information is not erased prior to the expiration of a policy which in this case could be the retention period. At the same time, this retention period could also be set as "undetermined" when content is archived but have a specific retention period at the time of an event.

95) Reconfiguration: Most storage products are subject to some pools of available resources managed by some policies that can change from time to time. Whenever the server resources are changed, they must be done in one operation so that the system presents a consistent view to all usages going forward. Such a system wide is a reconfiguration and is often implemented across storage products.