Cluster computing

Monday, November 26, 2018

Today we continue discussing the best practice from storage engineering:
100)
Conformance to verbs: Service oriented architecture framework of providing web services defined contract and behavior in addition to address and binding for services but the general shift in the industry has been towards RESTful services from that architecture. This paradigm introduces well known verbs for operations permitted. Storage products that provide RESTful services must conform to the well-defined mapping of verbs to create-update-delete operations on their resources.

101) Storage products tend to form a large code base which significantly hurts developer productivity when build time takes more than a few minutes. Consequently code base may need to be constantly refactored or the build needs to be completed with more workers, memory and profiling.

102) Profiling is not limited to build time. Like the performance counters mentioned earlier, there is a way to build instrumented code so that bottlenecks may be identified. Like build profiling, this has to be repeated and trends need to be monitored.

103) Stress testing the storage product also helps gain valuable insights into whether the product’s performance changes over time. This covers everything from memory leak to resource starvation.

104) Diagnostic tools and scripts including log queries that are used to troubleshoot during development time also become useful artifacts to share with the developer community for the storage product. Even if the storage product is used mostly for archival, there is value in sharing these and documentation with the community

105) Older versions of the storage product may have had to be diagnosed with scripts and log queries but bringing them into the product in its current version as diagnostic API makes it mainstream. Documentation for these and other APIs make it easier on the developer community.

Sunday, November 25, 2018

Today we continue discussing the best practice from storage engineering:
95) Reconfiguration: Most storage products are subject to some pools of available resources managed by some policies that can change from time to time. Whenever the server resources are changed, they must be done in one operation so that the system presents a consistent view to all usages going forward. Such a system wide is a reconfiguration and is often implemented across storage products.

96) Auto-tuning: This is the feedback loop cycle with which we allow the storage server/appliance/product to perform better because the dynamic parameters are adjusted to values that better suit the workload.

97) Acceptance: This is the user-defined level of service-level agreement for the APIs to the storage server so that they maintain satisfactory performance with the advantage that the clients can now communicate with a pre-existing contract.

98) Address: This defines how the storage is discovered by the clients. For example, if there were services, this would define how the service would be discovered. If it were a network share, this would define how the remote share would be mapped. While most storage products enable users to create their own address to their storage artifacts, not every storage product provides a gateway to those addresses.

99) Binding: A binding protocol defines the transport protocol, encoding and security requirements before the data transfer can be initiated. Although storage products concern themselves with data at rest, they must provide ways to secure data in transit.

100)
Conformance to verbs: Service oriented architecture framework of providing web services defined contract and behavior in addition to address and binding for services but the general shift in the industry has been towards RESTful services from that architecture. This paradigm introduces well known verbs for operations permitted. Storage products that provide RESTful services must conform to the well-defined mapping of verbs to create-update-delete operations on their resources.

#codingexercise
int GetNodeWithLeavesEqualToThreshold(Node root, int threshold, ref List<Node> result)
{
if (root == null) return 0;
if (root.left == null && root.right == null) return 1;
int left = GetNodeWithLeavesEqualToThreshold (root.left, threshold, ref result);
int right = GetNodeWithLeavesEqualToThreshold (root.right, threshold, ref result);
if (left + right == threshold) {
result.Add(root);
}
return left + right;
}

Saturday, November 24, 2018

Today we continue discussing the best practice from storage engineering:

92) Words: For the past fifty years that we have learned to persist our data, we have relied on the physical storage being the same for our photos and our documents and relied on the logical organization over this storage to separate our content, so we may run or edit them respectively. From file-systems to object storage, this physical storage has always been binaries with both the photos and documents appearing as 0 or 1. However, text content has syntax and semantics that facilitate query and analytics that are coming of age. Recently, natural language processing and text mining has made significant strides to help us do such things as classify, summarize, annotate, predict, index and lookup that were previously not done and not at such scale as where we save them today such as in the cloud. Even as we are expanding our capabilities on text, we have still relied on our fifty-year-old tradition of mapping letters to binary sequence instead of the units of organization in natural language such as words. Our data structures that store words spell out the letters instead of efficiently encoding the words. Even when we do read words and set up text processing on that content, we limit ourselves to what others tell us about their content. Words may appear not just in documents, they may appear even in such unreadable things as executables. Neither our current storage nor our logical organization is enough to fully locate all items of interest, we need ways to expand our definitions of both.

93) Inverted Lists: We have referred to collections both in the organization of data as well as from the queries over data. Another way we facilitate search over the data is by maintaining inverted lists of terms from the storage organizational units. This enables a faster lookup of locations corresponding to the presence of the search term. This inverted list may be constantly updated so that it remains consistent with the data. The lists are also helpful to gather overall ordering of terms by their occurrences.

94) Deletion policies/ Retention period: This must be a configurable setting which helps ensure that information is not erased prior to the expiration of a policy which in this case could be the retention period. At the same time, this retention period could also be set as "undetermined" when content is archived but have a specific retention period at the time of an event.

95) Reconfiguration: Most storage products are subject to some pools of available resources managed by some policies that can change from time to time. Whenever the server resources are changed, they must be done in one operation so that the system presents a consistent view to all usages going forward. Such a system wide is a reconfiguration and is often implemented across storage products.

Friday, November 23, 2018

Today we continue enumerating the best practice from storage engineering:
Data Types: There are some universally popular data types such as integers, float, date and string that are easy to recognize even on the wire and most storage engineering products also see them as ways of packing bytes in a byte sequence. However, storage engineering products including log indexes are actually far more convenient to expose more data types to the data they store than any other product because the data finally persists in these products. Conventional relational databases made sense of the data in the tables only because the data type was registered with them. Not all storage engineering products have that luxury. Log indexes for example ingest data without user interaction. The ability to infer data types and auto-register them to facilitate richer forms of search and analytics is not popular although it holds a lot of promise. Most products work largely by inferring fields rather than types within the data because it gives them a way to allow users to search and analyze using tags that are familiar to them. However, looking up fields together as types only adds a little bit more aggregation on the server side while improving the convenience to the users.

Words: For the past fifty years that we have learned to persist our data, we have relied on the physical storage being the same for our photos and our documents and relied on the logical organization over this storage to separate our content, so we may run or edit them respectively. From file-systems to object storage, this physical storage has always been binaries with both the photos and documents appearing as 0 or 1. However, text content has syntax and semantics that facilitate query and analytics that are coming of age. Recently, natural language processing and text mining has made significant strides to help us do such things as classify, summarize, annotate, predict, index and lookup that were previously not done and not at such scale as where we save them today such as in the cloud. Even as we are expanding our capabilities on text, we have still relied on our fifty-year-old tradition of mapping letters to binary sequence instead of the units of organization in natural language such as words. Our data structures that store words spell out the letters instead of efficiently encoding the words. Even when we do read words and set up text processing on that content, we limit ourselves to what others tell us about their content. Words may appear not just in documents, they may appear even in such unreadable things as executables. Neither our current storage nor our logical organization is enough to fully locate all items of interest, we need ways to expand our definitions of both.

Thursday, November 22, 2018

Today we continue discussing the best practice from storage engineering:

85) Statistics – We referred to statistics enabled counters earlier for the components of the storage server. This section merely refers to client-based statistics for the entire storage product whenever possible so that there can be differentiated tuning to workloads based on the data gathered from the server usage.

86) Tracers: As an extension of the above method for studying workloads, the usage of storage artifacts by a given workload may not always be known. In such cases, it is better for the storage server to inject markers or tracers to view the data path.

87) User versus system boundary: Many security vulnerabilities manifest themselves when the code gets executed with user context rather than with system context. The execution of code in system context is privileged and maintains a few assumptions including one that it is the source of truth. Therefore, the switching from user to system context is required wherever we can demarcate the boundary. If the context switching is missing then it is likely that the code can be executed with user context.

88) Lines of control – even when the code path for admission into the system has a clear user and system context defined, the user context is established only when the execution traverses the lines of authentication and authorization Consequently all user facing entry points need to guarantee proper exception handling to minimize security risks from the line of control

89) Impersonation – Usually identities are not switched by the system because most of the system code is executed with its own identity. However, there are cases when code is executed in user context in which case a system thread may need to use the security context of the user. Impersonation opens up a new dimension for tests when identities are switched between two user accounts and is generally best avoided.

90) Routines at the user-system boundary- When the boundary between user and system is clearly demarcated and secured, it facilitates the execution of common routines such as auditing, logging, exception handling and translations, resetting contexts and so on. In fact, the user-system context boundary is a convenient way to enforce security as well as collect data on the traffic.

Wednesday, November 21, 2018

Today we continue discussing the best practice from storage engineering:

80) Interoperability – Most storage products work well when the clients are running on a supported flavor of an operating system. However, this consideration allows the product to expand its usage. Interoperability is not just a convenience for the end-user, it is a reduction in management cost as well.

81) Application Configuration – Most storage products are best served with a static configuration that can determine the parameters to the system. Since product usages span a wide variety of deployments, most products offer at least a few parameters to tune the system. Configuration files has been used with applications and servers on the unix flavor platform and storage products also make use of it. Files also make it easy for listeners to watch for changes.

82) Dynamic Configuration – Applications and services have not only used configuration based on static files but also used dynamic configuration which may be obtained from external configuration sources. Since the value is not updated by touching the production server, this finds appeal in cases where parameters need constant tuning and have to be done without involving the production server.

83) Platform independent client library – Most frontend rely on some form of Javascript for client-side scripting. However, Javascript is also popular in server-side programming as a NodeJs server. While portals and application servers are portable when written in Javascript, it applies equally to any client library or agent interaction for the cloud server

84) External management tools – For object storage, S3 has been a primary API for control and data path to the storage server. Management tools that are cloud agnostic provide tremendous leverage for bulk automations and datacenter computing. Consequently, storage products must strive for conforming to these tools in ways that utilize already streamlined channels such as well-published APIs whenever possible.

85) Statistics – We referred to statistics enabled counters earlier for the components of the storage server. This section merely refers to client-based statistics for the entire storage product whenever possible so that there can be differentiated tuning to workloads based on the data gathered from the server usage.

Tuesday, November 20, 2018

Today we continue discussing the best practice from storage engineering:

75) Cachepoints – Cachepoints are used with consistent hashing. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. Virtual nodes can join and leave the network without impacting the operation of the ring.

76) Stream/Batch/Sequential - processing: Storage products often distinguish themselves as serving stream processing, batch processing or sequential processing. Yet, the factors that determine the choice are also equally applicable to the components within the product when they are not necessarily restricted by the overall design. There are ways to convert one form of processing into another which drives down the cost. For example, event processing has largely been stream- based.

77) Joins – Relational data has made remarkable use of joins over tuples of data involving storage and query improvements to handle these cases. Components within products that are used for unstructured data often have to encounter some form of matching between collections. The straightforward way to implement these have been iterators over one or more collections that are filtered based on conditions that evaluate those collections. However, it helps to lookup associations whenever possible by ways and means that can improve performance. Judicious choice of such techniques is always welcome wherever possible.

78) Strategies – Implementation of a certain data processing logic within a storage product may often have a customized implementation and maintained with the component as it improves from version to version. Very little effort is usually spent on externalizing the strategy across components to see what may belong to the shared category and potentially benefit the components. Even if there is only one strategy every used with that component, this technique allows other techniques to be tried out independent of the product usage.

79) Plug and Play architecture – the notion of plugins that work irrespective of the components and layers in a storage stack is well-understood and part of the software design. Yet the standardization of the interface such that it is applicable across implementations is often left pending for later. Instead, the up-front standardization of interfaces promotes eco-system and adds convenience to the user.

80) Interoperability – Most storage products work well when the clients are running on a supported flavor of an operating system. However, this consideration allows the product to expand its usage. Interoperability is not just a convenience for the end-user, it is a reduction in management cost as well.