Friday, November 30, 2018

Today we continue discussing the best practice from storage engineering:

119) When storage operations don’t go as planned, Exceptions need to be raised and reported. Since the exceptions bubble up from deep layers, the need  to be properly wrapped and translated for them to be actionable to the user. Such exception handling and the chaining often breaks leading to costly troubleshooting. Consequently, code revisits and cleanup become a routine chore

120) Exceptions and alerts don’t matter to the customer if they don’t come with a wording that explains the mitigatory action needed to be taken by the user. Error code, level and severity are other useful ways to describe the error. Diligence in preparing better error messages go a long way to help end users.

121) The number of background jobs to data path workers is an important ratio. It is easy to delegate jobs to the background in order to make the data path fast. However, if there is only one data path worker and the number of background jobs is very high, then efficiency reduces and  message passing increases. Instead it might be better to serialize the tasks on the same worker. The trade-off is even more glaring when the background workers are polling or executing in scheduled cycles because it introduces delays.

122) Event based programming is harder to co-ordinate and diagnose as compared to sequential programming yet it is fondly used in many storage drivers and even in user mode components which do not need to be highly responsive or where there might be significant delay between action triggers. This requires a driver verifier to analyze all the code paths. Instead, synchronous execution suffices with object oriented design for better organization and easier troubleshooting. While it is possible to mix the two, the notion that the execution follows the timeline in the logs for the activities performed by the storage product helps, reduce overall cost of maintenance.

Thursday, November 29, 2018

Today we continue discussing the best practice from storage engineering:

115) Upgrade scenarios : As with any other product, a storage server also has similar concerns for changes to data structures or requests/responses. While it is important for each feature to be backward compatible, it is also important to have a design which can introduce flexibility without encumberance.

116) Multi-purpose applicability of logic: When we covered diagnostic tools, scripts, we mentioned a logic that make use of feedback from monitored data paths. This is one example but verification and validation such as these are also equally applicable for external monitoring and troubleshooting of product. The same logic may apply in a diagnostic API as well as in product code as active data path corrections.  Furthermore, such a logic may be called from User Interface, command line or other SDKs. Therefore, validations throughout the product are candidates for repurposing.

117) Read-only/  Read-Write It is better to separate out read -only from read-write portion of data because it separates the task access for the data. Usually online processing can be isolated to read write while analytical processing can be separated to read only. The same holds true for static plans versus dynamic policies and the server side resources from the client side policies.

118) While control path is easier to describe and maintain, the data path is more difficult to determine upfront because customers use it any way they want. When we discussed assigning labels to incoming data and workloads, it was a reference to classify the usages of the product so we can gain insight into how it is being used. I’m a feedback cycle, such labels provide convenient understanding of the temporal and spatial nature of the data flow.

119) When storage operations don’t go as planned, Exceptions need to be raised and reported. Since the exceptions bubble up from deep layers, the need  to be properly wrapped and translated for them to be actionable to the user. Such exception handling and the chaining often breaks leading to costly troubleshooting. Consequently, code revisits and cleanup become a routine chore

120) Exceptions and alerts don’t matter to the customer if they don’t come with a wording that explains the mitigatory action needed to be taken by the user. Error code, level and severity are other useful ways to describe the error. Diligence in preparing better error messages go a long way to help end users.


Wednesday, November 28, 2018

Today we continue discussing the best practice from storage engineering:
111) Memory configuration: In a cluster environment, most of the nodes are commodity. Typically, they have reasonable memory. However, the amount of storage data that can be processed by a node depends on fitting the corresponding data structures in memory. The larger the memory, the higher the capability of the server component in the control path. Therefore, there must be some choice of memory versus capability in the overall topology of the server so that it can be recommended to customers.
112) Cpu Configuration: Typically, VMs added as nodes to storage cluster come in T-shirt size configurations with the number of CPUs and the memory configuration defined for each T-shirt size. There is no restriction for the storage server to be deployed in a container or a single Virtual Machine. And since the virtualization of the compute makes it harder to tell the scale up of the host, the general rule of thumb has been more the better. This does not need to be so and a certain configuration may provide the best advantage. Again, the choice and the recommendation must be conveyed to the customer.
113) Serverless Computing: Many functions of a storage server/product are written in the form of microservices or perhaps as components within layers if they are not separable. However, the notion of modularity can be taken in the form of serverless computing so that the lease on named compute servers does not affect the storage server.
114) Expansion: Although some configurations may be optimal, the storage server must be flexible to what the end user want as configuration for the system resources. Availability of flash storage and its configuration via external additions to the hardware is a special case. But upgrading the storage server from one hardware to another must be permitted.
115) Upgrade scenarios : As with any other product, a storage server also has similar concerns for changes to data structures or requests/responses. While it is important for each feature to be backward compatible, it is also important to have a design which can introduce flexibility without encumberance.

Tuesday, November 27, 2018




 Today we continue discussing the best practice from storage engineering: 

106) CAP theorem states that a system cannot have availability, consistency, and partition tolerance at the same time. However, it is possible to work around this when we use layering and design the system around a specific fault model. The append only layer provides high availability. The partitioning layer provides strong consistency guarantees. Together they can handle specific set of faults with high availability and strong consistency.
107) Workload profiles: Every storage engineering product will have data I/O and one of the ways to try out the product is to use a set of workload profiles with varying patterns of data access.
108) Intra/Inter: Since data I/O crosses multiple layers, a lower layer may perform operations that are similarly applicable to artifacts in another layer at a higher scope. For example, replication may be between copies of objects within a single PUT operation and may also be equally applicable to objects spanning sites designated for content replication.  This not only emphasizes reusability but also provides a way to check the architecture for consistency.
109) Local/Remote: While many components within the storage server take the disk operations to be local there are certain components that gather information across disks and components directly writing to it. In such case, even if the disk is local, it would prove consistent to access local via a loopback and simplify the logic to assuming every such operation as remote.
110) Resource consumption: We referred to performance engineering for improving the data operations. However, the number of resources used per request was not called out because it may have been perfectly acceptable if the elapsed time was within bounds. However, resource conservation has a lot to do with reducing interactions which in turn leads to efficiency.

Monday, November 26, 2018

Today we continue discussing the best practice from storage engineering:
100)
Conformance to verbs:  Service oriented architecture framework of providing web services defined contract and behavior in addition to address and binding for services but the general shift in the industry has been towards RESTful services from that architecture. This paradigm introduces well known verbs for operations permitted. Storage products that provide RESTful services must conform to the well-defined mapping of verbs to create-update-delete operations on their resources.

101) Storage products tend to form a large code base which significantly hurts developer productivity when build time takes more than a few minutes. Consequently code base may need to be constantly refactored or the build needs to be completed with more workers, memory and profiling.

102) Profiling is not limited to build time. Like the performance counters mentioned earlier, there is  a way to build instrumented code so that bottlenecks may be identified. Like build profiling, this has to be repeated and trends need to be monitored.

103) Stress testing the storage product also helps gain valuable insights into whether the product’s performance changes over time. This covers everything from memory leak to resource starvation.

104) Diagnostic tools and scripts including log queries that are used to troubleshoot during development time also become useful artifacts to share with the developer community for the storage product. Even if the storage product is used mostly for archival, there is value in sharing these and documentation with the community

105) Older versions of the storage product may have had to be diagnosed with scripts and log queries but bringing them into the product in its current version as diagnostic API makes it mainstream. Documentation for these and other APIs make it easier on the developer community.

Sunday, November 25, 2018




Today we continue discussing the best practice from storage engineering:
95) Reconfiguration: Most storage products are subject to some pools of available resources managed by some policies that can change from time to time. Whenever the server resources are changed, they must be done in one operation so that the system presents a consistent view to all usages going forward. Such a system wide is a reconfiguration and is often implemented across storage products.

96) Auto-tuning: This is the feedback loop cycle with which we allow the storage server/appliance/product to perform better because the dynamic parameters are adjusted to values that better suit the workload.

97) Acceptance: This is the user-defined level of service-level agreement for the APIs to the storage server so that they maintain satisfactory performance with the advantage that the clients can now communicate with a pre-existing contract.

98) Address: This defines how the storage is discovered by the clients. For example, if there were services, this would define how the service would be discovered. If it were a network share, this would define how the remote share would be mapped. While most storage products enable users to create their own address to their storage artifacts, not every storage product provides a gateway to those addresses.

99) Binding: A binding protocol defines the transport protocol, encoding and security requirements before the data transfer can be initiated. Although storage products concern themselves with data at rest, they must provide ways to secure data in transit.

100)
Conformance to verbs:  Service oriented architecture framework of providing web services defined contract and behavior in addition to address and binding for services but the general shift in the industry has been towards RESTful services from that architecture. This paradigm introduces well known verbs for operations permitted. Storage products that provide RESTful services must conform to the well-defined mapping of verbs to create-update-delete operations on their resources.

#codingexercise
int GetNodeWithLeavesEqualToThreshold(Node root, int threshold, ref List<Node> result)
{
if (root == null) return 0;
if (root.left == null && root.right == null) return 1;
int left = GetNodeWithLeavesEqualToThreshold (root.left, threshold, ref result);
int right = GetNodeWithLeavesEqualToThreshold (root.right, threshold, ref result);
if (left + right == threshold) {
 result.Add(root);
}
return left + right;
}


Saturday, November 24, 2018

Today we continue discussing the best practice from storage engineering: 

92) Words: For the past fifty years that we have learned to persist our data, we have relied on the physical storage being the same for our photos and our documents and relied on the logical organization over this storage to separate our content, so we may run or edit them respectively. From file-systems to object storage, this physical storage has always been binaries with both the photos and documents appearing as 0 or 1. However, text content has syntax and semantics that facilitate query and analytics that are coming of age. Recently, natural language processing and text mining has made significant strides to help us do such things as classify, summarize, annotate, predict, index and lookup that were previously not done and not at such scale as where we save them today such as in the cloud. Even as we are expanding our capabilities on text, we have still relied on our fifty-year-old tradition of mapping letters to binary sequence instead of the units of organization in natural language such as words. Our data structures that store words spell out the letters instead of efficiently encoding the words. Even when we do read words and set up text processing on that content, we limit ourselves to what others tell us about their content.  Words may appear not just in documents, they may appear even in such unreadable things as executables. Neither our current storage nor our logical organization is enough to fully locate all items of interest, we need ways to expand our definitions of both.

93) Inverted Lists: We have referred to collections both in the organization of data as well as from the queries over data. Another way we facilitate search over the data is by maintaining inverted lists of terms from the storage organizational units. This enables a faster lookup of locations corresponding to the presence of the search term. This inverted list may be constantly updated so that it remains consistent with the data. The lists are also helpful to gather overall ordering of terms by their occurrences.

94) Deletion policies/ Retention period: This must be a configurable setting which helps ensure that information is not erased prior to the expiration of a policy which in this case could be the retention period. At the same time, this retention period could also be set as "undetermined" when content is archived but have a specific retention period at the time of an event.

95) Reconfiguration: Most storage products are subject to some pools of available resources managed by some policies that can change from time to time. Whenever the server resources are changed, they must be done in one operation so that the system presents a consistent view to all usages going forward. Such a system wide is a reconfiguration and is often implemented across storage products.

Friday, November 23, 2018

Today we continue enumerating the best practice from storage engineering:
Data Types: There are some universally popular data types such as integers, float, date and string that are easy to recognize even on the wire and most storage engineering products also see them as ways of packing bytes in a byte sequence. However, storage engineering products including log indexes are actually far more convenient to expose more data types to the data they store than any other product because the data finally persists in these products. Conventional relational databases made sense of the data in the tables only because the data type was registered with them. Not all storage engineering products have that luxury. Log indexes for example ingest data without user interaction. The ability to infer data types and auto-register them to facilitate richer forms of search and analytics is not popular although it holds a lot of promise. Most products work largely by inferring fields rather than types within the data because it gives them a way to allow users to search and analyze using tags that are familiar to them. However, looking up fields together as types only adds a little bit more aggregation on the server side while improving the convenience to the users.

Words: For the past fifty years that we have learned to persist our data, we have relied on the physical storage being the same for our photos and our documents and relied on the logical organization over this storage to separate our content, so we may run or edit them respectively. From file-systems to object storage, this physical storage has always been binaries with both the photos and documents appearing as 0 or 1. However, text content has syntax and semantics that facilitate query and analytics that are coming of age. Recently, natural language processing and text mining has made significant strides to help us do such things as classify, summarize, annotate, predict, index and lookup that were previously not done and not at such scale as where we save them today such as in the cloud. Even as we are expanding our capabilities on text, we have still relied on our fifty-year-old tradition of mapping letters to binary sequence instead of the units of organization in natural language such as words. Our data structures that store words spell out the letters instead of efficiently encoding the words. Even when we do read words and set up text processing on that content, we limit ourselves to what others tell us about their content.  Words may appear not just in documents, they may appear even in such unreadable things as executables. Neither our current storage nor our logical organization is enough to fully locate all items of interest, we need ways to expand our definitions of both. 

Thursday, November 22, 2018

Today we continue discussing the best practice from storage engineering:

85) Statistics – We referred to statistics enabled counters earlier for the components of the storage server. This section merely refers to client-based statistics for the entire storage product whenever possible so that there can be differentiated tuning to workloads based on the data gathered from the server usage.

86) Tracers: As an extension of the above method for studying workloads, the usage of storage artifacts by a given workload may not always be known. In such cases, it is better for the storage server to inject markers or tracers to view the data path.

87) User versus system boundary: Many security vulnerabilities manifest themselves when the code gets executed with user context rather than with system context. The execution of code in system context is privileged and maintains a few assumptions including one that it is the source of truth. Therefore, the switching from user to system context is required wherever we can demarcate the boundary. If the context switching is missing then it is likely that the code can be executed with user context.

88) Lines of control – even when the code path for admission into the system has a clear user and system context defined, the user context is established only when the execution traverses the lines of authentication and authorization Consequently all user facing entry points need to guarantee proper exception handling to minimize security risks from the line of control

89) Impersonation – Usually identities are not switched by the system because most of the system code is executed with its own identity. However, there are cases when code is executed in user context in which case a system thread may need to use the security context of the user. Impersonation opens up a new dimension for tests when identities are switched between two user accounts and is generally best avoided.

90) Routines at the user-system boundary- When the boundary between user and system is clearly demarcated and secured, it facilitates the execution of common routines such as auditing, logging, exception handling and translations, resetting contexts and so on. In fact, the user-system context boundary is a convenient way to enforce security as well as collect data on the traffic.

Wednesday, November 21, 2018

Today we continue discussing the best practice from storage engineering:

80) Interoperability – Most storage products work well when the clients are running on a supported flavor of an operating system. However, this consideration allows the product to expand its usage. Interoperability is not just a convenience for the end-user, it is a reduction in management cost as well.

81) Application Configuration – Most storage products are best served with a static configuration that can determine the parameters to the system. Since product usages span a wide variety of deployments, most products offer at least a few parameters to tune the system. Configuration files has been used with applications and servers on the unix flavor platform and storage products also make use of it. Files also make it easy for listeners to watch for changes.

82) Dynamic Configuration – Applications and services have not only used configuration based on static files but also used dynamic configuration which may be obtained from external configuration sources. Since the value is not updated by touching the production server, this finds appeal in cases where parameters need constant tuning and have to be done without involving the production server.

83) Platform independent client library – Most frontend rely on some form of Javascript for client-side scripting. However, Javascript is also popular in server-side programming as a NodeJs server. While portals and application servers are portable when written in Javascript, it applies equally to any client library or agent interaction for the cloud server

84) External management tools – For object storage, S3 has been a primary API for control and data path to the storage server. Management tools that are cloud agnostic provide tremendous leverage for bulk automations and datacenter computing. Consequently, storage products must strive for conforming to these tools in ways that utilize already streamlined channels such as well-published APIs whenever possible.

85) Statistics – We referred to statistics enabled counters earlier for the components of the storage server. This section merely refers to client-based statistics for the entire storage product whenever possible so that there can be differentiated tuning to workloads based on the data gathered from the server usage.

Tuesday, November 20, 2018

Today we continue discussing the best practice from storage engineering:

75) Cachepoints – Cachepoints are used with consistent hashing. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. Virtual nodes can join and leave the network without impacting the operation of the ring.

76) Stream/Batch/Sequential - processing: Storage products often distinguish themselves as serving stream processing, batch processing or sequential processing. Yet, the factors that determine the choice are also equally applicable to the components within the product when they are not necessarily restricted by the overall design. There are ways to convert one form of processing into another which drives down the cost. For example, event processing has largely been stream- based.

77) Joins – Relational data has made remarkable use of joins over tuples of data involving storage and query improvements to handle these cases. Components within products that are used for unstructured data often have to encounter some form of matching between collections. The straightforward way to implement these have been iterators over one or more collections that are filtered based on conditions that evaluate those collections. However, it helps to lookup associations whenever possible by ways and means that can improve performance. Judicious choice of such techniques is always welcome wherever possible.

78) Strategies – Implementation of a certain data processing logic within a storage product may often have a customized implementation and maintained with the component as it improves from version to version. Very little effort is usually spent on externalizing the strategy across components to see what may belong to the shared category and potentially benefit the components. Even if there is only one strategy every used with that component, this technique allows other techniques to be tried out independent of the product usage.

79) Plug and Play architecture – the notion of plugins that work irrespective of the components and layers in a storage stack is well-understood and part of the software design. Yet the standardization of the interface such that it is applicable across implementations is often left pending for later. Instead, the up-front standardization of interfaces promotes eco-system and adds convenience to the user.

80) Interoperability – Most storage products work well when the clients are running on a supported flavor of an operating system. However, this consideration allows the product to expand its usage. Interoperability is not just a convenience for the end-user, it is a reduction in management cost as well. 

Monday, November 19, 2018

Today we continue enumerating the best practice from storage engineering:

70) Topology: Most storage products are deployed as single instances and usually comprising of a cluster or software defined stacks. However, the layout that the user might choose to have should remain as flexible as possible so that it can scale to their requirements. In this regard, each storage product/server/appliance must behave well with other instances in arrangements such chaining or federation.

71) Virtual time: As the storage server virtualizes storage over heterogeneous media and expands elastically for demand, there is a need to co-ordinate activities across participating agents. In such cases, the only event sequence that can be established correctly is the one based on virtual time.

72) Gossip protocol: In a distributed environment, the best way to detect failures and determine memberships is with the help of gossip protocol. When an existing node leaves the network, it may not respond to the gossip protocol so the neighbors become aware.  The neighbors update the membership changes and copy data asynchronously.

73) Paxos Algorithm: Some systems utilize a state machine replication such as Paxos that combines transaction kogging for consensus with write-ahead logging for data recovery. If the state machines are replicated, they are fully Byzantine tolerant.

74) Consistent hashing – Data is partitioned and replicated using consistent hashing to achieve scale and availability. Consistency is facilitated by object versioning. Replicas are maintained during updates based on a quorum like technique.

75) Cachepoints – Cachepoints are used with consistent hashing. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. Virtual nodes can join and leave the network without impacting the operation of the ring.

Sunday, November 18, 2018

Today we continue discussing the best practice from storage engineering:

  1. Serialization: There is nothing simpler than bytes and offsets to pack and persist any data structure. The same holds true in storage engineering. We have referred to messages as a necessity for communication between layers and components. When these messages are written out, it is irrelevant whether the destination is local or remote. Serialization comes useful in both cases. Consequently, serialization and deserialization are required for most entities. 

  1. Directories: The organization we expect from the user to maintain their storage artifacts is also the same we utilize ourselves within the storage layer so that we don’t have to mix and match different entities. Folders help us organize key values in their own collections. 

  1. Replication strategy: We have referred to replication in storage organization and replication groups earlier but there may be more than one strategy used for replication. The efficiency in replication is closely tied to the organization and the data transfer requirements. Simple file synchronization techniques include events and callbacks to indicate progress, preview the changes to be made, handle conflict resolution and have graceful error handling per unit of transfer. 

  1. Number of replications: Although replication groups are decided by the user and they correspond to sites that participate in keeping their contents similar, every data and metadata units of storage are also candidates for replication whose number does not need to be configured and has to be system defined.  A set of three copies is the norm for most such objects and their metadata. 

  1. Topology: Most storage products are deployed as single instances and usually comprising of a cluster or software defined stacks. However, the layout that the user might choose to have should remain as flexible as possible so that it can scale to their requirements. In this regard, each storage product/server/appliance must behave well with other instances in arrangements such chaining or federation. 

Saturday, November 17, 2018

Today we continue discussing the best practice from storage engineering: 

61) Diagnostic queries: As each layer and component of the storage server create and maintain their own data structures during their execution, it helps to query these data structures at runtime to diagnose and troubleshoot erroneous behavior. While some of the queries may be straightforward if the data structures already support some form of aggregation, others may be quite involved and include a number of steps. In all these cases, the queries will be against a running system in as much permitted with read-only operations.

62) Performance counter: Frequently subsystems and components take a long time. It is not possible to exhaust diagnostic queries to discover the scope that takes the most time to execute. On the other hand, the code is perfectly clear about call sequences, so such code blocks are easy to identify in the source. Performance counters help measure the elapsed time for the execution of these code blocks.

63) Statistics counter: In addition to the above-mentioned diagnostic tools, we need to perform aggregation over execution of certain code blocks. While performance counters measure elapsed time, these counters help with aggregation such as count, max, sum, and so on.

64) Locks: In order to perform thread synchronization, these primitives are often used. If their use cannot be avoided, they are best taken as few as possible universally. Partitioning and coordination solve this in many cases. Storage server relies on the latter approach and versioning.

65) Parallelization: Generally there is no limit enforced to the number of parallel workers in the storage server or the number of partitions that each worker operates on. However, the scheduler that interleaves workers works best when there is one active task to perform in any timeslice.  Therefore, the number of tasks is ideal when it is one more than the number of processor. A queue helps hold the tasks until their execution. This judicious use of task distribution improves performance in every layer.

Friday, November 16, 2018

Today we continue discussing the best practice from storage engineering:

55) Https Encryption not only helps secure data at rest but also secured data in transit. However, it comes with the onus of key and certificate management. Https by default is not just a mandate over internet but also a requirement even between departments in the same organization.

56) KeyManagement: We have emphasized that keys are needed for encryption purposes. This calls for keys to be kept secure. With the help of standardized key management interfaces, we can use external keySecure managers. Keys should be rotated every now and then.

57) API security: it is almost undeniable to have APIs with any storage service. Every request made over the web must be secured. While there are many authentication protocols including OAuth, each request will be sufficiently secured if it has an authorization and a digital signature. ApiKeys are not always required.

58) Integration with authentication provider:  File System protocol has been integrated with Active Directory. This enables organization to take advantage of authorizing domain users. Identity and Access management for cloud services can also be referred.

59) Auditing: Audit serves to detect unwanted access and maintain  compliance with regulatory  agencies. Most storage services enable auditing by each and every component in the control path. This is very much like the logging for components. In addition, the application exposes a way to retrieve the audits.

60) Offloading: Every bookkeeping, auxiliary and routine activity that takes up system resources could be candidate for hardware offloading so long as it does not have significant conditional logic and is fairly isolated. This improved performance in the data path especially when the activities can be consolidated globally.

#codingexercise
int GetNodeWithHeavierRightLeaves(Node root, ref List<Node> result)
{
if (root == null) return 0;
if (root.left == null && root.right == null) return 1;
int left = GetNodeWithHeavierRightLeaves(root.left, ref result);
int right = GetNodeWithHeavierRightLeaves(root.right, ref result);
if (right > left+2)
{
 result.Add(root);
}
return left + right;


}

Thursday, November 15, 2018

Today we continue discussing the best practice from storage engineering: 

50) Hardware arrangement: One of the most overlooked considerations has been the implications of the choice of hardware for storage servers. For example, a chassis with the expansion bays for solid state drive in the front is going to be more popular than the others and will set up the storage servers to take advantage of storage improvements.

51) SSD: Virtually all layers of a storage server can take  improvements from faster access on Solid State Device. SSD unlike flash drives have no seek time. Both offer faster random read access than disks.

52) Faster connections: Networking between storage servers and components may not always be regulated if they are not on-premise. Consequently it is better to set up Direct connections and faster network wherever possible.

53) Direct Connections: This helps to have better control over communication between two endpoints.  A dedicated TCP connection comes with the benefit of congestion control and ordering which translates to efficiency in data writes at the destination.

54) Virtual private network: Virtual private networks only add an IP header over the existing header so they may not improve the latency or bandwidth over the network but they certainly secure the network.

55) Https Encryption not only helps secure data at rest but also secured data in transit. However, it comes with the onus of key and certificate management. Https by default is not just a mandate over internet but also a requirement even between departments in the same organization.

Wednesday, November 14, 2018

Today we continue discussing the best practice from storage engineering:

46) Security – This is an integral part of every storage product. The artifacts from the user need to have proper access control list otherwise there may be undesirable access. The mechanism for access control has traditionally differed based on operating systems but most agree with role based access control mechanism.

47) Performance-  Storage operations need to support a high degree of concurrent and super fast operations. These operations may even be benchmarked. Although local operations are definitely cheaper than remote operations, they are not necessarily  a bottleneck in most modern cloud storage services.

48) Row level security although storage objects have granular access control lists, there is nothing preventing extension of security to individual key values with the help of tags and labels that can be universally designated.

49) Workload alignment: Public clouds pride themselves in the metrics they set and compete with each other, most also tune it to their advantage. It is important however to align the benchmarks with the workloads on the system.

50) Hardware arrangement: One of the most overlooked considerations has been the implications of the choice of hardware for storage servers. For example, a chassis with the expansion bays for solid state drive in the front is going to be more popular than the others and will set up the storage servers to take advantage of storage improvements.

Tuesday, November 13, 2018

Today we continue discussing the best practice from storage engineering:

41) Allocations: Although a storage organization unit such as file, blob or table seems like a single indivisible logical unit to the user, it translates to multiple physical layer allocations. Files have hierarchical organization and low-level drivers translate them to file location and byte offset on disk. This has been a traditional architecture and primarily driven by hierarchy and naming. Storage units have more than names. They have tags and metadata and designing file system that utilizes alternate forms of organization that leverages tags helps simultaneous use of different nomenclature. This is an example where master data management can bring significant advantages such as the use of attributes to lookup files.

42) Catalogs:  Physical organization does not always have to directly co-relate with the way users save them. A catalog is a great example of utilizing the existing organization to serve various ways in which the content is looked up or correlated. Moreover, custom tags can help increase the ways in which the files can be managed and maintained. While lookups have translated to queries, content indexers have provided an alternate way to look up data. Here we refer to organization of metadata so that the storage architecture can be separated from the logical organization and lookups.

43) System metadata – Metadata is not specific only to the storage artifacts from the user. Every layer maintains entities and bookkeeping in the immediately lower layer and these are often just 6 useful to query as some of the queries of the overall system. This metadata is internal and for system purposes only. Consequently, they are the source of truth for the artifacts in the system.

44) User metadata – We referred to metadata for user objects. However, such metadata is usually in the form of predetermined fields that the system exposes. In some cases, users can add more labels and tags and this customization is referred to as user metadata. User metadata helps in cases outside the system where users want to group their content that can then be used in classification and data mining.

45) User defined functions, callbacks and webhooks – Labels and tags are only as much useful to the user as they can be used with their queries. If the system does not support intensive or involved logic, the user is left to implement their own. Such expressions may involve custom user defined operators, and callbacks.  These can be executed on a subset of the user-data or all of the data including those of the user. They can also be executed where the results can be streamed.


Monday, November 12, 2018

Today we continue discussing the best practice from storage engineering:
36) Container Docker containers are also immensely popular for deployment and every server benefit from portability because it makes the server resilient from the issues faced by the host.

37) Congestion control: when requests are queued to the storage server, it can use the same sliding window technique that is used for congestion control in a tcp connection. Fundamentally there is no difference between giving each request a serial number in either case and handling them with the boundaries of the received and the processed at a rate that can keep the bounds manageable.

38) Standard query operators:  many bookkeeping operations can be translated to aggregates over simple key-value collections.  These aggregates so do not necessarily have to be dedicated customized logic. Instead if there was a generic way to perform standard query operations, many of the accounting can simply become similar query patterns

39) Queue: Most requests to the storage server are processed on a first come first served basis. This naturally suits the use of a queue as a data structure to hold the requests. Queues may be distributed in order to handle large volumes of requests. With distributed queues, the requests may be sent to partitions where they can be served best. Semantically all distributed queue processors behave the same in terms of the handling of request. They simply get the requests relevant to their partition

40) Task Schedulers:  Queues are used not just with storage partitions. They are also used for prioritizing workloads. Background task processors usually have long running jobs. These jobs may take several quantum of time slices. Even if the job were to be blocking when executed, they may need to be interleaved with other jobs. The purpose of the task scheduler is to decide which job runs next on the processor. In order to facilitate retries and periodic execution, a cron tab may be set up for the job.

41) Coordinator for nodes: Just like a scheduler, a co-ordinator hands out tasks to agents on remote nodes. This notion has been implemented in many forms and some as services over http. In some cases, the web services have given way to a registry of tasks that all nodes can read and update states for an individual job.
#codingexercise
int GetNodeWithRightImbalancedLeaves(Node root, ref List<Node> result)
{
if (root == null) return 0;
if (root.left == null && root.right == null) return 1;
int left = GetNodeWithRightImbalancedLeaves(root.left, ref result);
int right = GetNodeWithRightImbalancedLeaves(root.right, ref result);
if (right > left)
{
 result.Add(root);
}
return left + right;
}

Sunday, November 11, 2018

We continue discussing the best practice from storage engineering:
31) Acceleration- although network acceleration with the help of direct tcp connections is essentially a networking tier technique it is equally applicable in the storage tier when the tier spans over geographically distributed regions
32) Datacenters and data stores: the choice of locations for datacenters and Data stores plays heavily into the consolidation of storage technologies. When virtual machines are spun up on organizational assets, they are often done in private datacenters. Many web services use the datastore for their storage especially if they have no need for local storage. Therefore, storage offerings have to be mindful often the presence of large datastores.
33) Distributed hash table. In order to scale horizontally over commodity compute, storage tier use a distributed hash table to assign and delegate resources and tasks. This facilitates a large Peer to Peer network that works well with large scale processing including high volume workload.
34) Cluster This is another form of deployment as opposed to a single server deployment of storage servers. The advantages of using a cluster include horizontal scalability, fault tolerance and high availability. Cluster technology is now common practice and is widely adopted for any server deployment.
35) Container Docker containers are also immensely popular for deployment and every server benefit from portability because it makes the server resilient from the issues faced by the host.
36) Congestion control: when requests are queued to the storage server, it can use the same sliding window technique that is used for congestion control in a tcp connection. Fundamentally there is no difference between giving each request a serial number in either case and handling them with the boundaries of the received and the processed at a rate that can keep the bounds manageable.

Saturday, November 10, 2018

We continue discussing the best practice from storage engineering:
26) Containers – Data is organized as per the units of organization from the storage device or appliance. These containers however do not necessarily remain the same size because a user dictates what is packed in any container. Therefore, when it comes to data transfer, we can transfer a large container at a time or smaller. Also, users often have to specify attributes of the container and sometimes it could go wrong. Instead of correcting a container beyond salvage, it might be easier to recreate another and transfer the data.
27) Geographical location – Administrators often determine the sites where their data needs to be replicated. This involves choosing the locations which will have least latency to the users. This choice of sites may be common across data organizations and their owners and customized where the choices are inadequate.
28) Backup – although data backup has been cited earlier as a maintenance item, it is in fact the prudence on the part of the owner or administrator to determine which data needs to be backed up. Tools like duplicity use rsync protocol to determine incremental changes and storage products may have a way to do it or allow it to be externalized.
29) Aging – Generally the older the data, the more amenable it is for backup. The data age is progressive on the timeline. Therefore, it is easier to label the data as hot warm and cold so that the cut-off for age related treatments may then be taken. Cost savings on cheaper storage was touted as the primary motivation earlier but this has recently been challenged. That said, aged data lends itself to treatments such as deduplication.
30) Compression - Probably the hallmark of any efficient storage is in the packing of the data. Most data files and directories can be archived. For example, a tar ball is a convenient way to make web sites and installable portable. When the data is viewed in the form of binaries, a long sequence of either 0 or 1 can be efficiently packed. When the binary sequence flips way too often, it becomes efficient to not encode It and leave it as such. That said, there are many efficient compression techniques available.

Friday, November 9, 2018

We continue discussing the best practice from storage engineering:
21) Maintenance – Every storage offering comes with a responsibility for administrators. Some excel at reducing this maintenance with the help of auto-tuning and automation of maintenance chores while others present comprehensive dashboards and charts for detailed, interactive and involved maintenance. The managed service that moved technologies and stacks from on-premise to cloud came with the reduction in Total Cost of Ownership by way of centralizing and automating tasks that provided scalability, high availability, backups, software updates and patches, host and server maintenance, rack and stack, power and network redress etc.
22) Data transfer – The performance considerations of IO devices includes throughput and latency in one form or another. Any storage offering may be robust and large but will remain inadequate if the data transfer speed is low. In addition, data transfer may need to be across large geographical distances and repeatedly so. Facilitating of dedicated network connection may not be feasible in all cases so the baseline must itself be reasonable.
23) Gateway- Traditionally gateways have been used to bridge across different storage providers or between on-premise and cloud or even two similar but different origin storage stacks. Gateways also help with load balancing, routing and proxy duties. Some storage providers are savvy to include this technology within their offering so that they are not used everywhere.
24) Cache – A cache enables to requests to be handled by providing the resource without looking it up in deeper layers. The technology can span across storage or offered at many levels deep in the stack. Cache not only improves performance but they also save costs.
25) Checksum – This is a simple way to check data integrity and it suffices in place where encryption may not be easy especially when keys required to encrypt and decrypt cannot be secured. This simple technique is no match for the advantages from encryption but it is often put to use in low level message transfers and data at rest.


Thursday, November 8, 2018

We continue listing the best practice from storage engineering:
15) Management – Storage is very much a resource. It can be created, update and deleted. With software defined technologies, the resource only takes a gigantic form otherwise it is the equivalent of a single data record for the user. Every such resource has also significant metadata. Consequently we use manage storage just the same way as we manage resources.
16) Monitoring – Virtual large storage may be stretched across disks in one form or the other. And the physical resources such as disks often have failures and run out of space. Therefore monitoring becomes a crucial aspect. 
17) Replication groups – Most storage organization have to deal with copies of the data. This is generally handled with replication. There is no such limit to the copies maintained but if it spans across root of storage organization, a replication group is created where these different sites are automatically sync’ed.
18) Storage organization – We referred to hierarchical organization earlier that allows maximum flexibility to the user in terms of folders and depth. Here the organization includes replication groups if any as well as ability to maintain simultaneous organization such as when the storage is file system enabled.
19) Background tasks – routine and periodic tasks can be delegated to the background workers instead of executing them in line with data in and out. These can be added to a background task scheduler that invokes them as specified.  Some of the metadata for the storage entities is improved with journaling and other such background operations.
20) Relays – Most interactions between components is in the form of requests and responses.  These may have traverse through multiple layers before they are authoritatively handled by the node and partition. Relays help translate requests and responses between layers. They are necessary for making the request processing logic modular and chained.

#codingexercise
to determine if an integer binary tree is binary search tree, we can simply check if the root and leaves are within int_min and int_max and then reverse for every child.
bool IsBstHelper(node root, int min, int max)
{
 if (root==null) return true;
 if (root.data < min || root.data> max) return false;
return IsBstHelper(root.left, min, root.data-1) &&
           IsBstHelper(root.right, root.data+1, max);
}

Wednesday, November 7, 2018

We continue listing the best practice from storage engineering:
  1. Data flow – Data flows into stores and stores grow by size. Businesses and applications that generate data often find the data to be sticky once it accumulates. Consequently a lot of attention is paid to early estimation of size and the kind of treatment to take. 
  1. Distributed activity – File systems and object storage have to take advantage of horizontal scalability with the help of clusters and nodes. Consequently, the use of distributed processing such as with Paxos algorithm becomes useful to take advantage of this strategy. Partitioning becomes useful in isolating activities. 
  1. Protocols – Nothing facilitates communication between peers or master-slave as a protocol. Even a description of the payload and generic operations of create, update, list and delete become sufficient to handle storage relevant operations at all levels.  
  1. Layering – Finally storage solutions have taught us that appliances can be stacked, services can be hierarchical and data may be tiered. Problem solved in one domain with a particular solution may be equally applicable to similar problem in different domain. This means that we can use layers for the overall solution 
  1. Virtualization – Cloud computing has taught us the benefit of virtualization at all levels where different entities may be spanned with a universal access pattern. Storage is no exception and every storage product tends to take advantage of this strategy. 
  1. Security and compliance – Every regulatory agency around the globe look for some kind of certification. Most storage providers have to demonstrate compliance with one or more of the following: PCI-DSS, HIPAA/HITECH,  FedRAMP,  EU Data Protection Directive, FISMA and such others. Security is provided with the help of identity and access management and they come in useful to secure individual storage artifacts 

Tuesday, November 6, 2018

We continue listing the best practice from storage engineering.
5) Seal your data – Append only format of writing data is preferred for forms of data such as events where events appear in a continuous stream. When we seal the data, we make all activities progressive on the timeline without loss of fidelity over time. If there are failures, seal the data. When data does not change, we can perform calculations that help us repair and recover.
6) Versions and policy – As with most libraries, append only data facilitates versioning and versions can be managed with policies. Data may be static but policies can be dynamic. When the storage is viewed as a library, users go back in time and track revisions.
7) Reduplication - As data ages, there is very little need to access it regularly. It can be packed and saved in a format that reduces spaces. When the data is no longer used by an application or a user, it can be viewed as segments that are delineations which facilitate study of redundancy in data. Then redundant segments may simply be avoided from storing which allows a more manageable form of accumulated raw data.
8) Encryption – Encryption is probably the only technique to truly protect a data when there can be unwanted or undesirable access. The scope of encryption may be limited to sensitive data if the raw data can be tolerated as not encrypted.


Monday, November 5, 2018

Best practice from storage engineering:
Introduction: Storage is one of the three pillars of any commercial software. Together these three concepts of compute, networking and storage, are included directly as products to implement solutions, as components to make products, as perspectives for implementation details of a feature within a product and so on. Every algorithm that is implemented pays attention to these three perspectives in order to be efficient and correct. We cannot think of distributed or parallel algorithms without network, efficiency without storage, and convergence without compute. Therefore these disciplines bring certain best practice from the industry.

We list a few in this article from storage engineering perspective:
1) Not a singleton – Most storage vendors know that that data is precious. It cannot be lost or corrupted. Therefore storage industry vendors go to great lengths in making data safe at rest by not allowing a single point of failure such as a disk crash.  If the data is written to a store, it is made available with copies or archived as backup.
2) Protection against loss – Data when stored may get corrupted. In order to make sure the data does not change, we need to keep additional information. This is called erasure coding and with additional information about the data, we can not only validate the existing data, we may even be able to recreate the original data by tolerating certain loss. How we store the data and the erasure code, also determines the level of redundancy we can use.
3) Hot warm cold – Data differs in treatment based on the access. Hot data is one that is actively read and written. Warm and cold indicate progressive inactivity over the data. Each of these labels allows different leeway with the treatment to the data and the cost of storage.
4) Organizational unit of data – Data is often written in one of several units of organization depending on the producer. For example, we may have blobs, files and block level storage. These do not need to be handled the same way and each organizational unit even comes with its own software stack to facilitate the storage.


#codingexercise
// predicate to select positive integer sequence from enumerated  combinations 
List <List <Integer>> result = new ArrayList <>(Collection2.filter (combinations, new Predicate (List <Integer>() { 
       @Override 
       public boolean apply (List <Integer> sequence) { 
                   return isPositive (sequence); 
               } 
});