Cluster computing

Sunday, December 23, 2018

Today we continue discussing the best practice from storage engineering:

215) Fault domains are a group covering known faults in an isolation. Yet some faults may occur in combinations. It is best to give names to patterns of faults so that they can be included in the design of components.

216) Data driven computing has required changes in storage products. While previously, online transactional activities were read-write intensive and synchronous, today most processors including order and payments are done asynchronously on data driven frameworks usually employing a message queueing. Storage products do better with improved caching for these processing

217) The latency involved in the execution of an order and the prevention of repeats in the order processing has not been relaxed. A gift card run through a store register for renew must make its way to the backend storage so that it is not charged again when run through and before the prior execution is completed. A storage product does not need to be a database with strong ACID guarantees but it should support serialized readability against all read-write operations

218) When the processing requirements are relaxed from ACID to eventual consistency, the latency may not have been relaxed. Consequently, it is important for the storage to return the results as early as possible. In such cases, it is helpful to evaluate the results such that partial results can be returned.

219) Multiple active versions of a record need not be maintained. If there are several ongoing read writes of the records leading to multiple versions, they may merely hold a reference to a version and those older versions can be brought back by going back in time from the latest version of the record by means of the logs that were captured. This is how Read-Consistency isolation level works in database products that have known to favor versions.

220) Between the choice of versions and timestamps for capturing the data changes, it is always better to denote the name and the version for an entity as it narrows down the iteration of the records after they have been fetched.

Saturday, December 22, 2018

Today we continue discussing the best practice from storage engineering:
210) Location-based services are a key addition to many data stores simply because location gives more interpretations to the data that solve major business use cases. In order to facilitate this the location may become part of the data and maintained routinely or there must be a service close to the storage that relates them

211) Although storage servers share similar capabilities as compute layer api servers, they are not as transaction oriented as the api layer in many cases. This calls for a more relaxed execution that includes among other background tasks and asynchronous processing. This affects some of the choices that may be different from compute layer

212) The consensus protocol used in distributed components of a storage system, is required to be fault-tolerant. However, the choice of consensus protocol may vary from system to system

213) Message passing between agents in a distributed environment is required. Any kind of protocol can suffice for this. Some storage systems like to use open source in this regard while others build on message queuing.

214) Every storage server prepares for fault tolerance. Since faults can occur in any domain, temporarily or permanently, each component determines which activities to perform and how to overcome what is not available.

215) Fault domains are a group covering known faults in an isolation. Yet some faults may occur in combinations. It is best to give names to patterns of faults so that they can be included in the design of components .

Friday, December 21, 2018

Today we continue discussing the best practice from storage engineering:

205) Social engineering applications also have a lot of load balancing requirements and therefore more number of servers may need to be provisioned to handle their load. Since a storage tier does not necessarily expose load balancing semantics, it could call out when an external load balancer must be used.

206) Networking dominates storage for distributed hash tables and message queues to scale to social engineering applications. Whatsapps’ Erlang and FreeBSD architecture has shown unmatched symmetric multiprocessing (SMP) scalability

207) Unstructured data generally becomes heterogenous because there is no structure to hold them consistent. This is a big challenge for both data ingestion as well as machine parsing of data. Moreover, the data remains incomplete as its form and heterogeneity increases.

208) Timeliness of executing large data sets is important to certain systems that cannot tolerate wide error margins. Since the elapsed and execution time differ depending on the rate at which tasks get processed, the overall squeeze meets rock hard limits unless there are ways to tradeoff between compute and storage.

209) Spatial proximity of data is also important to prevent potential congestion along routes. In the cloud this was overcome with dedicated cloud network and direct access for companies but this is not generally the norm for on-premise storage.

210) Location-based services are a key addition to many data stores simply because location gives more interpretations to the data tha solve major business use cases. In order to facilitate this the location may become part of the data and maintained routinely or there must be a service close to the storage that relates them

Thursday, December 20, 2018

Today we continue discussing the best practice from storage engineering:
200) Data modeling and analysis: Data model may be described with entity-relationship diagrams, json documents, objects and graph nodes. However, the models are not final until several trials. Allowing the versions of data models to be kept also helps subsequent analysis.

201) Data Aggregation depends largely on the tools used. A sql query can perform rollups. A map-reduce can perform summation. They are very different usages but the storage tier can improve the performance if it is dedicated to either. In order to separate out the usages on a shared storage tier, we could classify the workloads and present different sites with redundant copies or materialized views.

202) Even if we separate out the query processing use of the storage tier from the raw data transfer into the storage, we need to maintain separate partitions for read-only data from read-write data. This will alleviate performance considerations as well as inconsistent views.

203) Social engineering data has had a phenomenal use case of using unstructured data storage instead of relational databases. This trend only expands and the requirements for the processing of chat messages and group discussions are way different from conventional file or object-based storage.

204) The speed of data transfer and not just the size is also very critical in the case of social engineering applications such as Facebook and Twitter. In such cases, we have to support a large number of concurrent message. A message platform such as one written in Erlang for Whatsapp may be more performant than servers written with extensive inter process communication

205) Social engineering applications also have a lot of load balancing requirements and therefore more number of servers may need to be provisioned to handle their load. Since a storage tier does not necessarily expose load balancing semantics, it could call out when an external load balancer must be used.

Wednesday, December 19, 2018

Today we continue discussing the best practice from storage engineering:

195) User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration. If the storage tier stores any of these information in the clear during or after the data transfer from the pipeline, it will not only be a security violation but also fail compliance.

196) Data Pipelines can be extended. The storage tier needs to be elastic and capable of meeting future demands. Object Storage enables this seamlessly because it virtualizes the storage. If the storage spans clusters, nodes can be added. Segregation of data is done based on storage containers.

197) When data gets connected, it expands the value. Even if the storage tier does not see more than containers, it does very well when all the data appears in its containers. The connected data has far more audience than it had independently. Consequently, the storage tier should facilitate data acquisition and connections

198) Big Data is generally accumulated from some source. Sensor data for example can be stored in NoSQL Databases. However, the data is usable when the right metadata is also recorded with the observational data. To do this continuously, the storage tier must facilitate metadata acquisition.

199) Cleaning and parsing: Raw data is usually noisy and imperfect. It has to be carefully parsed. For example, with full text analysis, we perform stemming and multi-stage pre-processing before the analysis. This applies to admission control and ingestion as well.

200) Data modeling and analysis: Data model may be described with entity-relationship diagrams, json documents, objects and graph nodes. However, the models are not final until several trials. Allowing the versions of data models to be kept also helps subsequent analysis.

#codingexercise
How does SkipList work?
SkipList nodes have multiple next pointers that point to adjacencies based on skipping say 4,2,1
In a sorted skiplist this works as follows:

SkipListNode skipAhead(SkipAheadNode a, SkipAheadNode b) {
SkipListNode cur = a
SkipListNode target = b;
If ( a == null) return a;
For ( cur; cur.next && cur.next.data <= target.data; ) {
// skip by 4, if possible
If (cur.next && cur.next.next && cur.next.next.next && cur.next.next.next.next &&
cur.next.next.next.next <= target.data)
cur = cur.next.next.next.next;
// Skip by 2, if possible
If (cur.next && cur.next.next &&
cur.next.next.next. <= target.data)
cur = cur.next.next;
// Skip by 1, if possible
If (cur.next &&
cur.next <= target.data)
cur = cur.next;
}
Return cur.next;
}
Since the SkipList already has the links at skip levels of 4,2,1 etc we avoid the checking and using next.next.next notations

The number of levels for skipping may not be restricted to using 4,2,1 only.

Tuesday, December 18, 2018

Today we continue discussing the best practice from storage engineering:

191) A pipeline must hold against a data tsunami. In addition, data flow may fluctuate and the pipeline must hold against the ebb and the flow. Data may be measured in rate, duration and size and the pipeline may need to become elastic. If the storage tier cannot accommodate a single pipeline from being elastic, it must mention its Service Level Agreement clearly

192) Data and their services have a large legacy in the form of existing products, process and practice. A storage tier cannot disrupt any of these and therefore must provide flexibility to handle such diverse data transfers as Extract-Transform-Load and analytical pipeline.

193) Extension: ETL may require flexibility in extending logic and in their usages on clusters versus servers. Microservices are much easier to be written. They became popular with Big Data storage. Together they have bound compute and storage into their own verticals with data stores expanding in number and variety. Queries written in one service now need to be written in another service while the pipeline may or may not support data virtualization. Depending on the nature of the pipeline, the storage tier may also change.

194) Both synchronous and asynchronous processing need to be facilitated so that some data transfers can be run online while others may be relegated to the background. Publisher-subscriber message queues may be used in this regard. Services and brokers do not scale as opposed to cluster- based message queues. It might take nearly a year to fetch the data into the analytics system and only a month for the analysis. While the benefit for the user may be immense, their patience for the overall time elapsed may be thin. Consequently, a storage tier can at best not require frequent and repeated data transfers from the user. Object Storage, for instance, handles multi-zone application automatically

195) User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration. If the storage tier stores any of these information in the clear during or after the data transfer from the pipeline, it will not only be a security violation but also fail compliance.

Monday, December 17, 2018

185) Reliability of data: A storage platform can provide redundancy and availability but it has no control on the content. The data from pipelines may sink into the storage but if the pipeline is not the source of truth, the storage tier cannot guarantee that the data is reliable. Garbage in Garbage out applies to storage tier also.

186) Cost based optimization: When we are able to determine the cost function for a state of the storage system or the processing of a query, we naturally try to work towards the optimum by progressively decreasing the cost. Some methods like simulated annealing serves this purpose. But the tradeoff is that the cost function is an oversimplification the trend to consistently lower the costs as a linear function does not represent all the parameters of the system. Data mining algorithms may help here better if we can form a decision tree or a classifier that can encapsulate all the logic associated with the parameters from both supervised and unsupervised learning

187) AI/ML pipeline: One of the emerging trends of vectorized execution is its use with new AI/ML packages that are easy to run on GPU based machines and pointing to the data from the pipeline. While trees, graphs and forests are a way to represent the decision-making models of the system, the storage tier can enable the analysis stack with better concurrency, partitions and summation forms.

188) Declarative querying: SQL is a declarative querying language. It works well for database systems and relational data. It’s bridging to document stores and Big Data is merely a convenience. A storage tier does not necessarily participate in the data management systems. Yet the storage tier has to enable querying.

189) Support for transformations from batch to Stream processing using the same Storage Tier: Products like Apache Flume are able to support dual mode processing by allowing transformations to different stores. Unless we have a data management system in place a storage tier does not support SQL query keywords like Partition, Over, On, Before, TumblingWindow. The support for SQL directly from the storage tier using an iteration of storage resources, is rather limited. However if the support for products like Flume is there, then there is no difference to analysis whether the product is a time series database or an Object Storage.

190) A pipeline may use the storage tier as a sink for the data. Since pipelines have their own reliability issues, a storage product cannot degrade the pipeline no matter how many pipelines share the storage.