Cluster computing

Friday, December 21, 2018

Today we continue discussing the best practice from storage engineering:

205) Social engineering applications also have a lot of load balancing requirements and therefore more number of servers may need to be provisioned to handle their load. Since a storage tier does not necessarily expose load balancing semantics, it could call out when an external load balancer must be used.

206) Networking dominates storage for distributed hash tables and message queues to scale to social engineering applications. Whatsapps’ Erlang and FreeBSD architecture has shown unmatched symmetric multiprocessing (SMP) scalability

207) Unstructured data generally becomes heterogenous because there is no structure to hold them consistent. This is a big challenge for both data ingestion as well as machine parsing of data. Moreover, the data remains incomplete as its form and heterogeneity increases.

208) Timeliness of executing large data sets is important to certain systems that cannot tolerate wide error margins. Since the elapsed and execution time differ depending on the rate at which tasks get processed, the overall squeeze meets rock hard limits unless there are ways to tradeoff between compute and storage.

209) Spatial proximity of data is also important to prevent potential congestion along routes. In the cloud this was overcome with dedicated cloud network and direct access for companies but this is not generally the norm for on-premise storage.

210) Location-based services are a key addition to many data stores simply because location gives more interpretations to the data tha solve major business use cases. In order to facilitate this the location may become part of the data and maintained routinely or there must be a service close to the storage that relates them

Thursday, December 20, 2018

Today we continue discussing the best practice from storage engineering:
200) Data modeling and analysis: Data model may be described with entity-relationship diagrams, json documents, objects and graph nodes. However, the models are not final until several trials. Allowing the versions of data models to be kept also helps subsequent analysis.

201) Data Aggregation depends largely on the tools used. A sql query can perform rollups. A map-reduce can perform summation. They are very different usages but the storage tier can improve the performance if it is dedicated to either. In order to separate out the usages on a shared storage tier, we could classify the workloads and present different sites with redundant copies or materialized views.

202) Even if we separate out the query processing use of the storage tier from the raw data transfer into the storage, we need to maintain separate partitions for read-only data from read-write data. This will alleviate performance considerations as well as inconsistent views.

203) Social engineering data has had a phenomenal use case of using unstructured data storage instead of relational databases. This trend only expands and the requirements for the processing of chat messages and group discussions are way different from conventional file or object-based storage.

204) The speed of data transfer and not just the size is also very critical in the case of social engineering applications such as Facebook and Twitter. In such cases, we have to support a large number of concurrent message. A message platform such as one written in Erlang for Whatsapp may be more performant than servers written with extensive inter process communication

205) Social engineering applications also have a lot of load balancing requirements and therefore more number of servers may need to be provisioned to handle their load. Since a storage tier does not necessarily expose load balancing semantics, it could call out when an external load balancer must be used.

Wednesday, December 19, 2018

Today we continue discussing the best practice from storage engineering:

195) User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration. If the storage tier stores any of these information in the clear during or after the data transfer from the pipeline, it will not only be a security violation but also fail compliance.

196) Data Pipelines can be extended. The storage tier needs to be elastic and capable of meeting future demands. Object Storage enables this seamlessly because it virtualizes the storage. If the storage spans clusters, nodes can be added. Segregation of data is done based on storage containers.

197) When data gets connected, it expands the value. Even if the storage tier does not see more than containers, it does very well when all the data appears in its containers. The connected data has far more audience than it had independently. Consequently, the storage tier should facilitate data acquisition and connections

198) Big Data is generally accumulated from some source. Sensor data for example can be stored in NoSQL Databases. However, the data is usable when the right metadata is also recorded with the observational data. To do this continuously, the storage tier must facilitate metadata acquisition.

199) Cleaning and parsing: Raw data is usually noisy and imperfect. It has to be carefully parsed. For example, with full text analysis, we perform stemming and multi-stage pre-processing before the analysis. This applies to admission control and ingestion as well.

200) Data modeling and analysis: Data model may be described with entity-relationship diagrams, json documents, objects and graph nodes. However, the models are not final until several trials. Allowing the versions of data models to be kept also helps subsequent analysis.

#codingexercise
How does SkipList work?
SkipList nodes have multiple next pointers that point to adjacencies based on skipping say 4,2,1
In a sorted skiplist this works as follows:

SkipListNode skipAhead(SkipAheadNode a, SkipAheadNode b) {
SkipListNode cur = a
SkipListNode target = b;
If ( a == null) return a;
For ( cur; cur.next && cur.next.data <= target.data; ) {
// skip by 4, if possible
If (cur.next && cur.next.next && cur.next.next.next && cur.next.next.next.next &&
cur.next.next.next.next <= target.data)
cur = cur.next.next.next.next;
// Skip by 2, if possible
If (cur.next && cur.next.next &&
cur.next.next.next. <= target.data)
cur = cur.next.next;
// Skip by 1, if possible
If (cur.next &&
cur.next <= target.data)
cur = cur.next;
}
Return cur.next;
}
Since the SkipList already has the links at skip levels of 4,2,1 etc we avoid the checking and using next.next.next notations

The number of levels for skipping may not be restricted to using 4,2,1 only.

Tuesday, December 18, 2018

Today we continue discussing the best practice from storage engineering:

191) A pipeline must hold against a data tsunami. In addition, data flow may fluctuate and the pipeline must hold against the ebb and the flow. Data may be measured in rate, duration and size and the pipeline may need to become elastic. If the storage tier cannot accommodate a single pipeline from being elastic, it must mention its Service Level Agreement clearly

192) Data and their services have a large legacy in the form of existing products, process and practice. A storage tier cannot disrupt any of these and therefore must provide flexibility to handle such diverse data transfers as Extract-Transform-Load and analytical pipeline.

193) Extension: ETL may require flexibility in extending logic and in their usages on clusters versus servers. Microservices are much easier to be written. They became popular with Big Data storage. Together they have bound compute and storage into their own verticals with data stores expanding in number and variety. Queries written in one service now need to be written in another service while the pipeline may or may not support data virtualization. Depending on the nature of the pipeline, the storage tier may also change.

194) Both synchronous and asynchronous processing need to be facilitated so that some data transfers can be run online while others may be relegated to the background. Publisher-subscriber message queues may be used in this regard. Services and brokers do not scale as opposed to cluster- based message queues. It might take nearly a year to fetch the data into the analytics system and only a month for the analysis. While the benefit for the user may be immense, their patience for the overall time elapsed may be thin. Consequently, a storage tier can at best not require frequent and repeated data transfers from the user. Object Storage, for instance, handles multi-zone application automatically

195) User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration. If the storage tier stores any of these information in the clear during or after the data transfer from the pipeline, it will not only be a security violation but also fail compliance.

Monday, December 17, 2018

185) Reliability of data: A storage platform can provide redundancy and availability but it has no control on the content. The data from pipelines may sink into the storage but if the pipeline is not the source of truth, the storage tier cannot guarantee that the data is reliable. Garbage in Garbage out applies to storage tier also.

186) Cost based optimization: When we are able to determine the cost function for a state of the storage system or the processing of a query, we naturally try to work towards the optimum by progressively decreasing the cost. Some methods like simulated annealing serves this purpose. But the tradeoff is that the cost function is an oversimplification the trend to consistently lower the costs as a linear function does not represent all the parameters of the system. Data mining algorithms may help here better if we can form a decision tree or a classifier that can encapsulate all the logic associated with the parameters from both supervised and unsupervised learning

187) AI/ML pipeline: One of the emerging trends of vectorized execution is its use with new AI/ML packages that are easy to run on GPU based machines and pointing to the data from the pipeline. While trees, graphs and forests are a way to represent the decision-making models of the system, the storage tier can enable the analysis stack with better concurrency, partitions and summation forms.

188) Declarative querying: SQL is a declarative querying language. It works well for database systems and relational data. It’s bridging to document stores and Big Data is merely a convenience. A storage tier does not necessarily participate in the data management systems. Yet the storage tier has to enable querying.

189) Support for transformations from batch to Stream processing using the same Storage Tier: Products like Apache Flume are able to support dual mode processing by allowing transformations to different stores. Unless we have a data management system in place a storage tier does not support SQL query keywords like Partition, Over, On, Before, TumblingWindow. The support for SQL directly from the storage tier using an iteration of storage resources, is rather limited. However if the support for products like Flume is there, then there is no difference to analysis whether the product is a time series database or an Object Storage.

190) A pipeline may use the storage tier as a sink for the data. Since pipelines have their own reliability issues, a storage product cannot degrade the pipeline no matter how many pipelines share the storage.

Sunday, December 16, 2018

Today we continue discussing the best practice from storage engineering:

181) Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. If the storage can serve both these models of execution, then they work really well.

182) Data types – their format and semantics have evolved to a wide degree. The pipeline must at least support primitives such as variants, objects and arrays. Support for data stores has become more important than services Therefore datastores do well to support at least some of these primitives in order to cater to a wide variety of analytical workloads.

183) Time-Travel - Time Travel walking through different versions of the data In SQL queries, this is done with the help of AT and BEFORE keywords. Timestamps can be absolute, relative with respect to current time, or relative with respect to previous statements. This is similar to change data capture in SQL Server so that we have historical record of all the changes except that we get there differently

184) Multi-version concurrency control - This provides concurrent access to the data in what may be referred to as transactional memory. In order to prevent reads and writes from seeing inconsistent views, we use locking or multiple copies of each data item.Since a version is a snapshot in time and any changes result in a new version, it is possible for writers to rollback their change while making sure the changes are not seen by others as we proceed.

185) Reliability of data: A storage platform can provide redundancy and availability but it has no control on the content. The data from pipelines may sink into the storage but if the pipeline is not the source of truth, the storage tier cannot guarantee that the data is reliable. Garbage in Garbage out applies to storage tier also.

Saturday, December 15, 2018

Today we continue discussing the best practice from storage engineering :

175) Storage can be used with any analysis package. As a storage tier, a product only serves the basic Create-update-delete-list operations on the resources. Anything above and beyond that is in the compute layer. Consequently, storage products do well when they integrate nicely with popular analysis packages.
176) When it comes to integration, programmability is important. How well a machine learning SDK can work with the data in the storage, is not only important for the analysis side but also for the storage side. The workloads from such analysis are significantly more different that others because they require GPUs for heavy iterations in their algorithms.
177) There are many data sources that can feed the analysis side but the storage tier is uniquely positioned as a staging for most in and out data transfers. Consequently, the easier it is to to trace the data through the storage tier, the more popular it becomes for the analysis side
178) Although we mentioned documentation, we have not elaborated on the plug and play kind of architecture. If the storage tier can act as a man in the middle with very little disruption to ongoing data I/O, it can become useful in many indirect ways that differ from the usual direct storage of data.
179) A key aspect of storage tier is the recognition of a variety of data formats or in a specific case file types. If the file is an image versus a text file, then it does not help with certain storage chores such as dedeplication and compression. Therefore, if the storage product provides ability to differentiate data natively, it will help in its acceptance and popularity
180) File types are equally important for applications. Workloads become segregated by file types. If a storage container has all kinds of files, the ability to work with them as if the file types were independent would immensely benefit in separating workloads and providin differential treatment.