Cluster computing

Sunday, January 27, 2019

Today we continue discussing the best practice from storage engineering:

375) Most storage products don’t differentiate between human and machine data because it involves upper layers of data management. However, dedicated differentiation between human and machine data can make the products more customized for these purposes.

376) Data storage requirements change from industry to industry. Finance data storage is largely in the form of distributed indexes and continuous data transfers. A cold storage product does not serve its needs even if the objects are accessible over the web

377) Finance data is subject to a lot of calculations and proprietary and often well-guarded calculators that have largely relied on relational databases. Yet these same companies have also adopted NoSQL storage in favor of their warehouses. As their portfolios grow, they incubate new and emerging features and increasingly favor new technologies

378) Finance data is also heavily regulated. The Sarbanes-Oxley act sought to bring control to the corporations in order to avoid accounting scandals. It specified disclosure controls, audit and the compliance terms

379) Health industry data is another example with its own needs around data compliance. The Health insurance portability and Accountability act required a lot of controls around who, when and where can get access to personally identifiable information

380) Health data is often tightly integrated into proprietary stacks and organizations. Yet they are also required to participate in providing web access to all the information surrounding an individual at the same place. This makes them require a virtualized cross company data storage.

Saturday, January 26, 2019

Today we continue discussing the best practice from storage engineering
:
371) Data management software such as Cloudera can be deployed and run on any cloud. It offers an enterprise data hub, an analytics DB, and operational DB, data science and engineering and essentials. It is elastic and flexible, it has high performance analytics, it can easily provision over multiple clouds and it can be used for automated metering and billing. Essentially they allow different data models, real-time data pipelines and streaming applications with their big data platform. They enable data models to break free from vendor lockins and with the flexibility to let it be community defined.

372) The data science workbench offered from Cloudera involves a console on a web browser that users can authenticate themselves with using Kerberos against the cluster KDC. Engines are spun-up and we can seamlessly connect with Spark, Hive, and Impala. The engines are spun up based on engine kernels and profiles.

373) Cloudera Data Science workbench uses Docker and Kubernetes. Cloudera is supported on dedicated Hadoop hosts. Cloudera also adds a data engineering service called Altus. It’s a platform that works against a cloud by allowing clusters to be setup and torn down and jobs to be submitted to those clusters. Clusters may be Apache Spark, MR2 or Hive.

374) Containerization technologies and Backend as a service aka lambda functions can also be supported by products such as Cloudera which makes them usable with existing public clouds while it offers an on-premise solution

375) Most storage products don’t differentiate between human and machine data because it involves upper layers of data management. However, dedicated differentiation between human and machine data can make the products more customized for these purposes.

Friday, January 25, 2019

Today we continue discussing the best practice from storage engineering :

366) Sometimes performance drives the necessity to create other storage products. Social engineering utilizes storage products that are not typical to enterprise or cloud storage. This neither means that social engineering applications cannot be built on cloud services nor does it mean that the on-premise storage products necessarily have to conform to organizational hardware or virtualware needs.

367) To improve performance and scalability, Facebook had to introduce additional parallelization in the runtime and the shared contexts which they called "WorkerContext". Bottlenecks and overheads such as checkpointing were addressed by scheduling. This was finer level than what the infrastructure provided.

368) Facebook even optimized the memory utilization of the graph infrastructure because it allowed arbitrary vertex id, vertex value, edge and message classes. They did this with 1) serializing edges with byte array and 2) serializing messages on the server.

369) Facebook improved parallelization with sharded aggregators that provided efficient shared state across workers. With this approach, each aggregator gets assigned to a randomly picked worker which then gathers the values, performs the aggregation and distributes the final values to master and other workers. This distributes the load that was otherwise entirely on the master.

370) Many companies view graphs as an abstraction rather than an implementation of the underlying database. There are two reasons for this:
First, Key-value stores suffice to capture the same information in a graph and can provide flexibility and speed for operations that can be translated as queries on these stores. Then these can be specialized for the top graph features that an application needs.
Second, different organizations within the company require different stacks built on the same logical data for reasons such as business impact, modular design and ease of maintenance.

Thursday, January 24, 2019

Today we continue discussing the best practice from storage engineering :

360) A rewarding service increases appeal and usage by customers. This is what makes Blockchain popular.

361) Many database clusters are used as a failover cluster and not as a performance or scaleable cluster. This is primarily because the storage server is designed for scale-up versus scale-out. This is merely an emphasis on the judicious choice of technology for design.

362) Some database server uses SAN, it is a shared storage and does not go offline. Most storage products have embraced Network Access Storage. Similar considerations are true when the database server is hosted on a container and the database is on a shared volume

363) The clustering does not save space or efforts for backup or maintenance. And it does not scale out the reads for the database. Moreover, it does not give a 100% uptime.

364) Reporting stack is usually a pull and transformation operation on any database and is generally independent of the data manipulation from online transactions. Therefore, if a storage product can simplify its design by offloading reporting stack to say time-series database, grafana and charting stack, then it can focus on storage server related design.

365) The above is not necessarily true for analysis stacks which often produce a large number of artifacts during computations and as such are heavily engaged in the read-write on the same storage stack.

Wednesday, January 23, 2019

Today we continue discussing the best practice from storage engineering:

355) P2P is considered a top-heavy network. Top-heavy means we have an inverted pyramid of layers where the bottom layer is the network layer. This is the substrate that connects different peers. The overlay nodes management layer handles the management of these peers in terms of routing, location lookup and resource discovery. The layer on top of this is the features management layer which involves security management, resource management, reliability and fault resiliency.

356) Let us take a look at contracts. We discussed this earlier in reference to Merkle trees. A smart contract can even include code and evaluate dynamically just like a classifier with rules in program order.

357) A utility token issued for authorization need not be opaque. It can be made up of three parts – one which represent the user, another which represents the provider, and an irrefutable part that is like a stamp of authority. The section from the provider may also include scope for the resources authorized.

358) The attestation mechanism itself might vary. This might include, for example, Merkle tree where each node of the tree represents an element of Personally-Identifiable-Information (PII) along with the hash and the hash of the hashes of the child nodes. The root hash then becomes the fingerprint of the data being attested.

359) An immutable record which has its integrity checked and agreed on an ongoing basis provides a venerable source of truth.

360) A rewarding service increases appeal and usage by customers. This is what makes Blockchain popular.

List <List <int>> getAllArrangements (List <List <int>> selected) {
List <List <int>> ret = new List < List <int>>();
Selected.distinct ().forEach ( x -> ret.addAll (getPermutations (x)));
Return ret;
}

Tuesday, January 22, 2019

Today we continue discussing the best practice from storage engineering:

351) P2P can be structured or unstructured.

352) In a structured topology, the P2P overlay is tightly controlled usually with the help of a distributed hash table (DHT). The location information for the data objects is deterministic as the peers are chosen with identifiers corresponding to the data object's unique key. Content therefore goes to specified locations that makes subsequent query easier.

353) Unstructured P2P is composed of peers joining based on some rules and usually without any knowledge of the topology. In this case the query is broadcast and peers that have matching content return the data to the originating peer. This is useful for highly replicated items but not appropriate for rare items. In this approach, peers become readily overloaded and the system does not scale when there is a high rate of aggregate queries.

354) Recently there have been some attempts at standardization on the key based routing KBR API abstractions and OpenHash - an open publicly DHT service that helps with a unification platform.

355) P2P is considered a top-heavy network. Top-heavy means we have an inverted pyramid of layers where the bottom layer is the network layer. This is the substrate that connects different peers. The overlay nodes management layer handles the management of these peers in terms of routing, location lookup and resource discovery. The layer on top of this is the features management layer which involves security management, resource management, reliability and fault resiliency.

Monday, January 21, 2019

Today we continue discussing the best practice from storage engineering :
345) A P2P network introduces a network first design where peers are autonomous agents and there is a protocol to enable them to negotiate contracts, transfer data, verify the integrity and availability of remote data and to reward with payments. It provides tools to enable all these interactions. Moreover, it enables the distribution of the storage of a file as shards on this network and these shards can bee stored using a distributed hash table. The shards themselves need not be stored in this hash table, rather a distributed network and messaging could facilitate it with location information.
346) Messages are helpful to enforce consistency as nodes come up or go down. For example, a gossip protocol may be used for this purpose and it involves propagating updates via message exchanges.
347) Message exchanges can include state or operation transfers. Both involve the use of vector clocks.
348) In state transfer model, each replica maintains a state version tree which contains all the conflicting updates. When the client sends its vector clock, the replicas will check whether the client state precedes any of its current versions and discard it accordingly. When it receives updates from other replicas via gossip, it will merge the version trees.
349) In operation transfer model, each replica has to first apply all operations corresponding to the cause before those corresponding to the effect. This is necessary to keep the operations in the same sequence on all replicas and is achieved by adding another entry in the vector clock, a V-state, that represents the time of the last updated state. In order that this causal order is maintained, each replica will buffer the update operation until it can be applied to the local state A tuple of two timestamps - one from the client's view and another from the replica's local view is associated with every submitted operation.
350) Since operations are in different stages of processing on different replicas, a replica will not discard the state or operations it has completed until it sees the vector clocks from all others to have preceded it