Cluster computing

Wednesday, November 11, 2020

Network engineering continued ...

This is a continuation of the earlier posts starting with this one: http://ravinote.blogspot.com/2020/09/best-practice-from-networking.html

The availability of a service is improved by adding a cluster instead of a server. On the other hand, processes involved in background loading can use a primary server together with secondary servers. In such cases, a primary server is authoritative but a secondary server can serve the content when the primary is unavailable.

There are two flavors of the release consistency model - the serialization consistency and processor consistency flavors. All of the models in this group allow a processor to read its own write early. However, the two flavors are the only ones whose straightforward implementations allow a read to return the value of another processor's write early. These models distinguish memory operations based on their type and provide stricter ordering constraints for some type of operations.

The Weak ordering model classifies memory operations into two categories: data operations and synchronization operations. Since the programmer is required to identify at least one of the operations as a synchronization operation, the model can reorder memory operations between these synchronization operations without affecting the program's correctness.

The other category of models for relaxing all program orders such as Alpha, RMO, and PowerPC - all provide explicit fence instructions as to their safety nets. The alpha model provides two different fence instructions: the memory barrier and the write memory barrier. The memory barrier (MB) instruction can be used to maintain program order from any memory operation before the MB to any memory instruction after the MB. The write memory barrier instruction provides this guarantee only among write operations.

The PowerPC model provides a single fence instruction: the SYNC instruction. This is like the memory barrier instruction with the exception that when there are two reads to the same location, one may return the value of an older write than the first read. This model, therefore, requires read-modify-write semantics to enforce program order.

A key goal of the programmer centric approach is to define the operations that should be distinguished as synchronization. In other words, a user's program consists of operations that are to be synchronized or otherwise categorized as data operations in an otherwise sequentially consistent program.

Tuesday, November 10, 2020

Network engineering continued ...

This is a continuation of the earlier posts starting with this one: http://ravinote.blogspot.com/2020/09/best-practice-from-networking.html

The number of storage nodes in use by the store can however be changed. When this happens, the store undergoes reconfiguration, the partitions are balanced between new and old shards, redistribution of partition between one shard and another takes place.

The more the number of partitions the more the granularity for the reconfiguration. It is typical to have ten to twenty partitions per shard. Since the number of partitions cannot be changed afterwards, it is decided at design time.

The number of nodes belonging to a shard called its replication factor improves the throughput. The higher the replication factor, the faster the read because of availability. The same is not true for writes since there is more copying involved. Once the replication factor is set, the storage product takes care of creating the appropriate number of replication nodes for each shard.

A topology is a collection of storage nodes, replication nodes, and the associated services. At any point of time, a deployed store has one topology. The initial topology is such that it minimizes the possibility of a single point of failure for any given shard. If the storage node hosts more than one replication node, those replication nodes will not be from the same shard. If the host machine goes down, the shard can continue for reads and writes.

A buffer pool is a common technique used in databases and storage products to provide access to pages of data. Along with arrays of buffer pool, it maintains a hash table that maps the page numbers currently held in memory to their location in the frame table, the location for the page on the disk storage, and metadata about the page such as a dirty bit and any information needed by the page replacement policy.

#codingexercise

Given a BST determine if it is valid

boolean isValidBST(Node root) {

if (root == null) return true;

if (root.left != null && root.left > root) return false;

if (root.right != null && root.right < root) return false;

boolean leftValid = isValidBST(root.left);

boolean rightValid = isValidBST(root.right);

if (leftValid && rightValid) { return true; }

return false;

}

Monday, November 9, 2020

This is a continuation of the earlier posts starting with this one: http://ravinote.blogspot.com/2020/09/best-practice-from-networking.html

Reporting stack is usually a pull and transformation operation on any database and is generally independent of the data manipulation from online transactions. Therefore, if a service can simplify its design by offloading reporting stack to say time-series database, grafana and charting stack, then it can focus on business-driven design.

The above is not necessarily true for analysis stacks which often produces a large number of artifacts during computations and as such are heavily engaged in the read-write on the same storage stack.

Sometimes performance drives the necessity to create other storage products. Social engineering utilizes storage products that are not typical to enterprise or cloud storage. This neither means that social engineering applications cannot be built on cloud services nor does it mean that the on-premise storage products necessarily have to conform to organizational hardware or virtualware needs.

To improve performance and scalability, Facebook had to introduce additional parallelization in the runtime and the shared contexts which they called "WorkerContext". Bottlenecks and overheads such as checkpointing were addressed by scheduling. This was a finer level than what the infrastructure provided.

Facebook even optimized the memory utilization of the graph infrastructure because it allowed arbitrary vertex id, vertex value, edge, and message classes. They did this by 1) serializing edges with a byte array and 2) serializing messages on the server.

Facebook improved parallelization with sharded aggregators that provided an efficient shared state across workers. With this approach, each aggregator gets assigned to a randomly picked worker which then gathers the values, performs the aggregation, and distributes the final values to the master and other workers. This distributes the load that was otherwise entirely on the master.

Many companies view graphs as an abstraction rather than an implementation of the underlying database. There are two reasons for this:
First, Key-value stores suffice to capture the same information in a graph and can provide flexibility and speed for operations that can be translated as queries on these stores. Then these can be specialized for the top graph features that an application needs.
Second, different organizations within the company require different stacks built on the same logical data for reasons such as business impact, modular design, and ease of maintenance.

Sunday, November 8, 2020

Network engineering continued ...

This is a continuation of the earlier posts starting with this one: http://ravinote.blogspot.com/2020/09/best-practice-from-networking.html

P2P can be structured or unstructured.

In a structured topology, the P2P overlay is tightly controlled usually with the help of a distributed hash table (DHT). The location information for the data objects is deterministic as the peers are chosen with identifiers corresponding to the data object's unique key. Content, therefore, goes to specified locations that makes subsequent query easier.

Unstructured P2P is composed of peers joining based on some rules and usually without any knowledge of the topology. In this case, the query is broadcast and peers that have matching content return the data to the originating peer. This is useful for highly replicated items but not appropriate for rare items. In this approach, peers become readily overloaded and the system does not scale when there is a high rate of aggregate queries.

Recently there have been some attempts at standardization on the key-based routing KBR API abstractions and OpenHash - an open publicly DHT service that helps with a unification platform.

P2P is considered a top-heavy network. Top-heavy means we have an inverted pyramid of layers where the bottom layer is the network layer. This is the substrate that connects different peers. The overlay nodes management layer handles the management of these peers in terms of routing, location lookup, and resource discovery. The layer on top of this is the features management layer which involves security management, resource management, reliability, and fault resiliency.

#codingexercise

Trim nodes of a BST that lie outside a range:

Public static Node trim(Node root, int low, int high) {

if (root == null) return root;

root.left = Trim(root.left, low, high);

root.right = Trim(Root.right, low, high);

if (root.data < low || root.data > high) {

// both root.left and root.right cannot be present in this case.

If (root.right != null) return root.right;

If (root.left != null) return root.left;

Return null;

}

return root;

}

Saturday, November 7, 2020

Network engineering continued ...

This is a continuation of the earlier posts starting with this one: http://ravinote.blogspot.com/2020/09/best-practice-from-networking.html

A Peer-to-Peer (P2P) network is popular In production storage, peer to peer networks was not popular as data because data accrues based on popularity, not on utility.

A P2P network introduces a network first design where peers are autonomous agents and there is a protocol to enable them to negotiate contracts, transfer data, verify the integrity and availability of remote data, and to reward with payments. It provides tools to enable all these interactions. Moreover, it enables the distribution of the storage of a file as shards on this network and these shards can be stored using a distributed hash table. The shards themselves need not be stored in this hash table, rather a distributed network and messaging could facilitate it with location information.

Messages are helpful to enforce consistency as nodes come up or go down. For example, a gossip protocol may be used for this purpose and it involves propagating updates via message exchanges.

Message exchanges can include state or operation transfers. Both involve the use of vector clocks.

In the case of the state transfer model, each replica maintains a state version tree that contains all the conflicting updates. When the client sends its vector clock, the replicas will check whether the client state precedes any of its current versions and discard it accordingly. When it receives updates from other replicas via gossip, it will merge the version trees.

In the operation transfer model, each replica has to first apply all operations corresponding to the cause before those corresponding to the effect. This is necessary to keep the operations in the same sequence on all replicas and is achieved by adding another entry in the vector clock, a V-state, that represents the time of the last updated state. In order that this causal order is maintained, each replica will buffer the update operation until it can be applied to the local state A tuple of two timestamps - one from the client's view and another from the replica's local view is associated with every submitted operation.

Since operations are in different stages of processing on different replicas, a replica will not discard the state or operations it has completed until it sees the vector clocks from all others to have preceded it

P2P can be structured or unstructured.