Cluster computing

Sunday, July 1, 2018

#codingexercise
Problem: Find the minimum number of steps to reach the end where each element represents the max number of steps that can be made forward from that element.
Solution:
int GetMinJumps(List<int> A, int start, int end)
{
int min = Integer.MaxValue;

if (start==end) return 0;
if (A[start] == 0) return min;

// the next pick is to the right with the min cost
for (int i = start+1; i<= end; i++)
{
if (i > start + A[start]) continue;
int jumps = GetMinJumps(A, i, end);
if (jumps != Integer.MaxValue && jumps + 1 < min) {
min = jumps + 1;
}

return min;
}

Saturday, June 30, 2018

#codingexercise

Problem: Given two container measurements of a and b used to distribute a commodity with a total quantity of d to n customers , determine the maximum number of customers who can be satisfied with their individual requirements of a and b

Solution: use a greedy strategy to take the increasing order aggregated demands of a and b together for each person which helps us maximize the customer count

List<int> GetSatisfiedCustomers(List<Tuple<int,int>> demands, int d)

{

assert(demands != null);

assert(demands.Count() > 0);

assert (d > 0);

Converter<Tuple<int,int>, Tuple<int,int>> converter =

(item,index) => {

return new Tuple<int, int> ( item.first*a+item.second*b, index );

};

var aggregated = demands.ConvertAll<Tuple<int,int>>(converter).ToList();

aggregated.sort(); // by aggregated demand

var customers = new List<int>();

aggregated.ForEach( demandByPerson => {

if (demandByPerson.First > d) {

d -= demandByPerson.First;

customers.Add(demandByPerson.Second);

}

});

return customers;

}

The above method merely tries to represent the individual demand of the users in unit measurements and then sort them in the increasing order. With the help of the increasing numbers for the demands, we are now able to include as many customers as we can.

Friday, June 29, 2018

We resume our discussion on Storj network and it's messaging
The messaging system implemented by Storj network is Quasar. The publisher-subscriber model is topic based and implemented utilizing Bloom Filters. The topics are predetermined and made available at the protocol level. This helps with implementing Storj over Kademlia. Three new message types are added that correspond to subscribe, update and publish methods. The topics can therefore be extended and include such things as contract parameters and bandwidth reservation. The Kademlia messages facilitate the creation and propagation of filters. Each node maintains information about topics to which it subscribes as well as those to which its neighbors subscribe. Three neighbors are chosen to form the initial subscribe list. The response to the subscribe includes this current filter list. The requesting nodes then merges these lists. This is a pull operation. The update message does the reverse. It pushes the filter changes to the three nearest neighbors. It usually follows the subscribe message so that the loop helps all the nodes learn what is relevant to their neighbors and to those reachable from them. Publish messages broadcast the message to the network. These messages are sent to a node with three nearest neighbors and includes a topic parameter. If the topic is found, the node processes the message. If it is found in the other filters in the filter list, it is forwarded to the three neighbors. If nothing matches, it is forwarded to a randomly selected peer. To prevent the message from coming back, the nodes add their id to the message. Publish messages also include time to live to prevent spam attacks.
We were comparing Kademlia with other P2P networks. In general they provide a good base for large scale data sharing and application level multicasting. Some of the desirable features of P2P networks include selection of peers, redundant storage, efficient location, hierarchical namespaces, authentication as well as anonymity of users. In terms of performance, the P2P has desirable properties such as efficient routing, self-organizing, massively scalable and robust in deployments, fault tolerance, load balancing and explicit notions of locality. Perhaps the biggest takeaway is that ht P2P is an overlay network with no restriction on size and there are two classes structured and unstructured. Structured P2P means that the network topology is tightly controlled and the content are placed at random peers and at specified location which will make subsequent queries more efficient. DHTs fall in this category where the location of the data objects is deterministic and the keys are unique. In distributed computing we saw the benefit of arranging the nodes in a ring with hashes at different nodes. Napster was probably the first example to realized the distributed file sharing benefit with the assertion that requests for popular content does not need to be sent to a central server. P2P file sharing systems are self-scaling.
#codingexercise
Given a paper size of A x B, cut the papers into squares of any size and determine the minimum number of squares.
uint GetMinCount(uint a, uint b)
{
uint count= 0;
uint remainder = 0;
uint min = GetMin(a,b);
uint max= GetMax(a,b);
while (min > 0)
{
count += max/min;
remainder = max%min;
max = min;
min = remainder;
}
return count;
}

Thursday, June 28, 2018

We resume our discussion on Storj network.
We were looking at audits. The data owner must send a message to have the farmer issue the audit. This was done by extending the Kademlia message set with a new one. The message contains a hash of the data and a challenge. The farmer responds with a Merkle proof and the data owner issues payment to the farmer. The contract helps determine the exchange. Since it is between the data owner and the farmer, it includes all the elements for the node to form a relationship, transfer the data, respond to audits and settle on payments. Parties interested in forming contract may use the publisher-subscriber model that is common in most messaging systems. Contracts are signed and stored by both parties.
The messaging system implemented by Storj network is Quasar. The publisher-subscriber model is topic based and implemented utilizing Bloom Filters. The topics are predetermined and made available at the protocol level. This helps with implementing Storj over Kademlia. Three new message types are added that correspond to subscribe, update and publish methods. The topics can therefore be extended and include such things as contract parameters and bandwidth reservation. The Kademlia messages facilitate the creation and propagation of filters. Each node maintains information about topics to which it subscribes as well as those to which its neighbors subscribe. Three neighbors are chosen to form the initial subscribe list. The response to the subscribe includes this current filter list. The requesting nodes then merges these lists. This is a pull operation. The update message does the reverse. It pushes the filter changes to the three nearest neighbors. It usually follows the subscribe message so that the loop helps all the nodes learn what is relevant to their neighbors and to those reachable from them. Publish messages broadcast the message to the network. These messages are sent to a node with three nearest neighbors and includes a topic parameter. If the topic is found, the node processes the message. If it is found in the other filters in the filter list, it is forwarded to the three neighbors. If nothing matches, it is forwarded to a randomly selected peer. To prevent the message from coming back, the nodes add their id to the message. Publish messages also include time to live to prevent spam attacks.
One of the things we quickly see as a comparision with messaging systems such as RabbitMQ/ZeroMQ is that the distributed computing notions of a virtual time, propagation is universal and required in these cases. Some of the fallacies we have to watch out for include the notions that the network is reliable, there is no latency, the bandwidth is infinite, the network is secure, the topology does not change, there is only one administrator, transport cost is zero, the network is homogeneous.

Wednesday, June 27, 2018

We resume our discussion on Storj network.
We were looking at contracts and negotiations. Contracts is between the data owner and the farmer. The latter is a term used to describe the one who houses the data. The term farmer or banker is interchangeable because he provides a storage for one of the distributed ledgers and hopes to make money from mining it. The banker or farmer does not need to know the contents of the data. The data owner retains complete control over the encryption key which determines access to the data. The data owner keeps a set of challenges, selects one of them and sends it to the farmer. The farmer uses the challenge and the data to generate the Merkle pre-leaf and uses it with the other leaves to generate a Merkle proof. This proof is sent back to the data owner. The data owner uses the root and tree depth to verify the proof.
The proofs may be compact but the tree may not be. Moreover the shard may be hashed many times to generate pre-leaves. Therefore an optimization is used to generate partial audits using subsets of data which reduces overhead in compute and storage. Unlike the full audits, the partial audits give only a probabilistic assurance that the farmer retains the entire file. This means there could be false positives since the probability is known, it gives a confidence level
The data owner must send a message to have the farmer issue the audit. This was done by extending the Kademlia message set with a new one. The message contains a hash of the data and a challenge. The farmer responds with a Merkle proof and the data owner issues payment to the farmer. The contract helps determine the exchange. Since it is between the data owner and the farmer, it includes all the elements for the node to form a relationship, transfer the data, respond to audits and settle on payments. Parties interested in forming contract may use the publisher-subscriber model that is common in most messaging systems. Contracts are signed and stored by both parties.
#codingexercise
Connect N ropes with minimum cost
Make a heap of n ropes
While the size of the heap is > 1
Extract minimum cost rope
Extract minimum cost rope
Insert the rope with combined cost
The last rope is the union forming minimum cost

Find the minimum number of coins from an infinite supply of varying denominations to form a change V
sort the coins from big to small denominations
while V > smallest denomination :
pick a coin from the largest denomination that is smaller than V and add it to the collection to be returned
reduce V by the value of the coin and repeat
return the collection

Tuesday, June 26, 2018

We were discussing two new operators:

one for the use of additional columns to generate groups that facilitate hierarchical view of entities and another for additional rows to describe new pseudo-entities transformed from exiting entities that is helpful to view entities in the form of layers.
The ability to view the entities as a hierarchical organization let us expand our logic to perform tree-based searches and to assign different roles to entities so that the data may have augmented information to improve the queries. Let us take an example. If we had an employee table that listed all the employees of an organization then additional columns such as manager_level_001_id and manager_level_002_id and so on could essentially provide information that does not repeatedly need to be calculated in subsequent queries. More importantly the addition of columns for groups may be a temporary table so that we don't necessarily write on the original data. Besides if storage is unlimited, it helps to memoize results of processing that is going to be repeated. This now favors a non-recursive tree algorithm to list the managers all the way to the root for any employee as well as the reuse of the logic to determine the next level manager as opposed to recursive queries across levels. This form of data extract, transform and load is popular in data reporting stacks that are used to publish charts and graphs from data.
The ability to view the entities as not just those existing but composites that behave the same as existing entities but also in simpler abstraction of entities purposes different from those involving original entities. We could take another example here. Let us say there was a table of email addresses. We could list different email addresses for the same person and also list different distribution lists that are made up of one or more email addresses. We could also dynamically combine individual and distribution lists into new temporary entries. Combinations may be multi-leveled. Since there is a separation of concern involving the original and the composites from the query, this is considered layering. Since the selection of original entities determine the composites for use in the higher layer, this comes as a different technique from grouping and hierarchies.

Monday, June 25, 2018

Yesterday we were discussing storj as a peer to peer cloud storage network. Regardless of whether we store data via a distributed hash table, in a streaming storage, object storage, or some cloud database, the enablement of data usages matter most in the choice of storage.

Let us take a look at one of the most common form of data usage which is querying. We had brought up earlier that querying language is usually independent of the data source and even the architecture. The language is mostly to enable the data usage and is a form of expression of logic.

We saw two different forms of expression of this language

First - we saw the expression in the form of standard query operators which became popular across languages in the form of expressions such as .Where() and .Sum() as in LINQ. This was convenient primitives taking inspiration from the tried, tested and well established SQL Query language. The language is very inspiring to express queries in succinct manner often enabling chaining to refine and improve resultsets.

The second form of expression was with the search query language which has had a rich history in shell scripting and log analysis where the results of one command are piped into another command for transformation and analysis. Although similar in nature to chaining operators the expressions of this form of query involved more search like capabilities across heterogenous data such as with the use of regular expressions for detecting patterns. This form of expression not only involved powerful query operators but facilitated data extract, transform and load as it made its way through the expressions.

Operators used in both these forms of query expressions expanded to include more involved from of computations such as grouping, ranking, sorting and such analysis.

In particular there are two new operators possible - one for the use of additional columns to generate groups that facilitate hierarchical view of entities and another for additional rows to describe new pseudo-entities transformed from exiting entities that is helpful to view entities in the form of layers.
The ability to view the entities as a hierarchical organization let us expand our logic to perform tree-based searches and to assign different roles to entities so that the data may have augmented information to improve the queries. More importantly the addition of columns for groups may be a temporary table so that we don't necessarily write on the original data. This form of data extract, transform and load is popular in data reporting stacks that are used to publish charts and graphs from data.
The ability to view the entities as not just those existing but composites that behave the same as existing entities but also in simpler abstraction of entities purposes different from those involving original entities. Since there is a separation of concern involving the original and the composites, this is considered layering. Since the selection of original entities determine the composites for use in the higher layer, this comes as a different technique from grouping and hierarchies.
#codingexercise

#codingexercise

We were discussing the count of the number of decreasing paths in a matrix if the traversal is permitted only on adjacent horizontal and vertical cells The same works for increasing paths

Solution: we can sum the number of paths with each cell as the starting location if we know how to do it for a given starting position.

initialize the sum to one for the number of paths at the center where the center forms a standalone single element path

foreach of the four adjacent positions around a center:

if the value is greater than the center:

recurse to find the count at the adjacent cell of its not available in the dp matrix or insert the value there after finding.

add the count through that adjacent cell to the sum at the center.

return the sum of paths for this position