Cluster computing

Friday, July 13, 2018

We were discussing the difference between structured and unstructured P2P networks.

In a structured topology, the P2P overlay is tightly controlled usually with the help of a distributed hash table (DHT) The location information for the data objects is deterministic as the peers are chosen with identifiers corresponding to the data object's unique key. Content therefore goes to specified locations that makes subsequent query easier.

Since the query executes locally at the node, performance is greatly improved as compared to distributing the query over many peers.

Unstructured P2P is composed of peers joining based on some rfules and usually without any knowledge of the topology. In this case the query is broadcast and peers that have matching content return the data to the originating peer. This is useful for highly replicated items but not appropriate for rare items. In this approach, peers become readily overloaded and the system does not scale when there is a high rate of aggregate queries.
Sensor data is one which utilizes the Acquisitional Query Processing. Based on query needs, AQP techniques intelligently determine which nodes to acquire data from. Sensor Database generally manifest four different types of querying:
First, the time-series analysis queries are primarily focused on detecting trends or anomalies in archived streams.They can specify sort order, anomalies and not contiguous change patterns. Incident detection falls under this category.
Second, the similarity search queries. In this class of queries, a user is interested in determining whether the data is similar to a given pattern. Again using archived data, the pattern matching helps determine events of interest. Surveillance data querying falls under this category.
Third, the classification queries are related to similarity search queries but these run classification algorithms that group and tag the event data. Determining grades of data is one example of this query processing.
Fourth the signal processing queries. These are heavy computations performed on the data directly such as Fast Fourier Transform and filtering. These enable interpretations that are not possible via grouping, sorting, searching and ranking techniques mentioned earlier.

#codingexercise
void PrintPairsForGivenSum (List <int> sorted, int sum)
{
validate(sorted);
for (int I = 0; i < sorted.length; i++) {
int index = binarySearch(sorted, sum - sorted [i], i);
if ( index != -1 && index != i) {
Console.WriteLine ("{0} {1}", sorted [i], sorted [index]);
}
}
}

Thursday, July 12, 2018

We were discussing the P2P network which differs from other forms of distributed computing. For example, it differs from cluster computing, cloud computing, grid computing and connected topologies. In P2P, every node can act both as a client and a server. The nodes act autonomously and are free to join or leave the network. The network may even be the internet. There is no central co-ordination and no central master with slave relationships. The system is self-organizing without need for a central global view from a co-ordinator. If Blockchain becomes a popular storage with which farmers can mine, a P2P network leveraging blockchain storage might become popular as a computing model.

P2P networks can be both structured and unstructured In a structured topology, the P2P overlay is tightly controlled usually with the help of a distributed hash table (DHT) The location information for the data objects is deterministic as the peers are chosen with identifiers corresponding to the data object's unique key. Content therefore goes to specified locations that makes subsequent query easier.
Since the query executes locally at the node, performance is greatly improved as compared to distributing the query over many peers.
Unstructured P2P is composed of peers joining based on some rules and usually without any knowledge of the topology. In this case the query is broadcast and peers that have matching content return the data to the originating peer. This is useful for highly replicated items but not appropriate for rare items. In this approach, peers become readily overloaded and the system does not scale when there is a high rate of aggregate queries.
#codingexercise
Given an unsorted array find the largest gap between any two elements
int GetLargetstGap(List<int> A)
{
return Math.Abs(A.Max() - A.Min());
}

Wednesday, July 11, 2018

We were discussing the P2P network as a top heavy architecture as opposed to the storage first architectures. Let us elaborate this a bit more. Top-heavy means we have an inverted pyramid of layers where the bottom layer is the network layer. This is the substrate that connects different peers. The overlay nodes management layer handles the management of these peers in terms of routing, location lookup and resource discovery. The layer on top of this is the features management layer which involves security management, resource management, reliability and fault resiliency. Over this we have the services specific layer which includes services management, metadata, services messaging and services scheduling. Finally, we have the application layer on top which involves applications, tools and services. Fundamentally P2P networks do not rise from established and connected groups of systems. They don't have a reliable set of resources. Yet they have fault-tolerance, self-organization and massive scalability properties.

P2P networking differs from other forms of distributed computing. For example, it differs from cluster computing, cloud computing, grid computing and connected topologies. In P2P, every node can act both as a client and a server. The nodes act autonomously and are free to join or leave the network. The network may even be the internet. There is no central co-ordination and no central master with slave relationships. The system is self-organizing without need for a central global view from a co-ordinator. If Blockchain becomes a popular storage with which farmers can mine, a P2P network leveraging blockchain storage might become popular as a computing model.

Courtesy:IEEE publications on its comparisions.
#codingexercise
Find the maximum number of unbalanced brackets of any one kind apart from the balanced brackets. For example:
[]][][ => 1
[[][]] => 0
int GetCountSwaps(String p)
{
validate(p);
Stack<char> open = new Stack<char>();
Stack<char> close = new Stack<char>();
for (int i = 0; i < p.Length; i++)
{
if (p[i] == '['){
open.push('[');
}
if (p[i] == ']') {
if (open.isEmpty() == false) {
open.pop(); / found a match;
}else{
close.push(']');
}//if
}//if
}//for
return Math.Abs(close.count, open.count);
}
The GeekToGeek solution for this appears incorrect.

Tuesday, July 10, 2018

We return to discussing storage as a network. In particular, I want to bring up a level of separation between storage and networking and show that by moving this separation further into one domain we get the possibilities of technologies that are vastly different than if it were pushed in the other domain. For example, Peer-to-Peer (P2P) networking provides a good base for large scale data sharing and application level multicasting. Some of the desirable features of P2P networks include selection of peers, redundant storage, efficient location, hierarchical namespaces, authentication as well as anonymity of users. In terms of performance, the P2P has desirable properties such as efficient routing, self-organizing, massively scalable and robust in deployments, fault tolerance, load balancing and explicit notions of locality. Perhaps the biggest takeaway is that the P2P is an overlay network with no restriction on size and there are two classes structured and unstructured. Structured P2P means that the network topology is tightly controlled and the content is placed on random peers and at specified location which will make subsequent queries more efficient. DHTs fall in this category where the location of the data objects is deterministic and the keys are unique. Napster was probably the first example to realize the distributed file sharing benefit with the assertion that requests for popular content does not need to be sent to a central server. P2P file sharing systems are self-scaling.
We have referred to the P2P network as a top heavy architecture as opposed to the storage first architectures. Let us elaborate this a bit more. Top-heavy means we have an inverted pyramid of layers where the bottom layer is the network layer. This is the substrate that connects different peers. The overlay nodes management layer handles the management of these peers in terms of routing, location lookup and resource discovery. The layer on top of this is the features management layer which involves security management, resource management, reliability and fault resiliency. Over this we have the services specific layer which includes services management, metadata, services messaging and services scheduling. Finally, we have the application layer on top which involves applications, tools and services. Fundamentally P2P networks do not rise from established and connected groups of systems. They don't have a reliable set of resources. Yet they have fault-tolerance, self-organization and massive scalability properties.
Courtesy:IEEE publications on its comparisions.
#codingexercise
Find the maximum number of unbalanced brackets after all the swappings needed to balance brackets. For example:
[]][][ => 1
[[][]] => 0
int GetCountSwaps(String p)
{
validate(p);
Stack<char> open = new Stack<char>();
Stack<char> close = new Stack<char>();
for (int i = 0; i < p.Length; i++)
{
if (p[i] == '['){
open.push('[');
}
if (p[i] == ']') {
if (open.isEmpty() == false) {
open.pop(); / found a match;
}else{
close.push(']');
}//if
}//if
}//for
int swaps = Math.Min(close.Count, open.Count);
return Math.Max(Math.Abs(close.Count-swaps), Math.Abs(open.Count-swaps));
}
The GeekToGeek solution for this appears incorrect.

There is a way to make a solution using linear scan however only the stack keeps track of the sequence of unbalanced.

One optimization in the proceeding is that we treat the open close and close open pairs as requiring 0 or 1 swaps

Monday, July 9, 2018

Natural language processing algorithms are based on co-location weight matrix where the document is taken as a collection of words each of which is represented by a limited dimension feature vector comprising of weights from this weight matrix.

However we can also consider a semantic matrix which assigns weights to words again based on a similar pattern.
Since each matrix represents different vectors in different feature space, a word may no longer be treated as just a vector but a combination of different vectors.

This results in a form as superimposed adjaceny matrix.
Correlating feature spaces is like building a regression model. We don’t want to do that but we if we assume equal contributions from each of the hidden matrix layers (collocation, semantics etc) then we can aggregate the weights not just along one plane but also vertically. The caveat is that we now seem to have an expanded union of all features across layers.
The addition of layers does not take away any of the benefits from the layer involving co-location matrix. Instead, it enables custom layers in the form of tags associated with words. We don't necessarily have to go with the well-known co-location and semantics-based relationships and can use tags for domain specific text.
The weights across layers of matrices also does not need to be uniform. It can be skewed in favor of customization layer so that domain specific processing can have more levers for tuning.
A vector space has a point of reference and dimensions. Combining vector spaces is not easy but not impossible. SpaceTime is an example. Instead of combining the above method of layered matrices merely assigns them different weights for their representation of the whole.
Another approach may be to take a single matrix where the association between the words is no longer just co-location probability based but a combination of different heuristics. If we could generalize the associativity from merely collocation, we can still use the same technique of finding the hidden matrix using neural net and softmax classifier. In such cases, collocation, semantics and proprietary tagging becomes such as shades of meanings can become dimensions of the associativity between words.

#codingexercise
Find the minimum number of swaps needed to balance brackets. For example:
[]][][ => 1
[[][]] => 0
int GetCountSwaps(String p)
{
validate(p);
Stack<char> open = new Stack<char>();
Stack<char> close = new Stack<char>();
for (int i = 0; i < p.Length; i++)
{
if (p[i] == '['){
open.push('[');
}
if (p[i] == ']') {
if (open.isEmpty() == false) {
open.pop(); / found a match;
}else{
close.push(']');
}//if
}//if
}//for
return Math.Min(close.Count, open.Count);
}
The GeekToGeek solution for this appears incorrect.

Sunday, July 8, 2018

#codingexercise
Implement a virtual pottery wheel method that conserves mass but shapes it according to external factor
List<int> ShapeOnPotteryWheel(List<int> diameters, List<int> touches)
{
assert(diameters.count == touches.count); // height of clay
Assert(touches.All( x => x >= 0));
double volume = 0;
for (int i = 0; i < touches.count; i++)
{
var old = diameters[i];
diameters[i] -= 2 × touches[i];
var new = diameters[i];
Assert(new > 0);
volume += (PI/4) x (old^2 - new^2);
}
assert(volume >= 0);
// volume adds up the top
var last = diameters.Last();
int count = volume/((PI/4)×(last^2));
if (count > 0)
diameters.AddRange(Enumerable.Repeat(last, count));
return diameters;
}

Saturday, July 7, 2018

When we iterate namespaces, buckets and objects, we often have to rely on sequentially visiting each one of them. There is no centralized data structure that speeds them up nor are they organized in a sorted manner unlike the indexes. These S3 artifacts may be stored over a layer that might facilitate data structure that speed lookup. One such example is a B-plus tree – a data structure that relies on storing ranges by their keys. Another example may be skip lists – a data structure that relies on the links not only between adjacent occurring records but also skipping adjacencies usually by an exponent of two. Such techniques improve lookup because they resist from having to visit each element one after the other.
Perhaps we need to put the improvements from the data structure in perspective. Many commercial products are content with the use of iteration over collections simply by virtue of scoping and delegating the bulk of data processing to downstream layers. This brings up the salient notion that no layer alone suffices to meet the operational gains expected by the user. Data infrastructure management seems to be the whole stack. For example, the presence of local storage and processing capability is required to reduce expensive communication. If the lower layer performs redundant and repeated requests and responses, that is going to cost much more than the benefits of speedier lookup. The ability to delegate querying to lower layers is referred to as push-down technology. The functionality to push-down filters for instance saves on the communication costs and the need for maintaining complex data structures at upper layers.
Another important class of query processing is the notion to distribute the queries to the relevant data sources. This Acquisition Query Processing is about determining which node has the data to acquire and even being intelligent about the choice of the node to reduce costs. Moreover, the data and the attributes required to query it may also be chosen in a way to answer queries with the minimum cost. Therefore, there can be a hierarchical query processing so far as we keep the queries and the data that they operate on well-matched and local to a node.
The distribution of queries brings us to a relevant discussion on network first or storage first design principle and the suitable products associated with that. The insatiable hunger for data storage has spurred the growth of virtualized, elastic and software defined storage that hides the networking to give the notion of a seemingly unbounded storage specifically volume, file and object storage to users. In such cases, the compute layer over this kind of storage is rather general purpose. An alternate form of query processing is the view that the network is a database where the queries are pushed all the way to the storage at each of the peers. A Peer to Peer network can be considered a distributed hash table that simplifies the delegation of local oriented dedicated storage and compute.
A seemingly hybrid solution could be to take the object storage and the associated compute library at each of the participating peers in a massive peer to peer network. This top-heavy processing allows the storage to be redundant no matter its size while keeping the compute relatively local.