Cluster computing

Wednesday, October 10, 2018

Now that we have looked at marking the object collection in the cache for eviction, let us look at a few techniques to improve the information passed from the garbage collection to the cache. We maintained that the cache need not implement any strategy such as the least-recently-used, time-to-live and such others. The garbage collection already maintains a distinct set of generations and it is grading down the objects and the information passed to the cache need not be a mark to delete. It can be transparent to the cache with all the statistics it gathers during the collection run. Therefore, the cache may be able to determine the next steps even if the garbage collector suddenly disappeared. This means it includes everything from counts of accesses to the object, initialization time, the last modified time and such others. Although we use the term garbage collector, it is really a metadata collector on the object because a traditional garbage collector relied extensively on the root object hierarchy and the scopes introduced by a predetermined set of instructions. Here, we are utilizing the metadata for the policy which the cache need not implement. Therefore, all information from the layer above may be saved so that the cache can use it just the same in the absence of any direct information of which objects to evict. Finally, the service for the cache may be able to bring the policy into the cache layer itself.
Let us now look at the topology of the cache. Initially we suggested that a set of servers can participate in the cache if the objects are distributed among the servers. In such a case, the cache was distributed among n servers as hash(o) modulo n. This had the nasty side-effect that when one or more servers went down or were added into the pool, all the objects in the cache would lose their hash because the variable n changed. Instead consistent hashing came up with the scheme of accommodating new servers and taking old servers offline by arranging the hashes around a circle with cache points. When a cache is removed or added, the objects with hashes along the circle are moved clockwise to the next cache point. It also introduced “virtual nodes” which are replicas of cache points in the circle. Since the caches may have non-uniform distribution of objects across caches, the virtual nodes have replicas of objects from a number of cache points.
#codingexercise
Find the minimum number of subset elements of an integer array that sum to a given number
This follows the dynamic programming :
return min(1+ recursion_with_the_candidate_and_sum_minus_candidate, recursion_without_candidate_and_same_sum)
We add the validations for the terminal conditions.

Tuesday, October 9, 2018

We were discussing cache policy for aging. While there can be other mechanisms that directly translate to a table and query for the cache, grading and shifting objects is sufficient to achieve aging and compaction. This then translates to an effective cache policy.
Another strategy for the interaction between garbage collection and the cache is for the cache to merely hold a table of objects and their status. The status is always progressive from initialized->active->marked-for-eviction->deleted. The state of the objects is determined by the garbage collector. Therefore, the net result of garbage collection is a set of new entries in the list of objects for the cache.
Also, items marked for eviction or deleted from the cache may increase over time. These may then be put archived on a periodic rolling basis into the object storage so that the cache merely focuses on the un-evicted items. The cache therefore sees only a window of objects over time and this is quite manageable for the cache because objects are guaranteed to expire. The garbage collector publishes the result to the cache as a list without requiring any object movement.
def reaper (objects, graveyard):
for dead in expired(objects):
If dead not in graveyard:
graveyard += [dead]
objects = list(set(objects) - set(dead))
else:
objects = list(set(objects) - set(dead))
def setup_periodic_reaping(sender, **kwargs):
sender.add_periodic_task(10.0, reaper(kwargs[‘objects’], kwargs[‘graveyard’]) , name=’reaping’)

#codingexercise

Find the length of the largest dividing subsequence of a number array. An dividing subsequence is one where the elements appearing prior to an element in the subsequence are proper divisor of that element. 2, 4, 8 has a largest dividing subsequence of 3 and that of 2,4,6 is 2.
public static void printLDS(List <int> a) {
if (a == null || a.size () ==0) {System.out.println (“0”);}
if (a.size () == 1) { System.out.println (“1”);}
int lds = new int [a.size ()+1];
lds [0] = 1;
for (int I = 1; I < a.size (); I++){
for (int j = 0; j < I; j++){
if (lds [j] != 0 && a [j] != 0 && a [i] > a [j] && a [i] % a [j] == 0) {
lds [i] = Math.max (lds [i], lds [j] +1);
}
}
List b = Arrays.asList(ArrayUtils.toObject(lds));
System.out.println (Collections.max (b));
}
The above method can be modified for for any longest increasing subsequence

The subsequence of divisors also from an increasing subsequence

Monday, October 8, 2018

We were discussing garbage collection and compaction.With objects, we don't have to create the graph. Instead we have to find ways to classify them by generation. Any classification of the objects is considered temporary until the next usage. At that point, the same objects may need reclassification. Since the classification is temporal, we run the risk of mislabeling generations and consequently reclaiming an object when it wasn't supposed to be.
We can also consider a markdown approach where after labeling the objects, we progressively mark them down so that we take actions only on the labels that are older than the oldest classification. This will help with keeping the classification and the reaping separate. The above approach will enable no actions to be taken if just the classification is needed. And also helping with taking different actions on the different tiers. For example, evict and delete may be considered separate actions.
The unit of moving operation is an object here only because applications use objects. However, the cache may choose to stash swaths of objects at a time for shifting or ignoring. For example, we could use heap to keep track of the biggest available swath to roll for objects of arbitrary size.
We may also have a compact version for keeping a running track of all the swaths. For example, if they are consecutive, we could coalesce them. If there is a set of active objects that can be margined into a swath, it can be marked as busy and ignored. If there are too many alternating swaths of free and busy object swaths, they can be retained as such and ignored from coalescing.
Busy swaths may need to be updated as candidate for conversion to free whenever objects are moved out of it.
Busy swaths can also be coalesced if they permit. This is usually not necessary because it is too aggressive and only a tiny fraction of the overall collection.
Movements of free and busy swaths are always opposite to each other. The free swaths move towards the younger generation where they can be utilized for gen 0.
The movement operation itself can be optimized by efficient bookkeeping so that it translates only to updates in the book-keeping

Sunday, October 7, 2018

The idea of compaction is that, objects are allocated contiguously assuming infinite caching. When they are no longer in use or they can be decommissioned, they are evicted from the cache. The compaction can be done in several sweeps and they can be triggered not just by the allocation of a new object but also periodically for the health of the cache.
With the help of the generations, the objects that are least likely to be used and hence good candidates to be reclaimed are close together on one end and the new ones are on the other end. With this gradation, we can efficiently perform delete operations on one end and additions on the other end
The process of shifting objects is compacting. As we walk through the cache linearly, we look for contiguous blocks of 'garbage' objects that can now be considered free space. The garbage collector now shifts the non-garbage objects down in cache, removing all of the gaps in the cache. The cache performs this operation and we call it rolling. Rolling doesn't have dependencies. Objects were reachable by their name or address.
Traditionally garbage collectors have been implemented with a graph representation of the objects to be reclaimed and following the principles of garbage collection from distributed computing. In this approach, all the objects of a heap derive from a base object which leads to a common root. During a collection, the entire set of objects on the heap then translates to an object graph which is finite and directed with a fixed set of vertices representing the objects but not the edges. The edges are added and deleted by a mutator process that modifies the graph objects that are connected can be called food and those that are not can be called garbage so we know which objects are to be collected. The mutator only adds edges if it has been marked as food.

Application of deep learning in text summarization : https://1drv.ms/w/s!Ashlm-Nw-wnWt1CQZVSOUjzUqITP

Saturday, October 6, 2018

This post is in continuation of the design for an object cache layer as described here. A specific use case now explains the aging of objects in the cache. While the object storage saved the objects for durability, the cache made it available for workloads to save on the cost of going deep into object storage. The cache also translated the user workloads into an append-only workload for the object storage. This simplifies versioning and replication which can now be done in object storage at little or no cost. A Garbage Collection System demonstrates this aging. While the cache can implement any one of the caching policies for retention of object, it can also be delegated to the user workload where the workload specifies which objects have aged. Then the cache merely schedules the storage of these objects into the object storage. Such a policy is best demonstrated in a user workload that implements a garbage collection system. And once it works well in the user workload, the logic can then be moved into the cache layer.
In this post and the next few, we bring up the .Net garbage collection as an example of such an aging policy. The .Net garbage collection is a generational markdown compactor. The compactor uses the notion of a generation to identify objects by their usage. The most used objects belong to a younger generation and those that are less used, belong to the older generation. The generation gives us an indication of age. The age of the object on the other hand, is the time from last use. We will describe the sweep and the collection shortly but we proceed with the notion that there is a metric that lets us identify the age of the object which the cache then rolls to object storage.

#codingexercise

Check if a number has less set than unset bits.

boolean hasLessSetThanUnsetBits(int n)

{

int set = 0;

int unset = 0;

while (n)

{

if (n & 0x1) {

set++;

} else {

unset++;

}

n = n >> 1;

}

return (set < unset);

}

Friday, October 5, 2018

While Queues serve a variety of purpose, they can come in especially useful to capture traffic. From network packets, api messages to just about anything that has a continuous flow, a queue will help record the traffic regardless of the flow. There are no upper limits to the number of messages in a queue and there is no limit to what can be stored in the object storage, therefore these will serve almost any traffic.

Queues do not need to actively participate in standard query operations on the captured traffic. They merely transfer the data to the object storage which can then serve the queries with the help a query execution engine like log parser that uses the available data as an enumerable. The query operations could include selection, aggregation, grouping and ordering.

Thursday, October 4, 2018

We were discussing the queue layer over Object Storage.

The reserving of certain queues in the queue layer and the reserving of certain object storage namespaces alleviates the need to have special purpose stacks. In addition, it reuses the existing to infrastructure for concerns such as monitoring, reporting, and background processes. There are no additional artifacts here other than the reservation so that they don’t mix with user data.

Securing system artifacts is equally necessary just as it is important to not allow data corruption. This can be achieved by isolation and encryption. Audit logs for example should not be tampered with. Therefore, they may be encrypted so that the data may not be modified even if they are leaked. System artifacts can use specific prefixes so that they are differentiated from user namespaces. In addition, these namespaces may not be allowed to be created by the user. Finally, data in transit must be secured just as much as it is necessary for the data at rest to be secured.

On the other hand, user queues and object storage must be made as programmatically available collectively as they are made easy individually. Many message queuing stacks provide APIs to manage the queues and they follow similar patterns. These APIs are made available over the web so they can be invoked remotely via direct calls. In addition, if SDKs are made available, then they can improve programmability. SDKs make working with queues and objects easier in the programming language of the user. They don’t need any extra setup and facilitate the calls made to the APIs.

Unlike other SDKs, the queue layer does not merely provide the APIs associated with the queue. It consolidates functionality through the queue layer on the object storage that would have otherwise been directly accessible. Such automation of common tasks associated with the object storage for the purpose of queue management is a useful addition to the SDK.