Cluster computing: August 2018

Friday, August 31, 2018

The gateway as a classifier.
The rules of a gateway need not mere regex translation of incoming address to another site-specific address. We are dealing with objects an all part of the object endpoint address such as the hierarchical namespace – bucket – object may be translated to another all-together different address but pointing to the same copy of the object. For that matter hashes of web addresses may be translated so that the caller may only need a tiny url to access an object and internally the same copy of the object may be provided at lightning speed from site specific buckets. We are not just putting the gateway on steroids, we are also making it smarter by allowing the user to customize the rules. These rules can be authored in the form of expressions and statements much like a program with lots of if then conditions ordered by their execution sequence. The gateway works more than an http proxy or a message queue server. It is a lookup of objects without sacrificing performance and without restrictions to the organization of objects within or distributed stores. It works much like routers and although we have referred to gateway as a networking layer over storage, it provides a query execution service as well. All the queries are similar in their nature. They are mostly web addresses of objects. The storage server only knows about three internal copies of an object for durability. These copies share the same address and different objects have different web address. What a storage server may think as different objects may even be the same object for the user. How the user organizes the objects in namespaces and buckets may be based on her rules that are beyond the site replication. if the gateway can route the request to the same object to different sites, there is nothing preventing the gateway to let the user add custom rules that utilize this address translation for purposes other than geography based content distribution. Fundamentally, a specific address just for an object each does not benefit the customer when she wants to hand out the same address for content that are served by two or more same objects. Where those objects are located and how the address translation works may be based on statics site based routing via regex or dynamic routing based on rules and program. Moreover, the gateway has the ability to interpret aliases of addresses that the object storage cannot.

Thursday, August 30, 2018

The case of the Cloud Gateways for storage.

Some view the cloud gateway as a device that can be placed at the customer’s premise and translate low level file commands into high level http requests that use cloud storage. Public cloud providers distort it further by saying the gateway is provided from the cloud. They offer easy integration into existing infrastructure because they route requests between options. Sometimes direct integration can be very expensive requiring manipulation of APIs for create, update and delete. On the other hand, the gateways feature as adapters and do away with the cost of integration by leveraging existing commands.

Others use gateway for segregating their workloads. Every store in an organization does not get used uniformly and gateways help to consolidate the infrastructure behind a common entrypoint. This allows users to use the same construct that they have while allowing the planners to separate the storage into high and low usage cases.

Cloud gateways can also be used for heterogenous stores where the data existing on one storage need not be replicated to another storage as long as they are accessible from the same common entrypoint.

Regardless of what gateway means for someone, they find universal appeal in their utility. Gateways distribute traffic. It works exceptionally well when it routes request to on-premise or cloud object stores. The on-premise helps with closer access of data. The same concept may apply to geographical distribution If similar content where each object storage serves a specific region. In this case replication may need to be set-up between different object storage. we could leverage an object storage replication group to do automatic replication. It might be considered a bottleneck if the same object storage is used. This is different from redirecting requests to separate servers/caches. However, shared services may offer at par service level agreement as an individual service for requests. Since a gateway will not see a performance degradation when sending to a proxy server or a shared dedicated store, it works in both these cases. Replacing a shared dedicated store with a shared dedicated storage such as an Object Storage is therefore also a practical option. Moreover, a cache generally improves performance over what might have been incurred in going to the backend. That is why different proxy servers behind a gateway could maintain their own cache. A dedicated cache service like AppFabric may also be sufficient to handle requests. In this case, we are consolidating proxy server caches with a dedicated cache.
#codingexercise

Determine if a number is perfect. A perfect number is the sum of all of its divisors.
Boolean isPerfect(uint n )
{
var factors = GetFactors(n);
return n == factors.sum();
}
List<int> GetFactors(uint n)
{
var ret = new List<int>();
ret.Add(1);
For (int I = 2; i <= Math.sqrt(n); I++) {
If (n %I ==0 ) {
ret.Add(I); // add lo factor
if (n/i != I ) ret.Add(I); // /add high factor
}
}
return ret;
}

Wednesday, August 29, 2018

We discussed that a gateway is supposed to distribute the traffic.Ut works exceptionally well when it routes request to on-premise or cloud object stores. The on-premise helps with closer access of data. The same concept may apply to geographical distribution If similar content where each object storage serves a specific region. In this case replication may need to be set-up between different object storage. we could leverage an object storage replication group to do automatic replication. It might be considered a bottleneck if the same object storage is used. This is different from redirecting requests to separate servers/caches. However, shared services may offer at par service level agreement as an individual service for requests. Since a gateway will not see a performance degradation when sending to a proxy server or a shared dedicated store, it works in both these cases. Replacing a shared dedicated store with a shared dedicated storage such as an Object Storage is therefore also a practical option. Moreover, a cache generally improves performance over what might have been incurred in going to the backend. That is why different proxy servers behind a gateway could maintain their own cache. A dedicated cache service like AppFabric may also be sufficient to handle requests. In this case, we are consolidating proxy server caches with a dedicated cache.
There is a tradeoff when we address gateway logic, replication logic, and storage server logic independently. While it is modular to visualize each layer as a separation of concerns, there is no necessity to house them in different products. Moreover they can be viewed as storage server logic and this can be moved into the storage server. The tradeoff is that when these layers are consolidated, they do not facilitate testing. Moreover they become more dedicated towards the storage and leave the onus on the owner to make copies of the content as necessary for the geographical regions. However, we argued that the storage and replication are handled well within object storage and what was missing was just the gateway feature. This gateway feature can be made extensible but it would be sufficient to enable the user to store once and have the same content made available from each geographical region and the request routed to the nearest geographical region. Further the address translation need not be made specific to region, they can be made granular to objects. If we take an example of the url from an object storage for the exposed endpoint of an object over http, it usually has a namespace, bucket and object name as hierarchy. This is the only input from the user. This component does not change. However, the gateway rules previously translated the server address but now they can translate the object naming hierarchy to the nearest site.

Tuesday, August 28, 2018

We discussed that a gateway is supposed to distribute the traffic. If it sends it to the same single point of contention, it is not very useful When requests are served from separate caches, the performance generally improves over what might have been incurred in going to the backend. That is why different proxy servers behind a gateway could maintain their own cache. A dedicated cache service like AppFabric may also be sufficient to handle requests. In this case, we are consolidating proxy server caches with a dedicated cache. This does not necessarily mean a single point of contention. Shared services may offer at par service level agreement as an individual service for requests. Since a gateway will not see a performance degradation when sending to a proxy server or a shared dedicated cache, it works in both these cases. Replacing a shared dedicated cache with a shared dedicated storage such as an Object Storage is therefore also a practical option.
While gateway route requests, they could be replaced with a networking layer that enables a P2P network of different object storage which could be on-premise or in the cloud. A distributed hash table in this case determines the store to go to. The location information for the data objects is deterministic as the peers are chosen with identifiers corresponding to the data object's unique key. Content therefore goes to specified locations that makes subsequent requests easier. Unstructured P2P is composed of peers joining based on some rules and usually without any knowledge of the topology. In this case the query is broadcast and peers that have matching content return the data to the originating peer. This is useful for highly replicated items. P2P provides a good base for large scale data sharing. Some of the desirable features of P2P networks include selection of peers, redundant storage, efficient location, hierarchical namespaces, authentication as well as anonymity of users. In terms of performance, the P2P has desirable properties such as efficient routing, self-organizing, massively scalable and robust in deployments, fault tolerance, load balancing and explicit notions of locality. Perhaps the biggest takeaway is that the P2P is an overlay network with no restriction on size and there are two classes structured and unstructured. Structured P2P means that the network topology is tightly controlled and the content is placed on random peers and at specified location which will make subsequent requests more efficient.

Monday, August 27, 2018

We were discussing anecdotal quotes from industry experts on gateway for object storage.
They cited gateways for object storage as provided by public cloud providers. This is a convenience for using on - premise and cloud storage. which shows that there is value in this proposition. In addition, our approach is novel in using it for Content Distribution Network and by proposing it to be built into the object storage as a service.
Some experts argued that gateway is practical only for small and medium businesses which are small scale in requirements. This means that they are stretched on large scale and object storage deployments are not necessarily restricted in size. These experts argued that the problem with gateway is that it adds more complexity and limits performance.
When gateways solve problems where data does not have to move, they are very appealing to many usages across the companies that use cloud providers. There have been several vendors in their race to find this niche. In our case, the http references to use copies of objects versus the same object is a way to do just that. With object storage not requiring any maintenance or administration and providing ability to store as much content as necessary, this gateway service becomes useful for content distribution network purposes.
Some experts commented that public cloud storage gateways are able to mirror volume to a cloud but they are still just building blocks in the cloud. They do not scale capacity or share data to multiple locations This is exactly what we try to do with a gateway from object storage.
A gateway is supposed to distribute the traffic. If it sends it to the same single point of contention, it is not very useful When requests are served from separate caches, the performance generally improves over what might have been incurred in going to the backend. That is why different proxy servers behind a gateway could maintain their own cache. A dedicated cache service like AppFabric may also be sufficient to handle requests. In this case, we are consolidating proxy server caches with a dedicated cache. This does not necessarily mean a single point of contention. Shared services may offer at par service level agreement as an individual service for requests. Since a gateway will not see a performance degradation when sending to a proxy server or a shared dedicated cache, it works in both these cases. Replacing a shared dedicated cache with a shared dedicated storage such as an Object Storage is therefore also a practical option.
#codingexercise
print all the combinations of a string in sorted order
void PrintSortedCombinations(String a)
{
a.Sort();
PrintCombinations(a);
// uses the Combine() method implemented earlier
}

Sunday, August 26, 2018

Anecdotal quotes from industry on gateway for object storage.
We know gateways for object storage is provided as a convenience by public cloud providers. Therefore, there is value in that proposition. In addition, we are also having a novel approach in using it for Content Distribution Network and by proposing it to be built into the object storage as a service. Today we use anecdotal quotes from industry in this regard.
They mention that gateways help connect systems that would otherwise require a lot of code to wire the APIs for data flow. This arduous task of rewriting applications to support web interfaces applies to those who are wanting to migrate to different object storage stacks. It does not really apply in our case.
Some experts argued that gateway is practical only for small and medium businesses which are small scale in requirements. This means that they are stretched on large scale and object storage deployments are not necessarily restricted in size. They are a true cloud storage. These experts point out that object storage is best for backup and archiving. Tools like duplicity use S3 apis to persist in object storage and in this case we are including any workflow for backup and archiving. These workflows do not require modifications of data objects and this makes object storage perfect for them. These experts argued that the problem with gateway is that it adds more complexity and limits performance. It is not used with primary storage applications which are more read and write intensive and do not tolerate latency. Some even argued that the gateway is diminished in significance when the object storage itself is considered raw.
On the other hand, other experts argued that gateways give predictable performance between on-premises infrastructure and a public cloud storage provider. They offer easy integration into existing infrastructure and they offer ability to integrate on a storage protocol by protocol basis. This may be true for cloud gateways in general but our emphasis was on virtual http endpoints within a single object storage.
When gateways solve problems where data does not have to move, they are very appealing to many usages across the companies that use cloud providers. There have been several vendors in their race to find this niche. In our case, the http references to use copies of objects versus the same object is a way to do just that. With object storage not requiring any maintenance or administration and providing ability to store as much content as necessary, this gateway service becomes useful for content distribution network purposes.
Some experts commented that public cloud storage gateways are able to mirror volume to a cloud but they are still just building blocks in the cloud. They do not scale capacity or share data to multiple locations This is exactly what we try to do with a gateway from object storage.

Saturday, August 25, 2018

We said we could combine gateway and http proxy services within the object storage for the site specific http addresses of objects. The gateway also acts as a http proxy. Any implementation of gateway has to maintain a registry of destination addresses. As http access enabled objects proliferate with their geo-replications, this registry becomes granular at the object level while enabling rules to determine the site from which they need to be accessed. Finally they gather statistics in terms of access and metrics which come very useful for understanding the http accesses of specific content within the object storage.

Both the above functionalities can be elaborate allowing gateway service to provide immense benefit per deployment.

The advantages of an http proxy include aggregations of usages. In terms of success and failure, there can be detailed count of calls. Moreover, the proxy could include all the features of a conventional http service like Mashery such as Client based caller information, destination-based statistics, per object statistics, categorization by cause and many other features along with a RESTful api service for the features gathered.

Friday, August 24, 2018

We were saying there are advantages to writing Gateway Service within Object Storage. These included:

First, the address mapping is not at site level. It is at object level.

Second, the address of the object – both universal as well as site specific are maintained along with the object as part of its location information

Third, instead of internalizing a table of rules from the external gateway, a lookup service can translate universal object address to the address of the nearest object. This service is part of the object storage as a read only query. Since object name and address is already an existing functionality, we only add the ability to translate universal address to site specific address at the object level.

Fourth, the gateway functionality exists as a microservice. It can do more than static lookup of physical location of an object given a universal address instead of the site-specific address. It has the ability to generate tiny urls for the objects based on hashing. This adds aliases to the address as opposed to the conventional domain-based address. The hashing is at the object level and since we can store billions of objects in the object storage, a url shortening feature is a significant offering from the gateway service within the object storage. It has the potential to morph into other services than a mere translator of object addresses. Design of a url hashing service was covered earlier as follows.

Fifth, the conventional gateway functionality of load balancing can also be handled with an elastic scale-out of just the gateway service within the object storage.

Sixth, this gateway can also improve access to the object by making more copies of the object elsewhere and adding the superfluous mapping for the duration of the traffic. It need not even interpret the originating ip addresses to determine the volume as long as it can keep track of the number of read requests against existing address of the same object.

In addition, this gateway service within  object storage may be written in a form that allows rules to be customized.  Moreover rules need not be written in the form of declarative configuration. They can be dynamic in the form of a module. As a forwarder, a gateway may leverage rules that are determined by the deployment. Expressions for rules may include features that can be borrowed from IPSec rules. These are well-known rules that govern whether a connection over the Internet may be permitted into a domain.

With the help of a classifier, these rules may even be evaluated dynamically.

The gateway also acts as a http proxy. Any implementation of gateway has to maintain a registry of destination addresses. As http access enabled objects proliferate with their geo-replications, this registry becomes granular at the object level while enabling rules to determine the site from which they need to be accessed. Finally they gather statistics in terms of access and metrics which come very useful for understanding the http accesses of specific content within the object storage.

Both the above functionalities can be elaborate allowing gateway service to provide immense benefit per deployment.

Thursday, August 23, 2018

We were discussing gateway like functionality from object storage. While a gateway maintains address mapping for several servers where routes translate to physical destination based on say regex, here we give the ability to each object to records its virtual canonical address along with its physical location so that each object and its geographically replicated copies may be addressed specifically. When an object is accessed by its address, the gateway used to forward the request to the concerned site based on a set of static rules say at the web server and usually based on regex. Instead with the gateway functionality now merged into the object storage, there are a few advantages that come our way:

First, the address mapping is not at site level. It is at object level.

Second, the address of the object – both universal as well as site specific are maintained along with the object as part of its location information

Fifth, the conventional gateway functionality of load balancing can also be handled with an elastic scale-out of just the gateway service within the object storage.

These advantages can improve the usability of the objects and their copies by providing as many as needed along with a scalable service that can translate incoming universal address of objects to site specific location information.

Wednesday, August 22, 2018

The nodes in a storage pool assigned to the VDC may have a fully qualified name and public ip address. Although these names and ip address are not shared with anyone, they serve to represent the physical location of the fragments of an object. Generally, an object is written across three such nodes. The storage engine gets a request to write an object. It writes the object to one chunk but the chunk may be physically located on three separate nodes. The writes to these three nodes may even happen in parallel. The object location index of one chunk and the disk locations corresponding to the chunk are also artifacts that need to be written. For this purpose, also, three separate nodes may be chosen and the location information may be written. The storage engine records the disk locations of the chunk in a chunk location index and the disk locations corresponding to the chunk to three different disks/nodes. The index locations are chosen independently from the object chunk locations. Therefore, we already have a mechanism to store locations. When these locations have representations for the node and the site, a copy of an object served over the web has a physical internal location. Even when they are geo-replicated, the object and the location information will be updated together. The mapping of a virtual address for an object to different physical copies and their location is therefore a matter of mere looking them up in an index just the same way as we look up the chunks for an object. We just need more information on the location part of the object and the replication group automatically takes care of keeping locations and objects updated as they are copied.

Tuesday, August 21, 2018

Together the gateway and the storage engine provide address and copies of objects to facilitate access via a geographically close location. However, we are suggesting native gateway functionality to object storage in a way that promotes this Content Distribution Network. Since we have copies of the object, we don’t need to give an object multiple addresses for access from different geographical regions.

The object storage has an existing concept of replication group. Their purpose was to define a logical boundary where storage pool content is protected. These groups can actually be local or global. A local replication group protects objects within the same virtual data center. The global replication groups protect the objects against disk, node as well as site failures. The replication strategy is inherent to the object storage. The copies made for the object are within the replication group. In a multi-site content distribution network, the copies may exist outside of the local replication group. The copies of the objects are then made outside of the replication group creating new isolated objects. In such cases, the replication strategy for content-distribution network may kick in to maintain the contents to be the same. However, in this case too, we don’t have to leverage external technologies to configure replication strategy different from that of the object storage. A multi-site virtual data center collection may be put under the same replication group and this should suffice to create enough copies across sites where sited are earmarked for different geographies.

Monday, August 20, 2018

We were discussing gateway and object storage. We wanted to create content distribution network from the object storage itself using a gateway like functionality to objects as geo-redundant copies. The storage engine layer responsible for the creation of objects would automatically take care of the replication of the objects. Storage engines generally have a notion each for a virtual data center and a replication group. An object created within a virtual data center is owned by that virtual data center. If there are more than one virtual data center within a replication group, the owning virtual data center within a group will be responsible to replicate the object in the other virtual data center and this is usually done after the object has been written. At the end of the replication, each virtual data center has a readable copy of the object. Erasure codes help protect the object without additional software or services because the data and the code of the fragments of the object are so formed that the entire object can be reconstructed. Internal to each virtual center, there may be a pool of cluster nodes and the object may have been written across three chosen nodes. Since each virtual data center needs to know the location of the object, the location information itself may be persisted the same way as an object. The location information is internal to the storage engine unlike the internet accessible address. The address is just another attribute for the object. Since the address has no geography specific information as per our design of the gateway, the rules for the gateway can be used to route a read request to the relevant virtual data center which will use the address to identify the object and use its location to read the object. Copying and caching by a non-owner virtual data center is entirely its discretion because those are enhancements to the existence of the object in each virtual data center within the same replication group. Traditionally replication groups were used for outages but the same may be leveraged for content distribution network and the gateway may decide to route the request to one of the virtual data centers.

Sunday, August 19, 2018

The design of content distribution network with object storage

The primary question we answer in this article is why objects don’t have multiple addresses for access from geographically closer regions. We know that there is more than one copy of objects and they are geographically replicated. Content distribution network also intends to do something very similar. They have content designated to proxy servers and the purpose of these servers is to make content available at the nearest location. These mirrored contents enable faster access over network simply by reducing the round-trip time. That is how content distribution network positions itself.

Object storage also has geo-redundant replication and there are secondary addresses for read access to these replicated data. This means data becomes available even during a failover. The question is clearer when we refer to geographically close primary addresses that are served from the same object storage. As long as the user does not have to switch to a secondary address and the primary address is already equivalent to that from a distribution network in terms of performance, the user has no justification to use a content distribution network.

With this context, let us delve into the considerations for enabling such an address for an object exposed over the object storage. We know gateways perform the equivalent of routing to designated servers and that the address merely needs to have a virtual address which is an address for the object that does not change in appearance to the user. Internally the address may be interpreted and routed to designated servers based on routing rules, availability and load. Therefore, the address may work well in terms of being a primary address for the object. A gateway like functionality is already something that works for web server so its design is established and well-known. The availability of the object storage as the unified storage for content regardless of copies or versions is also well-established and known. The purpose for the copies of the objects may merely be for redundancy but there is no restriction for keeping copies of the same object for geographical purposes. This means we can have adequate number of objects for as many geography-based accesses as needed. We have now resolved the availability of objects and their access using a primary distribution network like address.

boolean isDivisibleBy221(uint n)
{
return isDivisibleBy13(n) && isDivisibleBy17(n);
}