Cluster computing

Wednesday, September 19, 2018

Introduction:
This article is an addition to the notion of a Cache Layer for Object Storage that caches objects and translates workloads into frequently backed up objects so that the changes are routinely persisted into the Object Storage. The notion that data can be allowed to age before making its way into the object storage is not a limiting factor. Object Storage just like file storage and especially when file-system enabled allows direct access for persistence anyways. The previous article referenced here merely pointed to the use cases where the reads and writes to objects are much more often that something shallower than an Object Storage will benefit immensely.
Therefore, this article merely looks at the notion of lazy replication. If we use the cache layer and regularly save the objects from the cache into the Object Storage, it is no different than using a local filesystem for persistence and then frequently backing it up into the Cloud. We have tools like duplicity that frequently backup a filesystem into object storage. Although they use archives for compaction but it is no different from copying the data from source to destination even if the source is a file system and the destination is an object store. The schedule of this copying can be made as frequent as necessary to ensure the propagation of all changes by a maximum time limit.
Let us now look at the replication within the Object Storage. After all, the replication is essentially copying objects across the sites within the storage. This copying was intended for the purposes of durability When we setup multiple sites within a replication group, the object get copied to these sites so that it remains durable against loss. This copying is almost immediate and very well handled within the put method of the S3 API that is used to upload objects into the object storage. Therefore, there is multizone update of the object in a single put command when the replication group spans sites. When the object is uploaded, it may be saved in parts and all the book keeping regarding parts are also safeguarded for durability. Both the object data and the parts location information are treated as logically representing the object. There are three copies of such a representation so that if one copy is lost, another can be used. In addition, erasure codes may allow the reconstruction of an object and so the copy operation may not necessarily be a straightforward byte range copy.
Lazy replication allows for copying beyond these durability semantics. It allows for copying on a scheduled basis by allowing the data to age. There may be many updates to the object between two copy operations and this is tolerated because there is no semantic difference between the objects as long as they are copied elsewhere. Said another way, this is the equivalent of chaining object stores so that the cache layer is an object storage in itself with direct access to the data persistence and the object storage behind it as the one receiving copies of the objects that are allowed to age. Since the copy operations occur on every time interval, there is little or no data loss between the primary and the secondary object storages. We just need a connector that transfers some or all objects in a bucket from a namespace to an altogether different bucket in a different namespace in possibly a different object storage. This may be similar to file sync operations between local and remote file system which also allows for offline work to happen. The difference between the file sync operation and a lazy replication is probably just the strategy. Replication as such has several strategies even from databases where logs are used to replay the same changes in a destination database. The choice of strategy and the frequency is not necessary for the discussion that objects can be copied across object storage.
When Object Storage are linked this way, it may be contrary to the notion that a single Object Storage represents a limitless storage with zero maintenance so that the object once saved will always be found avoiding the use of unnecessary copies. However, the performance impact of using an Object Storage directly as opposed to a local file system may affect certain workloads where it may be easier to stage the data prior to its saving in the Object Storage. Therefore this lazy replication may come in helpful to increase the use cases of the Object Storage.

Tuesday, September 18, 2018

The Cache can be used independent of the Object Storage.

The use of a Cache facilitates server-load-balancing, request routing and batched writes. It can be offloaded to hardware. Caches may utilize Message Queuing. They need not be real web servers and can route traffic to sockets on steroids. They may be augmented globally or in a partitioned server. 

Moreover, not all the requests need to reach the object storage. In some cases, web cache may use temporary storage from hybrid choices. The benefits of using a web cache including saving bandwidth, reducing server load, and improving request-response time. If a dedicated content store is required, typically the caching and server are encapsulated into a content server. This is quite the opposite paradigm of using object storage and replicated objects to directly serve the content from the store. The distinction here is that there are two layers of functions - The first layer is the Cache layer that solves distribution using techniques such as caching, asset copying and load balancers. The second layer is the compute and storage bundling in the form of a server or a store with shifting emphasis on code and storage.  We will call this the storage engine and will get to it shortly. 

The Cache would do the same as an asynchronous write without any change in the application logic.   

Monday, September 17, 2018

We were discussing the role of Cache Service with Object Storage. The requirements for object storage need not even change while the reads and writes from the applications can be handled. There can be a middle layer as a proxy for a file system to the application while utilizing the object storage for persistence. This alleviates performance considerations to read and write deep into the private cloud each time. That is how this Cache Layer positions itself. It offers the same performance as query plan caching does to handle the workload and while it may use its own intermediate storage, it works as a staging for the data so that the data has a chance to age and persist in object storage.
Object Storage is a limitless storage. Data from any workload is anticipated to grow over time if it is saved continuously. Consequently the backend and particularly the cloud services are better prepared for this task. While a flush to local file system with an asynchronous write may be extremely cheap compared to persistence in the cloud as an S3 object, there is no reason to keep rolling over local filesystem data by hand to object storage.

Object storage is a zero maintenance storage. There is no planning for capacity and elastic nature of its services may be taken for granted. The automation of asynchronous writes, flush to object storage and sync of data in cache to that in object storage is now self-contained and packaged into this cache layer.

The cloud services are elastic. Both the storage and the cache services could be deployed in the cloud which not only gives the same benefits to one application or client but also to every department, organization, application, service and workload.

Object storage coupled with this cache layer is also suited to dynamically address the needs for client and application because the former may have been a global store but the latter is able to determine the frequency depending on the workload.  Different applications may tune the caching to its requirements.

Performance increases dramatically when the results are returned as close to the origin of the requests instead of going deep into the technology stack. This has been one of the arguments for web cache and web proxy in general.

Such service is hard to mimic individually within each application or client. Moreover, optimizations only happen when the compute and storage are elastic where they can be studied, cached, and replayed independent of the applications.

The move from tertiary to secondary storage is not a straightforward shift from NAS storage to object storage without some form of chores over time. A dedicated product likes this takes the concern out of the picture.
#linear regression for bounding box adjustment in regions of interest.

This fits a line to the data points.

Sunday, September 16, 2018

Object Storage is perceived as backup and tertiary storage. This may come from the interpretation that this storage is not suitable for read and write intensive data transfers that are generally handled by filesystem or database. However, not all data needs to be written deep into the object storage at once.The requirements for object storage need not even change while the reads and writes from the applications can be handled. There can be a middle layer as a proxy for a file system to the application while utilizing the object storage for persistence. This alleviates performance considerations to read and write deep into the private cloud each time. That is how this Cache Layer positions itself. It offers the same performance as query plan caching does to handle the workload and while it may use its own intermediate storage, it works as a staging for the data so that the data has a chance to age and persist in object storage.

Cache service has been a commercially viable offering. AppFabric is an example of a cache service that has shown substantial improvements to APIs. Since objects are accessed via S3 Apis, the use of such cache service works very well. However, traditional cache services have usually replayed previous request with the help of amortized results and cache writes have been mostly write-throughs which reach all the way to the disk. This service may be looked at in the form of a cloud service that not only maintains a proxy to the object storage but is also a smart as well as massive service that maintains its own storage as necessary.

Cache Service works closely with a web proxy and traditionally both have been long standing products in the marketplace. Mashery is an http proxy that studies web traffic to provide charts and dashboards for monitoring and statistics. This cache layer is well-positioned for web application traffic as well as those that utilize S3 APIs directly. It need not event require to identify callers and clients by requiring apikeys over S3 APIs. Moreover it can leverage geographical replication of objects within the object storage by routing to or reserving dedicated virtual data center sites and zones for its storage. As long as this caching layer establishes a sync between say a distributed or cluster file system and object storage with duplicity-tool like logic, it can roll over all data eventually to persistence

#codingexercise
def discounted_cumulative_gain(relevance, p):
sum-p = relevance(1)
for i in range (1, p+1):
sum-p += relevance(i) / log_base_2(i)
return sum-p

Saturday, September 15, 2018

Detecting objects using mAP metric:

def precision (relevant, retrieved) :

return len( list( set(relevant).intersection(retrieved) ) ) / len(list(retrieved))

def recall (relevant, retrieved) :

return len( list( set(relevant).intersection(retrieved) ) ) / len(list(retrieved))

def average_precision( precision, recall, retrieved, relevant, n ):

sum = 0

for rank in range(1, n+1):

precision_at_cutoff_k = precision ( get(sorted(retrieved), k)

delta_relevant = abs( relevant(get(retrieved, k)) – relevant(get(retrieved, k-1)) )

sum += precision_at_cutoff * delta_relevant

return sum / len(relevant)

def mean_average_precision (precision, recall, retrieved, relevant, n, queries): # map

sum = 0
if len(queries) == 0:
return 0

for query in queries:

sum += average_precision( get_precision_for_query(precision, query),

get_recall_for_query(recall, query),

get_retrieved_for_query(retrieved, query),

get_relevant_for_query(relevant, query),

get_count_for_query(n, query))

return sum / len(queries)

Friday, September 14, 2018

We were discussing the choice of Query Language for search over object storage.
The use of user defined operators and computations to perform the work associated with the data is well known for querying. Such custom operators enable intensive and involved queries to be written. These have resulted in stored logic such as the stored procedures which are written in a variety of languages. With the advent of machine learning and data mining algorithms, these have enabled support for new languages and packages as well as algorithms that are now available right out of the box and shipped with their respective tools.
While some graph databases have to catchup on support for streaming operations, Microsoft facilitated it with StreamInsight queries. The Microsoft StreamInsight Queries follow a five-step procedure:

1) define events in terms of payload as the data values of the event and the shape as the lifetime of the event along the time axis

2) define the input streams of the event as a function of the event payload and shape. For example, this could be a simple enumerable over some time interval

3) Based on the events definitions and the input stream, determine the output stream and express it as a query. In a way this describes a flow chart for the query

4) Bind the query to a consumer. This could be to a console. For example

Var query = from win in inputStream.TumblingWindow( TimeSpan.FromMinutes(3)) select win.Count();
5) Run the query and evaluate it based on time.
Query execution engine is different for large distributed databases. For example, Horton has four components - the graph client library, the graph coordinator, graph partitions and the graph manager. The graph client library sends queries to the graph coordinator which prepares an execution plan for the query. The graph partition manages a set of graph nodes and edges. Horton is able to scale out mainly because of graph partitions. The graph manager provides an administrative interface to manage the graph with chores like loading and adding and removing servers. But the queries that are written for Horton are not necessarily the same as SQL.
While Horton's approach is closer to SQL, Cypher's language has deviated from SQL. Graph databases evolved their own query language such as Cypher to make it easy to work with graphs. Graph databases perform better than relational in highly interconnected data where a nearly online data warehouse is required. Object Storage could have standard query operators for the query language if the entire data were to be considered as enumerable.

In order to collapse the enumeration, efficient lookup data structures such as Bplus tree are used. These indexes can be saved right in the object storage for enabling faster lookup later. Similarly logs for query engine operations and tags and metadata for objects may also be persisted in object storage. The storage forms a layer with the query engine compute layer stacked over it.

Void generateEvenFibonacci () {
Var Fibonacci = GetFibonacciNumbers ();
Fibonacci.Enumerate( (I , e) => { if ( (i%2 == 1) { Console.writeline (e); }} );
}

Thursday, September 13, 2018

We were discussing the suitability of object storage for deep learning.There were several advantages. The analysis can be run on all data at once and this storage is one of the biggest. The cloud services are elastic and they can pull in as much resource as needed As the backend, the processing is done once for all clients. The performance increases dramatically when the computations are as close to the data as possible. Such compute and data intensive operations are hardly required on the frontend. Moreover, optimization is possible when the compute and storage are elastic where they can be studied, cached, and replayed. Complex queries can already be reduced to use a few primitives leaving the choice to implement higher order query operators by users.
use of user defined operators and computations to perform the work associated with the data is well known for querying. Such custom operators enable intensive and involved queries to be written. These have resulted in stored logic such as the stored procedures which are written in a variety of languages. With the advent of machine learning and data mining algorithms, these have enabled support for new languages and packages as well as algorithms that are now available right out of the box and shipped with their respective tools.
If the query language allowed implicit data extract transform and piping of data, it becomes even more interactive. Previously the temporary data was held in temporary databases or tables or in-memory but there was no way to offload them to the cloud as S3 files or blobs so that the query language becomes even more purposeful as interactive language. Object storage serves this purpose very well and enables a user oriented interactive at-scale data ETL and operations via adhoc queries. Perhaps the interactive IDE or browser for query language may make use of the cloud storage in the future.