Cluster computing

Monday, September 17, 2018

We were discussing the role of Cache Service with Object Storage. The requirements for object storage need not even change while the reads and writes from the applications can be handled. There can be a middle layer as a proxy for a file system to the application while utilizing the object storage for persistence. This alleviates performance considerations to read and write deep into the private cloud each time. That is how this Cache Layer positions itself. It offers the same performance as query plan caching does to handle the workload and while it may use its own intermediate storage, it works as a staging for the data so that the data has a chance to age and persist in object storage.
Object Storage is a limitless storage. Data from any workload is anticipated to grow over time if it is saved continuously. Consequently the backend and particularly the cloud services are better prepared for this task. While a flush to local file system with an asynchronous write may be extremely cheap compared to persistence in the cloud as an S3 object, there is no reason to keep rolling over local filesystem data by hand to object storage.

Object storage is a zero maintenance storage. There is no planning for capacity and elastic nature of its services may be taken for granted. The automation of asynchronous writes, flush to object storage and sync of data in cache to that in object storage is now self-contained and packaged into this cache layer.

The cloud services are elastic. Both the storage and the cache services could be deployed in the cloud which not only gives the same benefits to one application or client but also to every department, organization, application, service and workload.

Object storage coupled with this cache layer is also suited to dynamically address the needs for client and application because the former may have been a global store but the latter is able to determine the frequency depending on the workload.  Different applications may tune the caching to its requirements.

Performance increases dramatically when the results are returned as close to the origin of the requests instead of going deep into the technology stack. This has been one of the arguments for web cache and web proxy in general.

Such service is hard to mimic individually within each application or client. Moreover, optimizations only happen when the compute and storage are elastic where they can be studied, cached, and replayed independent of the applications.

The move from tertiary to secondary storage is not a straightforward shift from NAS storage to object storage without some form of chores over time. A dedicated product likes this takes the concern out of the picture.
#linear regression for bounding box adjustment in regions of interest.

This fits a line to the data points.

Sunday, September 16, 2018

Object Storage is perceived as backup and tertiary storage. This may come from the interpretation that this storage is not suitable for read and write intensive data transfers that are generally handled by filesystem or database. However, not all data needs to be written deep into the object storage at once.The requirements for object storage need not even change while the reads and writes from the applications can be handled. There can be a middle layer as a proxy for a file system to the application while utilizing the object storage for persistence. This alleviates performance considerations to read and write deep into the private cloud each time. That is how this Cache Layer positions itself. It offers the same performance as query plan caching does to handle the workload and while it may use its own intermediate storage, it works as a staging for the data so that the data has a chance to age and persist in object storage.

Cache service has been a commercially viable offering. AppFabric is an example of a cache service that has shown substantial improvements to APIs. Since objects are accessed via S3 Apis, the use of such cache service works very well. However, traditional cache services have usually replayed previous request with the help of amortized results and cache writes have been mostly write-throughs which reach all the way to the disk. This service may be looked at in the form of a cloud service that not only maintains a proxy to the object storage but is also a smart as well as massive service that maintains its own storage as necessary.

Cache Service works closely with a web proxy and traditionally both have been long standing products in the marketplace. Mashery is an http proxy that studies web traffic to provide charts and dashboards for monitoring and statistics. This cache layer is well-positioned for web application traffic as well as those that utilize S3 APIs directly. It need not event require to identify callers and clients by requiring apikeys over S3 APIs. Moreover it can leverage geographical replication of objects within the object storage by routing to or reserving dedicated virtual data center sites and zones for its storage. As long as this caching layer establishes a sync between say a distributed or cluster file system and object storage with duplicity-tool like logic, it can roll over all data eventually to persistence

#codingexercise
def discounted_cumulative_gain(relevance, p):
sum-p = relevance(1)
for i in range (1, p+1):
sum-p += relevance(i) / log_base_2(i)
return sum-p

Saturday, September 15, 2018

Detecting objects using mAP metric:

def precision (relevant, retrieved) :

return len( list( set(relevant).intersection(retrieved) ) ) / len(list(retrieved))

def recall (relevant, retrieved) :

return len( list( set(relevant).intersection(retrieved) ) ) / len(list(retrieved))

def average_precision( precision, recall, retrieved, relevant, n ):

sum = 0

for rank in range(1, n+1):

precision_at_cutoff_k = precision ( get(sorted(retrieved), k)

delta_relevant = abs( relevant(get(retrieved, k)) – relevant(get(retrieved, k-1)) )

sum += precision_at_cutoff * delta_relevant

return sum / len(relevant)

def mean_average_precision (precision, recall, retrieved, relevant, n, queries): # map

sum = 0
if len(queries) == 0:
return 0

for query in queries:

sum += average_precision( get_precision_for_query(precision, query),

get_recall_for_query(recall, query),

get_retrieved_for_query(retrieved, query),

get_relevant_for_query(relevant, query),

get_count_for_query(n, query))

return sum / len(queries)

Friday, September 14, 2018

We were discussing the choice of Query Language for search over object storage.
The use of user defined operators and computations to perform the work associated with the data is well known for querying. Such custom operators enable intensive and involved queries to be written. These have resulted in stored logic such as the stored procedures which are written in a variety of languages. With the advent of machine learning and data mining algorithms, these have enabled support for new languages and packages as well as algorithms that are now available right out of the box and shipped with their respective tools.
While some graph databases have to catchup on support for streaming operations, Microsoft facilitated it with StreamInsight queries. The Microsoft StreamInsight Queries follow a five-step procedure:

1) define events in terms of payload as the data values of the event and the shape as the lifetime of the event along the time axis

2) define the input streams of the event as a function of the event payload and shape. For example, this could be a simple enumerable over some time interval

3) Based on the events definitions and the input stream, determine the output stream and express it as a query. In a way this describes a flow chart for the query

4) Bind the query to a consumer. This could be to a console. For example

Var query = from win in inputStream.TumblingWindow( TimeSpan.FromMinutes(3)) select win.Count();
5) Run the query and evaluate it based on time.
Query execution engine is different for large distributed databases. For example, Horton has four components - the graph client library, the graph coordinator, graph partitions and the graph manager. The graph client library sends queries to the graph coordinator which prepares an execution plan for the query. The graph partition manages a set of graph nodes and edges. Horton is able to scale out mainly because of graph partitions. The graph manager provides an administrative interface to manage the graph with chores like loading and adding and removing servers. But the queries that are written for Horton are not necessarily the same as SQL.
While Horton's approach is closer to SQL, Cypher's language has deviated from SQL. Graph databases evolved their own query language such as Cypher to make it easy to work with graphs. Graph databases perform better than relational in highly interconnected data where a nearly online data warehouse is required. Object Storage could have standard query operators for the query language if the entire data were to be considered as enumerable.

In order to collapse the enumeration, efficient lookup data structures such as Bplus tree are used. These indexes can be saved right in the object storage for enabling faster lookup later. Similarly logs for query engine operations and tags and metadata for objects may also be persisted in object storage. The storage forms a layer with the query engine compute layer stacked over it.

Void generateEvenFibonacci () {
Var Fibonacci = GetFibonacciNumbers ();
Fibonacci.Enumerate( (I , e) => { if ( (i%2 == 1) { Console.writeline (e); }} );
}

Thursday, September 13, 2018

We were discussing the suitability of object storage for deep learning.There were several advantages. The analysis can be run on all data at once and this storage is one of the biggest. The cloud services are elastic and they can pull in as much resource as needed As the backend, the processing is done once for all clients. The performance increases dramatically when the computations are as close to the data as possible. Such compute and data intensive operations are hardly required on the frontend. Moreover, optimization is possible when the compute and storage are elastic where they can be studied, cached, and replayed. Complex queries can already be reduced to use a few primitives leaving the choice to implement higher order query operators by users.
use of user defined operators and computations to perform the work associated with the data is well known for querying. Such custom operators enable intensive and involved queries to be written. These have resulted in stored logic such as the stored procedures which are written in a variety of languages. With the advent of machine learning and data mining algorithms, these have enabled support for new languages and packages as well as algorithms that are now available right out of the box and shipped with their respective tools.
If the query language allowed implicit data extract transform and piping of data, it becomes even more interactive. Previously the temporary data was held in temporary databases or tables or in-memory but there was no way to offload them to the cloud as S3 files or blobs so that the query language becomes even more purposeful as interactive language. Object storage serves this purpose very well and enables a user oriented interactive at-scale data ETL and operations via adhoc queries. Perhaps the interactive IDE or browser for query language may make use of the cloud storage in the future.

Wednesday, September 12, 2018

Using Machine Learning with Object Storage:
The Machine Learning packages such as sklearn and Microsoft ML package or complex reporting queries for say dashboards can also utilize object storage. These analytical capabilities are leveraged in the database systems, but there is no limitation to apply it over objects in object storage.
This works very well for several reasons:
1) The analysis can be run on all data at once. The more the data, the better the analysis and object storage is one of the biggest possible. Consequently the backend and particularly the cloud services are better prepared for this task
2) The cloud services are elastic - they can pull in as much resource as needed for the execution of the queries and this works well for map-reduce processing
3) Object storage is also suited to do this processing once for every client and application. Different views and viewmodels can use the same computation so long as the results are part of the storage.
4) Performance increases dramatically when the computations are as close to the data as possible. This has been one of the arguments for pushing the machine learning package into the sql server for example.
5) Such compute and data intensive operations are hardly required on the frontend where the data may be very limited on a given page. Moreover, optimizations only happen when the compute and storage are elastic where they can be studied, cached, and replayed.
6) Complex queries can already be reduced to use a few primitives which can be made available as query operators over object storage leaving the choice to implement higher order themselves using these primitives or their own custom operators.

#codingexercise
Determine if a sum is a perfect number. A perfect number is one whose factors add up to the number

Bool IsPerfectSum(ref List<int> factors, int sum)

{

If (sum == 0 && factors.Count() > 0) return false;

if (sum == 0) return true;

If (sum < 0) return false;

if (factors.Count() == 0 && sum != 0) return false;

// sum > 0 and factors.Count() > 0

Var last = factors.last();

factors.RemoveAt(factors.Count() - 1);

if (last> sum)

{

Return false;

}

Return IsSubsetSum(ref factors, sum-last) ;

}