Sunday, September 16, 2018

Object Storage is perceived as backup and tertiary storage. This may come from the interpretation that this storage is not suitable for read and write intensive data transfers that are generally handled by filesystem or database. However, not all data needs to be written deep into the object storage at once.The requirements for object storage need not even change while the reads and writes from the applications can be handled. There can be a middle layer as a proxy for a file system to the application while utilizing the object storage for persistence.  This alleviates performance considerations to read and write deep into the private cloud each time. That is how this Cache Layer positions itself. It offers the same performance as query plan caching does to handle the workload and while it may use its own intermediate storage, it works as a staging for the data so that the data has a chance to age and persist in object storage.
Cache service has been a commercially viable offering. AppFabric is an example of a cache service that has shown substantial improvements to APIs. Since objects are accessed via S3 Apis, the use of such cache service works very well. However, traditional cache services have usually replayed previous request with the help of amortized results and cache writes have been mostly write-throughs which reach all the way to the disk. This service may be looked at in the form of a cloud service that not only maintains a proxy to the object storage but is also a smart as well as massive service that maintains its own storage as necessary.
Cache Service works closely with a web proxy and traditionally both have been long standing products in the marketplace. Mashery is an http proxy that studies web traffic to provide charts and dashboards for monitoring and statistics. This cache layer is well-positioned for web application traffic as well as those that utilize S3 APIs directly. It need not event require to identify callers and clients by requiring apikeys over S3 APIs. Moreover it can leverage geographical replication of objects within the object storage by routing to or reserving dedicated virtual data center sites and zones for its storage. As long as this caching layer establishes a sync between say a distributed or cluster file system and object storage with duplicity-tool like logic, it can roll over all data eventually to persistence

#codingexercise
def discounted_cumulative_gain(relevance,  p):
       sum-p = relevance(1)
       for i in range (1, p+1):
             sum-p += relevance(i) / log_base_2(i)
      return sum-p

Saturday, September 15, 2018


Detecting objects using mAP metric: 

def precision (relevant, retrieved) : 
       return  len(  list( set(relevant).intersection(retrieved) )  )  /    len(list(retrieved)) 

def recall (relevant, retrieved) : 
       return len(  list( set(relevant).intersection(retrieved) ) ) / len(list(retrieved)) 
  
def average_precision( precision, recall, retrieved, relevant, n ): 
                  sum = 0 
                  for rank in range(1, n+1): 
                        precision_at_cutoff_k =  precision ( get(sorted(retrieved), k) 
                        delta_relevant = abs( relevant(get(retrieved, k)) – relevant(get(retrieved, k-1)) ) 
                        sum += precision_at_cutoff * delta_relevant 
                  return sum / len(relevant) 
def mean_average_precision (precision, recall, retrieved, relevant, n, queries):  # map 
                sum = 0 
                if len(queries) == 0:
                    return 0
                for query in queries:  
                        sum += average_precisionget_precision_for_query(precision, query), 
                                                                          get_recall_for_query(recall, query), 
                                                                          get_retrieved_for_query(retrieved, query), 
                                                                          get_relevant_for_query(relevant, query), 
                                                                          get_count_for_query(n, query)) 
               return sum / len(queries) 
  
                                                                          


Friday, September 14, 2018

We were discussing the choice of Query Language for search over object storage.
The use of user defined operators and computations to perform the work associated with the data is well known for querying. Such custom operators enable intensive and involved queries to be written. These have resulted in stored logic such as the stored procedures which are written in a variety of languages. With the advent of machine learning and data mining algorithms, these have enabled support for new languages and packages as well as algorithms that are now available right out of the box and shipped with their respective tools.
While some graph databases have to catchup on support for streaming operations, Microsoft facilitated it with StreamInsight queries. The Microsoft StreamInsight Queries follow a five-step procedure:

1)     define events in terms of payload as the data values of the event and the shape as the lifetime of the event along the time axis

2)     define the input streams of the event as a function of the event payload and shape. For example, this could be a simple enumerable over some time interval

3)     Based on the events definitions and the input stream, determine the output stream and express it as a query. In a way this describes a flow chart for the query

4)     Bind the query to a consumer. This could be to a console. For example

        Var query = from win in inputStream.TumblingWindow( TimeSpan.FromMinutes(3)) select win.Count();
5)     Run the query and evaluate it based on time.
Query execution engine is  different for large distributed databases.  For example, Horton has four components  - the graph client library, the graph coordinator, graph partitions and the graph manager. The graph client library sends queries to the graph coordinator which prepares an execution plan for the query. The graph partition manages a set of graph nodes and edges. Horton is able to scale out mainly because of graph partitions. The graph manager provides an administrative interface to manage the graph with chores like loading and adding and removing servers. But the queries that are written for Horton are not necessarily the same as SQL.
While Horton's approach is closer to SQL, Cypher's language has deviated from SQL. Graph databases evolved their own query language such as Cypher to make it easy to work with graphs. Graph databases perform better than relational in highly interconnected data where a nearly online data warehouse is required. Object Storage could have standard query operators for the query language if the entire data were to be considered as enumerable.
In order to collapse the enumeration, efficient lookup data structures such as Bplus tree are used. These indexes can be saved right in the object storage for enabling faster lookup later.  Similarly logs for query engine operations and tags and metadata for objects may also be persisted in object storage. The storage forms a layer with the query engine compute layer stacked over it.

Void generateEvenFibonacci () {
Var Fibonacci = GetFibonacciNumbers ();
Fibonacci.Enumerate( (I , e) => { if ( (i%2 == 1) { Console.writeline (e); }} );
}

Thursday, September 13, 2018

We were discussing the suitability of object storage for deep learning.There were several  advantages. The analysis can be run on all data at once and this storage is one of the biggest. The cloud services are elastic and they can pull in as much resource as needed As the backend, the processing is done once for all clients. The performance increases dramatically when the computations are as close to the data as possible. Such compute and data intensive operations are hardly required on the frontend. Moreover, optimization is possible when the compute and storage are elastic where they can be studied, cached, and replayed. Complex queries can already be reduced to use a few primitives  leaving the choice to implement higher order  query operators by users.
use of user defined operators and computations to perform the work associated with the data is well known for querying. Such custom operators enable intensive and involved queries to be written. These have resulted in stored logic such as the stored procedures which are written in a variety of languages. With the advent of machine learning and data mining algorithms, these have enabled support for new languages and packages as well as algorithms that are now available right out of the box and shipped with their respective tools. 
If the query language allowed implicit data extract transform and piping of data, it becomes even more interactive. Previously the temporary data was held in temporary databases or tables or in-memory but there was no way to offload them to the cloud as S3 files or blobs so that the query language becomes even more purposeful as interactive language. Object storage serves this purpose very well and enables a user oriented interactive at-scale data ETL and operations via adhoc queries. Perhaps the interactive IDE or browser for query language may make use of the cloud storage in the future.

Wednesday, September 12, 2018

Using Machine Learning with Object Storage:
The Machine Learning packages such as sklearn and Microsoft ML package or complex reporting queries for say dashboards can also utilize object storage. These analytical capabilities are leveraged in the database systems, but there is no limitation to apply it over objects in object storage.
This works very well for several reasons:
1) The analysis can be run on all data at once. The more the data, the better the analysis and object storage is one of the biggest possible. Consequently the backend and particularly the cloud services are better prepared for this task
2) The cloud services are elastic - they can pull in as much resource as needed for the execution of the queries and this works well for map-reduce processing
3) Object storage is also suited to do this processing once for every client and application.  Different views and viewmodels can use the same computation so long as the results are part of the storage.
4) Performance increases dramatically when the computations are as close to the data as possible. This has been one of the arguments for pushing the machine learning package into the sql server for example.
5) Such compute and data intensive operations are hardly required on the frontend where the data may be very limited on a given page. Moreover, optimizations only happen when the compute and storage are elastic where they can be studied, cached, and replayed.
6) Complex queries can already be reduced to use a few primitives which can be made available as query operators over object storage leaving the choice to implement higher order themselves using these primitives or their own custom operators.
#codingexercise
Determine if a sum is a perfect number.  A perfect number is one whose factors add up to the number
Bool IsPerfectSum(ref List<int> factors, int sum) 
{ 
If (sum == 0 && factors.Count() > 0) return false; 
if (sum == 0) return true;
If (sum < 0) return false; 
if (factors.Count() == 0 && sum != 0) return false; 
// sum > 0 and factors.Count() > 0
Var last = factors.last(); 
factors.RemoveAt(factors.Count() - 1); 
if (last> sum)  
{ 
Return false; 
} 
Return IsSubsetSum(ref factors, sum-last) ; 
} 

Tuesday, September 11, 2018

Object versioning versus naming conventions
Introduction: When an object is created, it has a lifetime and purpose. If it undergoes modifications, the object is no longer the same. Even though the modified object may serve the same purpose, it will be a different entity.  These objects can co-exist either as separate versions or with separate names.
In object storage we have similar concept. When we enable versioning, we have the option to track changes to an object. Any modification of the object via upload increments its version. This follows the “copy-on-write" principle.  Generally, in place editing of an object is not recommended. There are several reasons for this. First, an object may be considered a sequence of bytes. If we overwrite a byte range that is yet to be read, the reader may not know what the original object was. Second the size of the byte range may shrink or expand on editing. The caller may never be able to use the size as an attribute of the object if it keeps changing. Third, just like we have caveats with memcpy operations on byte ranges, we exercise similar caution on the changes for readers and writers. If we had made a copy, the readers could continue to read the old copy without any concern about writers and vice versa.  The changes to the object could also leave the object in an inconsistent or unusable state. Therefore, editing an object is not preferred. Unless it is done for debugging or other forms of forensics or reverse engineering, an object is best served to have a different version.  Versioning is automatic and it is possible to go forward or back ward between version. The versioning may even come with descriptions that talk about each version. All versions of the objects have the same name.  When an object is referred by its name, the latest version is retrieved.
When objects have different names, they may choose to have patterns for their organization. When there are a large number of objects, having a prefix and a naming convention is a standard practice in many Information Technology departments. These names with wild card patterns can then be used for search commands that can span some or all of these objects. This may not be easy to do with past versions unless the search command iterates over all versions of an objects. Moreover, copies of the same object with different filename can each undergo different modifications and maintain their own version history. The names may not just involve tags that lets objects be grouped, ranked and sorted.
There are other forms of modifications that are not covered by the techniques above. For example, there is a edit-original and copy-later technique where two copies are simultaneously maintained and the edit of one copy is allowed with restore from the other copy. An undo like technique is also possible where the incremental changes are captured and undone with the copies of the originals used to replace. In fact all in-place editing can be done automatically with the help of some form of rollback behaviour involving either discarding the writes or overwriting with the original.  Locking and logging are two popular techniques to help with their atomicity, consistency, isolation and durability.

Monday, September 10, 2018

Object Storage as a query store
Introduction: Users are able to search and query files in a file system or unstructured data stores. Object storage is not only a replacement for file storage but is also an unstructured data store promoting enumeration of object with a simple namespace, bucket and object hierarchy. This articles looks at enabling not just querying over Object Storage but also search and mining techniques.
Description:
1) Object Storage as  a SQL store:
This technique utilizes a SQL engine over enumerables:

Object Storage data is search-able as a COM input to log parser. A COM input simply implements a few methods for the log parser and abstracts the data store. These methods are :
OpenInput: Opens your data source and sets up any initial environment settings
GetFieldCount: returns the number of fields that the plugin provides
GetFieldName: returns the name of a specified field
GetFieldType : returns the datatype of a specified field
GetValue : returns the value of a specified field
ReadRecord : reads the next record from your data source
CloseInput: closes the data source and cleans up any environment settings
Here we are saying that Object Storage acts as a data store for COM input to log parser which can then be queried in SQL for the desired output.
There are two different forms of expressions enabled SQL queries
First - This expression is in the form of standard query operators which became popular across languages such as .Where() and .Sum() as in LINQ. This tried, tested and well-established SQL Query language features. The language is very inspiring to express queries in succinct manner often enabling all aspects of data manipulation to refine and improve result-sets.
The second form of expression was with the search query language which has had a rich history in shell scripting and log analysis where the results of one command are piped into another command for transformation and analysis. Although similar in nature to chaining operators the expressions of this form of query involved more search like capabilities across heterogenous data such as with the use of regular expressions for detecting patterns.  This form of expression not only involved powerful query operators but facilitated data extract, transform and load as it made its way through the expressions.


2) Object Storage as a search index:


Here we are utilizing the contents with the objects to build an index. The index may reside in a database but there is no restriction for storing it as objects in object store if performance is tolerated.

Sample program available here: https://github.com/ravibeta/csharpexamples/blob/master/SourceSearch/SourceSearch/SourceSearch/Program.cs
3) Object Storage for deep learning:


Here we utilize the tags associated with the objects which may be done once when the content is classified. Operators used here can be expanded to include more involved forms of computations such as grouping, ranking, sorting and such analysis.


There are three scenarios for showcasing the breadth of query language which include - a Cognitive example , a text analysis example and a JSON processing.
The cognitive example identifies objects in images. This kind of example show how the entire image processing on image files can be considered custom logic and used with the query language. As long as we define the objects, the input and the logic to analyze the objects, it can be made part of the query to extract the desired output dataset.
The text analysis example is also similar where we can extract the text prior to performing the analysis. It is interesting to note that the classifier used to tag the text can be written in R language and is not dependent on the query.
JSON processing  is another example that is referenced often probably because it has become important to extract transform load in analytical processing whether it is a cloud data warehouse or big data operations. This "schema later" approach is popular because it decouples producers and consumers which saves co-ordination and time-consuming operations between say departments. In all these three scenarios, the object storage can work effectively as a storage layer.
Conclusion:
All aspects of a universal query language may be applicable to object stores just as if the content was available from file or document stores.

Furthermore, the metadata and the indexes may be stored in dedicated objects within the object storage