Thursday, July 26, 2018

We were discussing the use of object storage as a time series data store. The notion of buckets in a time-series database translates well to object storage. As one gets filled, data can start filling another. With the help of cold, warm and hot labels, it is easy to maintain progression of data. This data can then serve all the search queries over times series just like events in a time series database.

#codingexercise
Performing inversion count utilizing merge sort:
int GetCountInversionsByMergeSort(ref List<int> A, ref List<int> B, int left, int right) 
{ 
int count = 0; 
if (right > left) { 
  int mid = (right+left)/2; 
  count = GetCountInversionsByMergeSort(A, B, left, mid); 
  count += GetCountInversionByMergeSort(A, B, mid +1, right); 
  count += GetCountByMerge(A, B, left, mid+1, right);  
} 
return count; 
} 

Int GetCountByMerge(ref List<int> A, ref List<int> B,  int left, int mid, int right) 
{ 
int count = 0; 
int I = left; 
int j = mid; 
int k = right; 
  
while( (i<=mid-1) && (j <=right)){ 
if (A[i] <= A[j]) { 
    B[k] = A[i]; 
    k++; 
    l++; 
} else { 
    B[k] = A[j]; 
    k++; 
    j++; 
    count = count + (mid-i); 
} 
} 
  
While (I <= mid-1){ 
   B[k] = A[i]; 
   k++; 
   i++; 
} 
  
While(j <= right) { 
   B[k] = A[j]; 
   j++; 
  k++; 
} 
  
For (int m = left; m <=right; m++) 
     A[m] = B[m]; 
  
return count; 
} 


Wednesday, July 25, 2018

We were discussing the use of object storage as a time series data store. The notion of buckets in a time-series database translates well to object storage. As one gets filled, data can start filling another. With the help of cold, warm and hot labels, it is easy to maintain progression of data. This data can then serve all the search queries over times series just like events in a time series database.
Most time-series databases prefer to use the filesystem directly for their index and store for events without requiring different NoSQL databases.  An NFS file system can also be exported as an Object Store. This means existing file system based data files can be served as buckets and objects once they are setup to do so. Object storage products allow a filesystem to be used as such.
This is helpful for existing data. In addition, time-series database products such as log stores can also write their indexes directly to object storage products which then provides more benefits than the filesystems did.
Since time-series databases make progressive buckets as they fill events in each bucket, they are mostly considered with individual buckets. There is no nesting of buckets and its a progression. This suits the hierarchy of buckets and objects. Most time-series buckets are allocated in the user defined indexes. This is very similar to the namespaces in an Object-Storage.  There does not need to be a direct mapping between an object storage bucket and a time - series bucket. The latter may even appear as objects within an object storage and the emphasis here is the one level hierarchy between buckets and events. The format of events stored may be proprietary so their storage as objects is opaque to the storage world. Their promotion to object stores not only improves storage but also offers them directly over the http without having to route the request through the time series databases controllers which removes some onus from the layer of the time series database and even facilitates querying.
There may be some concern over moving data up and down the protocol layers to be able to serve them over http. In addition there may be copying operation of remote data over local storage to search the data. However these can be delegated to the object storage and its query package so that the time series database merely focuses on the semantics leaving the optimization to the storage.  Most time-series databases shy away from conventional database products simply for the dedicated nature of their offering and the scale of billions of events. Here the entire object storage can be local and serve in place of the filesystem that the time-series database uses. 

Tuesday, July 24, 2018

Sample program to demonstrate search operation in buckets and objects:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Amazon.S3;
using Amazon.S3.Model;
using Lucene.Net;
using Lucene.Net.Analysis;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Store;
using Lucene.Net.Analysis.Standard;

namespace SourceSearch
{
    class Program
    {
        private const string bucketName = "ravi-rajamani-shared";
        private const string keyName1 = "searchIndex";
        private const string filePath = @"\Code\Index2";
        private const string sourcePath = @"\code\API";
        private static readonly RegionEndpoint bucketRegion = RegionEndpoint.EUWest1;

        private static IAmazonS3 client;
        static void Main(string[] args)
        {
            if (args.Count() != 1)
            {
                Console.WriteLine("Usage: SourceSearch <term>");
                return;
            }

            client = new AmazonS3Client(bucketRegion);
            var indexAt = SimpleFSDirectory.Open(new DirectoryInfo(@"C:\Code\Index2"));
            var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
            using (var indexer = new IndexWriter(
                indexAt,
                analyzer, true,
                IndexWriter.MaxFieldLength.UNLIMITED))
            {

                var src = new DirectoryInfo(sourcePath);
                var source = new SimpleFSDirectory(src);

                src.EnumerateFiles("*.cs", SearchOption.AllDirectories).ToList()
                    .ForEach(x =>
                        {
                            using (var reader = File.OpenText(x.FullName))
                            {
                                var doc = new Document();
                                TeeSinkTokenFilter tfilter = new TeeSinkTokenFilter(new WhitespaceTokenizer(reader));
                                TeeSinkTokenFilter.SinkTokenStream sink = tfilter.NewSinkTokenStream();
                                TokenStream final = new LowerCaseFilter(tfilter);
                                doc.Add(new Field("contents", final));
                                doc.Add(new Field("title", x.FullName, Field.Store.YES, Field.Index.ANALYZED));
                                indexer.AddDocument(doc);

                                // we persist this in ObjectStore:
                                // 1. Put object-specify o
                                try
                                {
                                    var putRequest1 = new PutObjectRequest
                                    {
                                        BucketName = bucketName,
                                        Key = x.FullName,
                                        ContentBody = doc.ToString()
                                    };

                                    putRequest1.Metadata.Add("x-amz-meta-title", x.FullName);
                                    PutObjectResponse response1 = await client.PutObjectAsync(putRequest1);
                                }
                                catch (Exception e)
                                {
                                    Console.WriteLine(
                                    "Error encountered ***. Message:'{0}' when writing an object"
                                    , e.Message);
                                }
                            }
                        });

                indexer.Optimize();
                Console.WriteLine("Total number of files indexed : " + indexer.MaxDoc());
            }

            using (var reader = IndexReader.Open(indexAt, true))
            {
                var pos = reader.TermPositions(new Term("contents", args.First().ToLower()));
                while (pos.Next())
                {
                    Console.WriteLine("Match in document " + reader.Document(pos.Doc).GetValues("title").FirstOrDefault());
                }
            }
        }
    }
}
// Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWtyVeqoXu7U9zEKuT

Monday, July 23, 2018

There are a few differences between filesystems and object storage in terms of file operations such as find and grep that are not well-suited for object storage. However, the ability to search object storage is not limited from the API. The S3 API can be used with options such as cp to dump the contents of the object to stdout or with grok. In these cases, it becomes useful to extend the APIs. 
The extensions to the APIs may involve standard query operators which enables most browsing and search operations. These make the object storage just as useful to search as the database. Although the operations may enumerate the objects, there is nothing preventing an overlay of metadata about other objects in the bucket if the current metadata does not suffice. 
Another usage of object storage is that it can help virtualize storage on existing devices and remote which enables the object storage to form a layer between the application and the store so that the application can conveniently move between clouds so long as the S3 interface remains the same. Most developers prefer the filesystem for the ability to save with name and hierarchy. In addition, some setup watch on the file systems. In the object storage, we have equivalents of paths and we can enable versioning as well as retention. As long as there are toolsSDK and API for promoting the object storage, we have the ability to finish it as a storage tier as popular as the filesystem.  There is no more a chore to maintain a file-system mount and the location that it points to. The storage is also virtual since it can be stretched over many virtual datacenters.  The ability to complete some tools such as for grep ,   SDK and connectors will improve the usage considerably. 
Perhaps a new usage for Object Storage would be to use it as the base for content-stores such as SharePoint and InfoPath. Currently they use a database server. But most of the operations they do are very similar to browsing an object store. Therefore, this can be substituted in favor of the Object storage and thus keeping it consistent regardless of where they are stored and allowing the library to be migrated with ease. 

Sunday, July 22, 2018

Object Storage can power not just websites with static resources but also serve as the intermediary data between mass migrations. Today many storage appliances for backup and de-duplication transfers data between drives and their own storage tier or hybrid vendor store. This backup and recovery items may be quite large with the size in terabytes. When these data transfers occur from one disk to another, they are opportunities to move the data to the object storage.
It is relatively easy to use object storage as time series buckets so that as one gets filled, data can start filling another. With the help of cold, warm and hot labels, it is easy to maintain progression of data. This data can then serve all the search queries over times series just like events in a time series database.
Today most of the organizations using private datacenters like to store large files and archives in the data centers. These datastores become increasingly unmanageable or costly to manage. Object storage offers a convenient way to save the data with the added benefit of programmability via S3 API.
Large staging area for data transfers and migrations are seen not only in storage appliances but also workflows involving storage. File-shares are a great example for saving data and there is already easy conversion of file-shares and files to buckets and objects in object storage.  Since file-shares have been traditionally used for data, these now becomes an easy candidate for object storage.
Another area of usage for Object Storage is the massive extract-transform-load operations such as in Data Warehouse tasks. This is certainly a large source of data and often used for analytical purposes where results of an analysis may be used subsequently. Consequently, data needs to be saved and the object storage can be used for these purposes.
More discussion included here: https://1drv.ms/w/s!Ashlm-Nw-wnWtyVeqoXu7U9zEKuT
This example could be modified to use objects and bucket to stash the index generated from the content. AWS .Net SDK provides a way to save data to object storage in cloud. 

Saturday, July 21, 2018

Improving reachability of Object Storage
Object storage is highly scalable and extremely durable storage for just about any digital format content you want to preserve. Typically used with static resources, it is often used for serving data over the http. With direct access to the data from anywhere over the internet, object storage becomes very popular.
S3 is a popular set of APIs that enables many vendors to provide object storage both on –premise and in the cloud. This set of APIs is widely used in a variety of programming languages. The rate of adoption could, however, be increased by moving upstream into more Software development Kits or SDKs. Popular tools like duplicity and s3cmd already use object storage from command-line but it’s the adoption in various programmability components that leads us to use increased usages of S3.
The erstwhile notion of storing data on disk and file system does not remain promising anymore. With the cloud relying increasingly on Object storage, both private and public cloud is increasingly looking to standardize object storage as the destination for saving their digital content. This stands in line with the expectations that the data should be durable, redundant and highly available in every geographical region.
However, most applications and software products especially those groaning with legacy components have not yet shaken off their dependence on filesystem and local files. For example, configuration is often stored from the local filesystem instead of from an object store. This is primarily because it has been convenient to stash data with the installation and usually on the owners’ computer and there is no solution to running applications remotely with partial installation or footprint on the local computer. It is either all remote deployment as in the case of software as a service or all local installation as in the case of software products on the owner's compute resource. The operating system is also required to be all local to the device with the owner. If it could be made more modular then it can be used with storage in the cloud. In this case it sets a precedent for applications in using installation that is spread out over local and remote.  Most software deployments are also available as a download usually with the help of an uninstaller that downloads in multiple parts but with one request. The same installer then removes the reliance on remote files so that the installation can operate locally.  Setup and deployment happens to be a classical software requirement that is probably most tolerant to performance as long it works.  The standalone file sourcing for setup therefore does not matter whether it is local or remote. If we can get one application, an entire operating system to run without any care for whether it is operating local or remote, we could do with much more object storage.
Applications are already using services that are remote for powering the experience to the users if they are not outright being served from a software as a service. More and more data are saved in the cloud storage. If the compute could make the devices not require as much hardware by moving not just the whole virtual machine but also parts of it, then the compute can be a mere runtime that can be hosted or ported across devices on the personal ecosystem.
 Finally, traditional concerns of personal storage will get eliminated with more adoption of technologies for saving and reading from the cloud instead of local resources.

#codingexercise
Find the number of equality paths in a matrix:
For example we  have 
1 2 
1 3 
as  1, 1,  forwards and backwards 1, 1
We can do this with recursion:
int getCount(int[,] matrix, int [,] dp, int x, int y)
{
  if (dp[x,y] != -1) 
      return dp[x,y];
  int dx = new int []{ 0, 1, -1, 0};
  int dy = new int [] {1 , 0 , 0 , -1};
  int result  = 1; // element by itself
  for (int i = 0; i < dx.count; i++)
  {
     int m = x + dx;
     int n  = y + dy;
     if (isValid(m,n, matrix) && matrix[m,n]  == matrix[x,y]) {
          result += getCount(matrix, dp, m, n);
    }
   }
  dp[x, y] = result;
  return result;


}