Cluster computing

Wednesday, September 12, 2018

Using Machine Learning with Object Storage:
The Machine Learning packages such as sklearn and Microsoft ML package or complex reporting queries for say dashboards can also utilize object storage. These analytical capabilities are leveraged in the database systems, but there is no limitation to apply it over objects in object storage.
This works very well for several reasons:
1) The analysis can be run on all data at once. The more the data, the better the analysis and object storage is one of the biggest possible. Consequently the backend and particularly the cloud services are better prepared for this task
2) The cloud services are elastic - they can pull in as much resource as needed for the execution of the queries and this works well for map-reduce processing
3) Object storage is also suited to do this processing once for every client and application. Different views and viewmodels can use the same computation so long as the results are part of the storage.
4) Performance increases dramatically when the computations are as close to the data as possible. This has been one of the arguments for pushing the machine learning package into the sql server for example.
5) Such compute and data intensive operations are hardly required on the frontend where the data may be very limited on a given page. Moreover, optimizations only happen when the compute and storage are elastic where they can be studied, cached, and replayed.
6) Complex queries can already be reduced to use a few primitives which can be made available as query operators over object storage leaving the choice to implement higher order themselves using these primitives or their own custom operators.

#codingexercise
Determine if a sum is a perfect number. A perfect number is one whose factors add up to the number

Bool IsPerfectSum(ref List<int> factors, int sum)

{

If (sum == 0 && factors.Count() > 0) return false;

if (sum == 0) return true;

If (sum < 0) return false;

if (factors.Count() == 0 && sum != 0) return false;

// sum > 0 and factors.Count() > 0

Var last = factors.last();

factors.RemoveAt(factors.Count() - 1);

if (last> sum)

{

Return false;

}

Return IsSubsetSum(ref factors, sum-last) ;

}

Tuesday, September 11, 2018

Object versioning versus naming conventions
Introduction: When an object is created, it has a lifetime and purpose. If it undergoes modifications, the object is no longer the same. Even though the modified object may serve the same purpose, it will be a different entity. These objects can co-exist either as separate versions or with separate names.
In object storage we have similar concept. When we enable versioning, we have the option to track changes to an object. Any modification of the object via upload increments its version. This follows the “copy-on-write" principle. Generally, in place editing of an object is not recommended. There are several reasons for this. First, an object may be considered a sequence of bytes. If we overwrite a byte range that is yet to be read, the reader may not know what the original object was. Second the size of the byte range may shrink or expand on editing. The caller may never be able to use the size as an attribute of the object if it keeps changing. Third, just like we have caveats with memcpy operations on byte ranges, we exercise similar caution on the changes for readers and writers. If we had made a copy, the readers could continue to read the old copy without any concern about writers and vice versa. The changes to the object could also leave the object in an inconsistent or unusable state. Therefore, editing an object is not preferred. Unless it is done for debugging or other forms of forensics or reverse engineering, an object is best served to have a different version. Versioning is automatic and it is possible to go forward or back ward between version. The versioning may even come with descriptions that talk about each version. All versions of the objects have the same name. When an object is referred by its name, the latest version is retrieved.
When objects have different names, they may choose to have patterns for their organization. When there are a large number of objects, having a prefix and a naming convention is a standard practice in many Information Technology departments. These names with wild card patterns can then be used for search commands that can span some or all of these objects. This may not be easy to do with past versions unless the search command iterates over all versions of an objects. Moreover, copies of the same object with different filename can each undergo different modifications and maintain their own version history. The names may not just involve tags that lets objects be grouped, ranked and sorted.
There are other forms of modifications that are not covered by the techniques above. For example, there is a edit-original and copy-later technique where two copies are simultaneously maintained and the edit of one copy is allowed with restore from the other copy. An undo like technique is also possible where the incremental changes are captured and undone with the copies of the originals used to replace. In fact all in-place editing can be done automatically with the help of some form of rollback behaviour involving either discarding the writes or overwriting with the original. Locking and logging are two popular techniques to help with their atomicity, consistency, isolation and durability.

Monday, September 10, 2018

Object Storage as a query store
Introduction: Users are able to search and query files in a file system or unstructured data stores. Object storage is not only a replacement for file storage but is also an unstructured data store promoting enumeration of object with a simple namespace, bucket and object hierarchy. This articles looks at enabling not just querying over Object Storage but also search and mining techniques.
Description:
1) Object Storage as a SQL store:
This technique utilizes a SQL engine over enumerables:

Object Storage data is search-able as a COM input to log parser. A COM input simply implements a few methods for the log parser and abstracts the data store. These methods are :
OpenInput: Opens your data source and sets up any initial environment settings
GetFieldCount: returns the number of fields that the plugin provides
GetFieldName: returns the name of a specified field
GetFieldType : returns the datatype of a specified field
GetValue : returns the value of a specified field
ReadRecord : reads the next record from your data source
CloseInput: closes the data source and cleans up any environment settings
Here we are saying that Object Storage acts as a data store for COM input to log parser which can then be queried in SQL for the desired output.
There are two different forms of expressions enabled SQL queries
First - This expression is in the form of standard query operators which became popular across languages such as .Where() and .Sum() as in LINQ. This tried, tested and well-established SQL Query language features. The language is very inspiring to express queries in succinct manner often enabling all aspects of data manipulation to refine and improve result-sets.
The second form of expression was with the search query language which has had a rich history in shell scripting and log analysis where the results of one command are piped into another command for transformation and analysis. Although similar in nature to chaining operators the expressions of this form of query involved more search like capabilities across heterogenous data such as with the use of regular expressions for detecting patterns. This form of expression not only involved powerful query operators but facilitated data extract, transform and load as it made its way through the expressions.

2) Object Storage as a search index:

Here we are utilizing the contents with the objects to build an index. The index may reside in a database but there is no restriction for storing it as objects in object store if performance is tolerated.

Sample program available here: https://github.com/ravibeta/csharpexamples/blob/master/SourceSearch/SourceSearch/SourceSearch/Program.cs
3) Object Storage for deep learning:

Here we utilize the tags associated with the objects which may be done once when the content is classified. Operators used here can be expanded to include more involved forms of computations such as grouping, ranking, sorting and such analysis.

There are three scenarios for showcasing the breadth of query language which include - a Cognitive example , a text analysis example and a JSON processing.
The cognitive example identifies objects in images. This kind of example show how the entire image processing on image files can be considered custom logic and used with the query language. As long as we define the objects, the input and the logic to analyze the objects, it can be made part of the query to extract the desired output dataset.
The text analysis example is also similar where we can extract the text prior to performing the analysis. It is interesting to note that the classifier used to tag the text can be written in R language and is not dependent on the query.
JSON processing is another example that is referenced often probably because it has become important to extract transform load in analytical processing whether it is a cloud data warehouse or big data operations. This "schema later" approach is popular because it decouples producers and consumers which saves co-ordination and time-consuming operations between say departments. In all these three scenarios, the object storage can work effectively as a storage layer.
Conclusion:
All aspects of a universal query language may be applicable to object stores just as if the content was available from file or document stores.

Furthermore, the metadata and the indexes may be stored in dedicated objects within the object storage

Sunday, September 9, 2018

We were discussing A connector to object storage.
The workflows imnvolving use of the backup data do not require modifications of data objects and this makes object storage perfect for them. On the other hand, applications which are more read and write intensive and do not tolerate latency are not suited for storage. As we progress from online transactions to analytical processing to object storage, the data is less susceptible to modifications. This does not mean we use Object Storage merely for backup and recovery. Compared to other storage, Object Storage provides immense scalability. This is true even if the source data stores are already in the cloud because those data stores are primarily focused for transactional and analytics purpose while object storage works exclusively as a store. Moreover, there is very little hierarchy in storing data. Namespaces, buckets and objects serve enough organizational tools for the bulk of the storage. This removes the restrictions and the data can be stored on a limitless storage. Moreover, object storage can be file system enabled. Finally, there’s plenty of cost-effectiveness to this storage which makes it more appealing than other forms of storage
Object Storage can also be a SQL store.

Object Storage data is search-able as a COM input to log parser. A COM input simply implements a few methods for the log parser and abstracts the data store. These methods are :
OpenInput: Opens your data source and sets up any initial environment settings
GetFieldCount: returns the number of fields that the plugin provides
GetFieldName: returns the name of a specified field
GetFieldType : returns the datatype of a specified field
GetValue : returns the value of a specified field
ReadRecord : reads the next record from your data source
CloseInput: closes the data source and cleans up any environment settings
Architecture:

Here we are saying that Object Storage acts as a data store for COM input to log parser which can then be queried in SQL for the desired output.

Saturday, September 8, 2018

The bridge between relational and object storage   

We review the data transfer between relational and object storage. We argue that the natural migration of relational data to object storage is via a data warehouse.    We further state that this bridge is more suited for star schema design as input than other forms of structured data. Finally, all three data stores may exist in the cloud as virtual stores and can each be massively scalable.

Role of Object Storage: 

Object storage is very good for durability of data. Virtual data warehouses often directly use S3 API to stash interim analytical byproducts. The nature of these warehouses is that they accumulate data over time for analytics. Since the S3 Apis are a popular programmatic access to store files in Object storage, most of the data in the warehouses may often be translated to smaller files or blobs and then stored in object storage. A connector works similarly but when the tables are in a star schema, it is easier to migrate the tables. The purpose of the connector is to move data quickly and steadily between source as structured store and destination as object storage. Therefore, the connector needs to fragment and upload in multiple parts into the object storage and S3 Api with their feature for multi part upload into the object storage is a convenience. The connector merely automates these transfers which would otherwise have been pushed rather than pulled by the object storage. This difference is enormous in terms of convenience to stash and for reusability of the objects for reading. The nature of the automation may also be flexible to store the fragments of data in the form that is most likely used during repeated analytics. The star schema is a way of saying that there are many tables and they are joined into a central table for easier growth and dimensional analysis. This makes the data stand out as independent dimensions or facts which can be migrated in parallel. In addition to the programmatic access for stashing, object storage is widely popular over conventional file storage making it all the more appealing for durable stashes or full copy of data. While transactional data may have highly normalized tables, there is nothing preventing it from being translated into unraveled star schema. Besides they are already a part of the chain between how transactional data ends up in a warehouse. The object storage does not have to concern itself with the data store directly and can exist and operate without any impact to warehouse or object storage that may well be a lot bigger than the database.

With this context, let us consider how to write such a connector. In this regard, we are lured by the dimensions of the snowflake design to be translated into objects There are only two aspects that need to be concerned about. First is the horizontal and vertical partitioning of a dimension and the second is the order and co-ordination of transfers from all dimensions. The latter may be done in parallel and with the expertise of database migration wizards. Here the emphasis is on the checklist of migrations so that the transfer is merely in the form of archival. Specifically, in transferring a table we repeat the following steps: while there are records to be migrated between the source and the destination, select a record, check if it exists in the destination. If it doesn’t we write it as a key-value and repeat the process. We make this failproof by checking at every step. Additionally, where the source is a staging area, we may choose delete from the staging so that we can keep it trimmed. This technique is sufficient to write data in the form of key-values which are then more suited for object storage

Conclusion:

Data is precious and almost all platforms compete for a slice of the data. Data is sticky so the platform and tools built around the data increase in size. Object Storage is a universally accepted store for its purpose and the connector improves the migration of this data to the store. Although the word migration is used to indicate transfer, it does not necessarily mean taking anything away from the origin.

Friday, September 7, 2018

We were discussing stochastic gradient method for text mining.
The stochastic gradient descent also hoped to avoid the batch processing and repeated iterations of the batch that came to refine the gradient. The overall cost function is written with a term that measures how the hypothesis is holding on the current datapoint. To fit the next data point, the parameter is modified. As it scans the data points, it may wander off but eventually reach near the optimum. The iteration over the scanning is repeated so that the dependence of the sequence of data points is eliminated. For this reason, the datapoints are initially shuffled which goes along the lines of a bag of word vectors rather than the sequence of words in a narrative. Generally, the overall iterations of all data points are restricted to a small number, say 10, and the algorithm makes improvements to gradients from one datapoint to another without having to wait for all of the batch.
The steps to take in order to utilize this method for text mining, we first perform feature extraction of the text. Then we use the stochastic gradient descent method as a classifier to learn the distinct features and Finally we can measure the performance using the F1 score. Again, vector, classifier and evaluator become the three-stage processing in this case.

We explore regions of interest detection using neural net. Here we begin with a feature map. A residual learning network can take the initial sample as input and output feature maps where each feature map is specialized in the detection of topic. If the model allows backpropagation for training and inference, we could simultaneously detect topics which are location independent as well as the position of occurrence.

With the feature map, there are two fully connected layers formed one for box-regression and another for box-classification.

The bounding boxes are the proposals. each box from the feature map is evaluated once each for regression and classification.
The classifier detects the topic and the regressor adjusts the coordinates of the bounding box.
We mentioned that the text is a bag of words and we don't have the raster data which is typical with image data. The notion here is merely the replacement of a box with a bag so that we can propose different clusters. If a particular bag emphasises one and only one cluster, then it is said to have detected a topic.The noise is avoided may even form its own cluster.

Thursday, September 6, 2018

The streaming algorithms are helpful so long as the incoming stream is viewed as a sequence of word vectors. However, word vectorization itself must happen prior to clustering. Some view this as a drawback of this method of algorithms because word vectorization is done with the help of neural nets and softmax classifier over the entire document and there are ways to use different layers of the neural net to form regions of interest. Till date there has been no application of a hybrid mix of detecting regions of interest in a neural net layer with the help of stream-based clustering. There is, however, a way to separate the stages of word vector formation from the stream-based clustering if all the words have previously well-known word vectors that may be looked up from something similar to a dictionary.

The kind of algorithms that work on stream of data points are not restricted to the above algorithms. It could involve a cost function. For example, stochastic gradient descent also hoped to avoid the batch processing and repeated iterations of the batch that came to refine the gradient. The overall cost function is written with a term that measures how the hypothesis is holding on the current datapoint. To fit the next data point, the parameter is modified. As it scans the data points, it may wander off but eventually reach near the optimum. The iteration over the scanning is repeated so that the dependence of the sequence of data points is eliminated. For this reason, the datapoints are initially shuffled which goes along the lines of a bag of word vectors rather than the sequence of words in a narrative. Generally, the overall iterations of all data points are restricted to a small number, say 10, and the algorithm makes improvements to gradients from one datapoint to another without having to wait for all of the batch.

The steps to take in order to utilize this method for text mining, we first perform feature extraction of the text. Then we use the stochastic gradient descent method as a classifier to learn the distinct features and Finally we can measure the performance using the F1 score. Again, vector, classifier and evaluator become the three-stage processing in this case.