Cluster computing

Friday, May 11, 2018

In large social engineering graphs, there are updates of about 86,400 operations per second. Even if we keep the journal in a cloud database where there is no restriction to storage, fetching and updating states on the graph may take a long while. In such case, we can take snapshots of the graph and pin those against timestamps for consistency. Next, we separate the snapshot, replay and consistency checks as offline activities. In such cases, it becomes important to perform analytics on dynamic graphs instead of static graphs. The store-and-static-compute model worked because updates were batched and then graph processing applied on static snapshots from different points in time. It worked so long as the graph modifications were less frequent than static processing. With dynamic graph processing, we need a new framework. One such framework proposed was GraphIn which introduces a new programming model called Incremental-Gather-Apply-Scatter. In the gather phase, incoming messages are processed and combined into one message. In the Apply phase, vertices use the combined message to update their state. In the scatter phase, vertices can send a message to their neighbors along their edges.

This framework divides the continuous stream of updates into fixed size batches processed in the order of their arrival.

If the updates were recorded in the table as described by the tuple above, there would be a range of entries over which the updates are pending and would need to be applied to the graph. The completed updates are available for analysis and they can also be aged and archived. Although there may be billions of rows of entries, we can apply window functions with 'over' clause in sql queries to work on the equivalent of fixed size records from batches but in a streaming manner.

For example:

SELECT COUNT(*)

OVER ( PARTITION BY hash(u.timestamp DIV (60*60*24)) partitions 3 ) u1

FROM graphupdate u;

Full discussion : https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs

Thursday, May 10, 2018

We were discussing incremental graph transformations:
One way to do this was to use tables for transformation rules. A graph transformation rule consists of a (LHS, RHS, NAC) where

LHS: left hand side graph

RHS: right hand side graph

NAC: Negative application condition

The matching results from LHS to RHS where NAC prohibits the presence of certain objects and links. By defining the locations in the graph for the application of the rules, the tables itself go through very little changes. However, storing transformation rules with the help of metamodels and instance models requires a lot of preprocessing.

The notion of using a database to store the updates for the graph is a very powerful technique. The above strategy is very involved in that it requires a lot of pre-processing with the hope that subsequent processing is going to be incremental and cheap. This is in fact the case for existing large graphs. However, the ability to reconstruct graphs is equally feasible just like we reconstruct even large master databases such as product catalogs.

Consider a table of updates in the form of each node and edge to add or delete in the form of tuple<Type, operation, Src, Destination, Value, status, created, modified> where type = Node or edge, operation = add or remove, src or destinations are node ids for an edge and null otherwise and Value is the node data for node additions. The status indicates the set of progressive operations associated with this entry such as initialized->in progress->completed/error and the timestamps indicate the positions on the timeline for these entries as they arrive or are acted upon.

Each instance of the tuple is a corresponding node addition or removal and is atomic and independent of another. The sequence of operations in the list of tuples is the order in which the graph was updated. When the operations are journaled, it gives the replay log for reconstructing the graph and a subset of this table gives the incremental changes between two states.

In large social engineering graphs, there are updates of about 86,400 operations per second. Even if we keep the journal in a cloud database where there is no restriction to storage, fetching and updating states on the graph may take a long while. In such case, we can take snapshots of the graph and pin those against timestamps for consistency. Next, we separate the snapshot, replay and consistency checks as offline activities.

Wednesday, May 9, 2018

Yesterday we started discussing incremental graph transformations:
Incremental Graph Transformations:

First, let us review the API for requesting graph transformations. They generally work in the following manner: The following example was taken from Microsoft Graph:

Initially a request is made for the resource

Subsequent pages of the resource are retrieved using a keyword like nextLink

Final response with nextLink indicates end of the original collection which is usually from a snapshot

The delta query is then issued with the deltaLink keyword that enables additions, deletions or updates to be queried

The delta query needs to detect the incremental changes in the graph.

One way to provide incremental updates is to keep track of graph transformations in materialized views with the help of transformation rules that can be pattern matched with the graph.
A graph transformation rule consists of a (LHS, RHS, NAC) where
LHS : left hand side graph
RHS: right hand side graph
NAC: Negative application condition
The matching results from LHS to RHS where NAC prohibits the presence of certain objects and links.
The rules are formed using metamodels and instance models:
The metamodel is represented by type graph where the nodes are called classes A class may have attributes and demonstrates inheritance and associations.
The instance model describes instances of the metamodels where the nodes are objects and the edges are links
A class is mapped to a table with a single column where (class(I)) stores the identifiers of the objects of the specified class
An association has three columns (assoc(I,S,T)) which contains identifiers for the link, source and target objects
An attribute is mapped to a table with two columns (attr(I,V)) for the object identifier and the attribute value
By defining the locations in the graph for the application of the rules, the tables itself go through very little changes.

#codingexercise
Given two sorted arrays of vastly different lengths, what is the lowest complexity to find intersection elements - Find the Math.Max(start1, start2) and Math.Min(end1, end2) for the values in the two arrays. Then for this sub-range in the smaller array, perform binary search in the longer array. If the lengths of the two arrays are comparable, sequentially traverse elements in both array in sorted order and printing elements that appear in both arrays.

Tuesday, May 8, 2018

Today we discuss the benefits of jquery plugin or javascript SDK to represent analytical capabilities.
We are well familiar with the analytical capabilities that come with say, the Machine Learning packages such as sklearn and Microsoft ML package or complex reporting queries for dashboards. We have traditionally leveraged such analytical capabilities in the backend.
This works very well for several reasons:
1) The analysis can be run on Big Data. The more the data, the better the analysis. Consequently the backend and particularly the cloud services are better prepared for this task
2) The cloud services are elastic - they can pull in as much resource as needed for the execution of the queries and this works well for map-reduce processing
3) The backend is also suited to do this processing once for every client and application. Different views and viewmodels can use the same computation
4) Performance increases dramatically when the computations are as close to the data as possible. This has been one of the arguments for pushing the machine learning package into the sql server
5) Such compute and data intensive operations are hardly required on the frontend where the data may be very limited on a given page. Moreover, optimizations only happen when the compute and storage are broadened to the backend where they can be studied, cached, and replayed.
6) Complex queries can already be reduced to use a few primitives which are available as query operators in both backend and frontend. The application or client that needs to do client side processing of such queries have the choice to implement it themselves using these primitives

Having reviewed these reasons, we may ask if we have enough support on the client side and if a new plugin is justified. Let us make the purpose of the plugin clear. It provides the convenience to reuse some of the common patterns seen from the analysis queries across charts and dashboards in different domains. Patterns that are well-known such as pivot operations as well as esoteric ones such as from advanced dashboarding tools may be consolidated into a standard plugin or sdk.
The QueryBuilder plugin allows conditions to be specified which is great to build conditions into the query. If we could also introduce piping of results into operations that would also help.
The QueryBuilder can either build a query or allow different queries to work in sequence where the results of one query is piped to the other. Queries written in the form of shell commands and piped execution is a form of the latter while complex predicates in SQL queries is an example of the former.
#codingexercise https://ideone.com/kAibv2
#OpenAI has algorithms more than what ML package and Turi implement
We will discuss incremental graph transformations next : https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs

Monday, May 7, 2018

We were discussing the differences between full-text search and text analysis in modern day cloud databases.

One application of this is in the use of a code visualization tool.

Software engineering results in a high value asset called source code which are usually millions and millions of lines of instructions that computers can understand. While Software Engineers make a lot of effort in organizing and making sense of this code, usually this interpretation lives as tribal knowledge specific to the teams associated with chunks of these codes. As engineers increase in number and rotate, this knowledge is quick to disappear and takes a lot of effort as onboarding. Manifestation of this knowledge is surprisingly hard, because it results in often and large complicated picture that is hard to understand, comprehend and remember. There have been attempts with documentation tools like function call graph, dependency graph and documentation generators that have tried different angles at this problem space but their effectiveness to engineers fall way short of simple architecture diagrams or data and control flow diagrams from architects.

With the help of an index and a graph, we have all the necessary data structures to write a code visualization tool.

The key challenge in a code visualization tool is keeping the inference in sync with code. While periodic evaluation and global updates to inference may be feasibility, we do better by persisting all relationships in a graph database. Moreover, if we apply incremental consistent updates based on code change triggers, the number of writes becomes less when compared to local updates. This also brings us to the scope of visualization. Tools like CodeMap allow us to visualize based on the current context in the code. The end result is a single picture. However, we reserve the display options to neither be limited to context nor to the queries for rendering. Instead we allow dynamic visualization depending on the queries and overlays involved. For large graphs, determining incremental updates is an interesting problem space with examples in different domains such as software defined networking etc. If we take an approach that all graph transformations can be represented by metadata in the form of rules, we can keep track of these rules and ensure that they are incremental.
By keeping the rules in a table, we can make sure the updates are incremental. The table keeps track of all the matching rules which makes pattern matching fast. The updates to the table require very little changes because most of the modifications are local to the graphs and the rules mention the locations.
#codingexercise https://ideone.com/yqnBiR

Sample application: http://52.191.138.87:8668/upload/ for direct access to text summarization

Sunday, May 6, 2018

Full text search and text summarization from relational stores
Many databases in the 1990s started providing full text search techniques. It was a text retrieval technique that relied on searching all of the words in every document stored in a specific full text database. Fulltext is a strategy. It is not changeable.
Text analysis models are a manifestation of strategy. We can train and test different models in the text analysis. With the migration of existing databases to cloud, we can now import the data we want to be analyzed from say Azure machine learning studio. Then we can use one or more of the strategies to evaluate what works best for us.
This solves two problems - one we do not rely on one t-shirt fits all strategy and two we can continously improve the model we train on our data by tuning its parameters.
The difference in techniques between fulltext and nlp rely largely on the data structures used. While one uses inverted document lists and indexes, the other is making use of word vectors. The latter can be used to create similarity graphs and used with inference techniques. The word vectors can also be used with a variety of data mining techniques. With the availability of managed service databases in the public clouds, we are now empowered to creating nlp databases in the cloud
If we are going to operational-ize any of the algorithms we might benefit from using the strategy design pattern in implementing our service. We spend a lot of time in data science to come up with the right model and parameters but it is always good to have a fallback option as full text search especially if we are going to be changing the applications of our text analysis.
Microsoft ML package is used to push algorithms such as multiclass logistic regression into the database servers. This package is available out of box from R as well as Microsoft Sql Server. One of its application is for text classification. In this application Newsgroup20 corpus is used to train the model. The newsgroup20 defines subject and text separately. When the word vectors are computed they are sourced from both the subject and text separately. The model is saved to SQL Server which can then be used on test data. This kind of analysis works well with on-premise data. If the data exists in the cloud such as in Cosmos Database, it is required to be imported into the Azure machine learning studio for use in an experiment. All we need is the Database ID for the name of the database to use, the DocumentDB key for the key to be pasted and the CollectionID for the name of the collection. SQL Query and its parameters can then be used from the database to filter the data.
# codingexercise added file uploader to http://shrink-text.westus2.cloudapp.azure.com:8668/add

Saturday, May 5, 2018

Enumeration of benefits of full-service and managed service solutions to Do-it-yourself machine-learning applications:
1. Full service is more than software logistics. It ensures streamlined operations of applications
2. It provides advanced level of services where none existed before
3. It complements out-of-box technologies that is otherwise inadequate for the operations
4. It provides a streamlined processing of all applications
5. A full service can manifest as a managed service or a Software as a service so that it is not limited to specific deployments on the Customer site.
6. Full-Service generally has precedence and the same kind of problem may have been solved for other customers which brings in expertise
7. Full Service offers support and maintenance while the ownership cost may significantly increase for the customer.
8. It offers a change in perspective to cost and pricing by shifting the paradigm in favor of the customer
9. It offers scaling of infrastructure and services to meet the need for growth
10. It offers multi-faceted care for services such as backups, disaster management while home grown projects may not take care of it.
11. The service level agreements are better articulated in a full-service.
12. The measurements and metrics are nicely captured enabling continuous justification of the engagement
13. The managed services offering generally improves analysis and reporting stacks.
14. Managed service model comes with a subscription and lease renewal that makes it effective to manage the lifetime of resources
15. It facilitates updates, patching and reconfiguration and routine software deployment tasks that frees up the customers to focus on just their application development
16. Even clouds can be switched in managed services and they facilitate either a transparent chargeback or a consolidated license fee
17. Managed service and software as a service option enables significant improvements to client appeal and audience including mobile device support
18. A full service can leverage one or more managed services, cloud services or software as a service to provide the seamless experience that the customer needs.
19. It performs integrations of technologies as well as automation of tasks in a way that is meant to improve client satisfaction. A full service might even go beyond solutions provider in enabling only turnkey or minimal touchpoints with the customer.
20. Since customer obsession is manifested in the service engagement, it improves over all those achieved via existing applications and services with opportunities to grow via addition of say microservices to its portfolio.
21. Sample application: http://52.191.138.87:8668/upload/ for direct access to text summarization