Cluster computing

Monday, May 7, 2018

We were discussing the differences between full-text search and text analysis in modern day cloud databases.

One application of this is in the use of a code visualization tool.

Software engineering results in a high value asset called source code which are usually millions and millions of lines of instructions that computers can understand. While Software Engineers make a lot of effort in organizing and making sense of this code, usually this interpretation lives as tribal knowledge specific to the teams associated with chunks of these codes. As engineers increase in number and rotate, this knowledge is quick to disappear and takes a lot of effort as onboarding. Manifestation of this knowledge is surprisingly hard, because it results in often and large complicated picture that is hard to understand, comprehend and remember. There have been attempts with documentation tools like function call graph, dependency graph and documentation generators that have tried different angles at this problem space but their effectiveness to engineers fall way short of simple architecture diagrams or data and control flow diagrams from architects.

With the help of an index and a graph, we have all the necessary data structures to write a code visualization tool.

The key challenge in a code visualization tool is keeping the inference in sync with code. While periodic evaluation and global updates to inference may be feasibility, we do better by persisting all relationships in a graph database. Moreover, if we apply incremental consistent updates based on code change triggers, the number of writes becomes less when compared to local updates. This also brings us to the scope of visualization. Tools like CodeMap allow us to visualize based on the current context in the code. The end result is a single picture. However, we reserve the display options to neither be limited to context nor to the queries for rendering. Instead we allow dynamic visualization depending on the queries and overlays involved. For large graphs, determining incremental updates is an interesting problem space with examples in different domains such as software defined networking etc. If we take an approach that all graph transformations can be represented by metadata in the form of rules, we can keep track of these rules and ensure that they are incremental.
By keeping the rules in a table, we can make sure the updates are incremental. The table keeps track of all the matching rules which makes pattern matching fast. The updates to the table require very little changes because most of the modifications are local to the graphs and the rules mention the locations.
#codingexercise https://ideone.com/yqnBiR

Sample application: http://52.191.138.87:8668/upload/ for direct access to text summarization

Sunday, May 6, 2018

Full text search and text summarization from relational stores
Many databases in the 1990s started providing full text search techniques. It was a text retrieval technique that relied on searching all of the words in every document stored in a specific full text database. Fulltext is a strategy. It is not changeable.
Text analysis models are a manifestation of strategy. We can train and test different models in the text analysis. With the migration of existing databases to cloud, we can now import the data we want to be analyzed from say Azure machine learning studio. Then we can use one or more of the strategies to evaluate what works best for us.
This solves two problems - one we do not rely on one t-shirt fits all strategy and two we can continously improve the model we train on our data by tuning its parameters.
The difference in techniques between fulltext and nlp rely largely on the data structures used. While one uses inverted document lists and indexes, the other is making use of word vectors. The latter can be used to create similarity graphs and used with inference techniques. The word vectors can also be used with a variety of data mining techniques. With the availability of managed service databases in the public clouds, we are now empowered to creating nlp databases in the cloud
If we are going to operational-ize any of the algorithms we might benefit from using the strategy design pattern in implementing our service. We spend a lot of time in data science to come up with the right model and parameters but it is always good to have a fallback option as full text search especially if we are going to be changing the applications of our text analysis.
Microsoft ML package is used to push algorithms such as multiclass logistic regression into the database servers. This package is available out of box from R as well as Microsoft Sql Server. One of its application is for text classification. In this application Newsgroup20 corpus is used to train the model. The newsgroup20 defines subject and text separately. When the word vectors are computed they are sourced from both the subject and text separately. The model is saved to SQL Server which can then be used on test data. This kind of analysis works well with on-premise data. If the data exists in the cloud such as in Cosmos Database, it is required to be imported into the Azure machine learning studio for use in an experiment. All we need is the Database ID for the name of the database to use, the DocumentDB key for the key to be pasted and the CollectionID for the name of the collection. SQL Query and its parameters can then be used from the database to filter the data.
# codingexercise added file uploader to http://shrink-text.westus2.cloudapp.azure.com:8668/add

Saturday, May 5, 2018

Enumeration of benefits of full-service and managed service solutions to Do-it-yourself machine-learning applications:
1. Full service is more than software logistics. It ensures streamlined operations of applications
2. It provides advanced level of services where none existed before
3. It complements out-of-box technologies that is otherwise inadequate for the operations
4. It provides a streamlined processing of all applications
5. A full service can manifest as a managed service or a Software as a service so that it is not limited to specific deployments on the Customer site.
6. Full-Service generally has precedence and the same kind of problem may have been solved for other customers which brings in expertise
7. Full Service offers support and maintenance while the ownership cost may significantly increase for the customer.
8. It offers a change in perspective to cost and pricing by shifting the paradigm in favor of the customer
9. It offers scaling of infrastructure and services to meet the need for growth
10. It offers multi-faceted care for services such as backups, disaster management while home grown projects may not take care of it.
11. The service level agreements are better articulated in a full-service.
12. The measurements and metrics are nicely captured enabling continuous justification of the engagement
13. The managed services offering generally improves analysis and reporting stacks.
14. Managed service model comes with a subscription and lease renewal that makes it effective to manage the lifetime of resources
15. It facilitates updates, patching and reconfiguration and routine software deployment tasks that frees up the customers to focus on just their application development
16. Even clouds can be switched in managed services and they facilitate either a transparent chargeback or a consolidated license fee
17. Managed service and software as a service option enables significant improvements to client appeal and audience including mobile device support
18. A full service can leverage one or more managed services, cloud services or software as a service to provide the seamless experience that the customer needs.
19. It performs integrations of technologies as well as automation of tasks in a way that is meant to improve client satisfaction. A full service might even go beyond solutions provider in enabling only turnkey or minimal touchpoints with the customer.
20. Since customer obsession is manifested in the service engagement, it improves over all those achieved via existing applications and services with opportunities to grow via addition of say microservices to its portfolio.
21. Sample application: http://52.191.138.87:8668/upload/ for direct access to text summarization

Friday, May 4, 2018

Introduction:

We were discussing full-service options for NLP programming for any organization. I used an example to discuss points in favor of such a service: http://shrink-text.westus2.cloudapp.azure.com:8668/add. Here we try to illustrate that a file uploader for instance is an improvement over raw text processing.
There are several reasons that full service is more appealing than some out of box capabilities. For example, the connectors to data sources will need to be authored. Automation and scheduling of jobs and intakes are going to be necessary for continuous processing. Error handling, reports and notifications will be required for administration purposes anyways.
Full Service options also involve compute and storage handling for all the near and long term needs surrounding the processes for the NLP task involved. Artifacts produced as a result of the processing may need to be archived. Aged artifacts many need to be tiered. Retrieval systems can be built on top of collections made so far. At any time, a full service solution at the very least provides answers to questions that generally take a lot of effort with the boxed solutions.
Moreover, it is not just the effort involved with out of box features, it is the complete ownership of associated activities and the convenience brought into the picture. And the availability of queuing services and asynchronous processing for all backlogs adds more value to the full service. Reports and dashboards become more meaningful with full service solutions. The impact and the feedback from audience is improved with full service solution. A full service solution goes a long way to improve customer satisfaction.

#codingexercise

Tests for https://ideone.com/6ms4Vz

Thursday, May 3, 2018

Introduction:

The previous article introduced some of the essential steps in getting a small lightweight eCommerce website up and running in minimal time. We mentioned that the users for this website may need to be recognized, their transactions on the website may need to be remembered and the processing for each order may need to be transparent to the user. This works well for in-house software development across industries for a variety of applications using off the shelf products, cloud services and development framework and tools. Web applications and services fall in this category. Most software engineering development in industries such as finance, retail, telecommunication and insurance have a significant amount of domain expertise and picking the right set of tools, resources and processes is easy for the business sponsors and their implementors.

However, when the domain remains the same and we apply new computational capabilities that require significant new knowledge and expertise such as in machine learning, then it is slower to onboard and expect new team members to realize the applications. In such cases, the machine learning toolkit provider may only be able to put out samples and community news and updates. The companies are then best served by white-glove service that not only brings in the expertise but also delivers on the execution. First it reduces the time to implementation because the skills, resources and best practice are available. Second, the challenges are no-longer unknown and have been dealt with earlier in other companies. These together argue for a specialized consultancy services in machine learning development in most verticals. Even web-application development started out this way in many organizations before having indigenous employees assume all application development effort. Some organizations may want to have both - the expediency to realize near term goals and the investment to build long term capabilities.

I have a sample application : http://shrink-text.westus2.cloudapp.azure.com:8668/add to illustrate an example. Suppose we want to use this as a sample within a particular domain, then we would need to justify it over say SQL Server text classification capability. Here we need not argue that the above processing does not require text data to make its way into the SQL server for the above service to be used. Instead, we focus on the model tuning and customization we can do for the same algorithm as in SQL Server as well as model enhancement with other algorithms say from R package while allowing to operate on data in transit in both cases.

#codingexercise

Tests for https://ideone.com/6ms4Vz

Wednesday, May 2, 2018

Introduction:

This article narrates some of the essential steps in getting a small lightweight eCommerce website up and running in minimal time. We assume that the users for this website will be recognized, their transactions on the website will be remembered and the processing for each order will be transparent to the user.

The registration of the user occurs with a membership provider – this can be an ASP.Net membership provider, a third party identity provider such as login with Google or an IAM vendor that honors authentication and authorization protocols such as OAuth or SAML

Assuming a simple python Django application suffices as a middle-tier REST service API, we can rely on Django’s native support for different authentication backends such as model based authentication backend or remote user backend. To support Google user automatic recognition, we just include the markup in the user interface as

function onSignIn(googleUser) {

var profile = googleUser.getBasicProfile();

console.log('ID: ' + profile.getId()); // Do not send to your backend! Use an ID token instead.

console.log('Name: ' + profile.getName());

console.log('Image URL: ' + profile.getImageUrl());

console.log('Email: ' + profile.getEmail()); // This is null if the 'email' scope is not present.

}

The transactions on the website for a recognized user is maintained with the help of session management.

Django has native support for session management and in addition allows us to write our own middleware

The order history is maintained in the form of relevant details from the orders in the Order table. Create, update, delete of the orders are tracked from this table. Status field on the order table is progressive in the form of initialized, processing, completed and canceled. Timestamps are maintained for created as well as modified.

Sample App: http://shrink-text.westus2.cloudapp.azure.com:8668/add
#codingexercise
https://ideone.com/6ms4Vz

Tuesday, May 1, 2018

Today we discuss the AWS database migration service- DMS. This service allows consolidation , distribution, and replication of databases. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. It support almost all the major brands of databases. It can also perform heterogeneous migration such as from Oracle to Microsoft SQL Server.
When the databases are different, the AWS schema conversion tool is used. The steps for conversion include : assessment, database schema conversion, application conversion, scripts conversion, integration with third party applications, data migration, functional testing of the entire system, performance tuning, integration and deployment, training and knowledge, documentation and version control, and post production support. The schema conversion tool assists with the first few steps until the data migration step. Database objects such as tables, views, indexes, code, user defined types, aggregates, stored procedures, functions, triggers, and packages can be moved with the SQL from the schema conversion tool. This tool also provides an assessment report and an executive summary. As long as the tool has the drivers for the source and destination databases, we can rely on the migration automation performed this way. Subsequently configuration and settings need to be specified on the target database. These settings include performance, memory, assessment report etc. The number of tables, schemas, user/role/permissions determine the duration of the migration.
The DMS differs from the schema conversion tool in that it is generally used for data migration instead of schema migration.
#codingexercises
Sierpinski triangle:
double GetCountRepeated(int n)

{

double result = 1;

For (int i = 0; i < n; i++)

{

result = 3 * result + 1 + 1;

}

Return result;

}

which can also be written recursively

another : https://ideone.com/F6QWcu

and finally: Text Summarization app: http://shrink-text.westus2.cloudapp.azure.com:8668/add