Cluster computing

Friday, February 19, 2021

Tracking changes in scripts from customer’s data:

Problem Statement: Scripts are saved in the database as text columns. The database provides many advantages but fails as a source control. It’s hard to visualize the line-by-line differences between versions of the same script much less the differences between folders in a recursive manner. Determine the differences between scripts cuts down on costs from alternate means of identifying root cause of symptoms found with customer’s data. For instance, cost from testing increases dramatically with more dimensions of test matrix. Scripts are customer’s logic stored in their instance of a multi-tenant platform. The service desk technical support engineer has no easy way to study the changes in the script without writing custom code to retrieve scripts via SQL queries. If these scripts were to be exported into a well-known source control tracking mechanism as files albeit for the duration of investigation, it would leverage well-known practice for viewing and determining changes to scripts that streamlines processes around investigations involving source code. Some automation is needed to export the scripts from the customer’s data into a source control mechanism. This article provides one such mechanism.

Solution: Data import / export from a database and from a source control system are not new features. The language of import/export has been text-based artifacts such as files. Extract-Transform-Load operations or adhoc SQL queries can export data from the database and the source control software can export the contents of a repository as a compressed archive of fiiles. What was missing was the layout of the data in the form of the organization in the database so that the data becomes easy to differentiate and work with on disk as opposed to remaining hidden within the database. When the cells, the rows and the tables are organized in a way that they are easy to locate using a folder/file hierarchy, they become easy for command-line tools which are savvy about folder/file hierarchy and can recursively differentiate instances of data.

The engineers who investigate issues with the customer’s data often post work notes and attachments to the case. The number of work notes and the size of attachments are usually not a concern as they can be rolled over as many times as needed. Engineers often like to get the entirety of scripts from a customer’s instance in one shot, which they can selectively study with the tools at hand. When the attachment is in the form of a compressed archive of files, it is easy to decompress and register it as a source repository to utilize the popular source differencing tools. Shell and command line utilities do not deliver the kind of rich experience that only source control systems like Git provide.

When the scripts change often, versioning plays an important role in the investigation. Fortunately, source control software such as Git are best suited for versioning overlays of source from decompressed archives that maintain part or all the file layout structure. Git is not only ubiquitous as a source control software but also has been revising its own versioning strategy from where it was using GitFlow and GitHubFlow in version 2 to using configurations in v3. With configurations, many aspects of the versioning can be tweaked, standardized and even support Semantic Versioning 2.0. The use of md5 hashes and commit manipulations allows forward and backward traversals in time in terms of revision history. If the engineer has merely to decompress an archive that was packed by automation onto a source repository to close in on a single line change in a script between version or time-range, it becomes more than a convenience.

Automations are savvy about making attachments to a case because they must do so for runtime operations such as taking a memory dump of running process, exporting certain logs, metrics or events or such other useful information. What they need additionally is a standardization of fetching data relevant to the stated goal in this article. The ETL wizards are only part of the solution because they export data but not the organization. OData was a technique that popularized exposing the data to the web using the conventional web Application Programming Interface (API). This data now could be imported via a command line tool such as curl and the JavaScript Object Notational Format (JSON) made it universally acceptable for programmability. Yet the organization information was largely left to the caller and the onus to make hundreds of calls did not address the need described. S3 APIs with its tri-level hierarchy of Namespace, Bucket and Object was a novel technique because each object was web accessible. If the data could be exported from a database into these containers, the use of command-line tools became unnecessary as downstream systems such as event processors could directly read the data without the need to put them on disk in between the stages of processing or leverage its folder prefix organization to export some or all the data as file archives.

Filesystem remains universally appealing to all shell-based command line tools as well as source control tracking mechanism. Writing C program-based tools that can download data from a database and write them to disk in the form of folder or files is therefore the only complete solution to differentiate instances of data between database snapshots. This option was always left out to the individual developers to implement for their use case because the size of production database could be upwards of Terabytes and filesystems stretched even on clusters were not considered practical for production. But scripts form a very small fraction of such production data and can be helpfully imported exported with successful automation.

Conclusion: With the ability to bridge a relational store and a source control mechanism using a folder/file layout, the cost of resolving incidents with customer’s data that originates from their logic can be reduced significantly with reliable automation.

Thursday, February 18, 2021

Writing an item-based and a user-based recommender using TensorFlow.js

Introduction: TensorFlow is a machine learning framework for JavaScript applications. It helps us build models that can be directly used in the browser or in the node.js server. We use this framework for building an application that can recommend using deep learning that can find the best result in millions.

Description: The recommender from TensorFlow is built on two core features – one that supports fast approximate retrieval and another that supports better techniques for modeling feature interactions. The SaveModel object is instantiated for the recommender which takes as input query features and provides the recommendations as output. Feature interactions are based on deep and cross networks which are efficient architectures for deep learning. Cross features are essential to span large and sparse feature space which are typical of datasets used with recommenders. ScaNN is a state of the art nearest neighbor search library (NNS) and it integrates with TensorFlow recommenders.

For example, we can say:

scann = tfrs.layers.factorized_top_k.ScaNN(model.user_model)

scann.index(movies.batch(100).map(model.movie_model), movies)

Collaborative Filtering involves making recommendations based on user group. In order to make a recommendation, first a group sharing similar taste is found and then the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keep tracking of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the people in the selected group. To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. Another approach is to used item-based filtering. This filtering is like the previous except that it was for user-based approach and this is for item-based approach. It is significantly faster than user-based approach but requires the storage for an item similarity table. TensorFlow supports recommenders with simple vectors involving high-dimensional features and embedding projectors for visualization. Recommenders work very well with Matrix Factorization.

A Keras layer is like a backend and can run on Colab environment. Keras can help author the model and deploy it to an environment such as Colab where the model can be trained on a GPU. Once the training is done, the model can be loaded and run anywhere else including a browser. The power of TensorFlow is in its ability to load the model and make predictions in the browser itself.

While Brute force approaches make fewer inferences per second, scann is quite scalable and efficient.

As with any ML learning example, the data is split into 70% training set and 30% test set. There is no order to the data and the split is taken over a random set.

TensorFlow makes it easy to construct this model using the Keras API. It can only present the output after the model is trained. In this case, the model must be run after the training data has labels assigned. This might be done by hand. The model works better with fewer parameters. The summary of the model can be printed for viewing the model. With a set of epochs and batches, the model can be trained.

With the model and training/test sets defined, it is now as easy to evaluate the model and run the inference. The model can also be saved and restored. It is executed faster when there is GPU added to the computing.

When the model is trained, it can be done in batches of predefined size. The number of passes of the entire training dataset called epochs can also be set up front. These are called model tuning parameters. Every model has a speed, Mean Average Precision and output. The higher the precision, the lower the speed. It is helpful to visualize the training with the help of a high chart that updates the chart with the loss after each epoch. Usually there will be a downward trend in the loss which is referred to as the model is converging.

When the model is trained, it might take a lot of time say about 4 hours. When the test data has been evaluated, the model’s efficiency can be predicted using precision and recall, terms that are used to refer to positive inferences by the model and those that were indeed positive within those inferences.

Conclusion: Tensorflow.js is becoming a standard for implementing machine learning models. Its usage is simple, but the choice of model and the preparation of data takes significantly more time than setting it up, evaluating, and using it.

Wednesday, February 17, 2021

User interface and API design for Streaming Queries (continued from previous post ...)

Analytical applications that perform queries on data are best visualized via catchy charts and graphs. Most applications using storage and analytical products have a user interface that lets them create containers for their data but have very little leverage for the customer to author streaming queries. Even the option to generate a query is nothing more than the uploading of a program using a compiled language artifact. The application reuses a template to make requests and responses, but this is very different from long running and streaming responses that are typical for these queries. A solution is presented in this document.

Query Interface is not about keywords.

When it comes to a user interface for querying data, people often associate the Google user interface with search terms. Avoid Google. This is not necessarily the right interface for Streaming queries. If anything, it is simplistic, and keywords based. It is not operators-based. Splunk interface is a much cleaner interface since it has seasoned from search over machine data. Streaming queries are not table queries. They also don’t operate on big data on the Hadoop file system only. They do have standard operators that can directly translate to operator keywords and can be chained the same manner in a java file as pipe operators on the search bar. These translations of query operator expressions on the search bar to a java language FLink streaming query is a patentable design for streaming user interface and one that will prove immensely flexible for direct execution of user queries as compared to package queries.

Use of off-the-shelf JavaScript libraries for data visualization.

There are several off-the-shelf libraries for data visualization such as D3.js, Recharts, and other libraries including admin-UI for dashboards. They are used on a case-by-case basis. User-interface for a dedicated purpose requires a lot of customization. Sometimes it is easy to work with off-the-shelf-libraries, other times, it is simpler to rewrite the foundation for the essential and long-standing use-cases.

Conclusion:

User interface development for streaming queries should bring the best practice of usability from those for objects store and database while bringing the best out of the event-by-event continuity that is the hallmark of stream processing. Paginated results are the hallmark of querying over big data whether the more relevant results are brought up first even if all the results follow later. Results from streaming queries can also be shown as they arrive so that the users can have a chance to pause if they have what they need already. They will also benefit from piped operators that can be entered in the search bar just the same way as keywords and this permits the results to be extracted-transformed-and-loaded to forms that appeal better to the user.