Cluster computing

Saturday, February 20, 2021

The file layout for scripts imported from database

Introduction: The article on exporting scripts from database talked about the business justification but not about the implementation. Databases lay out data in the form of tables, rows and columns. That article referred to the data being written out to files but omitted file layout. This article touches on that topic.

Description: When the text columns is consistent across all tables, it tends to have a reliable pattern. This data can be reorganized in a table that is suitable for staging before the data is exported to files. Creating staging tables also helps background jobs because they can leverage the organization to help with the pause and resume for their long running activities.

Some text columns could be small, others could be large or they might both be there in a table. They can even become large-text. Scripts do not necessarily have a size limit and not all text columns store scripts. There are no patterns to interpret from the contents of a column.

If the content varies a lot, it can still be saved as independent files just so long as we can selectively filter out the scripts from the relational tables. The order in which they are retrieved can also be determined by the order in which they are listed in the tables. Each table can then correspond to a folder. The folders can be independent just like the tables but folders can also be nested to determine the layout for tables and columns. Identifiers for the files and the folders can have a name or preferably a multi-part identifier that comprises not only of the sequence number but also of the name of the column or the table, if not the content of the cell pointed by the row number and column.

Some metadata can also be packed along with the data so that the entries can be corroborated upon unpacking. Checksums or digests help in this regard and the listing of entries or their sizes can also be included.

The data in the file layout must be compressed. This automatically prepares the list of files that are included. This packing is also efficient in terms of storage and portability.

Conclusion: The scripts from the database for a customer in say a multi-tenant cloud database can be extracted as files and then checked into a source control repository by simply overlaying over existing scripts by virtue of its file layout. This helps with determining the differences, it’s history and versioning which reduces the cost of investigations pertaining to this form of data in the customer’s instance.

Friday, February 19, 2021

Tracking changes in scripts from customer’s data:

Problem Statement: Scripts are saved in the database as text columns. The database provides many advantages but fails as a source control. It’s hard to visualize the line-by-line differences between versions of the same script much less the differences between folders in a recursive manner. Determine the differences between scripts cuts down on costs from alternate means of identifying root cause of symptoms found with customer’s data. For instance, cost from testing increases dramatically with more dimensions of test matrix. Scripts are customer’s logic stored in their instance of a multi-tenant platform. The service desk technical support engineer has no easy way to study the changes in the script without writing custom code to retrieve scripts via SQL queries. If these scripts were to be exported into a well-known source control tracking mechanism as files albeit for the duration of investigation, it would leverage well-known practice for viewing and determining changes to scripts that streamlines processes around investigations involving source code. Some automation is needed to export the scripts from the customer’s data into a source control mechanism. This article provides one such mechanism.

Solution: Data import / export from a database and from a source control system are not new features. The language of import/export has been text-based artifacts such as files. Extract-Transform-Load operations or adhoc SQL queries can export data from the database and the source control software can export the contents of a repository as a compressed archive of fiiles. What was missing was the layout of the data in the form of the organization in the database so that the data becomes easy to differentiate and work with on disk as opposed to remaining hidden within the database. When the cells, the rows and the tables are organized in a way that they are easy to locate using a folder/file hierarchy, they become easy for command-line tools which are savvy about folder/file hierarchy and can recursively differentiate instances of data.

The engineers who investigate issues with the customer’s data often post work notes and attachments to the case. The number of work notes and the size of attachments are usually not a concern as they can be rolled over as many times as needed. Engineers often like to get the entirety of scripts from a customer’s instance in one shot, which they can selectively study with the tools at hand. When the attachment is in the form of a compressed archive of files, it is easy to decompress and register it as a source repository to utilize the popular source differencing tools. Shell and command line utilities do not deliver the kind of rich experience that only source control systems like Git provide.

When the scripts change often, versioning plays an important role in the investigation. Fortunately, source control software such as Git are best suited for versioning overlays of source from decompressed archives that maintain part or all the file layout structure. Git is not only ubiquitous as a source control software but also has been revising its own versioning strategy from where it was using GitFlow and GitHubFlow in version 2 to using configurations in v3. With configurations, many aspects of the versioning can be tweaked, standardized and even support Semantic Versioning 2.0. The use of md5 hashes and commit manipulations allows forward and backward traversals in time in terms of revision history. If the engineer has merely to decompress an archive that was packed by automation onto a source repository to close in on a single line change in a script between version or time-range, it becomes more than a convenience.

Automations are savvy about making attachments to a case because they must do so for runtime operations such as taking a memory dump of running process, exporting certain logs, metrics or events or such other useful information. What they need additionally is a standardization of fetching data relevant to the stated goal in this article. The ETL wizards are only part of the solution because they export data but not the organization. OData was a technique that popularized exposing the data to the web using the conventional web Application Programming Interface (API). This data now could be imported via a command line tool such as curl and the JavaScript Object Notational Format (JSON) made it universally acceptable for programmability. Yet the organization information was largely left to the caller and the onus to make hundreds of calls did not address the need described. S3 APIs with its tri-level hierarchy of Namespace, Bucket and Object was a novel technique because each object was web accessible. If the data could be exported from a database into these containers, the use of command-line tools became unnecessary as downstream systems such as event processors could directly read the data without the need to put them on disk in between the stages of processing or leverage its folder prefix organization to export some or all the data as file archives.

Filesystem remains universally appealing to all shell-based command line tools as well as source control tracking mechanism. Writing C program-based tools that can download data from a database and write them to disk in the form of folder or files is therefore the only complete solution to differentiate instances of data between database snapshots. This option was always left out to the individual developers to implement for their use case because the size of production database could be upwards of Terabytes and filesystems stretched even on clusters were not considered practical for production. But scripts form a very small fraction of such production data and can be helpfully imported exported with successful automation.

Conclusion: With the ability to bridge a relational store and a source control mechanism using a folder/file layout, the cost of resolving incidents with customer’s data that originates from their logic can be reduced significantly with reliable automation.

Thursday, February 18, 2021

Writing an item-based and a user-based recommender using TensorFlow.js

Introduction: TensorFlow is a machine learning framework for JavaScript applications. It helps us build models that can be directly used in the browser or in the node.js server. We use this framework for building an application that can recommend using deep learning that can find the best result in millions.

Description: The recommender from TensorFlow is built on two core features – one that supports fast approximate retrieval and another that supports better techniques for modeling feature interactions. The SaveModel object is instantiated for the recommender which takes as input query features and provides the recommendations as output. Feature interactions are based on deep and cross networks which are efficient architectures for deep learning. Cross features are essential to span large and sparse feature space which are typical of datasets used with recommenders. ScaNN is a state of the art nearest neighbor search library (NNS) and it integrates with TensorFlow recommenders.

For example, we can say:

scann = tfrs.layers.factorized_top_k.ScaNN(model.user_model)

scann.index(movies.batch(100).map(model.movie_model), movies)

Collaborative Filtering involves making recommendations based on user group. In order to make a recommendation, first a group sharing similar taste is found and then the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keep tracking of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the people in the selected group. To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. Another approach is to used item-based filtering. This filtering is like the previous except that it was for user-based approach and this is for item-based approach. It is significantly faster than user-based approach but requires the storage for an item similarity table. TensorFlow supports recommenders with simple vectors involving high-dimensional features and embedding projectors for visualization. Recommenders work very well with Matrix Factorization.

A Keras layer is like a backend and can run on Colab environment. Keras can help author the model and deploy it to an environment such as Colab where the model can be trained on a GPU. Once the training is done, the model can be loaded and run anywhere else including a browser. The power of TensorFlow is in its ability to load the model and make predictions in the browser itself.

While Brute force approaches make fewer inferences per second, scann is quite scalable and efficient.

As with any ML learning example, the data is split into 70% training set and 30% test set. There is no order to the data and the split is taken over a random set.

TensorFlow makes it easy to construct this model using the Keras API. It can only present the output after the model is trained. In this case, the model must be run after the training data has labels assigned. This might be done by hand. The model works better with fewer parameters. The summary of the model can be printed for viewing the model. With a set of epochs and batches, the model can be trained.

With the model and training/test sets defined, it is now as easy to evaluate the model and run the inference. The model can also be saved and restored. It is executed faster when there is GPU added to the computing.

When the model is trained, it can be done in batches of predefined size. The number of passes of the entire training dataset called epochs can also be set up front. These are called model tuning parameters. Every model has a speed, Mean Average Precision and output. The higher the precision, the lower the speed. It is helpful to visualize the training with the help of a high chart that updates the chart with the loss after each epoch. Usually there will be a downward trend in the loss which is referred to as the model is converging.

When the model is trained, it might take a lot of time say about 4 hours. When the test data has been evaluated, the model’s efficiency can be predicted using precision and recall, terms that are used to refer to positive inferences by the model and those that were indeed positive within those inferences.

Conclusion: Tensorflow.js is becoming a standard for implementing machine learning models. Its usage is simple, but the choice of model and the preparation of data takes significantly more time than setting it up, evaluating, and using it.