Cluster computing

Tracking changes in scripts from customer’s data:

Problem Statement: Scripts are saved in the database as text columns. The database provides many advantages but fails as a source control. It’s hard to visualize the line-by-line differences between versions of the same script much less the differences between folders in a recursive manner. Determine the differences between scripts cuts down on costs from alternate means of identifying root cause of symptoms found with customer’s data. For instance, cost from testing increases dramatically with more dimensions of test matrix. Scripts are customer’s logic stored in their instance of a multi-tenant platform. The service desk technical support engineer has no easy way to study the changes in the script without writing custom code to retrieve scripts via SQL queries. If these scripts were to be exported into a well-known source control tracking mechanism as files albeit for the duration of investigation, it would leverage well-known practice for viewing and determining changes to scripts that streamlines processes around investigations involving source code. Some automation is needed to export the scripts from the customer’s data into a source control mechanism. This article provides one such mechanism.

Solution: Data import / export from a database and from a source control system are not new features. The language of import/export has been text-based artifacts such as files. Extract-Transform-Load operations or adhoc SQL queries can export data from the database and the source control software can export the contents of a repository as a compressed archive of fiiles. What was missing was the layout of the data in the form of the organization in the database so that the data becomes easy to differentiate and work with on disk as opposed to remaining hidden within the database. When the cells, the rows and the tables are organized in a way that they are easy to locate using a folder/file hierarchy, they become easy for command-line tools which are savvy about folder/file hierarchy and can recursively differentiate instances of data.

The engineers who investigate issues with the customer’s data often post work notes and attachments to the case. The number of work notes and the size of attachments are usually not a concern as they can be rolled over as many times as needed. Engineers often like to get the entirety of scripts from a customer’s instance in one shot, which they can selectively study with the tools at hand. When the attachment is in the form of a compressed archive of files, it is easy to decompress and register it as a source repository to utilize the popular source differencing tools. Shell and command line utilities do not deliver the kind of rich experience that only source control systems like Git provide.

When the scripts change often, versioning plays an important role in the investigation. Fortunately, source control software such as Git are best suited for versioning overlays of source from decompressed archives that maintain part or all the file layout structure. Git is not only ubiquitous as a source control software but also has been revising its own versioning strategy from where it was using GitFlow and GitHubFlow in version 2 to using configurations in v3. With configurations, many aspects of the versioning can be tweaked, standardized and even support Semantic Versioning 2.0. The use of md5 hashes and commit manipulations allows forward and backward traversals in time in terms of revision history. If the engineer has merely to decompress an archive that was packed by automation onto a source repository to close in on a single line change in a script between version or time-range, it becomes more than a convenience.

Automations are savvy about making attachments to a case because they must do so for runtime operations such as taking a memory dump of running process, exporting certain logs, metrics or events or such other useful information. What they need additionally is a standardization of fetching data relevant to the stated goal in this article. The ETL wizards are only part of the solution because they export data but not the organization. OData was a technique that popularized exposing the data to the web using the conventional web Application Programming Interface (API). This data now could be imported via a command line tool such as curl and the JavaScript Object Notational Format (JSON) made it universally acceptable for programmability. Yet the organization information was largely left to the caller and the onus to make hundreds of calls did not address the need described. S3 APIs with its tri-level hierarchy of Namespace, Bucket and Object was a novel technique because each object was web accessible. If the data could be exported from a database into these containers, the use of command-line tools became unnecessary as downstream systems such as event processors could directly read the data without the need to put them on disk in between the stages of processing or leverage its folder prefix organization to export some or all the data as file archives.

Filesystem remains universally appealing to all shell-based command line tools as well as source control tracking mechanism. Writing C program-based tools that can download data from a database and write them to disk in the form of folder or files is therefore the only complete solution to differentiate instances of data between database snapshots. This option was always left out to the individual developers to implement for their use case because the size of production database could be upwards of Terabytes and filesystems stretched even on clusters were not considered practical for production. But scripts form a very small fraction of such production data and can be helpfully imported exported with successful automation.

Conclusion: With the ability to bridge a relational store and a source control mechanism using a folder/file layout, the cost of resolving incidents with customer’s data that originates from their logic can be reduced significantly with reliable automation.

Cluster computing

Friday, February 19, 2021

No comments:

Post a Comment