Cluster computing: Importer-Exporter

Sunday, February 14, 2021

Importer-Exporter

Importer-Exporter:

Problem Statement: Data in a database poses a challenge to the utilities that have matured as usable from command-line tools because it is not in a declarative form. The moment data is exported, it becomes an instant legacy. On the other hand, when the data is exported as files or objects, or events, the businesses get empowered to use non-relational technologies such as Big Data analysis stacks, Snowflake warehouse, and Stream analytics. What is needed is an import-export tool and this article strives to discuss the options available for same.

Solution: Data import/export is not a new feature. Extract-Transform-Load operations can export data in a variety of formats such as CSV or JSON. What was missing was the layout of the data in the form of the organization in the database so that the data becomes easy to differentiate and work with on disk as opposed to remaining hidden within the database. When the cells, the rows, and the tables are organized in a way that they are easy to locate using a folder/file hierarchy, they become easy for command-line tools that are savvy about folder/file hierarchy and can recursively differentiate instances of data.

The ETL wizards are only part of the solution because they export data but not the organization. If they were to export schema, it would assist with the organization of the data, but it would not be sufficient for the end-to-end capability required from a tool to differentiate instances of data between database snapshots.

OData was a technique that popularized exposing the data to the web using the conventional web Application Programming Interface (API). This data now could be imported via a command-line tool such as curl and the JavaScript Object Notational Format (JSON) made it universally acceptable for programmability. Yet the organization information was largely left to the caller and the onus to make hundreds of calls does not yet address the need described. Again, this was also an insufficient solution but certainly a helpful step.

S3 APIs with its tri-level hierarchy of Namespace, Bucket, and Object was an interesting evolution because each object was web-accessible. If the data could be exported from a database into these containers, the use of command-line tools became unnecessary as downstream systems such as event processors could directly read the data without the need to put them on disk in between the stages of processing. The ability to differentiate instances of data from database snapshots could have been modified to iterate over the remote web locations which incidentally maintained a folder prefix organization. Whether the database, table, and columns could be translated to Namespace, Buckets and Objects remained at the discretion of the caller and the nature or size of data. This mapping even if predetermined could not be found in the existing pieces of solutions proposed by ETL wizards, OData, or S3 exporters and importers.

Filesystem remains universally appealing to all shell-based command-line tools. Writing C program-based tools that can download data from a database and write them to disk in the form of a folder or files is, therefore, the only complete solution to differentiate instances of data between database snapshots. This option was always left out to the individual developers to implement for their use case because the size of the production database could be upwards of Terabytes and filesystems stretched even on clusters were not considered practical for production. But these very developers truly proved that data is not sticky to relational databases because they found and promoted batch and stream-based processing that avoids the centralized store.

Conclusion: With the ability to connect to a relational store and a predetermined strategy to flatten data from database snapshots to folder/file layout, a shell-based command-line tool can enable a source control tracking system on data which allows data revisions to be visualized and compared better.

Cluster computing

Sunday, February 14, 2021

Importer-Exporter

No comments:

Post a Comment