Thursday, June 13, 2024

 This is a continuation of articles on IaC shortcomings and resolutions. The following article describes how data scientists leveraging the cloud infrastructure tend to think about individual files and archives rather than filesystems. Most data used by data scientists to train their models either lives in a remote blob store, filestore or some form of data store such as structured and unstructured databases and virtual data warehouses. Distributed file systems in operating systems and intercompatibility protocols between heterogeneous operating systems such as Linux and Windows have long addressed the problem of viewing remote file systems as local paths via mounts and mapped drives, yet the diligence to setup and tear down entire filesystems on local compute instances and clusters is often ignored. 

Part of the reason for such limited use of files and archives has been the popularity of signed URIs for remote files that facilitate sharing on a file-by-file basis as well as the adoption of new file formats like parquet and zip archives for convenient data transfer. When changes are made to these files, they often require unpacking and packing and one-time update at the remote location. 

With the convenience of BlobFuse2 technology, mounted file systems can persist changes to remote location near instantaneously and are available for blob stores just as much as the technology is available for file stores. BlobFuse is a virtual system driver for Azure Blob Storage. It can be used to access existing blob data through the Linux File system. Page blobs are not supported. It uses libfuse open-source library to connect to the Linux FUSE kernel module. It implements filesystem operations by using Azure Storage REST APIs. Local file caching improves subsequent access times. An azure blob container on a remote Azure Data Lake Storage Gen 2 file system is mounted on Linux and its activities and resource usage can be monitored. The version 2 provides more management support through the Command-Line Interface  

On the other hand, the Azure File Storage offers fileshares in the cloud using the standard SMB protocol. Enterprise applications that rely on fileservers can find this transition easier. File shares can be mounted even from virtual machines running in Azure and on-premises applications that support SMB 3.0.

To mount the file share from a virtual machine running Linux, an SMB/CIFS client needs to be installed and if the distribution does not have a built-in client, it can be installed with the cifs-utils package. Then a mount command can be specified to make a mount point by giving the type, remote location, options, and local path as parameters. Mount shares can be persisted across reboots by adding a setting in the /etc/fstab file.

Lastly, as with all cloud resources and operations, all activities can be logged and monitored. They come with role-based access control for one-time setup and control plane operations can be automated with command-line interface, REST API calls, user-interface automations, and Software Development Kits in various languages.

Previous write-up: IaCResolutionsPart135.docx


No comments:

Post a Comment