Cluster computing

Saturday, March 23, 2024

This is a continuation of a series of articles on Infrastructure-as-platform code, its shortcomings and resolutions. IaC full service does not stop at just the provisioning of resources. The trust that the clients place on the IaC based deployment service is that their use cases will be enabled and remain operational without hassles. As an example, since we were discussing azure machine learning workspace, one of the use cases is to draw data from sources other than Azure provided storage accounts such as Snowflake. Execution of Snowflake on this workspace requires PySpark library and support from the java and Scala as well as jars specific to Snowflake.

This means that the workspace deployment will only be complete when the necessary prerequisites are installed. If the built-in doesn’t support, some customization is required. And in many cases, these come back to IaC configurations as much as there is automation possible via inclusion of scripts.

In this case of machine learning workspace, a custom kernel might be required for supporting snowflake workloads. Such a kernel can be installed by passing in an initialization script that writes out a kernel specification in yaml file that can in turn be used to initialize and activate the kernel. Additionally, the jars can be downloaded specific to Snowflake that includes their common library, support for Spark code execution and the official scala language execution jars.

Such a kernel might look something like this:

name: customkernel

channels:

- conda-forge

- defaults

dependencies:

- python=3.11

- numpy

- pyspark

- pip

- pip:

- azureml-core

- ipython

- ipykernel

- pyspark==3.5.1

When the Spark session is started, the configuration specified can include the path to the jars. These additional steps must be taken to go the full length of onboarding customer workloads. Previous article references: IacResolutionsPart97.docx

Cluster computing

Saturday, March 23, 2024

No comments:

Post a Comment