Cluster computing

Friday, April 19, 2024

This is a continuation of previous articles on allowing Spark/Scala/Snowflake code to execute on Azure Machine Learning Compute. The built-in Jupyter kernel of “Azure ML – Python 3.8” does not have pyspark and we discussed the choices of downloading version compatible jars as well as alternative code to get data from Snowflake.

In this article, we will review the steps to set up a Jupyter notebook for Snowpark Scala. An “Almond” kernel can be used to setup Scala and coursier can be used to install the Almond Kernel to specify a supported version of Scala. The Almond Kernel has a prerequisite that a Java Virtual Machine needs to be installed on the system. This can be done by installing AdoptOpenJDK version 8. Then Almond can be fetched with coursier, by downloading its release and running the executable with the command line parameters to install Almond and Scala. Coursier is a Scala application that makes it easy to manage artifacts. It can setup the Scala development environment by downloading and caching artifacts from the web.

The Jupyter notebook for Snowpark can then be configured by defining a variable for the path and directory for classes generated by the Scala REPL and creating it. The Scala REPL generates classes for the Scala code that user writes. This process of configuring the compiler is not complete without adding the directory created earlier as a dependency of the REPL interpreter. Next we create a new session in SnowPark with an example as below:

import $ivy.`com.snowflake:snowpark:1.12.0`

import com.snowflake.snowpark._

import com.snowflake.snowpark.functions._

val session = Session.builder.configs(Map(

"URL" -> "https://<account_identifier>.snowflakecomputing.com",

"USER" -> "<username>",

"PASSWORD" -> "<password>",

"ROLE" -> "<role_name>",

"WAREHOUSE" -> "<warehouse_name>",

"DB" -> "<database_name>",

"SCHEMA" -> "<schema_name>"

)).create

session.addDependency(replClassPath)

and then the Ammonite Kernel classes can be added for the code.

The session can be used to run a SQL query and populate a dataframe which can then be used independent of the data source.

Previous articles: IaCResolutionsPart107.docx

Cluster computing

Friday, April 19, 2024

No comments:

Post a Comment