This is a continuation of previous articles on allowing
Spark/Scala/Snowflake code to execute on Azure Machine Learning Compute. The
built-in Jupyter kernel of “Azure ML – Python 3.8” does not have pyspark and we
discussed the choices of downloading version compatible jars as well as
alternative code to get data from Snowflake.
In this article, we will review the steps to set up a
Jupyter notebook for Snowpark Scala. An “Almond” kernel can be used to setup
Scala and coursier can be used to install the Almond Kernel to specify a
supported version of Scala. The Almond Kernel has a prerequisite that a Java
Virtual Machine needs to be installed on the system. This can be done by
installing AdoptOpenJDK version 8. Then Almond can be fetched with coursier, by
downloading its release and running the executable with the command line
parameters to install Almond and Scala. Coursier is a Scala application that
makes it easy to manage artifacts. It can setup the Scala development
environment by downloading and caching artifacts from the web.
The Jupyter notebook for Snowpark can then be configured by
defining a variable for the path and directory for classes generated by the
Scala REPL and creating it. The Scala REPL generates classes for the Scala code
that user writes. This process of configuring the compiler is not complete
without adding the directory created earlier as a dependency of the REPL
interpreter. Next we create a new session in SnowPark with an example as below:
import
$ivy.`com.snowflake:snowpark:1.12.0`
import com.snowflake.snowpark._
import
com.snowflake.snowpark.functions._
val session =
Session.builder.configs(Map(
"URL" ->
"https://<account_identifier>.snowflakecomputing.com",
"USER" -> "<username>",
"PASSWORD" -> "<password>",
"ROLE" -> "<role_name>",
"WAREHOUSE" -> "<warehouse_name>",
"DB" -> "<database_name>",
"SCHEMA" -> "<schema_name>"
)).create
session.addDependency(replClassPath)
and then the Ammonite Kernel
classes can be added for the code.
The session can be used to run a
SQL query and populate a dataframe which can then be used independent of the
data source.
Previous articles: IaCResolutionsPart107.docx
No comments:
Post a Comment