Cluster computing

Thursday, April 18, 2024

This is a continuation of previous articles on IaC shortcomings and resolutions. No infrastructure is useful without considerations for usability. As with the earlier example of using Azure Machine Learning workspace to train models using Snowflake data source, some consideration must be given to allow connections to data source and importing data. We cited resolving versions between Spark, Scala and Snowflake libraries within the kernel to allow data to be imported into a dataframe for use with SQL and this could be difficult for end-users if they were to locate the jars and download themselves.

One way to resolve this would be to use a different coding style as shown below:

import snowflake.connector

conn = snowflake.connector.connect(

account=’<snowflake_account>’,

host='<account>.east-us-2.azure.snowflakecomputing.com',

user='<login>',

private_key=<bytes-to-private_key>,

role='<data-scientist-role>',

warehouse='<name-of-warehouse>',

database='<demo_db>',

schema='<demo_table>'

)

cursor = conn.cursor()

cursor.execute('select ID from <schema> limit 10')

rows = cursor.fetchall()

for row in rows:

print(row)

cursor.close()

conn.close()

When compared to the following code:

spark = (

SparkSession.builder \

.appName('SnowflakeSample') \

.config("spark.jars","/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/snowflake-jdbc-3.12.2.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/snowflake-ingest-sdk-0.9.6.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/spark-snowflake_2.13-2.11.3-spark_3.3.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/scala-library-2.12.19.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/hadoop-azure-3.2.1.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/azure-storage-7.0.0.jar")

.config(conf = conf) \

.getOrCreate()

)

print(spark.version)

sfOptions = {

"sfUser" : "<login>",

"sfURL" : "<account>.east-us-2.azure.snowflakecomputing.com",

"sfRole" : "<data-scientist-role>",

"sfWarehouse" : "<warehouse>",

"sfDatabase" : "<demo_database>",

"sfSchema" : "<demo_table>",

"pem_private_key" : <private-key-bytes>

}

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

query = 'select ID from <schema> limit 10'

df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \

.options(**sfOptions) \

.option("query",query) \

.load()

It becomes clear that the spark.read.format can be difficult to run without the proper jars passed in the config.

Therefore, samples are important to be provided with infrastructure.

Previous articles: IaCResolutionsPart106.docx

Cluster computing

Thursday, April 18, 2024

No comments:

Post a Comment