This is a continuation of previous articles on IaC
shortcomings and resolutions. No infrastructure is useful without
considerations for usability. As with the earlier example of using Azure
Machine Learning workspace to train models using Snowflake data source, some
consideration must be given to allow connections to data source and importing
data. We cited resolving versions between Spark, Scala and Snowflake libraries
within the kernel to allow data to be imported into a dataframe for use with
SQL and this could be difficult for end-users if they were to locate the jars
and download themselves.
One way to resolve this would be to use a different coding
style as shown below:
import snowflake.connector
conn = snowflake.connector.connect(
account=’<snowflake_account>’,
host='<account>.east-us-2.azure.snowflakecomputing.com',
user='<login>',
private_key=<bytes-to-private_key>,
role='<data-scientist-role>',
warehouse='<name-of-warehouse>',
database='<demo_db>',
schema='<demo_table>'
)
cursor = conn.cursor()
cursor.execute('select ID from <schema>
limit 10')
rows = cursor.fetchall()
for row in rows:
print(row)
cursor.close()
conn.close()
When compared to the
following code:
spark = (
SparkSession.builder \
.appName('SnowflakeSample') \
.config("spark.jars","/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/snowflake-jdbc-3.12.2.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/snowflake-ingest-sdk-0.9.6.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/spark-snowflake_2.13-2.11.3-spark_3.3.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/scala-library-2.12.19.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/hadoop-azure-3.2.1.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/azure-storage-7.0.0.jar")
.config(conf = conf) \
.getOrCreate()
)
print(spark.version)
sfOptions = {
"sfUser" : "<login>",
"sfURL" : "<account>.east-us-2.azure.snowflakecomputing.com",
"sfRole" : "<data-scientist-role>",
"sfWarehouse" : "<warehouse>",
"sfDatabase" : "<demo_database>",
"sfSchema" : "<demo_table>",
"pem_private_key" : <private-key-bytes>
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
query = 'select ID from
<schema> limit 10'
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query",query) \
.load()
It becomes clear that the spark.read.format can be difficult
to run without the proper jars passed in the config.
Therefore, samples are important to be provided with
infrastructure.
Previous articles: IaCResolutionsPart106.docx
No comments:
Post a Comment