Cluster computing

Saturday, April 20, 2024

This is a continuation of previous articles on IaC shortcomings and resolutions. No infrastructure is useful without considerations for usability. As with the earlier example of using Azure Machine Learning workspace to train models using Snowflake data source, some consideration must be given to allow connections to data source and importing data. We cited resolving versions between Spark, Scala and Snowflake libraries within the kernel to allow data to be imported into a dataframe for use with SQL and this could be difficult for end-users if they were to locate the jars and download themselves. While the infrastructure could provide pre-configured kernels such as Almond kernel with appropriate jars such as for Scala, some samples might ease the task for datascientists wrangling with Snowflake data on existing workspaces.

For example, they could stage their action in multiple steps with pulling the data from snowflake and then loading it into a dataframe.

This is a sample code to do so:

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName("BytesToDataFrame").getOrCreate()

# Sample raw bytes (replace with your actual data from Snowflake using snowflake-connector cursor)

# https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example

# either using

# a) df = pd.DataFrame(cursor.fetchall())

# or

# b) df = cursor.fetch_pandas_all()

# or

# c)

raw_bytes = [b'\xba\xed\x85\x8e\x91\xd4\xc7\xb0', b'\xba\xed\x85\x8e\x91\xd4\xc7\xb1']

schema = StructType([StructField("id", StringType(), True)])

rdd = spark.sparkContext.parallelize(raw_bytes)

df = spark.createDataFrame(rdd, schema=schema)

df.show()

In this example, the data is retrieved first with a cursor and then loaded into a dataframe.

Cluster computing

Saturday, April 20, 2024

No comments:

Post a Comment