Tuesday, April 9, 2024

 Question: How to execute Apache Spark code in Azure Machine Learning Workspace as jobs on a non-interactive cluster?

Answer: Unlike Compute Instances on Azure Machine Learning workspace, a non-interactive cluster creation does not take initialization or startup scripts to configure the libraries or packages on the instance. Errors encountered on running a sample Spark code as shown here, will likely result in JAVA_HOME-not-set error or JAVA_GATEWAY_EXITED error.

from pyspark import SparkContext 

sc = SparkContext.getOrCreate() 


# check that it really works by running a job

# example from http://spark.apache.org/docs/latest/rdd-programming-guide.html#parallelized-collections

data = range(10000) 

distData = sc.parallelize(data)

result = distData.filter(lambda x: not x&1).take(10)

print(result)

# Out: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

To execute Apache Spark code in non-interactive jobs in Azure Machine Learning Workspace, we build custom environments. Custom environments allow us to specify the necessary dependencies, packages, and configurations required to run your Spark code.

Here's a step-by-step guide on how to build custom environments for executing Apache Spark code in Azure Machine Learning Workspace:

Define the environment: Start by defining the environment dependencies in a conda or pip environment file. Specify the required Python version, Spark version, and any additional packages or libraries needed for your code. For example, in the option to create an environment using an existing curated one, choose mldesigner:23 and customize the Conda specification with:

name: MyCustomEnvironment

channels:

  - conda-forge

  - defaults

dependencies:

  - python=3.8

  - numpy

  - pyspark

  - pip

  - pip:

    - azureml-core

    - ipython

    - ipykernel

    - pyspark


Specify the environment in the job configuration: When submitting a Spark job in Azure Machine Learning Workspace, we specify the custom environment that we created in the job configuration. This ensures that the job executes using the desired environment. Jobs must be submitted as part of an experiment and this helps with organization and locating jobs from their listing. Experiments are different from Environments.

Execute the job: We submit the Spark job using the Azure Machine Learning SDK or Azure portal. The job will be executed in the specified environment, ensuring that all required dependencies are available.

By building custom environments, we ensure that your Spark code runs consistently and reproducibly in Azure Machine Learning Workspace, regardless of the underlying infrastructure or dependencies.

Azure Machine Learning Workspace also provides pre-built environments with popular data science and machine learning frameworks like Spark, TensorFlow, and PyTorch. These environments are optimized and ready to use out of the box and are helpful for training models. Leveraging built-in over custom helps with automatic maintenance.


No comments:

Post a Comment