Cluster computing

Sunday, March 24, 2024

Spark code execution on Azure Machine Learning workspace allows us to leverage the power of Apache Spark for big data processing and analytics tasks. Here are some key points to know about Spark code execution on Azure Machine Learning workspace:

1. Integration: Azure Machine Learning workspace provides seamless integration with Apache Spark, allowing us to run Spark code within the workspace. This integration simplifies the process of running Spark jobs, as us don't need to set up and manage a separate Spark cluster.

2. Scalability: Azure Machine Learning workspace enables us to scale our Spark jobs easily. We can choose the appropriate cluster size based on our workload requirements, and Azure will automatically provision and manage the necessary resources. This scalability ensures that us can handle large-scale data processing tasks efficiently.

3. Notebook support: Azure Machine Learning workspace supports Jupyter notebooks, which are commonly used for interactive data exploration and analysis with Spark. We can write and execute Spark code in a Jupyter notebook within the workspace, making it convenient to prototype and experiment with our Spark code.

4. Parallelism and distributed computing: Spark code execution on Azure Machine Learning workspace takes advantage of the parallel processing capabilities of Spark. It allows us to distribute our data across multiple nodes in a cluster and perform computations in parallel, thereby accelerating the processing of large datasets.

5. Data integration: Azure Machine Learning workspace provides easy integration with various data sources, including Azure Data Lake Storage, Azure Blob Storage, and Azure SQL Database. We can seamlessly read data from these sources into Spark, perform transformations and analytics, and write the results back to the desired output location.

6. Monitoring and management: Azure Machine Learning workspace offers monitoring and management capabilities for Spark code execution. We can track the progress of our Spark jobs, monitor resource usage, and diagnose any issues that may arise. Additionally, us can schedule and automate the execution of Spark jobs using Azure Machine Learning pipelines.

7. Collaboration and version control: Azure Machine Learning workspace enables collaboration and version control for Spark code. We can work with our team members on Spark projects, track changes made to the code, and manage different versions of our Spark scripts. This facilitates teamwork and ensures that us can easily revert to previous versions if needed.

Overall, Spark code execution on Azure Machine Learning workspace provides a powerful and flexible platform for running large-scale data processing and analytics workloads using Apache Spark. It simplifies the management of Spark clusters, provides integration with other Azure services, and offers monitoring and collaboration capabilities to streamline our Spark-based projects.

Sample Spark Session with downloaded jars can be invoked using:

spark = (

SparkSession.builder \

.appName('SnowflakeSample') \

.config("spark.jars","/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/snowflake-jdbc-3.13.29.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/spark-snowflake_2.13-2.13.0-spark_3.3.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/snowflake-common-3.1.19.jar,/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pyspark/jars/scala-library-2.13.9.jar")

.config(conf = conf) \

.getOrCreate()

Saturday, March 23, 2024

This is a continuation of a series of articles on Infrastructure-as-platform code, its shortcomings and resolutions. IaC full service does not stop at just the provisioning of resources. The trust that the clients place on the IaC based deployment service is that their use cases will be enabled and remain operational without hassles. As an example, since we were discussing azure machine learning workspace, one of the use cases is to draw data from sources other than Azure provided storage accounts such as Snowflake. Execution of Snowflake on this workspace requires PySpark library and support from the java and Scala as well as jars specific to Snowflake.

This means that the workspace deployment will only be complete when the necessary prerequisites are installed. If the built-in doesn’t support, some customization is required. And in many cases, these come back to IaC configurations as much as there is automation possible via inclusion of scripts.

In this case of machine learning workspace, a custom kernel might be required for supporting snowflake workloads. Such a kernel can be installed by passing in an initialization script that writes out a kernel specification in yaml file that can in turn be used to initialize and activate the kernel. Additionally, the jars can be downloaded specific to Snowflake that includes their common library, support for Spark code execution and the official scala language execution jars.

Such a kernel might look something like this:

name: customkernel

channels:

- conda-forge

- defaults

dependencies:

- python=3.11

- numpy

- pyspark

- pip

- pip:

- azureml-core

- ipython

- ipykernel

- pyspark==3.5.1

When the Spark session is started, the configuration specified can include the path to the jars. These additional steps must be taken to go the full length of onboarding customer workloads. Previous article references: IacResolutionsPart97.docx

Friday, March 22, 2024

You are given a 0-indexed integer array nums of length n.

You can perform the following operation as many times as you want:

Pick an index i that you haven’t picked before, and pick a prime p strictly less than nums[i], then subtract p from nums[i].

Return true if you can make nums a strictly increasing array using the above operation and false otherwise.

A strictly increasing array is an array whose each element is strictly greater than its preceding element.

Example 1:

Input: nums = [4,9,6,10]

Output: true

Explanation: In the first operation: Pick i = 0 and p = 3, and then subtract 3 from nums[0], so that nums becomes [1,9,6,10].

In the second operation: i = 1, p = 7, subtract 7 from nums[1], so nums becomes equal to [1,2,6,10].

After the second operation, nums is sorted in strictly increasing order, so the answer is true.

Example 2:

Input: nums = [6,8,11,12]

Output: true

Explanation: Initially nums is sorted in strictly increasing order, so we don't need to make any operations.

Example 3:

Input: nums = [5,8,3]

Output: false

Explanation: It can be proven that there is no way to perform operations to make nums sorted in strictly increasing order, so the answer is false.

Constraints:

1 <= nums.length <= 1000
1 <= nums[i] <= 1000
nums.length == n

· class Solution {

· public boolean primeSubOperation(int[] nums) {

· for (int i = 0; i < nums.length; i++) {

· int min = 0;

· if (i > 1) min = Math.max(nums[i-1], 0);

· int max = nums[i];

· int prime = getPrime(min, max);

· nums[i] -= prime;

· }

· return isIncreasing(nums);

· }

· public boolean isIncreasing(int[] nums){

· for (int i = 1; i < nums.length; i++){

· if (nums[i] <= nums[i-1]){

· return false;

· }

· return true;

· }

· public int getPrime(int min, int max) {

· for (int i = max-1; i > min; i--){

· if (isPrime(i) && (max – i > min)){

· return i;

· }

· return 0;

· }

· public boolean isPrime(int n){

· for (int i = 2; i < n; i++){

· if (n % i == 0) {

· return false;

· }

· return true;

· }