Cluster computing

Thursday, November 9, 2023

Azure Machine Learning workspace integrates with Azure container registry, Azure KeyVault, Azure Storage account and Azure Analytics Insights. Model building requires exploratory data analysis, data-preprocessing and prototyping to validate hypotheses. What makes it different as an interactive and experimentation ML platform from others such as databricks workspace, is that it aims to provide a unified seamless experience with its own libraries to automate much of the tasks needed to accomplish a machine learning model that serves the business needs.

For example, the following code automates creation of compute needed to build a model.

from azureml.core.compute import AmlCompute, ComputeTarget

amlcompute_cluster_name="cpu-cluster"

provisioning_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", max_nodes=1)

compute_target = ComputeTarget.create(workspace, amlcompute_cluster_name, provisioning_config)

compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

The compute is part of the workspace so the Identity and Access Management provided on the workspace is sufficient to create the compute.

Storage Accounts, on the other hand, are external and might have their own access restrictions. There are quite a few ways to connect to an external Azure storage account from an Azure machine learning workspace. This section will review some of the ways to do that. All of these require a Datastore class to be instantiated that store connection information to Azure Storage Services.

The first method involves the azureml.core library to instantiate a datastore class as follows:

from azureml.core import Workspace, Datastore

from azureml.core.dataset import Dataset

from azureml.data.datapath import DataPath

ws = Workspace.from_config()

datastore = Datastore.register_azure_blob_container(ws, datastore_name="ds1", container_name="temporary", account_name="somestorageaccount", sas_token="<for-connecting-with-SAS-URL")

dataset = Dataset.Tabular.from_parquet_files(path = [(datastore, 'temporary/yellow_tripdata_2023-08.parquet')])

# preview the first 3 rows of the dataset

dataset.take(3).to_pandas_dataframe()

The Datastore is a common resource across many usages and is registered with the ML workspace with the credentials required to connect to the external storage account.

The fully qualified url for locating a blob on the storage account associated with the ML workspace is:
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'

The notebook executes with the default credentials of the logged in user, so it is possible to not specify the credentials when creating the datastore.

%pip install azure-ai-ml

from azure.ai.ml import MLClient

from azure.identity import DefaultAzureCredential

from azure.ai.ml import command, Input

from azure.ai.ml.entities import AzureBlobDatastore

from azure.ai.ml.entities import Environment

ml_client = MLClient(

DefaultAzureCredential(), subscription, resource_group, workspace

)

blob_credless_datastore = AzureBlobDatastore(

name="ds4",

description="Credential-less datastore pointing to a blob container.",

account_name=account_name,

container_name="temporary",

)

ml_client.create_or_update(blob_credless_datastore)

With the help of datastore, accessing a dataset is as simple as:

datastore = Datastore.get(ws, datastore_name="ds4")

dataset = Dataset.Tabular.from_parquet_files(path = [(datastore, 'temporary/yellow_tripdata_2023-08.parquet')])

# preview the first 3 rows of the dataset

dataset.take(3).to_pandas_dataframe()

Reference:

Different types of algorithms for models: MLRxFastLinear.docx

Cluster computing

Thursday, November 9, 2023

No comments:

Post a Comment