One of the more
recent additions to Azure resources has been Azure Machine Learning Studio.
This is a managed machine learning environment that allows you to create,
manage, and deploy ML models and applications. It is part of the Azure AI resource,
which provides access to multiple Azure AI services with a single setup. Some
of the features of Azure Machine Learning Studio resource are:
- It has a
GUI-based integrated development environment for building machine learning
workflows on Azure.
- It supports
both no-code and code-first experiences for data science.
- It lets you
use various tools and components for data preparation, feature engineering,
model training, evaluation, and deployment.
- It enables you
to collaborate with your team and share datasets, models, and projects.
- It allows you
to configure and manage compute targets, security settings, and external
connections.
When
provisioning this resource for use by data scientists, it is important to
consider the following best practices:
- The workspace
itself must allow outbound connectivity to the public network. One of the ways
to do this is to allow it to be accessible from all or selective public Ip
addresses.
- The clusters
must be provisioned with no node public Ip addresses. This is conforming to the
well-known no public Ip addresses aka NPIP patterns. This is done by adding the
compute to a subnet in a virtual network with service endpoints for Azure
Storage, Key vault and container registry and default routing.
- Since the
workspace and its dependent resources namely storage account, key vault,
container registry and application insights are independently created, it is
helpful to have the same user-assigned managed identity associated with them,
which also makes it convenient to customize data plane access to other
resources not part of this list such as a different storage account or key
vault. The same goes for compute which can also be launched with this identity.
- Permissions
granted to various roles on this resource can be customized to be further
restricted since this is a shared workspace.
- Code that is
executed by data scientists in this studio can be categorized as one of many
such as regular interactive python notebook, Spark code, and non-interactive
jobs. Permissions necessary to run each
of them must be independently tried out.
- There are
various kernels and serverless spark compute available to execute the
user-defined code in a notebook. The user-defined managed identity used to
facilitate the data access for this code must have both control plane read
access to perform actions such as getAccessControl and data plane operations
such as blob data read and write. The logged-in user credentials are
automatically used over this session created with the managed identity for the
user to perform the data access.
- The
non-interactive jobs require specific permission to submit run within an experiment for any user.
Together, the
built-in and customizations of this resource can immensely benefit the data
scientists to train their models. Previous articles: IacResolutionsPart77.docx
No comments:
Post a Comment