Cluster computing

Sunday, July 2, 2023

Problem Statement: A 0-indexed integer array nums is given.

Swaps of adjacent elements are able to be performed on nums.

A valid array meets the following conditions:

· The largest element (any of the largest elements if there are multiple) is at the rightmost position in the array.

· The smallest element (any of the smallest elements if there are multiple) is at the leftmost position in the array.

Return the minimum swaps required to make nums a valid array.

Example 1:

Input: nums = [3,4,5,5,3,1]

Output: 6

Explanation: Perform the following swaps:

- Swap 1: Swap the 3^rd and 4^th elements, nums is then [3,4,5,3,5,1].

- Swap 2: Swap the 4^th and 5^th elements, nums is then [3,4,5,3,1,5].

- Swap 3: Swap the 3^rd and 4^th elements, nums is then [3,4,5,1,3,5].

- Swap 4: Swap the 2^nd and 3^rd elements, nums is then [3,4,1,5,3,5].

- Swap 5: Swap the 1^st and 2^nd elements, nums is then [3,1,4,5,3,5].

- Swap 6: Swap the 0^th and 1^st elements, nums is then [1,3,4,5,3,5].

It can be shown that 6 swaps is the minimum swaps required to make a valid array.

Example 2:

Input: nums = [9]

Output: 0

Explanation: The array is already valid, so we return 0.

Constraints:

· 1 <= nums.length <= 10⁵

· 1 <= nums[i] <= 10⁵

Solution:

class Solution {

public int minimumSwaps(int[] nums) {

int min = Arrays.stream(nums).min().getAsInt();

int max = Arrays.stream(nums).max().getAsInt();

int count = 0;

while (nums[0] != min && nums[nums.length-1] != max && count < 2 * nums.length) {

var numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var end = numsList.lastIndexOf(max);

for (int i = end; i < nums.length-1; i++) {

swap(nums, i, i+1);

count++;

}

numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var start = numsList.indexOf(min);

for (int j = start; j >= 1; j--) {

swap(nums, j, j-1);

count++;

}

return count;

}

public void swap (int[] nums, int i, int j) {

int temp = nums[j];

nums[j] = nums[i];

nums[i] = temp;

}

Input

nums =

[3,4,5,5,3,1]

Output

Expected

Input

nums =

[9]

Output

Expected

Saturday, July 1, 2023

Problem statement: There is a growing need for dynamic, reliable and repeatable infrastructure as the scope expands to small footprint to deployments to cloud scale. Some of the manual approaches and management practices cannot keep up. There are two popular ways to meet these demands on the Azure public cloud which are Terraform and ARM templates. This article compares these two frameworks and their use cases. Specifically, we include a use case for DevSecOps and its applicability to the development and operation of trustworthy infrastructure-as-a-code.

Terraform is universally extendable through providers that furnish IaC for resource types. It’s a one-stop shop for any infrastructure, service, and application configuration. It can handle complex order-of-operations and composability of individual resources and encapsulated models. It is also backed by an open-source community for many providers and their modules with public documentation and examples. Microsoft also works directly with the Terraform maker on building and maintaining related providers and this partnership has gained widespread acceptance and usage. Perhaps, one of the best features is that it tracks the state of the real-world resources which makes Day-2 and onward operations easier and more powerful.

ARM templates are entirely from Microsoft consumed internally and externally as the de facto standard for describing resources on Azure and with their import and export options. There is a dedicated cloud service called the Azure Resource Manager service that expects and enforces this convention for all resources to provide effective validation, idempotency and repeatability.

Azure Blueprints can be leveraged to allow an engineer or architect to sketch a project’s design parameters, define a repeatable set of resources that implements and adheres to an organization’s standards, patterns and requirements. It is a declarative way to orchestrate the deployment of various resource templates and other artifacts such as role assignments, policy assignments, ARM templates, and Resource Groups. Blueprint Objects are stored in the CosmosDB and replicated to multiple Azure regions. Since it is designed to setup the environment, it is different from resource provisioning. This package fits nicely into a CI/CD.

With Azure templates, one or more Azure resources can be described with a document, but it doesn’t exist natively in Azure and must be stored locally or in source control. Once those resources deploy, there is no active connection or relationship to the template.

Other IaC providers like Terraform also have features such that it tracks the state of the real-world resources which makes Day-2 and onward operations easier and more powerful and with Azure Blueprints, the relationship between what should be deployed and what was deployed is preserved. This connection supports improved tracking and auditing of deployments. It even works across several subscriptions with the same blueprint.

Typically, the choice is not between a blueprint and a resource template because one comprises the other but between an Azure Blueprint and a Terraform tfstate. They differ in their organization methodology as top-down or bottom-up. Blueprints are great candidates for compliance and regulations while Terraform is preferred by developers for their flexibility. Blueprints manage Azure resources only while Terraform can work with various resource providers.

Once the choice is made, some challenges will require to be tackled next. The account with which the IaC is deployed and the secrets it must know for those deployments to occur correctly are something that works centrally and not in the hands of individual end-users. Packaging and distributing solutions for end-users is easier when these can be read from a single source of truth in the cloud, so at least the location in the cloud for the solution to read and deploy the infrastructure must be known beforehand.

The DevSecOps workflow has a double loop between various stages including create->plan->monitor->configure->Release->Package->Verify where the create, plan, verify and package stages belong to Dev or design time and the monitor, configure and release belong to operations runtime. SecOps sits at the cusp between these two halves of Dev and Ops and participates in the planning, package and release stages.

Some of the greatest challenges of DevSecOps are firstly, cultural in that it comes from market fragmentation in terms of IaC providers and secondly, variety of wide skills required for such IaC. Others include definition of well-known code or design patterns, difficulty in replicating errors, IaC language specifics and diverse toolset, security and trustworthiness, configuration drift and changing infrastructure requirements.

Friday, June 30, 2023

How to enable Unity Catalog for Azure Databricks?

Azure Databricks is an Azure managed service for provisioning Databricks instances which is a platform that unifies data, analytics and AI. Databricks users who have previously used older versions of Databricks may not have migrated to Unity Catalog which is a centralized administrative module. This article explains how to enable and work with Unity Catalog.

Databricks does not force us to migrate our data into proprietary storage systems to use the platform. Instead, it allows us to integrate the platform with external storage and deploys compute to process the data. We control the integrations and manage permissions. Unity Catalog further extends this relationship by managing permissions for accessing data using SQL syntax from within Azure Databricks.

The primary purpose is integrated access control:

· Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. It offers a single place to administer data access policies that apply across all workspaces and personas. It automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages and personas.

· An Azure managed identity can access external storage on behalf of Unity Catalog users. Managed identities provide an identity for applications to use when they connect to resources that support Azure Active Directory (Azure AD) authentication.

The Unity Catalog comprises a hierarchy of Metastore at the top level, followed by Catalog, then by Schema and Tables and views at the leaf level. All items are referenced via a three-level namespace in the format catalog.schema.table. Metastore is the top level container for metadata. Other than the metastore, Unity Catalog comprises of user management module.

The steps to follow to setup Unity Catalog are:

1. Configure a storage container and Azure managed identity with read-write access to it.

2. Create a metastore

3. Attach workspaces to the metastore

4. Add users, groups and service principals to the Azure Databricks account.

Many people struggle to follow these steps because the navigation to get started is hidden behind their user icon on the admin accounts portal under the menu item “Manage Account”. Once they find this item, it is easy to follow the get started tutorial to create a metastore and setup the unity catalog as directed.

The steps to follow for setting up an integration of a fresh new instance with Azure data lake storage are:

1. Create an Azure Databricks instance in a vnet.

2. Create an ADB access connector resource for ADLS.

3. Use the access connector MI to access the Unity Catalog root storage account by specifying the access connector id under Data->Metastore.

4. Create a storage credential in the Unity catalog for this Managed Identity

5. Set up your data lake storage account with storage firewall that allows only Optum Ips

6. Grant access to this storage account by specifying to allow access from specific resource type and the databricks instance.

7. Setup storage Credential with external location mapping and access control policies for users and groups in the Unity Catalog.

Problem Statement: A 0-indexed integer array nums is given.

Swaps of adjacent elements are able to be performed on nums.

A valid array meets the following conditions:

· The largest element (any of the largest elements if there are multiple) is at the rightmost position in the array.

· The smallest element (any of the smallest elements if there are multiple) is at the leftmost position in the array.

Return the minimum swaps required to make nums a valid array.

Example 1:

Input: nums = [3,4,5,5,3,1]

Output: 6

Explanation: Perform the following swaps:

- Swap 1: Swap the 3^rd and 4^th elements, nums is then [3,4,5,3,5,1].

- Swap 2: Swap the 4^th and 5^th elements, nums is then [3,4,5,3,1,5].

- Swap 3: Swap the 3^rd and 4^th elements, nums is then [3,4,5,1,3,5].

- Swap 4: Swap the 2^nd and 3^rd elements, nums is then [3,4,1,5,3,5].

- Swap 5: Swap the 1^st and 2^nd elements, nums is then [3,1,4,5,3,5].

- Swap 6: Swap the 0^th and 1^st elements, nums is then [1,3,4,5,3,5].

It can be shown that 6 swaps is the minimum swaps required to make a valid array.

Example 2:

Input: nums = [9]

Output: 0

Explanation: The array is already valid, so we return 0.

Constraints:

· 1 <= nums.length <= 10⁵

· 1 <= nums[i] <= 10⁵

Solution:

class Solution {

public int minimumSwaps(int[] nums) {

int min = Arrays.stream(nums).min().getAsInt();

int max = Arrays.stream(nums).max().getAsInt();

int count = 0;

while (nums[0] != min && nums[nums.length-1] != max && count < 2 * nums.length) {

var numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var end = numsList.lastIndexOf(max);

for (int i = end; i < nums.length-1; i++) {

swap(nums, i, i+1);

count++;

}

numsList = Arrays.stream(nums).boxed().collect(Collectors.toList());

var start = numsList.indexOf(min);

for (int j = start; j >= 1; j--) {

swap(nums, j, j-1);

count++;

}

return count;

}

public void swap (int[] nums, int i, int j) {

int temp = nums[j];

nums[j] = nums[i];

nums[i] = temp;

}

Input

nums =

[3,4,5,5,3,1]

Output

Expected

Input

nums =

[9]

Output

Expected

Thursday, June 29, 2023

Workflows with AirFlow:

Apache AirFlow is a platform used to build and run workflows. A workflow is represented as a Directed Acyclic Graph where the nodes are the tasks and the edges are the dependencies. This helps to determine the order in which to run them and with retries. The tasks are self-described. An Airflow deployment consists of a scheduler to trigger scheduled workflows and to submit tasks to the executor to run, an executor to run the tasks, a web server for a management interface, a folder for the DAG artifacts, and a metadata database to store state. The workflows don’t restrict what can be specified as a task which can be an Operator or a predefined task using say Python, a Sensor which is entirely about waiting for an external event to happen, and a Custom task that can be specified via a Python function decorated with an @task.

Runs of the tasks in a workflow can occur repeatedly by processing the DAG and can occur in parallel. Edges can be modified by setting the upstream and downstream for a task and its dependency. Data can be passed between tasks using an XCom, a cross-communications system for exchanging state, uploading and downloading from an external storage, or via implicit exchanges.

Airflows send out tasks to run on workers as space become available so they can fail but they will eventually complete. Notions for sub-DAGs and TaskGroups are introduced for better manageability.

One of the characteristics of AirFlow is that it prioritizes flow, so that there is no need to describe data input or output and all aspects of the flow can be visualized whether they include pipeline dependencies, progress, logs, code, tasks and success status.

AirFlow is in use by over ten thousand organizations with popular use cases involving orchestrating batch ETL jobs, organizing, executing and monitoring data flow, building ETL pipelines for extracting batch data from hybrid data sources and running Spark jobs, training machine learning models, generating automated reports and performing backups and other DevOps tasks. It might not be ideal for streaming events because the scheduling required is different between batch and stream. Also AirFlow does not offer versioning of pipelines, so a source control might become necessary for such cases. AirFlow epitomizes pipeline as a code with artifacts described in Python for creating jobs, stitching jobs, programming other necessary data pipelines and debugging and troubleshooting.