Cluster computing

Sunday, February 18, 2024

The Databricks Unity Catalog offers centralized access control, auditing, lineage, and data discovery capabilities across multiple Databricks workspaces. It includes user management, metastore, clusters, SQL warehouses, and a standards-compliant security model based on ANSI SQL. The catalog also includes built-in auditing and lineage, allowing for user-level audit logs and data discovery. The metadata store is a top-level container, while the data catalog has a three-level namespace namely catalog.schema.table. The catalog explorer allows for creation of tables and views, while the tables of views and volumes provide governance for nontabular data. The catalog is multi-cloud friendly, allowing for federation across multiple cloud vendors and unified access. The idea here is that you can define once and secure anywhere.

Databricks Unity Catalog consists of a metastore and a catalog. The metastore is the top-level logical container for metadata, storing data assets like tables or models and defining the namespace hierarchy. It handles access control policies and auditing. The catalog is the first-level organizational unit within the metastore, grouping related data assets and providing access controls. However, only one metastore per deployment is used. Each Databricks region requires its own Unity Catalog metastore.

There is a Unity catalog quick start notebook in Python. The key steps include creating a workspace with the Unity Catalog meta store, creating a catalog, creating a managed schema, managing a table, and using the Unity catalog in the Pandas API on Spark. The code starts with creating a catalog, selecting show, and then creating a managed schema. The next step involves creating and managing schemas, extending them, and granting permissions. The table is managed using the schema created earlier, and the table is shown and all available tables are shown. The final step involves using the Pandas API on Spark, which can be found in the official documentation for Databricks. This quick start is a great way to get a feel for the process and to toggle back and forth with the key steps inside the code.

The Unity Catalog system employs object security best practices, including access control lists (ACLs) for granting or restricting access to specific users and groups on securable objects. ACLs provide fine-grain control, ensuring access to sensitive data and objects. Less privilege is used, limiting access to the minimum required, avoiding broad groups like All Users unless necessary. Access is revoked once the purpose is served, and policies are reviewed regularly for relevance. This technique enhances data security and compliance, prevents unnecessary broad access, and controls a blast radius in case of security breaches.

The Databricks Unity Catalog system offers best practices for catalogs. First, create a separate catalog for loose coupling, managing access and compliance at the catalog level. Align catalog boundaries with business domains or applications, such as marketing analytics or HR. Customize security policies and governance within the catalog to drill down into specific domains. Create access control groups and roles specific to a catalog, fine-tune read-write privileges, and customize settings like resource quotas and scrum rules. These fine-grain policies provide the best of security and functionality in catalogs.

To ensure security and manage external connections, limit visibility by granting access only to specific users, groups, and roles, and setting lease privileges. Limit access to only necessary users and groups using granular access control lists or ACLs. Be aware of team activities and avoid giving them unnecessary access to external resources. Tag connections effectively for discovery using source categories or data classifications, and discover connections by use case for organizational visibility. This approach enhances security, prevents unintended data access, and simplifies external connection discovery and management.

Databricks Unity Catalog Business Unit Best Practices emphasize the importance of providing dedicated sandboxes for each business unit, allowing independent development environments, and preventing interference between different workflows. Centralizing shareable data into production catalogs ensures consistency and reduces the need for duplicate data. Discoverability is crucial, with meaningful naming conventions and metadata best practices. Federated queries via Lakehouse architecture unify data access across silos, governing securely via contracts and permissions. This approach supports autonomy for units, increases productivity through reuse, and maintains consistency with collaborative governance. This approach supports autonomy, increases productivity, and maintains consistency.

In conclusion, the Unity catalog standard allows centralized data governance and best practices for catalogs, connections, and business units.

https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html#enable-workspace

https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html

Saturday, February 17, 2024

This is a summary of the book “What you can change and what you can’t – the complete guide to Successful Self-improvement by” written by Martin E.P. Seligman and published by Vintage Books, 2007. He was the former president of the American Psychological Association who delivers a pragmatic treatise on problems that are treatable and those that are not. His attitude is candid, and his advice is well-informed albeit generalized. His message is, at times, as simple as if you are overweight and you have gotten thin, you have beaten the odds. Among his other messages, he believes that radical changes are possible so long as we know which areas are more amenable to change. Those on the opposite side of the spectrum can be misguidedly deemed as treatable by hoaxes. The degree to which a problem exists might indicate their treatability. The past, and particularly, mistreatment during childhood is wrongly blamed. If we are hopeful and optimistic, we can begin change but we must be realistic. Therapies and drugs control only the symptoms. Dieting and alcohol rehabilitation do not cure but obsession, panic attacks and phobias are highly treatable. Similarly, depression, anxiety and anger are hard-wired into the human psyche but treatable. Sometimes the best thing people can do with certain deep-rooted emotional maladies is to learn to live with them.

Psychological problems such as depression, addiction, obsessiveness, anxiety, and post-traumatic stress often require courage and self-improvement. Current treatments, such as drugs and psychotherapies, have an effectiveness rate of only about 65% due to the high degree of heritability of problem-causing personality traits. People who suffer from these mental afflictions often hope to live courageously with their problems. It is possible that both Winston Churchill and Abraham Lincoln were unipolar depressives.

People often believe they can benefit from self-improvement, such as weight loss, meditation, and suppressing sexual desire. However, these methods often fail due to genetic factors and biochemical factors. Our personality is more a product of our genes than we would have believed a decade ago.

Most people can change some things, such as depression, sexual dysfunction, mood, and outlook, while factors that people usually cannot change include severe weight problems, alcoholism, and homosexuality. Understanding one's psychological state helps to deal with and potentially change it for the better.

Dysphoric emotions, such as depression, anxiety, and anger, have been used as warnings of danger and loss throughout history. These emotions can be viewed as bad weather and can be managed through natural methods like meditation and progressive relaxation. However, unremitting, intense anxiety can indicate serious disorders like obsession, phobia, and panic, which require therapeutic exorcism.

Panic attacks, which manifest catastrophic thinking, are curable and can be treated with cognitive therapy. Phobias, which stem from evolutionary history, represent unreasonable fears and can be treated through systematic desensitization or "flooding." Optimism is a learned skill that can improve work achievement and physical health. Obsessions involve repetitive thoughts or images that can be depressing, scary, and repugnant, leading to obsessive-compulsive disorder (OCD). Treatment involves exposing the individual to a fearful situation and preventing the ritualistic behavior, eventually disarming the obsession. Understanding and managing these emotions is crucial for overall well-being.

Depression, anger, and post-traumatic stress disorder are common mental illnesses that can lead to feelings of sadness, helplessness, and despair. Bipolar depression, a heritable condition, can be treated with lithium, while unipolar depression stems from loss, pain, and sadness. Treatments include electroconvulsive shock, medication, interpersonal therapy (IPT), and cognitive therapy (CT). Anger, a powerful emotion, can be controlled by counting to 10, but it can also spark violence. Post-traumatic stress disorder (PTSD) involves extraordinary loss or tragedy, and some people may require drugs and psychotherapy. Sex, diet, and alcohol also play a role in these disorders. Transsexuals may believe they are trapped in bodies of the wrong gender, while exclusive homosexuality is deeply ingrained. Therapy can help alter feelings about sexual choice and preferences, but only within specified limits. Successful living often involves learning to make the best of a bad situation.

Dieting is not effective for heavy people, as it can make them more overweight and unhealthier. People who weigh less live longer, but society's emphasis on an "ideal weight" is misplaced. Exercise is good for the body and helps with weight control, but don't diet. Stomach bypass surgery can help the extremely obese. Factors such as overeating, having an "overweight personality," physical inactivity, and willpower are not true. Alcoholism is not a physical pathology, and treatments like Alcoholics Anonymous (AA) may not be effective. The depth of the problem should be considered, as most psychological problems start with childhood trauma. Focusing on changing aspects of personality within control and amenable to change is better. The "Serenity Prayer" counsels courageously changing the things that can, while accepting the things that cannot. Therapy or drugs may help if the problem is deep-seated, but it may not be effective in the long run.

Previous book summaries: BookSummary52.docx
Summarizing Software: SummarizerCodeSnippets.docx.

Friday, February 16, 2024

One of the more recent additions to Azure resources has been Azure Machine Learning Studio. This is a managed machine learning environment that allows you to create, manage, and deploy ML models and applications. It is part of the Azure AI resource, which provides access to multiple Azure AI services with a single setup. Some of the features of Azure Machine Learning Studio resource are:

- It has a GUI-based integrated development environment for building machine learning workflows on Azure.

- It supports both no-code and code-first experiences for data science.

- It lets you use various tools and components for data preparation, feature engineering, model training, evaluation, and deployment.

- It enables you to collaborate with your team and share datasets, models, and projects.

- It allows you to configure and manage compute targets, security settings, and external connections.

When provisioning this resource for use by data scientists, it is important to consider the following best practices:

- The workspace itself must allow outbound connectivity to the public network. One of the ways to do this is to allow it to be accessible from all or selective public Ip addresses.

- The clusters must be provisioned with no node public Ip addresses. This is conforming to the well-known no public Ip addresses aka NPIP patterns. This is done by adding the compute to a subnet in a virtual network with service endpoints for Azure Storage, Key vault and container registry and default routing.

- Since the workspace and its dependent resources namely storage account, key vault, container registry and application insights are independently created, it is helpful to have the same user-assigned managed identity associated with them, which also makes it convenient to customize data plane access to other resources not part of this list such as a different storage account or key vault. The same goes for compute which can also be launched with this identity.

- Permissions granted to various roles on this resource can be customized to be further restricted since this is a shared workspace.

- Code that is executed by data scientists in this studio can be categorized as one of many such as regular interactive python notebook, Spark code, and non-interactive jobs. Permissions necessary to run each of them must be independently tried out.

- There are various kernels and serverless spark compute available to execute the user-defined code in a notebook. The user-defined managed identity used to facilitate the data access for this code must have both control plane read access to perform actions such as getAccessControl and data plane operations such as blob data read and write. The logged-in user credentials are automatically used over this session created with the managed identity for the user to perform the data access.

- The non-interactive jobs require specific permission to submit run within an experiment for any user.

Together, the built-in and customizations of this resource can immensely benefit the data scientists to train their models. Previous articles: IacResolutionsPart77.docx

Thursday, February 15, 2024

Ways to replicate databases in Azure for DevOps

Azure Relational Databases often consolidate data access across services and become important for transactional processing. The data stored in these databases are mission critical and consequently some steps to ensure business continuity and disaster recovery are needed. Daily backups and continuous replication are some of the frequently sought after methods to do that and often organizations build their own GitOps initiatives to take backup and restore across databases and database servers. This article compares these two methods.

Backup and restore is a feature that allows you to create a copy of your server and its databases at a specific point in time, and restore it to a new server if needed. This is useful for recovering from user or application errors, or for migrating data to a different region.

Continuous replication is a feature that allows you to create one or more read-only replicas of your server in the same or different region, and synchronize them with the primary server asynchronously. This is useful for scaling out read workloads, improving availability, and reducing latency.

If a database server is introduced into every azure subscription that an organization owns, with the sole purpose of receiving replications from other databases and database servers, then it can even eliminate the need for a backup and restore or provide different levels of service for source databases. There cannot be multiple database servers in Azure that replicate to the same database server instance. This is because each replica server must have a unique server ID in a replication topology. But we can have multiple source servers that replicate to different replica servers using the Data-in Replication feature or the read replica feature. These features allow us to synchronize data from a source MySQL server to one or more read-only replica servers in the same or different region. Also, we cannot set up multiple replication channels from a single replica to a single source. This is again because the server IDs of replicas must be unique in a replication topology. But we can set up multiple replication channels from a single source to multiple replicas using the read replica feature in Azure Database for MySQL. This feature allows us to replicate data from an Azure Database for MySQL server or flexible server instance to up to five or 10 read-only servers, respectively. This can help us scale out read workloads, improve availability, and reduce latency

The main differences between backup and restore and continuous replication are:

- Backup and restore requires you to manually initiate the restore process, while continuous replication automatically streams the data from the primary to the replicas.

- Backup and restore has a longer recovery time objective (RTO) and a higher potential data loss (RPO) than continuous replication, depending on the backup frequency and retention period.

- Backup and restore is at the server level, not at the database level, while continuous replication allows you to select which databases to replicate.

- Backup and restore can be configured to use either locally redundant or geographically redundant storage, while continuous replication always uses geographically redundant storage.

- Backup and restore is included in the service cost, while continuous replication incurs additional charges for the replicas.

Besides these options, the database migration service and mysqldump also provide resiliency.

The azure data migration service instance comes with the following.  

Pros: 

1. One instance works across all subscriptions. 

2. Can transfer between on-premises and cloud and cloud to cloud. 

3. Pay-per-use billing. 

4. Provides a wizard to create data transfer activity. 

Cons: 

1. Limited features via IaC as compared to the portal but enough to get by. 

2. Not recommended for developer instances or tiny databases that can be exported and imported via mysqldump. 

3. binlog_expire_logs_seconds must be set to non-zero value on source server. 

4. Supports only sql login

The steps to perform the data transfer activity between the source and destination MySQL servers involves: 

1. Create a source mysql instance mysql-src-1 in rg-mysql-1 

2. Create a database and add a table to mysql-src-1 

3. Create a destination mysql instance mysql-dest-1 in rg-mysql-1 

4. Deploy the DMS service instance and project. 

5. Create and run an activity to transfer data. 

6. Verify the data. 

The mysqldump utilities, on the other hand, prepare the SQL for replay against the destination server. All the tables of the source database can be exported.

Pros:

1. The SQL statements fully describe the source database.

2. They can be edited before replaying

3. There are many options natively supported by the database server.

Cons:

1. It acquires a global read lock on all the tables at the beginning of the dump.

2. If long updating statements are running when the Flush statement is issued, the MySQL server may get stalled until those statements finish

These are some of the ways to replicate databases across servers.

Tuesday, February 13, 2024

This is a summary of the book titled “Leading Beyond Change - A practical guide to evolving business agility” written by Michael K. Sahota and Audree Tara Sahota and published by Berrett-Koehler in 2021. As productivity consultants they propose “Shift314 Evolutionary Leadership framework (SELF)” that advocates learning what happens in your organization by listening “to the voice of the system”. It’s the next step in your personal leadership journey and for your company. They link effective change to awareness and accountability.

Most changes are ineffective but by encompassing results, organizational change, organizational culture, people, use of power, leadership, and understanding reality. Evolutionary leaders lead beyond change. They must be constantly realistic and open to new ideas. They are empathetic as human beings and are accountable for their choices. It is crucial to focus on achieving "organizational evolution" instead of routine change programs. This requires a shared transition into higher consciousness and evolutionary leadership. Traditional change efforts often fail because leaders mistakenly view their organizations as easy to understand. The "Shift314 Evolutionary Leadership Framework" (SELF) is a strategic approach to seek evolutionary change. The SELF program consists of four aspects: maps, principles, models, and tools. Maps help leaders visualize patterns and understand the difference between traditional and evolutionary change patterns. Principles guide personal evolution and inspire change in others. Models provide a better understanding of the organization through data collection and analysis. Tools allow leaders to enhance personal and organizational results. Evolutionary leaders prioritize employees' safety and interests, even prioritizing their safety and interests over shareholder interests.

Evolutionary leadership focuses on system improvement and higher consciousness, leading "beyond change." It involves two dimensions: "Being" and "Doing," where leaders commit to individual improvement that inspires people and leads to team and systemic improvement. Responsibility is the key factor for successful self-organization and self-management. Organizations with high-energy employees and superior performance are ideally equipped to manage evolutionary change. Companies need leaders who develop and evolve their own leadership styles to move beyond conventional business operations.
Leaders must be realistic and accept reality, as their brains tend to generalize, distort, or delete information when overloaded with sensory data. Companies need leaders who accept reality and its consequences to deal with challenges effectively. Open-minded and humble leaders accept that their thoughts may not accurately reflect reality, and are tough-minded enough to question their own models.
Achieving this mindset requires faith in oneself and the process of growth. To learn, keep the mind open to new ideas and avoid believing you know it all. A famous Zen koan teaches that believing in one's own knowledge can hinder clear thinking and openness to new ideas.

The best leaders evolve as aware human beings, embracing self-awareness, emotional awareness, and enlightened wisdom. Self-evolution is crucial for leaders, as it allows them to discover reality, unlearning, understand that external challenges are linked to internal ones, serve purpose, let go of control, increase freedom, exhibit responsibility, provide psychological safety, and have an equal voice. Leaders must be accountable for their choices that define their leadership, as everything in their sphere at work reflects on their personal leadership. They must be aware of the consequences of their actions, including how they grow as a leader, the organization's processes, corporate changes, employee treatment, and the leaders they appoint and mentor. This journey requires courage and motivation to accept the challenging process of evolving.

Monday, February 12, 2024

Sequences from usage

Introduction: Neural networks democratize dimensions, but the pattern of queries served by the neural network inform more about some of them rather than others over lengthy periods of time. These are like sequences used in transformers except that sequence is not given between successive queries and instead must be learned over time and curated as idiosyncratic to the user sending the queries. Applications of such learners are significant in spaces such as personal assistant where the accumulation of queries and the determination of these sequences offer insights into interests of the user. The idea behind the capture of sequences over large batches of queries is that they are representational not actual and get stored as state rather than as dimensions. There is a need to differentiate the anticipation, prediction, and recommendation of the next query to determine the user traits from usage over time. The latter does not play any role in immediate query responses but adjusts the understanding whether a response will be with high precision and recall for the user. In an extension of this distinguishing curation of user traits, this approach is different from running a latent semantic classifier on all incoming requests because that would be short-term and helping with the next query or response but not assist with the inference of the user’s traits.

If the neurons in a large language model were to light up for the query and responses, there would be many flashes across millions of them. While classifiers help with the salience of these flashes for an understanding of the emphasis in the next query and response, it's the capture off repetitive flash sequences over a long enough time span in the same space that indicate recurring themes and a certain way of thinking for the end-users. Such repetitions have their origins in the way a person draws upon personal styles, way of thinking and vocational habits to form their queries. When subjects brought up the user are disparate, her way of going about exploring them might still draw upon their beliefs, habits and way of thinking leading to sequences that at the very least, is representative of themselves.

Curation and storage of these sequences is like the state stored by transformers in that must be encoded and decoded. Since these encoded states are independent of the forms in which the input appears, it eschews the vector representation and consequent regression and classification, focusing instead on the capture and possible replay of the sequences, which are more along the lines of sequence databases use of which is well-known in the industry and provides opportunities for interpretation not limited to machine learning.

Continuous state aggregation for analysis might also be possible but not part of this discussion. Natural language processing relies on encoding-decoding to capture and replay state from text. This state is discrete and changes from one set of tokenized input texts to another. As the text is transformed into vectors of predefined feature length, it becomes available to undergo regression and classification. The state representation remains immutable and decoded to generate new text. Instead, if the encoded state could be accumulated with the subsequent text, it is likely that it will bring out the topic of the text if the state accumulation is progressive. A progress indicator could be the mutual information value of the resulting state. If there is information gained, the state can continue to aggregate, and this can be stored in memory. Otherwise, the pairing state can be discarded. This results in a final state aggregation that continues to be more inclusive of the topic in the text.

NLP has algorithms like BERT to set the precedence for state encoding and decodings but they act on every input and are required to be contiguous in their processing of the imputs. A secondary pass for selectivity and storing of the states is suggested in this document. Reference: https://booksonsoftware.com/text

Sunday, February 11, 2024

Storing secrets in app_settings versus key vault for Azure App Service has some trade-offs. Here are some of the advantages and disadvantages of each option:

App_settings: This option is simpler and more convenient, as you can access the secrets as regular app settings or connection strings in your application code. App settings are also securely encrypted at rest ¹. However, app settings do not provide the same level of security and management as key vault. For example, app settings do not support hardware-level encryption, granular access policies, or certificate rotation ². App settings are also a global resource for all your web apps, which may not be desirable if you need to isolate secrets for different regions or environments ³.
Key vault: This option is more secure and robust, as you can use key vault references to store secrets in a centralized service that provides full control over access policies and audit history ¹. Key vault also supports hardware-level encryption, certificate rotation, and other advanced features ⁴. However, key vault requires more configuration and permissions, as you need to create a vault, a managed identity, and an access policy for your application ¹. Key vault is also a region-focused resource, which may introduce latency or availability issues if your web app and key vault are in different regions ³.

In summary, app settings are a good choice for storing non-sensitive application settings, such as endpoint locations, sizing, flags, etc ⁴. Key vault is a better choice for storing sensitive information, such as encryption keys, certificates, passwords, etc ⁴. You can also use both options together, by creating app settings that reference secrets stored in key vault ¹. This way, you can maintain secrets apart from your app’s configuration, and access them like any other app setting or connection string in your code ¹.

1. https://learn.microsoft.com/en-us/azure/app-service/app-service-key-vault-references

2. https://stackoverflow.com/questions/67722160/what-is-the-point-of-using-azure-key-vault-instead-of-only-app-configuration

3. https://learn.microsoft.com/en-us/azure/azure-app-configuration/faq

4. https://learn.microsoft.com/en-us/azure/key-vault/secrets/secrets-best-practices

### Code sample for reading app settings

```python:

import logging

import os

import azure.functions as func

app = func.FunctionApp()

@app.function_name(name="HttpTrigger1")

@app.route(route="req")

def main(req: func.HttpRequest) -> func.HttpResponse:

# Get the setting named 'myAppSetting'

my_app_setting_value = os.environ["myAppSetting"]

logging.info(f'My app setting value: {my_app_setting_value}')

# Your other function logic goes here

return func.HttpResponse("Function executed successfully!")

```

### Code sample for reading key vault secrets

```python:

# Import necessary libraries

import logging

import azure.functions as func

from azure.identity import DefaultAzureCredential

from azure.keyvault.secrets import SecretClient

def main(req: func.HttpRequest) -> func.HttpResponse:

logging.info('Python HTTP trigger function processed a request.')

# Initialize Azure credentials

credentials = DefaultAzureCredential()

# Create a SecretClient to interact with the Key Vault

vault_url = "https://your-key-vault.vault.azure.net" # Replace with your Key Vault URL

secret_client = SecretClient(vault_url=vault_url, credential=credentials)

# Retrieve the secret by name

secret_name = 'your-secret-name' # Replace with your secret name

secret = secret_client.get_secret(name=secret_name)

# Access the secret value

secret_value = secret.value

# You can now use the secret value in your function logic

# For example, return it as an HTTP response

return func.HttpResponse(f"Secret value: {secret_value}")

```

Make sure to replace the placeholders (your-key-vault.vault.azure.net and your-secret-name) with your actual Key Vault URL and secret name. Additionally, ensure that your Azure Function App has the necessary permissions to access the Key Vault (e.g., Reader role assignment).

Remember to include the required libraries (azure-functions, azure-keyvault-secrets, and azure-identity) in your requirements.txt file for deployment.

Previous articles: IaCResolutionsPart73.docx