Wednesday, February 28, 2024

 

Sequences from usage

Introduction: Neural networks democratize dimensions, but the pattern of queries served by the neural network inform more about some of them rather than others over lengthy periods of time. These are like sequences used in transformers except that sequence is not given between successive queries and instead must be learned over time and curated as idiosyncratic to the user sending the queries. Applications of such learners are significant in spaces such as personal assistant where the accumulation of queries and the determination of these sequences offer insights into interests of the user. The idea behind the capture of sequences over large batches of queries is that  they are representational not actual and get stored as state rather than as dimensions. There is a need to differentiate the anticipation, prediction, and recommendation of the next query to determine the user traits from usage over time. The latter does not play any role in immediate query responses but adjusts the understanding whether a response will be with high precision and recall for the user. In an extension of this distinguishing curation of user traits, this approach is different from running a latent semantic classifier on all incoming requests because that would be short-term and helping with the next query or response but not assist with the inference of the user’s traits.

If the neurons in a large language model were to light up for the query and responses, there would be many flashes across millions of them. While classifiers help with the salience of these flashes for an understanding of the emphasis in the next query and response, it's the capture off repetitive flash sequences over a long enough time span in the same space that indicate recurring themes and a certain way of thinking for the end-users. Such repetitions have their origins in the way a person draws upon personal styles, way of thinking and vocational habits to form their queries. When subjects brought up the user are disparate, her way of going about exploring them might still draw upon their beliefs, habits and way of thinking leading to sequences that at the very least, is representative of themselves.

Curation and storage of these sequences is like the state stored by transformers in that must be encoded and decoded. Since these encoded states are independent of the forms in which the input appears, it eschews the vector representation and consequent regression and classification, focusing instead on the capture and possible replay of the sequences, which are more along the lines of sequence databases use of which is well-known in the industry and provides opportunities for interpretation not limited to machine learning.

Continuous state aggregation for analysis might also be possible but not part of this discussion. Natural language processing relies on encoding-decoding to capture and replay state from text.  This state is discrete and changes from one set of tokenized input texts to another. As the text is transformed into vectors of predefined feature length, it becomes available to undergo regression and classification. The state representation remains immutable and decoded to generate new text. Instead, if the encoded state could be accumulated with the subsequent text, it is likely that it will bring out the topic of the text if the state accumulation is progressive. A progress indicator could be the mutual information value of the resulting state. If there is information gained, the state can continue to aggregate, and this can be stored in memory. Otherwise, the pairing state can be discarded. This results in a final state aggregation that continues to be more inclusive of the topic in the text.

NLP has algorithms like BERT to set the precedence for state encoding and decodings but they act on every input and are required to be contiguous in their processing of the imputs. A secondary pass for selectivity and storing of the states is suggested in this document. Reference: https://booksonsoftware.com/text

 

Tuesday, February 27, 2024

 

Get the Longest Increasing Subsequence:

For example, the LIS of [10, 9, 2, 5, 3, 7, 101, 18] is [2, 5, 7, 101] and its size() is 4.

public static int getLIS(List<Integer> A) {

if (A == null || A.size() == 0) return 0;

var best = new int[A.size() + 1];

for (int I = 0; I < A.size(); i++)

      best[i] = 1;

for (int I = 1; I < A.size(); i++)

 {

     for (int j = 0; j < I; j++) {

            if (A[i] > A[j]) {

                   best[i]  = Math.Max(best[i], best[j] + 1);

            }

     }

}

List<Integer> b = Arrays.asList(ArrayUtils.toObject(best));

return b.stream()

.max(Comparator.comparing(Integer::valueOf))

.get();

}

 

Generate the Nth Fibonacci number.

Fibonacci series is like this: 1, 1, 2, 3, 5, 8, 13, …

int GetFib(int n) {

if (n <= 0) return 0;

if (n == 1) return 1;

if (n == 2) return 1;

return GetFib(n-1) + GetFib(n-2);

}

 

Monday, February 26, 2024

 

Part 2 – Reducing operational costs of chatbot model deployment.

This is the second part of the chatbot application discussion here.

The following strategies are required to reduce operational costs for the deployed chat model otherwise even idle ones can incur about a thousand dollars per month.

1.       The app service plan for the app service that hosts the chat user interface must be reviewed for CPU, memory and storage.

2.       It should be set to scale dynamically.

3.       Caching mechanisms must be implemented to reduce the load to the app service. Azure Redis cache can help in this regard.

4.       If the user interface has significant assets in terms of JavaScripts and images, Content Delivery Networks could be leveraged.

5.       CDNs reduce latency and offload the traffic from the app service to distributed mirrors.

6.       It might be hard to envision the model as a database but vector storage is  used and there is an index as well. It is not just about embeddings matrix.  Choosing the appropriate database tier and sku and optimizing the queries can help with the cost.

7.       Monitoring and alerts can help to proactively identify performance bottlenecks, resource spikes and anomalies.

8.       Azure Monitor and application insights can track metrics, diagnose issues, and optimize resource usage.

9.       If the chat model experiences idle periods, then the associated resources can be stopped and scaled down during those times.

10.   You don’t need the OpenAI service APIs. You only need the model APIs. Note the following:

a.       Azure OpenAI Model API: this is the API to the GPT models used for text similarity, chat and traditional completion tasks.

b.       Azure OpenAI service API: this encompasses not just the models but also the security, encryption, deployment and management functionalities to deploy models, manage endpoints and control access.

c.       Azure OpenAI Search API allows the chatbot model to retrieve from various data sources.

11.   Storing the vectors and the embeddings and querying the search APIs does not leverage the service API. The model APIs are a must so include that in the deployment but trim the data sources to just your data.

12.   Sample deployment:

 

Sunday, February 25, 2024

 

Exporting and using Word Embeddings:

The following is the example to use word embeddings externally:

import openai

 

# Set up Azure OpenAI configuration

openai.api_type = "azure"

openai.api_version = "2022-12-01"

openai.api_base = "YOUR_RESOURCE_NAME.openai.azure.com"  # Replace with your resource endpoint

openai.api_key = "YOUR_API_KEY"  # Replace with your API key

 

# Initialize the embedding model

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", chunk_size=1)

 

# Generate embeddings for your text

response = openai.Embedding.create(input="Sample Document goes here", engine="YOUR_DEPLOYMENT_NAME")

embeddings = response['data'][0]['embedding']

print(embeddings)

Saturday, February 24, 2024

 

Azure Machine Learning Datastore differentiations:

This is probably going to be an easy read compared to the previous articles referenced below. The problem an az ml workspace administrator wants to tackle is create different datastore objects so that user A gets one datastore but not others and user B gets another datastore but not others. Permissions are granted by roles and both users A and B have custom roles that have granted the permission to read the datastores with the following enumeration:

-          Microsoft.MachineLearningServices/workspaces/datastores/read

 

This permission does not say Datastore1/read but not Datastore2/read. In fact, both users must get the generic datastores/read permission that the cannot do without. Access controls cannot be granted to datastores as they can be given to files.

The solution to this problem is fairly simple. There are no datastores created by the administrator. Instead, the users create the datastores programmatically passing it either the Shared-Access-Signature Token to an external data storage or an account key. Either way, they must have access to their storage-account/container/path/to/file and can create the SAS token at their choice of scope.

The creation and use of datastores are just like that of credentials or connection objects required for a database. As long as the users manage it themselves, they can reuse it at their will.

If the administrator must be tasked with isolating access to the users to their workspace components and objects, then two workspaces will be created and assigned to groups to which these users can subscribe individually.

If we refer to the copilots for information on this topic, it will be a false positive that custom roles and Role-based Access Control will solve this for you. It will not be wrong in asserting “By properly configuring RBAC, we can control access to datastores and other resources within the workspace” but it is simply not recognizing that the differentiation is being made to the objects of the same kind. That said, there will be a full commentary on the other mechanisms available that include

Role-based Access Control, access control at external resource, generating and assigning different SAS tokens as secrets, generating virtual network service endpoints, exposing datastores with fine-grained common access, or using monitoring and alerts to detect and mitigate potential security threats. It is also possible to combine a few of the above techniques to achieve desired isolation of user access.

Previous articles: IaCResolutionsPart81.docx 

 

Friday, February 23, 2024

 

Shared workspaces and isolation

In a shared Azure Machine Learning workspace, achieving isolation of user access to datastores involves implementing a combination of access control mechanisms. This helps ensure that each user can only access the specific datastores they are authorized to use. Here are the key steps to achieve isolation of user access to datastores in a shared Azure Machine Learning workspace:

1.      Role-based Access Control (RBAC): Azure Machine Learning supports RBAC, which allows us to assign roles to users or groups at various levels of the workspace hierarchy. By properly configuring RBAC, we can control access to datastores and other resources within the workspace. For example:

Built-in role: AzureML Data Scientist Role

Custom-role: AzureML Data Scientist Datastore access role:

    Actions:

-        Microsoft.MachineLearningServices/workspaces/datastores/listsecrets/actions

-        Microsoft.MachineLearningServices/workspaces/datastores/read

                 Data_actions:

-        Microsoft.MachineLearningServices/workspaces/datastores/write

-        Microsoft.MachineLearningServices/workspaces/datastores/delete:

                 not_actions:
                 not_data_actions:

2.      Azure Data Lake Storage (ADLS) Data Access Control: If we're using Azure Data Lake Storage Gen2 as a datastore, we can utilize its built-in access control mechanisms. This includes setting access control lists (ACLs) on directories and files, as well as defining access permissions for users and groups.

3.      Shared Access Signatures (SAS): Azure Blob Storage, another commonly used datastore, supports SAS. SAS allows us to generate a time-limited token that grants temporary access to specific containers or blobs. By using SAS, we can control access to data within the datastore on a per-user or per-session basis.

4.      Virtual Network Service Endpoints: To further isolate access to datastores, we can leverage Azure Virtual Network (VNet) Service Endpoints. By configuring service endpoints, we can ensure that datastores are accessible only from specific VNets, thereby restricting access from outside the network.

5.      Workspace-level Datastore Configuration: Within the Azure Machine Learning workspace, we can define multiple datastores and associate them with specific storage accounts or services. By carefully configuring each datastore's access control settings, we can enforce granular access controls and limit user access to specific datastores.

6.      Monitoring and Auditing: It's important to monitor and audit user access to datastores within the shared Azure Machine Learning workspace. Azure provides various monitoring and auditing tools, such as Azure Monitor and Azure Sentinel, which can help we track and analyze access patterns and detect any potential security threats or unauthorized access attempts.

By following these steps and implementing a combination of RBAC, access control mechanisms within datastores, and network-level isolation, we can achieve effective isolation of user access to datastores in a shared Azure Machine Learning workspace

 

Previous articles: IaCResolutionsPart81.docx 

Thursday, February 22, 2024

 

This is a summary of a book titled “How Successful engineers become great business leaders” written by Paul Rulkens and published by BEP, 2018. As an engineer who transitioned to becoming a boardroom advisor, he draws expertise from his experience and provides tips and valuable insights to others making a leap from technical to business domain. He proposes three power laws as framework and uses them as building blocks for “clarity, focus and execution” to achieve business goals. His tools leverage the pragmatism that engineers are trained and use towards excelling in business world that’s rightfully focused on revenue growth. He explains problem solving is central to both disciplines.
Nearly one in three Fortune 500 CEOs have an engineering background, and they can become business leaders by gaining non-engineering business experience or broadening their knowledge with additional education or training, such as obtaining an MBA. However, both career strategies can carry major downsides for engineers, such as the need for decades of hard work and cookie-cutter curricula. Engineers can make a smooth transition from the engineering side to the business side by carefully positioning themselves within their corporation or industry. To leverage their engineering talents and skills in business, engineers should embrace three "power laws": "prime location, prime time, and prime knowledge."
Prime location refers to where skills can have the greatest impact and gain the greatest recognition. Prime time, on the other hand, refers to when and how your skills can have the greatest impact. Prime knowledge, on the other hand, is the value of extra know-how that can have a multiplier effect on your business leadership career.
To achieve business goals, engineers should have clarity, focus, and execution. Achieving ambitious goals requires better skills and behaviors to solve different problems. In today's business world, corporate leaders, including engineers, should focus on revenue growth, strategic planning, innovative practices, and organizational performance. Engineers can excel in the business world due to their practical nature and ability to help organizations execute. However, developing effective execution cultures requires considerable planning, vision, and communication. Engineers can use storytelling and evocative language to encourage an execution culture and become models of attuned, disciplined, aware, and focused executive behavior. Regularly testing their developmental abilities and monitoring internal activities can help them become accomplished business leaders. Strategic quitting, a process where a company abandons a failed project, is essential when things don't work out as expected. Engineers are strategic problem solvers, making them perfect for executives who have obstacles to overcome and challenges to surmount. Control is essential in engineering training, but ambitious engineers need to let go of control and step into the unknown to achieve business success.
Engineers should identify their best fit for business and focus on their controllable talents and skills. Focus on higher-risk development activities, recognizing that larger goals require more obstacles. Expand their capabilities in reality-based thinking, process design, and accelerated learning. Improve leadership behavior by adopting new conduct and modeling it to employees. Build a referral network to secure new customers. Be mindful of biases and achieve strategic goals quickly with minimal resources and energy. As a business leader, consider available time, extra knowledge, business operations, employee rewards, legacy, and growth goals. Monitor progress, provide value to customers, and abandon dogmatic thinking. Embrace the importance of establishing a legacy, achieving one growth goal, and implementing single behaviors to achieve strategic goals.
Summarizing Software: SummarizerCodeSnippets.docx.

Tuesday, February 20, 2024

 

Databricks is a unified data analytics platform that combines big data processing, machine learning and collaborative analytics tools in a cloud-based environment. As a collaborative workspace to author data driven workflows, it is usually quick to be adopted in any organization and prone to staggering costs from aging. This article explains that it need not be so and instead leverage some best practices to reduce infrastructure costs.  This involves optimization and best practices.

One of the advantages of being a cloud resource is that Databricks workspaces can be spun up as many times and for as many purposes as needed. Given the large number of features, the mixed-use cases of data engineering and analytics, and diverse compute and storage intensive usages such as machine learning and ETL, some segregation of workloads to workspaces even within business divisions and workspace lifetimes is called for.

Usages of databricks for the purposes of leveraging Apache Spark, a powerful open-source distributed computing framework is significantly different from the usages involving Delta Lake, an open-source storage layer that brings ACID transactions over heterogeneous data sources. The former drives compute utilization costs and the latter drives Databricks-Units and network costs. Some of the problems encountered include:

Data skew with uneven distribution of data over partitions lead to bottlenecks and poor execution. This is often addressed by repartitioning using salting or bucketing.

Inefficient data formats increase overhead and prolong query completion. This is addressed with more efficient packing such as parquet file types that offer built-in compression, columnar storage, and predicate pushdown.

Inadequate caching leads to repeated disk access that is costly. Spark’s in-memory caching features can help speed up iterative algorithms.

Large shuffles lead to network congestion, latency, and slower execution. This can be resolved with broadcast joins, filtering data early and using partition aware operations.

Inefficient queries occur when the query parameters and hints are not fully leveraged. Predicate pushdowns, partition pruning and query rewrites can resolve these.

Suboptimal resource allocation occurs when CPU, memory or storage is constrained. Monitoring resource usage and adjusting resource limits accordingly mitigate this.

Garbage collection settings are not proper. Much like resources, these can be monitored and tuned.

Outdated versions and required bug fixes. These can be solved with patching and upgrades.

Similarly, the best practices can be enumerated as:

Turning off compute that are not in use and enabling auto-termination.

Sharing compute between different groups via consolidation at relevant scope and level.

Tracking costs against usages so that they can be better understood.

Auditing usages against users and principals to take corrective action.

Leveraging spot instances for compute that come with a discount.

Using photon acceleration that speeds up SQL queries and Spark SQL API.

Using built-in and custom mitigations for patterns of problems encountered at resource and component levels.

Lastly, turning off features that are not actively used and using appropriate features for their recommended use also help significantly.


Monday, February 19, 2024

 

One of the more recent additions to Azure resources has been Azure Machine Learning Studio. This is a managed machine learning environment that allows you to create, manage, and deploy ML models and applications. It is part of the Azure AI resource, which provides access to multiple Azure AI services with a single setup. Some of the features of Azure Machine Learning Studio resource are:

- It has a GUI-based integrated development environment for building machine learning workflows on Azure.

- It supports both no-code and code-first experiences for data science.

- It lets you use various tools and components for data preparation, feature engineering, model training, evaluation, and deployment.

- It enables you to collaborate with your team and share datasets, models, and projects.

- It allows you to configure and manage compute targets, security settings, and external connections.

When provisioning this resource for use by data scientists, it is important to consider the following best practices:

- The workspace itself must allow outbound connectivity to the public network. One of the ways to do this is to allow it to be accessible from all or selective public Ip addresses.

- The clusters must be provisioned with no node public Ip addresses. This is conforming to the well-known no public Ip addresses aka NPIP patterns. This is done by adding the compute to a subnet in a virtual network with service endpoints for Azure Storage, Key vault and container registry and default routing.

- Since the workspace and its dependent resources namely storage account, key vault, container registry and application insights are independently created, it is helpful to have the same user-assigned managed identity associated with them, which also makes it convenient to customize data plane access to other resources not part of this list such as a different storage account or key vault. The same goes for compute which can also be launched with this identity.

- Permissions granted to various roles on this resource can be customized to be further restricted since this is a shared workspace.

- Code that is executed by data scientists in this studio can be categorized as one of many such as regular interactive python notebook, Spark code, and non-interactive jobs.  Permissions necessary to run each of them must be independently tried out.

- There are various kernels and serverless spark compute available to execute the user-defined code in a notebook. The user-defined managed identity used to facilitate the data access for this code must have both control plane read access to perform actions such as getAccessControl and data plane operations such as blob data read and write. The logged-in user credentials are automatically used over this session created with the managed identity for the user to perform the data access.

- The non-interactive jobs require specific permission to submit  run within an experiment for any user.

Together, the built-in and customizations of this resource can immensely benefit the data scientists to train their models. Previous articles: IacResolutionsPart77.docx

Sunday, February 18, 2024

 

The Databricks Unity Catalog offers centralized access control, auditing, lineage, and data discovery capabilities across multiple Databricks workspaces. It includes user management, metastore, clusters, SQL warehouses, and a standards-compliant security model based on ANSI SQL. The catalog also includes built-in auditing and lineage, allowing for user-level audit logs and data discovery. The metadata store is a top-level container, while the data catalog has a three-level namespace namely catalog.schema.table. The catalog explorer allows for creation of tables and views, while the tables of views and volumes provide governance for nontabular data. The catalog is multi-cloud friendly, allowing for federation across multiple cloud vendors and unified access. The idea here is that you can define once and secure anywhere.

Databricks Unity Catalog consists of a metastore and a catalog. The metastore is the top-level logical container for metadata, storing data assets like tables or models and defining the namespace hierarchy. It handles access control policies and auditing. The catalog is the first-level organizational unit within the metastore, grouping related data assets and providing access controls. However, only one metastore per deployment is used. Each Databricks region requires its own Unity Catalog metastore.             

There is a Unity catalog quick start notebook in Python. The key steps include creating a workspace with the Unity Catalog meta store, creating a catalog, creating a managed schema, managing a table, and using the Unity catalog in the Pandas API on Spark. The code starts with creating a catalog, selecting show, and then creating a managed schema. The next step involves creating and managing schemas, extending them, and granting permissions. The table is managed using the schema created earlier, and the table is shown and all available tables are shown. The final step involves using the Pandas API on Spark, which can be found in the official documentation for Databricks. This quick start is a great way to get a feel for the process and to toggle back and forth with the key steps inside the code.

The Unity Catalog system employs object security best practices, including access control lists (ACLs) for granting or restricting access to specific users and groups on securable objects. ACLs provide fine-grain control, ensuring access to sensitive data and objects. Less privilege is used, limiting access to the minimum required, avoiding broad groups like All Users unless necessary. Access is revoked once the purpose is served, and policies are reviewed regularly for relevance. This technique enhances data security and compliance, prevents unnecessary broad access, and controls a blast radius in case of security breaches.

The Databricks Unity Catalog system offers best practices for catalogs. First, create a separate catalog for loose coupling, managing access and compliance at the catalog level. Align catalog boundaries with business domains or applications, such as marketing analytics or HR. Customize security policies and governance within the catalog to drill down into specific domains. Create access control groups and roles specific to a catalog, fine-tune read-write privileges, and customize settings like resource quotas and scrum rules. These fine-grain policies provide the best of security and functionality in catalogs.

To ensure security and manage external connections, limit visibility by granting access only to specific users, groups, and roles, and setting lease privileges. Limit access to only necessary users and groups using granular access control lists or ACLs. Be aware of team activities and avoid giving them unnecessary access to external resources. Tag connections effectively for discovery using source categories or data classifications, and discover connections by use case for organizational visibility. This approach enhances security, prevents unintended data access, and simplifies external connection discovery and management.

Databricks Unity Catalog Business Unit Best Practices emphasize the importance of providing dedicated sandboxes for each business unit, allowing independent development environments, and preventing interference between different workflows. Centralizing shareable data into production catalogs ensures consistency and reduces the need for duplicate data. Discoverability is crucial, with meaningful naming conventions and metadata best practices. Federated queries via Lakehouse architecture unify data access across silos, governing securely via contracts and permissions. This approach supports autonomy for units, increases productivity through reuse, and maintains consistency with collaborative governance. This approach supports autonomy, increases productivity, and maintains consistency.

In conclusion, the Unity catalog standard allows centralized data governance and best practices for catalogs, connections, and business units.

https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html#enable-workspace

https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html

Saturday, February 17, 2024

 

This is a summary of the book “What you can change and what you can’t – the complete guide to Successful Self-improvement by” written by Martin E.P. Seligman and published by Vintage Books, 2007. He was the former president of the American Psychological Association who delivers a pragmatic treatise on problems that are treatable and those that are not. His attitude is candid, and his advice is well-informed albeit generalized. His message is, at times, as simple as if you are overweight and you have gotten thin, you have beaten the odds. Among his other messages, he believes that radical changes are possible so long as we know which areas are more amenable to change. Those on the opposite side of the spectrum can be misguidedly deemed as treatable by hoaxes. The degree to which a problem exists might indicate their treatability. The past, and particularly, mistreatment during childhood is wrongly blamed. If we are hopeful and optimistic, we can begin change but we must be realistic. Therapies and drugs control only the symptoms. Dieting and alcohol rehabilitation do not cure but obsession, panic attacks and phobias are highly treatable. Similarly, depression, anxiety and anger are hard-wired into the human psyche but treatable. Sometimes the best thing people can do with certain deep-rooted emotional maladies is to learn to live with them.

Psychological problems such as depression, addiction, obsessiveness, anxiety, and post-traumatic stress often require courage and self-improvement. Current treatments, such as drugs and psychotherapies, have an effectiveness rate of only about 65% due to the high degree of heritability of problem-causing personality traits. People who suffer from these mental afflictions often hope to live courageously with their problems. It is possible that both Winston Churchill and Abraham Lincoln were unipolar depressives.

People often believe they can benefit from self-improvement, such as weight loss, meditation, and suppressing sexual desire. However, these methods often fail due to genetic factors and biochemical factors. Our personality is more a product of our genes than we would have believed a decade ago.

Most people can change some things, such as depression, sexual dysfunction, mood, and outlook, while factors that people usually cannot change include severe weight problems, alcoholism, and homosexuality. Understanding one's psychological state helps to deal with and potentially change it for the better.

Dysphoric emotions, such as depression, anxiety, and anger, have been used as warnings of danger and loss throughout history. These emotions can be viewed as bad weather and can be managed through natural methods like meditation and progressive relaxation. However, unremitting, intense anxiety can indicate serious disorders like obsession, phobia, and panic, which require therapeutic exorcism.

Panic attacks, which manifest catastrophic thinking, are curable and can be treated with cognitive therapy. Phobias, which stem from evolutionary history, represent unreasonable fears and can be treated through systematic desensitization or "flooding." Optimism is a learned skill that can improve work achievement and physical health. Obsessions involve repetitive thoughts or images that can be depressing, scary, and repugnant, leading to obsessive-compulsive disorder (OCD). Treatment involves exposing the individual to a fearful situation and preventing the ritualistic behavior, eventually disarming the obsession. Understanding and managing these emotions is crucial for overall well-being.

Depression, anger, and post-traumatic stress disorder are common mental illnesses that can lead to feelings of sadness, helplessness, and despair. Bipolar depression, a heritable condition, can be treated with lithium, while unipolar depression stems from loss, pain, and sadness. Treatments include electroconvulsive shock, medication, interpersonal therapy (IPT), and cognitive therapy (CT). Anger, a powerful emotion, can be controlled by counting to 10, but it can also spark violence. Post-traumatic stress disorder (PTSD) involves extraordinary loss or tragedy, and some people may require drugs and psychotherapy. Sex, diet, and alcohol also play a role in these disorders. Transsexuals may believe they are trapped in bodies of the wrong gender, while exclusive homosexuality is deeply ingrained. Therapy can help alter feelings about sexual choice and preferences, but only within specified limits. Successful living often involves learning to make the best of a bad situation.

Dieting is not effective for heavy people, as it can make them more overweight and unhealthier. People who weigh less live longer, but society's emphasis on an "ideal weight" is misplaced. Exercise is good for the body and helps with weight control, but don't diet. Stomach bypass surgery can help the extremely obese. Factors such as overeating, having an "overweight personality," physical inactivity, and willpower are not true. Alcoholism is not a physical pathology, and treatments like Alcoholics Anonymous (AA) may not be effective. The depth of the problem should be considered, as most psychological problems start with childhood trauma. Focusing on changing aspects of personality within control and amenable to change is better. The "Serenity Prayer" counsels courageously changing the things that can, while accepting the things that cannot. Therapy or drugs may help if the problem is deep-seated, but it may not be effective in the long run.

Previous book summaries: BookSummary52.docx 
Summarizing Software:
SummarizerCodeSnippets.docx.

Friday, February 16, 2024

 

One of the more recent additions to Azure resources has been Azure Machine Learning Studio. This is a managed machine learning environment that allows you to create, manage, and deploy ML models and applications. It is part of the Azure AI resource, which provides access to multiple Azure AI services with a single setup. Some of the features of Azure Machine Learning Studio resource are:

- It has a GUI-based integrated development environment for building machine learning workflows on Azure.

- It supports both no-code and code-first experiences for data science.

- It lets you use various tools and components for data preparation, feature engineering, model training, evaluation, and deployment.

- It enables you to collaborate with your team and share datasets, models, and projects.

- It allows you to configure and manage compute targets, security settings, and external connections.

When provisioning this resource for use by data scientists, it is important to consider the following best practices:

- The workspace itself must allow outbound connectivity to the public network. One of the ways to do this is to allow it to be accessible from all or selective public Ip addresses.

- The clusters must be provisioned with no node public Ip addresses. This is conforming to the well-known no public Ip addresses aka NPIP patterns. This is done by adding the compute to a subnet in a virtual network with service endpoints for Azure Storage, Key vault and container registry and default routing.

- Since the workspace and its dependent resources namely storage account, key vault, container registry and application insights are independently created, it is helpful to have the same user-assigned managed identity associated with them, which also makes it convenient to customize data plane access to other resources not part of this list such as a different storage account or key vault. The same goes for compute which can also be launched with this identity.

- Permissions granted to various roles on this resource can be customized to be further restricted since this is a shared workspace.

- Code that is executed by data scientists in this studio can be categorized as one of many such as regular interactive python notebook, Spark code, and non-interactive jobs.  Permissions necessary to run each of them must be independently tried out.

- There are various kernels and serverless spark compute available to execute the user-defined code in a notebook. The user-defined managed identity used to facilitate the data access for this code must have both control plane read access to perform actions such as getAccessControl and data plane operations such as blob data read and write. The logged-in user credentials are automatically used over this session created with the managed identity for the user to perform the data access.

- The non-interactive jobs require specific permission to submit  run within an experiment for any user.

Together, the built-in and customizations of this resource can immensely benefit the data scientists to train their models. Previous articles: IacResolutionsPart77.docx

Thursday, February 15, 2024

 

Ways to replicate databases in Azure for DevOps

Azure Relational Databases often consolidate data access across services and become important for transactional processing. The data stored in these databases are mission critical and consequently some steps to ensure business continuity and disaster recovery are needed. Daily backups and continuous replication are some of the frequently sought after methods to do that and often organizations build their own GitOps initiatives to take backup and restore across databases and database servers. This article compares these two methods.

Backup and restore is a feature that allows you to create a copy of your server and its databases at a specific point in time, and restore it to a new server if needed. This is useful for recovering from user or application errors, or for migrating data to a different region.

Continuous replication is a feature that allows you to create one or more read-only replicas of your server in the same or different region, and synchronize them with the primary server asynchronously. This is useful for scaling out read workloads, improving availability, and reducing latency.

If a database server is introduced into every azure subscription that an organization owns, with the sole purpose of receiving replications from other databases and database servers, then it can even eliminate the need for a backup and restore or provide different levels of service for source databases. There cannot be multiple database servers in Azure that replicate to the same database server instance. This is because each replica server must have a unique server ID in a replication topology. But we can have multiple source servers that replicate to different replica servers using the Data-in Replication feature or the read replica feature. These features allow us to synchronize data from a source MySQL server to one or more read-only replica servers in the same or different region. Also, we cannot set up multiple replication channels from a single replica to a single source. This is again because the server IDs of replicas must be unique in a replication topology. But we can set up multiple replication channels from a single source to multiple replicas using the read replica feature in Azure Database for MySQL. This feature allows us to replicate data from an Azure Database for MySQL server or flexible server instance to up to five or 10 read-only servers, respectively. This can help us scale out read workloads, improve availability, and reduce latency

The main differences between backup and restore and continuous replication are:

    - Backup and restore requires you to manually initiate the restore process, while continuous replication automatically streams the data from the primary to the replicas.

    - Backup and restore has a longer recovery time objective (RTO) and a higher potential data loss (RPO) than continuous replication, depending on the backup frequency and retention period.

    - Backup and restore is at the server level, not at the database level, while continuous replication allows you to select which databases to replicate.

    - Backup and restore can be configured to use either locally redundant or geographically redundant storage, while continuous replication always uses geographically redundant storage.

    - Backup and restore is included in the service cost, while continuous replication incurs additional charges for the replicas.

Besides these options, the database migration service and mysqldump also provide resiliency.

The azure data migration service instance comes with the following.   

Pros:  

1.       One instance works across all subscriptions.  

2.       Can transfer between on-premises and cloud and cloud to cloud.  

3.       Pay-per-use billing.  

4.       Provides a wizard to create data transfer activity.  

Cons:  

1.       Limited features via IaC as compared to the portal but enough to get by.  

2.       Not recommended for developer instances or tiny databases that can be exported and imported via mysqldump.  

3.       binlog_expire_logs_seconds must be set to non-zero value on source server.  

4.       Supports only sql login 

The steps to perform the data transfer activity between the source and destination MySQL servers involves:  

1.       Create a source mysql instance mysql-src-1 in rg-mysql-1  

2.       Create a database and add a table to mysql-src-1  

3.       Create a destination mysql instance mysql-dest-1 in rg-mysql-1  

4.       Deploy the DMS service instance and project.  

5.       Create and run an activity to transfer data.  

6.       Verify the data.  

The mysqldump utilities, on the other hand, prepare the SQL for replay against the destination server. All the tables of the source database can be exported. 

Pros: 

1.       The SQL statements fully describe the source database. 

2.       They can be edited before replaying 

3.       There are many options natively supported by the database server. 

Cons: 

1.       It acquires a global read lock on all the tables at the beginning of the dump. 

2.       If long updating statements are running when the Flush statement is issued, the MySQL server may get stalled until those statements finish 

 

These are some of the ways to replicate databases across servers.