Sunday, November 12, 2023

 

Using FastText word embeddings to improve text summarization.

Word vectors are commonly used to represent the association words with other words. Vectors form is helpful for purposes of classification and regression. Two popular forms of word vectors are FastText and Word2Vec. FastText treats each word as composed of character n-grams while word2vec treats it as a bag of words. Character n-gram is the contiguous sequence of n items from a given sample of a character or word. For example, trigram or n=3 of the word “where” is <wh, whe, her, ere, re>. FastText includes the character n-gram as well as the word itself which means the input data will be <wh, whe, her, ere, re> and <where>.

Since the objective of text summarization is to bring out the salient topic in the text, FastText is better suited to summarize to as little as a topic word for an entire text input than word2vec. As with most machine learning models, the training of the model takes more compute resources than its execution for the purposes of prediction.  With FastText, the model is lightweight enough to be hosted on a variety of devices.
Let us take an example for this extreme summarization with FastText.

```python:

from nessvec.indexers import Index

index = Index(num_vecs=200_000) # the default is 100_000

index.extreme_summarize(‘hello and goodbye’)

>>> array([‘hello’], dtype=object)

index.query_series(index[0])

,             1.92093e-07

and       3.196178e-01

(             3.924445e-01

)            4.218287e-01

23         4.463376e-01

22         4.471740e-01

18         4.490819e-01

19         4.515444e-01

21         4.544248e-01

but        4.546938e-01

dtype: float64

```

If we were to take the average of index[‘hello’] and index[‘goodbye’], it would be closer to ‘goodbye’

If we were to normalize with say

```python

Import numpy as np

avg=(index[‘hello’]+index[‘goodbye’])/2

index.query_series(avg / np.linalg.norm(avg))

>>> # this would be closer to goodbye

This suggests that numerical rounding and weighting can change the outcome of the extreme summarization. It is better to not impose any arithmetic over vectors and use them merely for latent semantics. Sample application at https://booksonsoftware.com/text

Saturday, November 11, 2023

 

This is a summary of the podcast “Think Like a Rocket Scientist” from The Innovation Show, 2021 with Aidan McCullen as host and Ozan Varol as guest. Mr. Varol is a speaker, author and former rocket scientist, who brings insights to the episode from his book “Think like a rocket scientist: Simple strategies for giant leaps in work and life.” He aims for people to boost their innovation and creativity  so that they can take leaps in their personal life, career or business.

The main take aways from his talk are as follows: stay curious and take time for play to boost creativity, shed your old skin to interrupt the power of the status quo, use the first principles thinking to get back to fundamentals, look out for invisible rules that limit your thinking, welcome uncertainty despite the fear, and value questions more than answers

The most innovative thinkers hold on to their curiosity as adults, retaining the ability to think beyond the status. He says just like both work and play are important, deliberate practice has its place which helps people refine a skill and become an expert but deliberate play nurtures creativity and helps people find new ways forward. When we take small breaks during the day, we allow innovative insights to come. Varol calls this the “airplane mode” and even budgets time for it during the day.

Taking the example of a snake that sheds its skin, Varol says that we must let go of what’ no longer serving us to be able to do the next thing. Chicago restaurant owner Alinea understood that restaurants struggle to survive their own success. In their most profitable year to date, they chose to rebuild the restaurant from scratch. The existing skill base and credentials did not go to waste.

“The first principles” thinking is a way of returning to fundamentals especially when discovering new way to execute the original vision. Looking out for invisible rules that limit one’s thinking. These are the one’s that persist past their usefulness. Questioning why we are doing what we are doing can yield productive insights. He asks to welcome uncertainty despite the fear and adds that when certainty ends, progress begins. Insisting on certainty before we make a move will keep you mired in status quo. The same goes for companies.

We are often accustomed to spotting the right answer or solution and spend a lot of energy in trying to refine it from our environment and sources. Instead, he suggests to ask questions because they have a lot of value. The simple act of asking a question can change a problem and reveal previously hidden answers. The development of 2003 Mars Rovers started with someone asking why not send two rovers instead of one? That transformed the mission and led to its enormous success.

Ozan Varol is a professor of law at Lewis and Clark Law school and is the author of “Think like a Rocket scientist: Simple strategies for Giant Leaps in Work and Life.”

Reference to summarizing software: https://booksonsoftware.com/text

CodingExercise-11-11-2023.docx

Friday, November 10, 2023

 

This is a summary of a fun-read-book titled “The Everyday Warrior” - A no hack practical approach to life written by Mike Sarraille in 2022. He talks about life holding valuable lessons as we go along the way to achieving our goals. He even calls out failures to find encouragement not discouragement. Using military terminology, he teaches soft skills such as drive, resilience, and a positive attitude. He says maintaining a balance between physical, mental, and emotional needs is important. Finally, he talks about why shortcuts disappoint and the journey is often its own reward.

The main take-aways are that the “Everyday warrior” uses failure as a teacher and motivator. Mentally fit people treat the brain as a muscle they condition, exercise and rest. Time constraints, fear, doubt and weak initiative prevent people from committing to achieving their goals. Instant gratification must be avoided and instead a step-by-step technique can be adopted to pursue one’s goals. People yearn for connection via a social circle or tribe. Time must be taken to rest and reflect.

Maintaining a balance when striving towards goals is an art and a science. We cannot let anxiety, depression, social isolation, apathy and frustration deter us. The Everyday warrior’s traits are instead resilience, confidence, positive attitude, and a drive to achieve and improve. This mindset makes us accountable, disciplined, pragmatic, vulnerable, humble and capable of honest self-assessment.

Consider that Michael Phelps, the Olympian swimmer won 23 gold medals and constantly sought to achieve success in Olympics at the cost of a balanced boyhood and resulting in subsequent struggles with alcohol and drug abuse. The “whole person concept”, on the other hand, practiced by Army Green Berets and Navy Seal recruiters who excel in volatile, uncertain, complex and ambiguous aka VUCA situations.

An everyday warrior knows when to rest and care for oneself is as important as fighting the battle. Acquiring self-knowledge is challenging because media, school, families, and “external influences” shape what people think they want. Comparing oneself with others or choosing goals based on how others will perceive them is not the right approach. Instead, we must determine our own definition of success. An example of this is that Marine Rob Jones lost both his legs in Afghanistan tour to an IED but when he returned he made it his goal to raise funds for veterans. As part of one of his campaigns, he ran 31 marathons in 31 days. There is even a famous saying attributed to stoic king Marcus Aurelius who said that we cannot control events but we can control our mind. We must shift away from a victim mind-set. People only fail to reach their goal out of doubt, fear, time-constraints, and unwillingness to make the required effort.

The five steps to succeed are 1. To set a smart goal, 2. Develop a plan of sequential tasks aka “small victories”, 3. Taking actions that disrupt the comfort of old habits, 4. Making time for introspection, and 5. And repeating the process until we are “more accomplished and capable.”

We must avoid shortcuts in pursuing our goals. Building resistance to instant gratification by showing gratitude and training oneself to avoid instant rewards will help us in the long run.

Thursday, November 9, 2023

 

Azure Machine Learning workspace integrates with Azure container registry, Azure KeyVault, Azure Storage account and Azure Analytics Insights. Model building requires exploratory data analysis, data-preprocessing and prototyping to validate hypotheses. What makes it different as an interactive and experimentation ML platform from others such as databricks workspace, is that it aims to provide a unified seamless experience with its own libraries to automate much of the tasks needed to accomplish a machine learning model that serves the business needs.

For example, the following code automates creation of compute needed to build a model.

from azureml.core.compute import AmlCompute, ComputeTarget

amlcompute_cluster_name="cpu-cluster"

provisioning_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", max_nodes=1)

compute_target = ComputeTarget.create(workspace, amlcompute_cluster_name, provisioning_config)

compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

 

The compute is part of the workspace so the Identity and Access Management provided on the workspace is sufficient to create the compute.

Storage Accounts, on the other hand, are external and might have their own access restrictions. There are quite a few ways to connect to an external Azure storage account from an Azure machine learning workspace. This section will review some of the ways to do that. All of these require a Datastore class to be instantiated that store connection information to Azure Storage Services.

The first method involves the azureml.core library to instantiate a datastore class as follows:

from azureml.core import Workspace, Datastore

from azureml.core.dataset import Dataset

from azureml.data.datapath import DataPath

ws = Workspace.from_config()

datastore = Datastore.register_azure_blob_container(ws, datastore_name="ds1", container_name="temporary", account_name="somestorageaccount", sas_token="<for-connecting-with-SAS-URL")

dataset = Dataset.Tabular.from_parquet_files(path = [(datastore, 'temporary/yellow_tripdata_2023-08.parquet')])

# preview the first 3 rows of the dataset

dataset.take(3).to_pandas_dataframe()

 

The Datastore is a common resource across many usages and is registered with the ML workspace with the credentials required to connect to the external storage account.

The fully qualified url for locating a blob on the storage account associated with the ML workspace is:
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'

The notebook executes with the default credentials of the logged in user, so it is possible to not specify the credentials when creating the datastore.

%pip install azure-ai-ml

from azure.ai.ml import MLClient

from azure.identity import DefaultAzureCredential

from azure.ai.ml import command, Input

from azure.ai.ml.entities import AzureBlobDatastore

from azure.ai.ml.entities import Environment

ml_client = MLClient(

    DefaultAzureCredential(), subscription, resource_group, workspace

)

blob_credless_datastore = AzureBlobDatastore(

    name="ds4",

    description="Credential-less datastore pointing to a blob container.",

    account_name=account_name,

    container_name="temporary",

)

 

ml_client.create_or_update(blob_credless_datastore)

 

With the help of datastore, accessing a dataset is as simple as:

datastore = Datastore.get(ws, datastore_name="ds4")

dataset = Dataset.Tabular.from_parquet_files(path = [(datastore, 'temporary/yellow_tripdata_2023-08.parquet')])

# preview the first 3 rows of the dataset

dataset.take(3).to_pandas_dataframe()

 

Reference:

Different types of algorithms for models: MLRxFastLinear.docx

Wednesday, November 8, 2023

Testing azure ml workspace deployment

 

This is a sample script to run for testing an azure ml workspace deployment. For further information about the workspace, please visit IaCResolutionsPart38.docx

import azureml.core

import pandas as pd

import numpy as np

import logging

print("Azure ML SDK Version:", azureml.core.VERSION)

 

from azureml.core import Workspace, Experiment

workspace = Workspace.from_config()

experiment = Experiment(workspace, "automl_bikeshare_forecast")

 

from azureml.core.compute import AmlCompute, ComputeTarget

amlcompute_cluster_name="cpu-cluster"

provisioning_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", max_nodes=1)

compute_target = ComputeTarget.create(workspace, amlcompute_cluster_name, provisioning_config)

compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

 

compute_target.delete()

compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

The result of running a script like above should be a successful creation of a compute cluster similar to what’s shown here:

{'id': '/subscriptions/656e67c6-f810-4ea6-8b89-636dd0b6774c/resourceGroups/rg-temp/providers/Microsoft.MachineLearningServices/workspaces/amltemp1/computes/cpu-cluster',

 'name': 'cpu-cluster',

 'type': 'Microsoft.MachineLearningServices/workspaces/computes',

 'location': 'centralus',

 'tags': {},

 'properties': {'createdOn': '2023-11-06T00:21:48.6348786+00:00',

  'modifiedOn': '2023-11-06T00:22:13.9257782+00:00',

  'disableLocalAuth': False,

  'description': None,

  'resourceId': None,

  'computeType': 'AmlCompute',

  'computeLocation': 'centralus',

  'provisioningState': 'Succeeded',

  'provisioningErrors': None,

  'provisioningWarnings': {},

  'isAttachedCompute': False,

  'properties': {'vmSize': 'STANDARD_D2_V2',

   'vmPriority': 'Dedicated',

   'scaleSettings': {'maxNodeCount': 1,

    'minNodeCount': 0,

    'nodeIdleTimeBeforeScaleDown': 'PT30M'},

   'subnet': None,

   'currentNodeCount': 0,

   'targetNodeCount': 0,

   'nodeStateCounts': {'preparingNodeCount': 0,

    'runningNodeCount': 0,

    'idleNodeCount': 0,

    'unusableNodeCount': 0,

    'leavingNodeCount': 0,

    'preemptedNodeCount': 0},

   'allocationState': 'Steady',

   'allocationStateTransitionTime': '2023-11-06T00:22:12.801+00:00',

   'errors': None,

   'remoteLoginPortPublicAccess': 'Enabled',

   'osType': 'Linux',

   'virtualMachineImage': None,

   'enableBatchPrivateLink': False}}}

Tuesday, November 7, 2023

 

Applying MicrosoftML rxFastLinear algorithm:  

 

While rxLogisticRegression is a binary and multiclass classification that uses a regular regression model, the rxFastLinear algorithm is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method.  It combines the capabilities of logistic regressions and SVM algorithms. The dual problem is the dual ascent by maximizing the regression in the scalar convex functions adjusted by the regularization of vectors. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss. This is used for applications in Payment default prediction and Email Spam filtering.

 

This form of regression uses statistical measures, is highly flexible, takes any kind of input and supports different analytical tasks. This regression folds the effects of extreme values and evaluates several factors that affect a pair of outcomes.  Regression is very useful to calculate a linear relationship between a dependent and independent variable, and then use that relationship for prediction. Errors demonstrate elongated scatter plots in specific categories. Even when the errors come with different error details in the same category, they can be plotted with correlation. This technique is suitable for specific error categories from an account.   

 

Default detection rates can be boosted, and false positives can be reduced using real-time behavioral profiling as well as historical profiling. Big Data, commodity hardware and historical data going as far back as three years help with accuracy. This enables payment default detection to be almost as early as when it is committed. True real time processing implies stringent response times.

 

The algorithm for the least squares regression can be written as:   

 

1. Set the initial approximation    

 

2. For a set of successive increments or boosts each based on the preceding iterations, do   

 

3. Calculate the new residuals   

 

4. Find the line of search by aggregating and minimizing the residuals   

 

5. Perform the boost along the line of search   

 

6. Repeat 3,4,5 for each of 2.  

 

Conjugate gradient descent can be described with a given input matrix A, b, a starting value x, a number of iterations i-max and an error tolerance  epsilon < 1 in this way:

 

set I to 0        

set residual to b - Ax     

set search-direction to residual.    

And delta-new to the dot-product of residual-transposed.residual.    

Initialize delta-0 to delta-new    

while I < I-max and delta > epsilon^2 delta-0 do:     

    q = dot-product(A, search-direction)    

    alpha = delta-new / (search-direction-transposed. q)     

    x = x + alpha.search-direction    

    If I is divisible by 50     

        r = b - Ax     

    else     

        r = r - alpha.q     

    delta-old = delta-new    

    delta-new = dot-product(residual-transposed,residual)    

    Beta = delta-new/delta-old    

    Search-direction = residual + Beta. Search-direction    

    I = I + 1 

 

Sample application:  

#! /bin/python  
import numpy

import pandas

from microsoftml import rx_fast_linear, rx_predict

from revoscalepy.etl.RxDataStep import rx_data_step

from microsoftml.datasets.datasets import get_dataset

 

attitude = get_dataset("attitude")

 

import sklearn

if sklearn.__version__ < "0.18":

    from sklearn.cross_validation import train_test_split

else:

    from sklearn.model_selection import train_test_split

 

attitudedf = attitude.as_df()

data_train, data_test = train_test_split(attitudedf)

 

model = rx_fast_linear(

    formula="rating ~ complaints + privileges + learning + raises + critical + advance",

    method="regression",

    data=data_train)

   

# RuntimeError: The type (RxTextData) for file is not supported.

score_ds = rx_predict(model, data=data_test,

                     extra_vars_to_write=["rating"])

                    

# Print the first five rows

print(rx_data_step(score_ds, number_rows_read=5))

 

Sunday, November 5, 2023

 

Azure Databricks aka dbx and Public Egress IP:

The secure way to deploy Azure Databricks workspace is with NPIP and VNet Injection set to true. This is also the most flexible way of deploying. When azure resources receive traffic from Dbx clusters, they must allow that traffic by the ip address or CIDR range and it becomes difficult to find out the egress ip for Dbx clusters. This article explains how to configure the egress IP and the IP access restriction rules on traffic originating from the databricks workspace and destined to these azure resources with access restrictions.

Egress for deployment without secure cluster connectivity aka SCC /NPIP and VNet injection, is different from that with SCC and VNet Injection. Without SCC/NPIP, there is a control plane NAT IP and with SCC there is a SCC Relay IP.  The relay refers to the tunneling of traffic through a relay in the control plane for deployments with no public ip aka NPIP and with public and private subnets. With the SCC enabled, the cluster initiates a connection to the SCC relay during cluster creation over the 443 port and uses a different application than is used for the Web application and REST API. Cluster administration tasks reach the cluster through this tunnel.

With SCC enabled, both the workspace subnets are effectively private subnets since cluster nodes do not have public IP addresses. The network egress will vary depending on whether the Dbx workspace is deployed in the default managed VNet is used or your own virtual network aka vNet Injection. In the managed network, a default NAT gateway is automatically created within the resource group associated with the databricks workspace. If we use secure cluster connectivity with vNet Injection, then we must ensure that it has a stable egress public IP using one of the following options:

-          Choose an egress load balancer aka outbound load balancer by providing loadBalancerName, loadBalancerBackendPoolName and loadBalancerFrontendConfigName and loadBalancerPublicIPName to the workspace parameters. The load balancer configuration is not customizable and is tightly controlled by the Dbx workspace

-          Choose a NAT gateway and configure the gateway on both workspace’s subnets. Clusters have a stable egress public IP, and this can be done via the portal and IaC.

-          Choose an egress firewall if there is complex routing involved. These user-defined routes aka UDRs ensure that network traffic is routed correctly for the workspace and either directly to the required endpoints or through an egress firewall. Allowed firewall rules will then need to be specified.

So the simplest approach for Vnet injection cases is to use a NAT gateway and add the public ip address of the NAT gateway to the access restriction rules of target Azure resources that must be accessed from the jobs and notebooks within the workspace.

References:

-          https://learn.microsoft.com/en-us/azure/databricks/security/network/secure-cluster-connectivity

-          https://learn.microsoft.com/en-us/azure/databricks/security/network/secure-cluster-connectivity#egress-with-vnet-injection

-          https://learn.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/udr