Cluster computing

Friday, June 28, 2024

This is a continuation of IaC shortcomings and resolutions. In Azure, a storage account without private endpoints can be accessed by compute resources that do not have public IP addresses through the use of Azure's internal networking capabilities. Here's how it works:

1. Virtual Network (VNet): Both the storage account and the compute resources reside within an Azure VNet, which is a private network within Azure.

2. Service Endpoints: While private endpoints are not used, we can enable service endpoints for Azure Storage within the VNet. This allows us to secure our storage account so that it can only be accessed from specific subnets within the VNet.

3. Network Security Groups (NSGs): NSGs are used to control inbound and outbound traffic to network interfaces (NIC), VMs, and subnets. We can configure NSGs to allow traffic between the compute resources and the storage account within the VNet.

4. Azure Bastion: For secure, remote access to the compute resources from outside the VNet, we can use Azure Bastion, which provides RDP and SSH connectivity via the Azure portal without the need for public IP addresses.

5. VPN Gateway or ExpressRoute: To connect to the Azure VNet from on-premises networks securely, we can use a VPN Gateway or ExpressRoute with private peering. This allows on-premises compute resources to access the Azure storage account as if they were part of the same local network.

6. DNS Configuration: Proper DNS configuration is necessary to resolve the names of the storage account for the compute resources within the Azure VNet. Azure provides DNS services that can be used for name resolution within VNets. A compute resource from a different virtual network can reach the storage account via the private endpoint, provided the necessary dns configuration is in place and the virtual networks are peered or there is line-of-sight private ip routing between the caller and the callee.

7. Outbound Connectivity: If the compute resources need to access the internet, we can configure outbound connectivity using Azure NAT Gateway or Load Balancer outbound rules, even if the compute resources don't have public IP addresses.

By configuring the VNet, NSGs, and DNS settings correctly, and using service endpoints, we can ensure that compute resources without public IP addresses can securely access an Azure storage account without private endpoints. This setup maintains the security and isolation of our resources within Azure while allowing necessary communication between them.

Thursday, June 27, 2024

Even when a vector database might be a straightforward choice for specific use cases involving drone data, the choice of vector database matters. For example, usages of vector embeddings and vector similarity search are two different use cases. The embedding model is a neural network that transforms raw data into a vector embedding, or a vector of numbers that represents the original data. Querying the vector database requires similarity search between the query vectors and the vectors in the database. The result of the search can be the most relevant vectors. The scope of the search can be limited to a subset of the original set of vectors in the embeddings and this is done with the help of metadata filtering. So, the difference between the two is that the first is geared for storing and retrieving large number of high-dimension numerical data vectors and the latter optimizes for selectivity and high computation over a subset of the data. Metadata might include dates, times, genres, categories, names, types, descriptions, and depending on our use-case, something custom including tags and labels. Frameworks like LangChain and LlamaIndex offer capabilities to automatically tag incoming queries with metadata. Cloud vector searches like Azure Cognitive Search can automatically index vector data from two primary sources: Azure Blob indexers and Azure Cosmos DB for NoSQL Indexers. Azure Cognitive Search also includes scoring algorithms for vector search which are primarily of two types: exhaustiveKnn that calculates the distance between the query vector and data points and Hierarchical Navigable Small World aka hnsw that organizes high-dimensional data points into a hierarchical graph structure. Amazon also offers bountiful cloud resources for varying purposes which is not all tightly integrated into a single platform like Vertex AI, Databricks or Snowflake do. A large number of Databricks users in organizations also use Snowflake. Vector databases also include pure form such as Pinecone, full-text search databases like ElasticSearch, vector libraries like Faiss, Annoy and Hnswlib, vector-capable NoSQL databases such as MongoDB, CosmosDB and Cassandra, and vector capable SQL databases like SingleStoreDB and PostgreSQL. Rockset is a leader in this quadrant.

When functionalities are met, choices are often prioritized by efficient storage, storing and retrieving with high performance, and the variety of metrics that can be used to perform similarity searches. Pure vector databases provide efficient similarity search with indexing techniques, scalability for large datasets and high query workloads, support high dimensional data, support HTTP and JSON-based APIs, native support for vector operations including dot-products. Their main drawback is usually that indexing is time consuming especially given that there might be various parameters for indexing and incorrect values may introduce inefficiencies. Full-text search work great for text and work well with indexing libraries like Apache Lucene and vector libraries. If we want off-the-shelf vector computations such as fast nearest neighbor search, recommendation systems, image search and NLP, vector libraries are useful and more and more are being added to open source continually. Their main drawback is that we must bring our own infrastructure.

Wednesday, June 26, 2024

Gen AI created a new set of applications that require a different data architecture than traditional systems. Traditional databases cannot help innovating in this space and now there are existing applications that are enhanced with AI. The demands on the data architecture that allow people to build applications quickly and efficiently at scale is the most important need of the hour. Even the data structures expected to store records are changing. A search analytics database that stores vector embeddings and indexes vector embeddings so that we can extract value from both your structured and unstructured data. You also need observability along with databases. There is a need to have multiple aspects from the data stores that power these AI-era applications. There are structure and data governance requirements surrounding the storing and use of this data especially with renewed emphasis on building trust by leveraging privacy and data protection capabilities. There is also a need to unify data whether they are from event streams to bring in real-time data into the system or whether they are transactions or change data captures. Performance considerations have also changed from benchmarks that look obsolete in the face of what is required to train the models. From 2015 to today, the main emphasis in data architectures has been the separation of compute from storage at cloud scale that is evident in the way models are trained, tested, and released today as well as the release of successful products like Snowflake and Databricks. This is going to change in the case of AI application data architectures with primary use case of Uber-like applications because it is real-time like people, places and things as well as unifying all the different data sources. There are two really important sides where one side involves data training or tuning with proprietary data sets that comes with infrastructure that allows us to aggregate all this data and build really efficient models and then the other side is the inference side where we take these models and extract embeddings and this comes with serving tier with which to build AI enhanced applications. Both of these need to be enhanced so that there are very fast iterative cycles. And then one more aspect of building both of these subsystems is the enablement of real-time data collection and analysis. Temporal and spatial capabilities also matter as aspects to this data architecture. Also, vectors are important for identifying context, but a new kind of data set needs to be behavioral which comes from metadata filtering where the search space is reduced. Applications that empower drones include Retrieval Augmented Generation, pattern matching, anomaly detection, and recommendation systems just like many other AI applications. Contextual, behavioral, accuracy and personalization of data and search characterizes this architecture.

Reference: DroneData.docx

#codingexercise: CodingExercise-06-26-2024.docx

Tuesday, June 25, 2024

Challenges in storing drone data.

Unlike traditional data architectures involving an online transaction processing data system, where atomicity, consistency, isolation and durability guarantees enable proper inventory registration and calculations, and an online analytical processing system with its reporting, temporal aggregation and analysis capabilities, the lines between transactions and analysis blur for near real-time data analysis of drones as both the inventory and the associated processing must continuously adapt. It could probably be compared to data and event pipelines built for large-scale commercial applications such as AirBnb and with the potential to become real-time processing by eliminating the cost for network latency and storage access.

Flights for drones remain undisturbed whether stationary or linear until the next update from the controller or local flight path change determination. Clearly, the capabilities of the drone units might vary from fleet to fleet. Individual drone, degree of freedom, motion capabilities, and variety of non-flight actions that the unit can take must need to be differentiated so they can be used selectively. When the entire fleet moves the same as a single unit in the fleet, there are far fewer updates to the data stored for the drones than otherwise. With updates varying from few to large scale, rare to frequent, the data and the events generated must be handled for any volume and rate. At all processing modules, virtualizations that cover the variety possible from the model and type of drone units mandate consistency in the data types used to represent them, so that the interface remains clean with just the right levers and validations for the associated concerns. A unified Api is not just an evolutionary step for drone data management but also a necessity from the start. It might be customary to build data pipeline on Spark and Scala that aids bookkeeping and double-entries for tallying so that the accounting information can then be segregated. Drone data and events have a lot of similarity to edge computing and streaming data pipelines and stores. The need for these approaches must be balanced by the hard performance goals for regular routines.

Cloud databases such as Azure Cosmos DB provide such a general purpose and easily programmable starter stores which can scale and keep up its amazing response times for requests. They remain starter stores because there is little forethought to the specializations in routines needed with data access and storage of drone data and metadata. That said, location information that is rapidly updated for drones can be continuously updated and maintained in these cloud databases. There is convenience in adding various dimensions to the same store as warehouses have taught and such a schema works well for drones as well. But as noted earlier, specializations in components may mandate hybrid data architectures that might not all fit nicely within a single product even if the product were a starting point.

In conclusion, drones data architectures must be articulated with the full palette of storage options, microservices, shared architectures, events, pipelines and automation that may become just as big as Master Data Management systems are today. The good news is that this needs to be done only once for multi-tenant systems.

Monday, June 24, 2024

A vector database and search for positioning drones involves the following steps:

The first step would be to install all the required packages and libraries. We use Python in this sample:

import warnings

warnings.filterwarnings(‘ignore’)

from datasets import load_dataset

from pinecone import Pinecone, ServerlessSpec

from DLAIUtils import Utils

import DLAIUtils

import os

import time

import torch

From tqdm.auto import tqdm

We assume the elements are mapped as embeddings in a 384-dimensional dense vector space.

A sample query would appear like this:

query = `what is node nearest this element?`

xq = model.encode(query)

xq.shape

(384,)

The next step is to set up the Pinecone vector database to upsert embeddings into it. These database index vectors make search and retrieval easy by comparing values and finding those that are most like one-another

utils = Utils()

PINECONE_API_KEY = utils.get_pinecone_api_key()

if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:

pinecone.delete_index(INDEX_NAME)

print(INDEX_NAME)

pinecone.create_index(name=INDEX_NAME, dimension=model.get_sentence_embedding_dimension(), metric=’cosine’,spec=ServerlessSpec(cloud=’aws’, region=’us-west-2’))

index = pinecone.Index(INDEX_NAME)

print(index)

Then, the next step is to create embeddings for all the elements in the sample space and upsert them to Pinecone.

batch_size=200

vector_limit=10000

elements=element[:vector_limit]

import json

for i in tqdm(range(0, len(elements), batch_size)):

i_end = min(i+batch_size, len(elements))

ids = [str(x) for x in range(i, i_end)]

metadata = [{‘text’: text} for text in elements[i:i_end]]

xc = model.encode(elements[i:i_end])

records = zip(ids, xc, metadata)

index.upsert(vectors=records)

index.describe_index_stats()

Then the query can be run on the embeddings and the top matches can be returned.

def run_query(query):

embedding = model.encode(query).tolist()

results = index.query(top_k=10, vector=embedding, include_metadata=True, include_value)

for result in results[‘matches’]:

print(f”{round(result[‘score’], 2)}: {result[‘metadata’][‘node’]}”)

run_query(“what is node nearest this element?”)

With this, the embeddings-based search over elements is ready. In Azure, cosmos DB offers a similar semantic search and works as a similar vector database.

The following code outlines the steps using Azure AI Search

# configure the vector store settings, vector name is in the index of the search

endpoint: str = "<AzureSearchEndpoint>"

key: str = "<AzureSearchKey>"

index_name: str = "<VectorName>"

credential = AzureKeyCredential(key)

client = SearchClient(endpoint=endpoint,

index_name=index_name,

credential=credential)

# create embeddings

embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(

azure_deployment=azure_deployment,

openai_api_version=azure_openai_api_version,

azure_endpoint=azure_endpoint,

api_key=azure_openai_api_key,

)

# create vector store

vector_store = AzureSearch(

azure_search_endpoint=endpoint,

azure_search_key=key,

index_name=index_name,

embedding_function=embeddings.embed_query,

)

# create a query

docs = vector_store.similarity_search(

query=userQuery,

k=3,

search_type="similarity",

)

collections.insert_many(docs)

reference: https://github.com/ravibeta/Node-Element-Predictions

Sunday, June 23, 2024

Some of fleet management data science algorithms are captured via a comparison table of well-known data mining algorithms as follows:

Data Mining Algorithms Description Use case

Classification algorithms This is useful for finding similar groups based on discrete variables

It is used for true/false binary classification. Multiple label classifications are also supported. There are many techniques, but the data should have either distinct regions on a scatter plot with their own centroids or if it is hard to tell, scan breadth first for the neighbors within a given radius forming trees or leaves if they fall short.

Useful for categorization of fleet path changes beyond the nomenclature. Primary use case is to see clusters of service request that match based on features. By translating to a vector space and assessing the quality of cluster with a sum of square of errors, it is easy to analyze large number of changes as belonging to specific clusters for management perspective.

Regression algorithms This is very useful to calculate a linear relationship between a dependent and independent variable, and then use that relationship for prediction. Fleet path changes demonstrate elongated scatter plots in specific categories. Even when the path changes come demanding different formations in the same category, the reorientation times are bounded and can be plotted along the timeline. One of the best advantages of linear regression is the prediction about time as an independent variable. When the data point has many factors contributing to their occurrence, a linear regression gives an immediate ability to predict where the next occurrence may happen. This is far easier to do than coming up with a model that behaves like a good fit for all the data points.

Segmentation algorithms A segmentation algorithm divides data into groups or clusters or items that have similar properties. Path change stimuli segmentation based on fleet path change feature set is a very common application of this algorithm. It helps prioritize the response to certain stimuli.

Association algorithms This is used for finding correlations between different attributes in a data set Association data mining allows these users to see helpful messages such as “stimulii who caused a path change for this fleet type also caused a path change for this other fleet formation”

Sequence Analysis Algorithms This is used for finding groups via paths in sequences. A Sequence Clustering algorithm is like a clustering algorithm mentioned above but instead of finding groups based on similar attributes, it finds groups based on similar paths in a sequence. A sequence is a series of events. For example, a series of web clicks by a user is a sequence. It can also be compared to the IDs of any sortable data maintained in a separate table. Usually, there is support for a sequence column. The sequence data has a nested table that contains a sequence ID which can be any sortable data type. This is very useful to find sequences of fleet path changes opened across customers. Generally, a transit failure could result in a cascading failure across the transport network. This sort of sequence determination in a data driven manner helps find new sequences and target them actively even suggesting the same to the stimulii who cause ath changes to the fleet formations so that they can be better prepared for failures across relays.

Sequence Analysis also helps with interactive formation changes as described here.

Outliers Mining Algorithm Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given.

The choices for similarity measures between rows include distance functions such as Euclidean, Manhattan, string-edits, graph-distance etc and L2 metrics. The choices for aggregate dissimilarity measures is the distance of K nearest neighbors, density of neighborhood outside the expected range and the attribute differences with nearby neighbors The steps to determine outliers can be listed as: 1. Cluster regular via K-means, 2. Compute distance of each tuple in R to nearest cluster center and 3. choose top-K rows, or those with scores outside the expected range. Finding outliers is sometimes humanly impossible because the number of path changes can be quite high. Outliers are important to discover new strategies to encompass them. If there are numerous outliers, they will significantly increase costs. If they were not, then the patterns help identify efficiencies.

Decision tree This is probably one of the most heavily used and easy to visualize mining algorithms. The decision tree is both a classification and a regression tree. A function divides the rows into two datasets based on the value of a specific column. The two list of rows that are returned are such that one set matches the criteria for the split while the other does not. When the attribute to be chosen is clear, this works well. A Decision Tree algorithm uses the attributes of the external stimulii to make a prediction such as the reorientation time on a next path change. The ease of visualization of split at each level helps throw light on the importance of those attributes. This information becomes useful to prune the tree and to draw the tree

Logistic Regression This is a form of regression that supports binary outcomes. It uses statistical measures, is highly flexible, takes any kind of input and supports different analytical tasks. This regression folds the effects of extreme values and evaluates several factors that affect a pair of outcomes. Path changes based on stimulii category can be used to predict the likelihood of a path change from a category of stimulii. It can also be used for finding repetitions in requests

Neural Network This is a widely used method for machine learning involving neurons that have one or more gates for input and output. Each neuron assigns a weight usually based on probability for each feature and the weights are normalized across resulting in a weighted matrix that articulates the underlying model in the training dataset. Then it can be used with a test data set to predict the outcome probability. Neurons are organized in layers and each layer is independent of the other and can be stacked so they take the output of one as the input to the other. Widely used for SoftMax classifier in NLP associated with fleet path changes. Since descriptions of stimulii, fleet formation changes, path adjustments and adjustment time to modified path and formation captured by spatial and temporal are conformant to narratives with metric-like quantizations, Natural Language Processing could become a significant part of the data mining and ML portfolio

Naïve Bayes algorithm This is probably the most straightforward statistical probability-based data mining algorithm compared to others.

The probability is a mere fraction of interesting cases to total cases. Bayes probability is conditional probability which adjusts the probability based on the premise. This is widely used for cases where conditions apply, especially binary conditions such as with or without. If the input variables are independent, their states can be calculated as probabilities, and if there is at least a predictable output, this algorithm can be applied. The simplicity of computing states by counting for class using each input variable and then displaying those states against those variables for a give value, makes this algorithm easy to visualize, debug and use as a predictor.

Plugin Algorithms Several algorithms get customized to the domain they are applied to resulting in unconventional or new algorithms. For example, a hybrid approach on association clustering can benefit determining relevant associations when the matrix is quite large and has a large tail of irrelevant associations from the cartesian product. In such cases, clustering could be done prior to association to determine the key items prior to this market-basket analysis. Fleet path changes are notoriously susceptible to apply with variations even when pertaining to the same category. These path changes do not have pre-populated properties from a template, and spatial and temporal changes can vary drastically along one or both. Using a hybrid approach, it is possible to preprocess these path changes with clustering before analyzing such as with association clustering.

Simultaneous classifiers and regions-of-interest regressors Neural nets algorithms typically involve a classifier for use with the tensors or vectors. But regions-of-interest regressors provide bounding-box localizations. This form of layering allows incremental semantic improvements to the underlying raw data. Fleet path changes are time-series data and as more and more are applied, specific time ranges become as important as the semantic classification of the origin of path changes and their descriptions. Using this technique, underlying issues can be discovered as tied to internal or external factors. The determination of root cause behind a handful of path changes is valuable information.

Collaborative filtering Recommendations include suggestions for a knowledge base, or to find model service requests. To make a recommendation, first a group sharing similar taste is found and then the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the people in the selected group. To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. Several approaches mentioned earlier provide a perspective to solving a problem. This is different from those in that opinions from multiple participants or sensors in a stimuli creation or recognition agens are taken to determine the best set of fleet formation or path changes to recommend.

Collaborative Filtering via Item-based filtering This filtering is like the previous except that it was for user-based approach, and this is for item-based approach. It is significantly faster than the user-based approach but requires the storage for an item similarity table. There are certain filtering cases where divulging which stimuli/sensors go with what formation/path change, is helpful to the fleet manager or participants. At other times, it is preferable to use item/flight path-based similarity. Similarity scores are computed in both cases. All other considerations being same, item-based approach is better for sparse dataset. Both stimuli-based and item-based approach perform similarly for the dense dataset.

Hierarchical clustering Although classification algorithms vary quite a lot, hierarchical algorithm stands out and is called out separately in this category. It creates a dendrogram where the nodes are arranged in a hierarchy. Specific domain-based ontology in the form of dendrogram can be quite helpful to mining algorithms.

NLP algorithms Popular NLP algorithms like BERT can be used towards text mining. NLP models come very useful for processing flight path commentary and associated artifacts in the fleet flight management.

Algorithm Implementations:

https://jsfiddle.net/g2snw4da/

https://jsfiddle.net/jmd62ap3/

https://jsfiddle.net/hqs4kxrf/

https://jsfiddle.net/xdqyt89a/

#codingexercise https://1drv.ms/w/s!Ashlm-Nw-wnWhPAI9qa_UY0gWf8ZPA?e=PRMYxU

Saturday, June 22, 2024

This is a summary of the book titled “Glad We Met – The Art and Science of 1:1 Meetings” written by Steven G. Rogelberg and published by Oxford UP in 2024. The workplace involves plenty of 1:1 meetings and almost half of them do not achieve the desired results. Drawing on extensive research, the author provides a framework on setting up, conducting, and following through on one-on-one meetings. Since career advancement depends on performance evaluation by manager for his or her reports, the author encourages managers to ask the right questions, foster engagement, illuminating each person’s progress. It works both ways for the manager to educate their reports and for their own leadership journey.

These one-on-one meetings do benefit from a framework, argues the author, and those between the manager and direct reports already come with an agenda. Weekly sessions are helpful to the managers and the meeting locations and questions to ask must be planned. Staying positive, sharing mutual priorities, and covering new material, asking for feedback, and saying thank you are all part of it. Regularly conducting these sessions gives more practice to both parties.

One-on-one meetings are crucial for team members and organizations, as they address their priorities, goals, problems, productivity, and employee development. With about a billion business meetings daily, 20% to 50% of these sessions could cost $1.25 billion daily. However, participants often report suboptimal results. Managers can improve their one-on-one meetings to gain a better return on time and money. These meetings strengthen ties within teams and organizations, supplement performance appraisals, and fuel communication between direct reports and managers. To maximize the benefits of one-on-one meetings, create an agenda using the "listings" approach, with the employee covering their list first and the team leader going down. This approach covers immediate work issues and long-range topics, such as career growth and development.

One-on-one meetings are crucial for managerial success, team success, and employee learning and engagement. They promote diversity and inclusion, strengthen relationships, and produce better outcomes. Before meetings, provide context for the topics and ask usual questions. Establish a routine for meetings and explain that they represent the manager's decision to prioritize employees' needs. Stay open-minded and explain shared objectives.

Hold weekly sessions, especially with remote employees, to avoid micromanagement. Choose a schedule that aligns with your needs and preferences, giving your directs some agency. If employees operate from the same office, consider deferring to their preferences.

Plan the location and questions for the meetings, including the office, direct report's office, or outdoor locations. Involve your employee in planning the setting and direct the conversation. The quality of questions asked will determine the quality of dialogue.

Effective one-on-one meetings are essential for team success, fostering better outcomes, strengthening relationships, and promoting diversity and inclusion. Focus on building connections with your employees, their engagement, setting priorities, giving feedback, and fostering career growth and development. Avoid asking personal questions or gossip and maintain a cheerful outlook. Take notes, cover new ground, and ask for feedback. Work through tactical and personal issues, ask for candid feedback, and implement "five key behaviors" to improve your performance. Both parties should feel free to ask for help, and the meeting should end on time. Wrap up the meeting and record important takeaways. Follow up on all commitments made during the meeting.

One-on-one meetings can occur between managers and their direct reports, or with employees meeting individually with their managers' manager or a higher-up executive. Regular one-on-one sessions help ensure your success as a leader, as they provide valuable insights, foster relationships, and help you make better decisions. The Chinese proverb "If you want happiness for an hour, take a nap, go fishing, inherit a fortune, but if you want happiness for a lifetime, help somebody" suggests that fostering relationships and helping others can lead to long-term happiness.

Previous book articles: BookSummary111.docx