Cluster computing

Tuesday, May 30, 2023

Pattern #1: Backup your MySQL database

Overview/Purpose:

This pattern articulates the way to provide business continuity and disaster recovery for your MySQL databases deployed on a single server or on a cluster in Azure such that the data can be recovered after a user or application error, regional data center outage, or unplanned disruptions.

Concepts to Understand

Paired Region: Azure supports cross region replication pairings for all geographies. Regions are paired for cross-region replication based on proximity and other factors. The Azure regional pairs in North America include East US – West US, East US 2 – Central US, North Central US – South Central US, West US 2 – West Central US, and West US 3 – East US. One of the benefits of choosing from these pairings is that if there’s a broad outage, recovery of at least one region is prioritized. Without pairings, the default region used across many deployments is Central US, but it is recommended to achieve high availability via availability zones and locally redundant or zone-redundant storage. Regions without a pair will not have geo-redundant storage.

Geo-Restore: This is a feature of the Azure Database for MySQL that allows the server to be restored with geo-redundant backups. The backups are hosted in the server’s paired region.

RTO: The Recovery Time Objective is the amount of time that the resource can be down without causing significant damage to the business and the time spent restoring it back to normal operations after the incident.

RPO: The Recovery Point Objective is the amount of time that might pass during a disruption before the quantity of data lost during that period is greater than the allowable threshold.

Solution Design

Set your MySQL server to take:

1. geo-redundant backups with the ability to initiate geo-restore, or

2. deploy read replicas in a different region.

With Geo-restore, a new server is created using the backup data that is replicated from another region. The overall time it takes to restore and recover depends on the size of the database and the number of logs to recover which is in the range of a few minutes to a few hours.

With read replicas, transaction logs from the primary are asynchronously streamed to the replica. In the event of a primary database outage due to a zone-level or regional level fault, failing over to the replica provides a shorter RTO and reduced data loss.

Feature	Cost	RTO	RPO
Geo-restore	Only on General-purpose/memory-optimized SKU	Varies	<1h
Read replicas	Available on Basic	Minutes but depends on latency, size of data and write workload	< 5 min

Terraform to apply:

Option 1:

resource "azurerm_mysql_flexible_server" "default" {

create_mode: “GeoRestore”

geo_redundant_backup_enabled = true

source_server_id: “other_server”

}

Changing the backup attribute to be geo_redundant from the default of locally redundant via Terraform, so that there is protection against region level failures, is an action that involves destroying the existing instance and creating it again.

Option 2:

resource "azurerm_mysql_flexible_server" "example" {

create_mode: “Replica”

source_server_id: “other_server”

sku_name = "B_Standard_B1s"

}

· If possible, run a test drill for your changes.

Recovery plan:Applications do not see the failure of a database or storage because the configured MySQL server automatically recovers but user action is required when there is a region failure or a user error. A region failure is a rare event and requires the promotion of a read replica to master. The replica is stopped and then promoted.

This pattern holds true for Cassandra cluster as well where we can specify hours_between_backups that defaults to 24 hours and it takes continuous backups. Paired region support is available for Kubernetes clusters and persistent volumes.

Note that the databases are typically backed up automatically every day, we only need to choose between geo-restoring from a backup or linking a replica to the original server. It works for both a single server instance as well as a high-availability flexible server instance.

Monday, May 29, 2023

This summary of a book on how to improve conversations doesn't do justice to it but it certainly tries to spread across all the chapters because each one is as good as the previous. The book is written by Professor Anne Curzan who has been teaching English linguistics for several years. She starts out by saying that an average of 16,000 words are spoken by a person in a day, regardless of their gender. And many a times the conversation turns out to be either not communicated properly or not received properly and she makes a case where participants on either side can multitask both listening and speaking in a way that makes the conversation load balanced properly and engaging for both. She points out that conversations where one party does most of the heavy lifting to keep it going, is subject to a lot of factors that cannot always be understood in time by either of the parties so the best that they can do is to not let a topic die and any topic mentioned deserves attention for a possible follow up. With this premise, she brings on a lot of stories, actual conversations, references from researchers, and quite a few demonstrated improvements using the techniques she discusses in her book.

The author points out that conversation doesn't always happen between like-minded people and some differences are well-known. For example, women tend to speak amongst themselves in a cooperative manner, while the men speak among themselves in statements. Women ask twice more questions than men. These differences aside, both men and women are from the same planet and have similar difficulties in making their conversations more effective. She suggests the use of all forms of expressions to make their intentions known and these include such things as non-verbal cues, hand movements, simultaneous talking, back channeling speech acts and many other instruments. She says that manners of speaking such as “you know I have been thinking…”, “ I have been thinking about “, “ on my way to work”, you know with a dragging o, I mean with the dragging n, so with the dragging o and others are all effective conversation starters and endings. She points out that hand movements are helping the brain release some of the thoughts it is trying hard to find language for and can be performed while the speaker is speaking. She doesn't rule out any form of language that helps a person become more engaged in the conversation and elicits the same from the audience. She recognizes the power of storytelling in all forms of conversation whether they are casual or business. She compares stories to lists and suggests that popular software tools make lists easier and popular but advises against it saying that stories are more powerful than lists.

The author also cites both direct and indirect forms of communication and their utilities in various scenarios. She suggests that direct forms of communication can be literal and factual while the indirect play a role and requests and asking a favor. Recognizing the possible differences in the social political cultural and economic standing between the listener and the speaker, she suggests indirect forms of communication to make the conversation more polite. She calls out variations of conversations in the form of face threatening attacks, damages to the public impression, imposing or intruding into the personal space of others and how to respond to these. Some techniques involve leveling the field with some words to show camaraderie, hedging against possible refusal, understanding or stating the premises and it’s limitations, showing politeness and apologizing for imposing into the area of possible discomfort.

On the topic of ingratiating oneself to another for business or social reasons, she provides plenty of examples from subtle to clear, that make it more heartfelt not perfunctory. Likewise, she explains how to accept compliments with a full range of surprise, graceful acceptance, self-deprecation, deflection, and carefulness to resume the more substantial matter at hand. Towards the end of this book, the examples become lengthier and more meaningful in context. One cannot help but close the book with a feeling of being mindful to be more honest and engaging the next time one has an important conversation.

Sunday, May 28, 2023

Hot Desking:

Hot or shared desk is the concept of reserving time on a desk that is not personal or dedicated to one’s personal space. When workers used to have an office space, prior to the pandemic, it came with a dedicated desk that provided a significant sense of safety, comfort and security which raised employee productivity. Companies, big and small, could not do without it to retain talent and improve their balance sheets.

The biggest detriment to its adoption was the soaring prices of real-estate and the costs associated with assigning it to every new worker. Shared office space and even shared desks became common practice. Hot desks were introduced nearly twenty years ago and they have not lost their characteristic dislike and far-flung frustration among the work force throughout these times. It has caused worker agitation, lawsuits and even made it to the union’s agenda. From eastern to western hemisphere, the pain point was well known and high on the hate lists.

Employees could not do with the additional step of reserving time or space for a shared commodity. It made them feel worthless, not to mention the failure to find one at the hour of need. Not much has changed on this front and this unavoidable plight with hot desks manifests in many polls and surveys. So much so that that the notion that hot-desking over time becomes acceptable has been refuted.

The pandemic has had a two-fold impact about hot desks. It has made workers rely less on their assigned desks and prefer flexible workspaces while the companies have found a growing cost from unused furniture and real-estate. Some have already begun to improve their post-pandemic plans to increase the use of co-working. Scheduling applications and software are recognizing the notion of reserving desk as much as reserving conference room or virtual meeting space. Desk reservations will likely find a better home in terms of office productivity and team collaboration tools rather than dedicated software that gives yet another tool for workers. That said, niche software will continue to expand these capabilities. For example, Envoy, a software maker, provides an application to book a hot desk.

Employers are recognizing the hot desk not merely as a proven agent of worker stress but also as an opportunity to cut costs, if there were an acceptable remediation to bridge the gap. Office properties in big cities are so expensive that they cost upwards of a billion dollars a year in a single city alone. Many businesses do not see an end to the work from home that precipitates the need to do without the costs from unused space and furniture. If the workforce is only going to appear two or three days a week, it will expose a lot of room for improvement. The hot desk will tend to become a necessity despite the criticism.

Workers find change in processes to be least palatable. Hot desk will surely ratchet up a lot of gripe but the companies will be eager to address the criticism rather than run with the overhead. Even though protests might not be the last, organizations from corporate to government will run into the same debate, and timeless cost-cutting fanatics will tout the idea, the unused office space is fast becoming a bigger problem. People’s return to their office space as and when the pandemic has let up has not matched the pre-pandemic utility, value, and purpose of the desks. The cost and discomfort with using hot desks is still manageable albeit not cost-free. Personal desks are not only ergonomically sound and customized, but it is also a sense of belonging and value. It brings recognition as well as loyalty and appreciation from the workers.

Saturday, May 27, 2023

This article introduces my “skip-graphs” – a data structure that enables improved performance on traditional graphs in Computer Science. It introduces layering based on selective nodes. For example, we can choose nodes with centrality in each range as a sub graph with which to capture the connectivity at this layer. Edges may need to be "added" between these nodes to represent underlying connectivity through lower level (lower centrality) nodes. We can consider these not as skip lists but as skip graphs. The benefit is that the graph algorithms can continue to operate on a smaller subset of the original graph just so long as this sub graph can be completed with vertex selections and addition of edges. Using a metric such as centrality and this technique of skipping graphs, we can improve performance in almost all applications of the graph where the size is very large.

In this technique, we have introduced a hierarchy of centrality ranges. Starting from the lowest centrality of 1 at the core the graph is spread out spatially from the origin with increasing radius of higher centrality ranges. At each layer of the sphere the sub-graphs are fully formed with their own edges. These edges do not interfere with the edges that are originally present in the graph.

Selection of the nodes does not change existing vertices and edges either linearly or spatially. It merely rearranges the graphs in the Z plane where the z-axis is the centrality measure and the xy plane is where the graph is depicted. By selecting the more connected nodes from the graph and adding edges between them where there is underlying connectivity through less connected nodes, we are creating a different sub-graph. Selected nodes may have many paths between each other. Therefore, there can be multiple edges between these selected vertices. These edges are maintained physically but bundled in a single logical edge. The logical edge helps with general graph algorithms on the subgraph of selected nodes so that we have fewer data points for answering common questions. The physical edges help with finding reachability through one or more physical edges. So, this has very limited application in that we use the selected vertex and the new edges between them only for problems where they are helpful and not for problems that require graph algorithms over all vertices.

There is significant cost in building such sub-graphs and maintaining it. The idea is that for graphs that don’t change but require many analytical computations, we can do this once and not have to edit it.

The subgraph building operation is offline and does not interfere with conventional approaches of using the graphs otherwise. By automating and rendering additional data structures, we can improve performance from existing computations that require all the vertices. This is a separation of one-time setup cost versus running costs of computations.

One example of using such graphs is for time series data. The data is a collection over time and therefore not susceptible to change in its history. It accumulates more and more new data and the graphs can be incremental. Such time series data is often found in data warehouses.

There can be many interpretations of skip graphs. A simple graph with fewer nodes is a new graph of existing nodes and has no overlap with the existing original graph. It answers some of the queries directly with the nodes and edges in the graph using traditional graph algorithms. Different techniques yield different graphs and it can include a metric like the goodness of fit for clusters or F-score. A skip list is a tower of increasingly sparse linked lists sorted by key. The lists at the higher-level act as express lanes that allow the sequence of nodes to be traversed quickly. A skip graph, as per James Aspnes and Gauri Shah, is distinguished from a skip list in that there may be many lists at a particular level and every node participates in one of these lists until the nodes are splintered into singletons. A skip graph is equivalent to a collection of up to n skip lists that happen to share some of their lower levels. In distributed computing, the graph must tolerate arbitrary node failures and it should be usable even when a huge fraction of the nodes fails. Each node that appears in higher level skip lists occurs based on some membership vector generally using prefix memberships to a word. This form of membership makes a skip graph a trie of skip lists. A membership vector is a random variable. There is an edge between x and y whenever x and y are adjacent in some list.

Friday, May 26, 2023

Moving cloud resources with Terraform:

Introduction: Cloud resources often outgrow the initial definition with which they are created. When changes are applied to a resource, they are often updated in place or they might be destroyed and recreated. When Terraform keeps track of the resource, the changes are seamless to the end user and each instance will be associated with a configured resource at a specific resource address. When an existing infrastructure object must be associated with a different configured resource, Terraform can be informed about the history with a declaration block indicating the from and the to configured resources. Terraform reacts to this information during the planning step and in between init and apply steps in the three step deployment process.

Common operations like renaming or moving a resource into and out of a module can be performed with Terraform state mv command that changes which resource address in the configuration is associated with a particular real world object. In this case, you are required to come up with a new configuration that will instantiate the desired resource and then provide the id to the move command.

Changing a resource’s resource-group or location requires you to come up with a new configuration and something that Terraform cannot do automatically yet. This limitation is not imposed by Terraform but by the public cloud providers who want an entirely new instance to be created. When cloud provider say the support such operation, as for example, Azure can change the resource group membership of a virtual machine, it requires a new one to be created.

One reconciliation to performing such a move and keeping track of it in Infrastructure as code involves the following three steps. First, use the cloud provider’s resource to move the resource to the target region/resource group. Second, modify the Terraform configuration to use the new attribute value. Then, run Terraform refresh against the workspace or configuration to force Terraform to update the state file with the new attribute value. Then the next plan step would find a match between the new location or group and what’s defined in the configuration and would result in not requiring any changes.

Moving resources in production that requires recreation often involves moving production data. This implies there will be a down time but this way is usually the simplest way to move a resource to a target location or group. Some tools like aztfmove can help move resources based on Terraform state to different resource groups and subscriptions in Azure. One limitation of this tool is that it cannot alter just the region.

Azure Resource Mover can help with moving resources from one region to another and it performs this successfully via UI, CLI and scripting for compute resources such as virtual machines, network resources, and relational databases. It gives ample time to test and validate the migration. It allows automation of rollback until the final commit. Source files can be deleted after the move. But this does not support advanced host and runtimes such as Kubernetes. In fact, the Azure public cloud offers no support for moving Kubernetes instances between regions, resource groups and subscriptions.

Consequently, the only available choice is for end-user to take this on themselves and recreate a new instance in the destination and then inform Terraform about the change.

Wednesday, May 24, 2023

In continuation of the preceding post on leveraging Kusto databases for querying, even when it involves import of data from external sources, the following Kusto queries will help with useful analysis across ITOM and ITSM landscape.

1. When log entries do not have function names, scopes or duration of calls:

source

| where description Contains "<string-before-scope-of-execution>"

| project SessionId, StartTime=timestamp

| join (source

| where description Contains "<string-after-scope-of-execution>"

| project StopTime=timestamp, SessionId)

on SessionId

| project SessionId, StartTime, StopTime, duration = StopTime - StartTime

| summarize count() by duration=bin(min_duration/1s, 10)

| sort by duration asc

| render barchart

2. Since the duration column is also relevant to other queries later

source | extend duration = endTime – sourceTime

3. When the log entries do not have an exact match for a literal:

source

| filter EventText like "NotifyPerformanceCounters"

| extend Tenant = extract("tenantName=([^,]+),", 1, EventText)

4. If we wanted to use regular expressions on EventText:

source

| parse EventText with * "resourceName=" resourceName ",

totalSlices=" totalSlices:long * releaseTime=" releaseTime:date ")" *

| valid in~ ("true", "false")

5. If we wanted to read signin logs:

source

| evaluate bag_unpack(LocationDetails)

| where RiskLevelDuringSignIn == 'none'

and TimeGenerated >= ago(7d)

| summarize Count = count() by city

| sort by Count desc

| take 5

6) If we wanted to time bucket the log entries:

source

| where Modified > ago(7d)

| summarize count() by bin(Modified, 1h)

| render columnchart

7) If we wanted to derive just the ids:

Source

| where Modified > ago(7d)

| project Id

8) If we wanted to find the equivalent of a SQL query, we could use an example as follows:

EXPLAIN

SELECT COUNT_BIG(*) as C FROM StormEvents

Which results in

StormEvents

| summarize C = count()

| project C

This works even for wild card characters such as:

EXPLAIN

SELECT * from dependencies where type like “Azure%”

Which results in

dependencies

| where type startswith “Azure”

And for extensibility as follows:

EXPLAIN

SELECT operationName as Name, AVG(duration) as AvgD FROM dependencies GROUP BY name

Which results in

dependencies

| summarize AvgD = avg(duration) by Name = operationName

A few queries are included for incident dashboard purposes:

- How many incidents did customers open this week, month, year-to-date?

- How many incidents did a team resolve and close those periods ?

- What is the mean time to resolution?

- What is the breakdown of incidents by category?

- What are the largest clusters of incidents?

- What are the bounding boxes by time or resource groups for the highest number of incidents?

- What are the outlying incidents?

- What are the splits for a decision tree?

- What are clusters formed from co-occurring incidents by running them through a neural net.

Kusto queries can involve external data sources and it is easy to integrate with APIs from external services. This expands the processing and relaxes it from being inlined into the query.

Tuesday, May 23, 2023

Creating and curating Kusto databases for integration with ServiceNow incidents tables:

Azure as a public cloud aggregates functionalities and traffic from distributed on-premises and private datacenters with the benefits of managed resources, their lifecycle management, elasticity, scalability, and efficient operations management. As it attracts on-premises applications and data, owning teams are overwhelmed by many organizational units such as tenants, subscriptions, resource groups, and resources such as dashboards, alert groups, identities, roles, and permissions. An inventory of these resources usually finds their place in the Azure Data Explorer in the form of one or more Kusto databases. Together with Microsoft Graph and the Kusto Query language as the universal language of querying across resource and data explorers, the cloud engineers gravitate to these query editors for getting meaningful virtualized information across diverse datasets. One such dataset that continues to remain external is the incidents table from ITSM products.

Azure resources are as important to IT service management as any other on-premises resources and enterprise applications. With one example of an ITSM product as ServiceNow that provides robust ITSM capabilities, there are often talks about integration points between the two. ServiceNow enables its users to automate incident and issue management. Additionally, it provides users with access to real-life performance analysis and change management. The integration points between the two products comprise connectors for log analytics and azure monitor, linking action groups and incident creation or update, webhooks and azure AD authentication or runbooks and scripts-based automations, azure devops and service management ticketing.

Many of the integration points establish connections from the cloud to the ITSM such that the intelligent queries written in KQL in the cloud serve to trigger actions for service management. The reverse direction of propagating incidents information and history to databases on Kusto clusters is considered upstream but occur especially when the deprecation of ITSM in favor of the native cloud tools and practices is more than technological and cultural shifts.

Azure Data Factory facilitates data transfers between diverse source and destination. The CopyActivity and LookupActivity in the data factory are sufficient to copy with incremental progress and at scale. The rich information captured by the ITSM brings to Azure unparalleled opportunity for analyzing resources and pain points along with a deeper understanding of the trends, predictions, and quality of service for customer usages.

Kusto is popular both with Azure monitor as well as Azure data explorer. It is a read only request to process data and return results in plain text. If uses a data flow model that is remarkably like the slice and dice operators in the shell commands.IT can work with structured data with the help of tables, rows, and columns but it is not restricted to schema-based entities. It can be applied to unstructured data such as telemetry data. It consists of a sequence of statements delimited by semicolon operator and has at least one tabular query operator. The name of a table is sufficient to stream the rows to a pipeline operator that separates the filtering into its own stage with the help of a SQL like where clause. Sequences of where clauses can be chained to result in a more refined set of resulting rows. It can be as short as a tabular query operator, a data source, and a transformation. Any use of new tables, rows and columns requires the use of control commands that are differentiated from Kusto queries because they begin with a dot character. The separation of these control commands helps with security of the overall data analysis routines. Administrators will have less hesitation for Kusto queries to run on their data. Control commands also help to manage entities or discover their metadata. A sample control command is a “.show” command that shows all the tables in the current database.

This article explores the integrations between the public cloud and ITSM with the benefits of bringing the best of both worlds.