Cluster computing

Sunday, May 14, 2023

Queries for Operational Engineering of Cloud Resources

Introduction: This article focuses on the operational engineering aspects of Cloud DevOps and solutions development. While the public cloud and the technology landscape owned by companies play up their strengths for operational engineering, there is no defined solution or product that addresses all the concerns of this discipline. The public cloud does come with immense capabilities for monitoring and management of resources but leaves the authoring of rules, alerts, action group, dashboards and their organizations by scope and level to the folks. Individual teams end up with custom practice and approach for a single pane of management or often do without it by resorting to those built-in products and repeated efforts usually by resources on rotation.

Operational engineering is just as much about asking the right questions for the smooth running of operations as it is about troubleshooting and remediations for current and future deployments. Instead of running the same questions time and again through different sets of means and personnel, it might be better to articulate them and invest in systems that can automate the efforts and increase the insights for these operational engineers. While different companies may have developed the tools and techniques for this purpose already, this article suggests that the purpose of those systems is to answer questions that can be curated and automated.

With different products in the technology landscape vying for user acceptance in answering some of these questions, the notion of developing a reporting stack regardless of the size and scale of the deployments it provides information for, might seem an overhead but it is precisely the limitations of those products and the convenience and consistency of the answers that any system built to answer these questions, bring additional value and even become a staple.

Another angle of a custom approach to meeting the operational engineering needs of the deployments from various teams is the virtualized view of the different data sources so that the queries do not necessarily have to lose their relevance with the quirks and demands of the products that store the data. Let us say the incidents stored in an IT Operations Management database served by a SaaS provider needs to be related to the resources in the different subscriptions and resource groups of the cloud provider for a holistic view of the pain point resources. In such a case, there is no correlation possible because there is simply no data store or a common query that can provide the answer to both. A data store provides order and a query provides a method to come up with the answers that management might want to know about the resource types that have caused the most incidents in the last N days or to correlate it with costs. The information could very well be answered in parts either by the SaaS provider or the cloud provider or both but not as a complete answer.

Saturday, May 13, 2023

ServiceNow Incidents and Azure Data Explorer (Kusto query language):

Introduction:

Azure resources are as important to IT operations management as any other on-premises resources and enterprise applications. ServiceNow provides robust ITOM capabilities. Microsoft Graph and Kusto Query language empower intelligent experiences. The Graph or a Kusto database just needs mechanisms to bring content from external services. Connectors offer a simple and intuitive way to do just that. For example, the data brought in from the organization can appear in Microsoft Search results. This expands the type of content sources that are searchable in Microsoft 365 productivity applications and the broader ecosystem. There are over a hundred connectors that are currently available from Microsoft and partners which include Azure Services and ServiceNow. Kusto is popular both with Azure monitor as well as Azure data explorer. It is a read only request to process data and return results in plain text. If uses a data flow model that is remarkably like the slice and dice operators in the shell commands.IT can work with structured data with the help of tables, rows, and columns but it is not restricted to schema-based entities. It can be applied to unstructured data such as telemetry data. It consists of a sequence of statements delimited by semicolon operator and has at least one tabular query operator. The name of a table is sufficient to stream the rows to a pipeline operator that separates the filtering into its own stage with the help of a SQL like where clause. Sequences of where clauses can be chained to result in a more refined set of resulting rows. It can be as short as a tabular query operator, a data source, and a transformation. Any use of new tables, rows and columns requires the use of control commands that are differentiated from Kusto queries because they begin with a dot character. The separation of these control commands helps with security of the overall data analysis routines. Administrators will have less hesitation for Kusto queries to run on their data. Control commands also help to manage entities or discover their metadata. A sample control command is a “.show” command that shows all the tables in the current database.

The power of querying ServiceNow Incidents in Kusto Query Language is unparalleled for Azure resources. This article explains one such method.

Method:

Here is one method to integrate ServiceNow with Azure DevOps followed by Kusto.

1. The first step requires access to the SNOW portal for ServiceNow.

2. Then the Devops integration application (plugin) is installed

3. The next step is to navigate to: Search > Connection & credential aliases > New > Name= “Azuredemo1” > submit

4. Followed by navigation to Search > credentials > new > basic Auth > name = “Azuredevops1” > username = “AzureDevOps1”. For password = go to Azure DevOps and create new personal access token (top right corner select User settings > Personal access token)

5. Then we copy and paste this token into SNOW credentials password tab.

6. And click Submit

7. This is followed by navigation to: Snow portal > Search > connection > new > HTTP(s) > name = AzureDevOps1 > Credentials = select the one which we created in previous step (“Azuredevops1”) > connection alias = select the aliases we created before (“Azuredemo1”) > connection URL = go to Azure DevOps > org settings > copy URL from overview tab and past it in SNOW portal > submit

8. It is followed by SNOW portal > search > Azure DevOps Instance > New > name = “AzureDemo1” > connection alias = select the aliases we created before (“AzureDemo1”) > Version = compatible one. > submit

9. Then, the Azure DevOps Instance dashboard is accessed and AzureDemo1 (new instance that was just created) is selected > click on Connect > once we do that our state will change to “Connected”

10. Then create mapping is selected > “map is created successfully”

11. Then Discover Projects is selected > under Azure DevOps Project tab. We should see our project from Azure DevOps (ie :DCP)

12. Now project (DCP) is clicked > register webhooks > it will enable connection b/w Azure DevOps and SNOW

13. Followed by navigation to Team integration settings > new > assignment group = “select your agile group” (We can create our own agile group from search > agile azure devops integration > create agile group) > Team = “select our azure devops team” (imported from Azure) > submit

14. This lets us create, delete or modify user story/feature from either Azure DevOps or SNOW portal, and they will be integrated automatically.

Friday, May 12, 2023

An earlier article introduced Epidemic Algorithms. This continues the discussion with some examples of those algorithms. Cyclon is used for membership management while T-man is used for Topology management. Both demonstrate peer sampling which we might recall from the earlier article as one of the criteria for the design space for such algorithms. Every node maintains a relatively small local membership table that provides a partial view of the complete set of nodes and it periodically refreshes the table using a gossiping procedure.

The generic peer sampling will execute a handler for a timed event that runs every T time units. It will involve selecting peers into a view, permuting the view, moving the oldest pre-defined set of items to the end of the view, adding the others to a buffer initialized with the current node address, sending the buffer to the selected peers, receiving the set of buffers from those selected peers, refreshing the view with these set of buffers and increasing the age of the view. This generic framework also includes a receiver handler that executes what the timed event handler does but without selecting peers since the senders are known from what is received. The refreshing of the view in these handlers will start by appending the set of buffers to the view, removing duplicates, removing old items, removing the initial few as necessary, and removing some items at random.

Cyclon as a peer sampling example will involve a highest age Tail strategy for peer selection, a push-pull mode for view propagation, and a Swapper mode for view selection. A Cyclon peer will pick the oldest peer from its view and remove it from the view. It will exchange some of the peers in neighbors via (swap policy) and the active peer sends its fresh address. If there are no network disruptions, no peer becomes disconnected in the undirected graph. Pointers are updated so peers change from being neighbor of one peer to being the neighbor of another peer. The algorithm starts from a state where peers are connected in a chain and converges to a graph that has the same average path length as a random graph.

T-man is a protocol that can construct and maintain any topology with the help of a ranking function. The ranking function orders any set of nodes according to the neighbors of a given node. As with generic peer sampling algorithms, T-man also has a timed event that runs every T time units. It will involve selecting peers into a view, permuting the view, merging a descriptor formed from the current node address and its profile into a buffer, merging the random order of the selected peers into a buffer, sending the buffer to the selected peers, receiving the set of buffers from those selected peers, refreshing the view with these set of buffers by the same merge operation and selecting the view as a subset of the buffer. It also includes a receiver handler that executes what the timed event handler does but without selecting peers since the senders are known from what is received. The selecting of the view from the buffer will sort all nodes in the buffer and pick out the highest ranked nodes. The ranking function could be based on linear interpolation or ring formation.

When a private node behind a Network Address Traversal aka NAT, requests the information from a public node it will specify a shuffle request. In response the public node will send a shuffled response. Both the private node and the public node will update their states. If another node on the internet tries to issue a shuffle request to the private node behind the NAT, it will not pass through the NAT. This is the expected behavior for the Epidemic algorithms when they leverage a NAT.

The solutions for communicating with the private nodes in this manner can be achieved in one of two ways. The first technique involves a relay where the communications are relayed to the private node using a public relay node. The second method involves a NAT hole-punching algorithm to establish a direct connection to the private node using a public rendezvous node. The choice between the two is determined by latency and load. Relaying has lower latency message exchange. It enables lower gossip cycle periods. It is also necessary in dynamic networks. The other method decreases load on the public nodes but this is not always the case. The load does not decrease if the shuffle messages are small.

Gozar is NAT aware peer sampling algorithm. Each private nodes connects to one or more public nodes called partners that act as a relay or rendezvous server on behalf of the private node. A node’s descriptor consists of both its own address, it’s NAT type, and its partners’ addresses at the time of the descriptor creation. When a node wants to gossip with a private node, it uses the partner addresses in its descriptor to communicate with the private node.

Epidemic algorithms are important techniques to solve problems in a dynamic large scale systems. They are scalable, simple and robust to node failures, message loss and transient network disruptions. The applications are aggregation, membership management, and topology management.

#codingexercise

https://1drv.ms/w/s!Ashlm-Nw-wnWhMpiXxGeVryG7eiFXA

Thursday, May 11, 2023

Wednesday, May 10, 2023

Application specific conflict detection is accomplished in the Bayou system with dependency checks. An application can specify it in the form of a query and an expected result that the server can run against its current data. A conflict is detected if the actual results do not match the application specified expected result. If the pre-condition fails, the requested update is not performed. The server invokes a procedure to resolve the detected conflict.

Once a conflict is detected, a merge procedure is run by the Bayou server in an attempt to resolve the conflict. Merge procedures included with each write operation are routines written in a high-level language. These procedures can have embedded data or application specific logic related to the update that was being attempted. The merge procedure associated with the write is responsible for resolving any conflicts detected by its dependency check and for producing a revised update to apply. The complete process of detecting a conflict, running a merge procedure, and applying the revised update is performed atomically as part of executing a write.

The meeting room scheduling application provides a good reference for the dependency check and the merge procedure. In this application, users understand their reservations may be invalidated by other concurrent users and can specify alternate scheduling choices as part of their original scheduling updates. Since these are encoded into the merge procedure, it attempts to reserve one of the alternate meeting times if the original time is found to have a conflict with the original time.

In case automatic conflict resolution is not possible, it will still run to completion but will spew a log entry for the detected conflict in some fashion that enables manual intervention to resolve it later.

The replicas held by two servers at any time may vary in their contents because they have received and processed different writes, a fundamental property of the Bayou design is that all servers move towards eventual consistency. The Bayou system guarantees that all the servers eventually receive all writes via the pair-wise exchanges and that two servers holding the same set of writes will have the same data contents. But it cannot enforce strict bounds on write propagation delays since these depend on network connectivity factors that are outside of the Bayou’s control. Two important features of the Bayou system design allow the servers to achieve eventual consistency. First, writes are performed in the same well-defined order at all servers. Second, the conflict detection and merge procedures are deterministic so that the servers resolve the same conflicts in the same manner.

Tuesday, May 9, 2023

Epidemic Algorithms Part 2

Epidemic algorithms are important techniques to solve problems in dynamic large-scale systems that are scalable, simple, and robust to node failures. The applications are aggregation, membership management, and topology management.

Monday, May 8, 2023

Disaster recovery:

Preparedness in the event of an emergency arising from a region wide failure constitutes disaster recovery plan. They can be broadly categorized into four approaches ranging from low cost and complexity of taking backups to more complex and costlier options involving partial or full redundant deployments.

On a sliding scale of the tradeoffs mentioned, these can be pegged as follows:

ç==================================================================è

| | | |

Backup & Restore Pilot Light Warm Standby Multi-region active-active

RTO/RPO in hours ~ 1hr order of minutes near real-time

For lower priority live data business critical towards zero data loss

One thing to call out here is that it is a myth that the price increases linearly as

$ $$ $$$ $$$$

Because the unit is not a full deployment stamp comprising of multiple resource types and instead the geoDR or georedundant features are built into the resource types and by careful selection, the price of one option can be lower than the other for different baskets of selections.

The following the questions to ask for a DR plan for some of the cloud service products.

Databricks:

How many notebooks, repos, clusters, jobs does your Databricks instance have?

Has all the data been backed to Azure Storage Accounts?

Do you export your workspace to a repo with the databricks workspace cli command?

Do you have many upstream data sources?

Do you need to replicate a lot of data between failover and failback? Would it be possible to be selective about your data or leverage storage accounts that are accessible from other regions? Please exclude data in clusters / instances and only data sources

Do you have any control plane data such as notebook, source code, job configuration, cluster management, and user/group ACL data that is not already in IaC or GitHub and needs to be replicated to the secondary region?

Do you have any network configuration in data plane such as firewall rules, NAT configuration that must be replicated to secondary region and is part of tfstate files?

Do you have a priority on the processes that are critical to business and must be replicated on regional service-wide cloud-service provider outage?

Do you have streaming data that you ingest via say Kafka queues, change data capture stream, file-based continuous processing, or trigger once file processing? Checkpoints can be replicated to if they are not already in a managed storage like a managed disk or a storage account.

Do you use Managed Disks that are greater than 32 TiB in size?

Do you want to participate in a drill and recovery procedure?