Cluster computing

Sunday, March 20, 2022

Azure Service Fabric:

Introduction: This is a continuation of a series of articles on Azure services from an operational engineering perspective and the role their design and algorithms play in the field. Most recently we discussed Azure Functions with the link here. This article turns to Microsoft Service Fabric with an intention to discuss the comparisons between Paxos, Raft and Service Fabric’s protocol.

Discussion:

ServiceFabric is Microsoft’s distributed platform for building, running, and maintaining microservices applications in the cloud. It is a container orchestrator and it is able to provide quality of service to microservice framework models such as stateful, stateless, and actor. It differs from Azure Container Service in that it is an Infrastructure-as-a-Service offering rather than a Platform-as-a-Service offering. There is also a Service Fabric Mesh offering that provides a PaaS service for Service Fabric applications. Service Fabric provides its own specific programming model, allows guest executables, and orchestrates Docker containers. It supports both Windows and Linux but it is primarily suited for Windows. It can scale to handle Internet-of-Things traffic. It is open to workloads and is technology-agnostic. It relies on Docker and has supported both Windows and Linux containers but it provides a more integrated feature-rich orchestration that gives more openness and flexibility.

Cloud-native container applications are evaluated on a 12-factor methodology for building web applications and software-as-a-service which demand the

Use of declarative frameworks for setup automation, minimizing time and cost for new developers joining the project
Use of a clean contract with the underlying operating system, offering maximum portability between execution environments,
Suitability for deployment on modern cloud platforms and avoiding the need for servers and system administration
Ability to minimize divergence between development and production and to enable continuous deployment for maximum agility
Ability to scale up without significant changes to tooling, architecture or development practices.

Service Fabric encourages all of these so that its workloads can focus more on their business requirements.

It is often compared to Kubernetes which is also a container orchestration framework that hosts applications. Kubernetes extends this idea of app+container all the way where the host can be nodes of a cluster. Kubernetes evolved as an industry effort from the native Linux containers support of the operating system. It can be considered as a step towards a truly container-centric development environment. Containers decouple applications from infrastructure which separates dev from ops. Containers demonstrate better resource isolation and improved resource utilization. Kubernetes is not a traditional, all-inclusive PaaS. Unlike PaaS which restricts applications, dictates the choice of application frameworks, restricts supported language runtimes, or distinguishes apps from services, Kubernetes aims to support an extremely diverse variety of workloads. If the application has been compiled to run in a container, it will work with Kubernetes. PaaS provides databases, message buses, cluster storage systems but those can run on Kubernetes. There is also no click to deploy service marketplace. Kubernetes does not build user code or deploy it. However, it facilitates CI workflows to run on it. Kubernetes allows users to choose logging, monitoring, and alerting Kubernetes also does not require a comprehensive application language or system. It is independent of machine configuration or management. But PaaS can run on Kubernetes and extend its reach to different clouds.

Service Fabric is used by some of the largest services in the Azure Cloud Service Portfolio but it comes with a different history, different goals, and different designs. The entire Microsoft Azure Stack hybrid offering relies on Service Fabric to run all the platform core services. Kubernetes is a heavily centralized system. It has an API server in the middle, and agents called Kubelets that are installed on all worker nodes. All Kubelets communicate with the API server and this saves the state in a centralized repository – the etcd cluster which is a Raft-backed distributed KV-store. Cluster membership is maintained by requiring kubelets to maintain a connection with the API server to send heartbeats every period. Service Fabric avoids this with a decentralized ring and failure detection. It is motivated by the adage that distributed consensus is at the heart of numerous co-ordination problems, but it has a trivially simple solution if there is a reliable failure detection service. Service Fabric does not use a distributed consensus protocol like Raft or Paxos or a centralized store for cluster state. Instead, it proposes a Federation subsystem that answers the most important question on membership which is whether a specific node is part of the system. The nodes are organized in rings, and the heartbeats are only sent to a small subset of nodes called the neighborhood. The arbitration procedure involves more nodes besides the monitor but it is only executed on missed heartbeats. A quorum of privileged nodes in an arbitration group helps resolve the possible failure detection conflicts, isolate appropriate nodes and maintain a consistent view of the ring.

Conclusion:

Service Fabric recognizes different types of workloads and is particularly well-suited for stateful workloads.

Saturday, March 19, 2022

Identity claims:    

This is a continuation of a series of articles on Azure services from an operational engineering perspective with the most recent introduction to Azure Functions with the link here. This article discusses Microsoft Identity Model.

One of the most important aspects of code that executes in the cloud is the identity invoking it along with the claims it presents. Identity refers to a security principal that is protected and whose access is well-managed. The data associated with this entity and its representation falls under and identity and access management routines for a cloud service. While the execution of code to provide certain functionalities requires privileges and an identity must provide claims to show that the code can be executed, services operate with the least privilege policy so that unauthenticated users are refuted, and unauthorized access is forbidden.

The use of principal and access is critical to Azure Functions just as much as it is for any service. Code must not only demand for privilege, but the identity must also provide adequate claims. This is particularly relevant to the user management.

Roles provide a convenient form of bundling these privileges and enabling access controls. Roles can also be graded to have incremental privileges. Certain deployment actions can also be performed with incremental privileges.

One of the fundamental aspects of privileged code execution is the ability to add privilege on demand.

The following code illustrates a way to do this programmatically:

using System.IO;

using IdentityClaim = Microsoft.IdentityModel.Claims.Claim;

using IdentityClaimTypes = Microsoft.IdentityModel.Claims.ClaimTypes;

using IdentityClaimsPrincipal = Microsoft.IdentityModel.Claims.ClaimsPrincipal;

using ClaimsIdentityCollection = Microsoft.IdentityModel.Claims.ClaimsIdentityCollection;

IClaimsIdentity claimsIdentity = new ClaimsIdentity(Thread.CurrentPrincipal.Identity);

var claimValue = string.Format("claim://{0}@{1}", DsmsResourceRole.PrivilegedDeploymentOperator, "sample-resource-folder-test");

var identityClaim = new IdentityClaim(IdentityClaimTypes.Role, claimValue);

claimsIdentity.Claims.Add(identityClaim);

ClaimsIdentityCollection claimsIdentityCollection = new ClaimsIdentityCollection(new List<IClaimsIdentity>() { claimsIdentity });

var newIcp = IdentityClaimsPrincipal.CreateFromIdentities(claimsIdentityCollection);

Thread.CurrentPrincipal = newIcp;

The above example uses the Microsoft.IdentityModel namespace to describe the elevation of privilege to run some code.

Friday, March 18, 2022

Azure Functions:      

Performance and scalability concerns are clearer with serverless function apps. Large long-running functions can cause unexpected timeout issues. Bloating of an application is noticeable when there are many libraries included. For example, a Node.js function application can have dozens of dependencies. Importing dependencies increases load times that result in unexpected timeouts. Dependencies tree can be of arbitrary breadth and length. Whenever possible, it is best to split the function applications into sets that work together and return responses fast. One way of doing this involves the HTTP trigger function being separate from a queue trigger function such that the HTTP trigger returns an acknowledgment while the payload is placed on the queue to be processed subsequently. 

Function applications in production should be as overhead-free as possible. There should not be any test-related functions. Shared code between functions must be in its own folder otherwise there will be multiple versions of it. Memory usage is averaged across functions so the less there are the better. Verbose logging in production code has a negative impact.

Async calls can be used to avoid blocking calls. Asynchronous programming is a recommended best practice, especially when blocking I/O operations are involved. Any host instance for functions uses a single worker process. The functions_worker_process_count can increase the number of worker processes per host to up to 10 and function invocations are distributed across these workers.

Messages can be batched which leads to better performance. This can be configured in the host.json file associated with the function application. C# functions can change the type to a strongly typed array to denote batching. Other languages might require setting the cardinality in function.json.

Host behaviors can be configured to better handle concurrency. Host runtime and trigger behaviors can be configured in the host.json file. Concurrency can be managed for a number of triggers and these settings tremendously influence the scaling of the function applications.

Settings apply across all functions in the application for a single instance of that application. A function application having two HTTP functions and maxConcurrentRequests set to 25 will count a request to either trigger towards the shared concurrent requests and when the application is scaled out to say ten instances, the maximum concurrency will be effectively 250 requests across those instances.

Thursday, March 17, 2022

Azure Functions:

Performance and scalability concerns are clearer with serverless function apps. Large long-running functions can cause unexpected timeout issues. Bloating of an application is noticeable when there are many libraries included. For example, a Node.js function application can have dozens of dependencies. Importing dependencies increases load times that result in unexpected timeouts. Dependencies tree can be of arbitrary breadth and length. Whenever possible, it is best to split the function applications into sets that work together and return responses fast. One way of doing this involves HTTP trigger function to be separate from a queue trigger function such that the HTTP trigger returns an acknowledgement while the payload is placed on the queue to be processed subsequently.

Scalability is also improved when functions are stateless. When the state is combined with the data, the functions become stateless. For example, an order might have a state so that repeated processing of the order transitions it from one state to another while the functions remain stateless.

When a solution comprises of multiple functions, they are often combined into a single function app, but they can also run-in separate function applications. Multiple function applications can also share the same resources by running in the same plan if the Premium and dedicated hosting plan has been selected. Each function has a memory footprint. When there are many functions in the same application, they might cause the function to load more slowly than for single function applications. Different functions might also have disproportionate requirements and those with excessive usage can be put in their own dedicated function application. Grouping of functions in a function application can also be motivated by load profiles. If one gets frequent messages and another is more resource intensive, then they can be put in separate function applications.

Function applications have a host.json file which can be used to configure advanced behavior of function triggers, and the Azure Functions runtime. Changes to this file can apply to all functions in the application. Organizing different functions based on this configuration will alleviate the operational concerns.

Functions can also be organized by privilege. Connection strings and other credentials stored in application settings need not be given access to all. Those that use them can be put in one function application.

Function scaling depends a lot on managing connections and storage accounts. It is best to avoid sharing storage accounts. Test and production code need not be mixed in the same function app. It is preferable not to have verbose logging in production code because it affects performance.

Wednesday, March 16, 2022

Azure Functions:    

Performance and scalability concerns are clearer with serverless function apps. Large long-running functions can cause unexpected timeout issues. Bloating of an application is noticeable when there are many libraries included. For example, a Node.js function application can have dozens of dependencies. Importing dependencies increases load times that result in unexpected timeouts. Dependencies tree can be of arbitrary breadth and length. Whenever possible, it is best to split the function applications into sets that work together and return responses fast. One way of doing this involves HTTP trigger function to be separate from a queue trigger function such that the HTTP trigger returns an acknowledgement while the payload is placed on the queue to be processed subsequently.

State transitions and communications between multiple functions must be managed by Durable functions and Azure Logic applications otherwise a queue can be used to store messages for cross-function communication. The main reason for this is that storage queues are cheaper and much easier to provide than other storage options.

Individual messages in a storage queue are limited in size to 64 KB. Azure Service Bus queues can queue messages up to 256KB in standard tier and up to 100 MB in the premium tier. Use service bus topics when messages need to be filtered. Event subs are useful to support high volume communications.

Idempotent functions are best recommended for use with timer triggers. This enables retry behavior to be correct. If a function must run once a day, then it can be written such that it can be run any time of the day with the same results. The functions can exit, when there is no work for a particular day. It can also resume if the previous run failed.

Defensive functions are built to handle exceptions. It enables the function to continue from a previous fail point upon next run. Processing that involves a large number of records from a database must handle a variety of faults such as those coming from network outages or quota limits. Tracking the progress of execution is especially helpful under such circumstances.

Tuesday, March 15, 2022

Azure Functions:   

When we want code to be triggered by events, Azure Functions become very useful because it is a compute-on-demand experience. It extends the existing Azure App Service platform with capabilities to implement code triggered by events occurring in Azure, in third-party service, and in on-premises systems. Functions can be built to be reactive, but they are also useful to process data from various data sources. Functions are hosted in environments that like VMs are susceptible to faults such as restarts, moves or upgrades. Functions may also have the same reliability as the APIs it invokes. But functions can scale out so that they never become a bottleneck. The previous articles discussed the best practices around security and concurrency. This article continues with some of the other best practices such as availability and monitoring. 

Intermittent failures can be overcome with retries. For example, a database may not be reachable within the limited timeout that is usually set for scripts. Nested commands can be called that might each take their own timeout. Such errors do not occur under normal conditions, so they escape the functional tests. If a limited number of retries were to be attempted, the code executes successfully, and the issue does not recur. Retry also has the side-effect that it reduces the cost of detection and investigation to other spurious errors from logs, incidents and false defects..

Error handling practices is important to avoid loss of data or missed messages. The recommended error handling best practices include enabling application insights, using structured error handling, designing for idempotency, and implementing retry policies. For example, the top most level of any function code should have a try catch and tge catch block should log errors.