Cluster computing

Thursday, March 24, 2022

Service Fabric (continued)

Part 2 compared Paxos and Raft. Part 3 discussed SF-Ring.

This article continues the discussion on Service Fabric with a focus on its architecture. ServiceFabric is built with layered subsystems which enables us to write applications that are highly available, scalable, manageable and testable.

The major subsystems in a ServiceFabric include the following:

- the transportation subsystem that secures point to point communication as the base layer

- the federation subsystem that federates a set of nodes to form a consistent scalable fabric

- the communication system that does service discovery

- the reliability system that offers reliability, availability, replication, and service orchestration.

- The hosting and activation subsystem that offers application lifecycle

- The management subsystem that performs deployment, upgrade and monitoring

- The testability subsystem that performs fault injection, test in production

- The application model that is declarative application description

- The native and managed APIs that support reliable, scalable applications

Service Fabric provides the ability to resolve service locations through its communication subsystem. The application programming models exposed to the developers are layered on top of these subsystems along with the application model to enable tooling.

The transport subsystem implements a point-to-point datagram communication channel which is used for communication within service fabric clusters and communication between the service fabric cluster and clients. It enables broadcast and multicast in the Federation layer and provides encrypted communication. It is not exposed to users.

The federation subsystem stitches the various nodes into a single unified cluster. It provides the distributed systems primitives needed by the other subsystems - failure detection, leader election, and consistent routing. It is built on top of distributed hash tables with a 128-bit token space which is a ring topology over the nodes.

The reliability subsystem consists of a Replicator, Failover Manager, and Resource Balancer. The Replicator ensures that state changes in the primary service replica are replicated and in sync. The Failover Manager ensures that the load is automatically redistributed across the nodes on additions and removals. The Resource Manager places service replicas across failure domains in the cluster and ensures that all failover units are operational.

The Management subsystem consists of a cluster manager, Health manager, and the Image store. The Cluster manager places the applications on the nodes based on service placement constraints. The Health manager enables health monitoring of applications, services and cluster entities. The Image store service provides storage and distribution of the application binaries.

The Hosting subsystem services manages the lifecycle of an application on a node that the hosting subsystem provides.

The communication subsystem provides reliable messaging within the cluster and service discovery through the Naming service.

The Testability subsystem is a suite of tools specifically designed for testing services built on Service Fabric.

Wednesday, March 23, 2022

Service Fabric (continued)

Part 2 compared Paxos and Raft. Part 3 discussed SF-Ring. This article continues the discussion on Service Fabric with it support for microservices. Microsoft Azure Service Fabric is a distributed systems platform to package, deploy, and manage scalable and reliable Microservices and containers while supporting native cloud development. Service Fabric helps developers and administrators to focus on the implementation of workloads that are scalable, reliable and manageable by avoiding the issues that are regularly caused by complex infrastructures. The major benefits it provides include: deploying and evolving services at very low cost and high velocity, lowering costs to changing business requirements, exploiting the widespread skills of developers and decoupling packaged applications from workflows and user interactions.

SF provides first-class support for full Application Lifecycle Management (ALM) of cloud applications, from development, deployment, daily management, to eventual decommissioning. It provides system services to deploy, upgrade, detect, and restart failed services; discover service location; manage state; and monitor health. In production environments, there can be hundreds of thousands of microservices running in an unpredictable cloud environment. SF is an automated system that provides support for the complex task of managing these microservices.

An application is a collection of constituent microservices (stateful or stateless) in Service Fabric. Each of these performs a complete and standalone function and is composed of code, configuration and data. The code consists of the executable binaries, the configurations consist of service settings that can be loaded at run time, and the data consists of arbitrary static data to be consumed by the microservice. A powerful feature of SF is that each component in the hierarchical application model can be versioned and upgraded independently.

Service Fabric distinguishes itself with support for strong consistency and support for stateful microservices. Each of the SF components offer strong consistency behavior. There were two ways to do this: provide consistent – build consistent applications on top of inconsistent components or use consistent components from the grounds-up. The end-to-end principle dictates that if performance is worth the cost for a functionality then it can be built into the middle. If consistency were instead to only be built at the application layer, each distinct application will have significant costs for maintenance and reliability. Instead if the consistency is supported at each layer, it allows higher layer design to focus on their relevant notion of consistency and allows both weakly consistent applications and strongly consistent applications to be built on top of Service Fabric. This is easier than building consistent applications over an inconsistent substrate.

Support for stateful microservices that maintain a mutable authoritative state beyond the service request and its response is a notable achievement of Service Fabric. Stateful microservices can demonstrate high-throughput, low-latency, fault-tolerant online transaction processing services by keeping code and data close on the same machine. It also simplifies the application design by removing the need for additional queues and caches.

Tuesday, March 22, 2022

Service Fabric (continued)

Part 2 compared Paxos and Raft. This article continues the discussion with a comparison to Service Fabric.

Kubernetes requires distributed consensus, but Service Fabric relies on its ring. Distributed consensus algorithm like Paxos and Raft must perform leader election, Service Fabric doesn’t. Service Fabric avoids this with a decentralized ring and failure detection. It is motivated by the adage that distributed consensus is at the heart of numerous co-ordination problems, but it has a trivially simple solution if there is a reliable failure detection service. Service Fabric does not use a distributed consensus protocol like Raft or Paxos or a centralized store for cluster state. Instead, it proposes a Federation subsystem which answers the most important question on membership which is whether a specific node is part of the system. The nodes are organized in rings, and the heartbeats are only sent to a small subset of nodes called the neighborhood. The arbitration procedure involves more nodes besides the monitor but it is only executed on missed heartbeats. A quorum of privileged nodes in an arbitration group helps resolve the possible failure detection conflicts, isolate appropriate nodes and maintain a consistent view of the ring.

Nodes in a Service Fabric are organized in a virtual ring with 2^m points where m = 128 bits. Nodes and keys are mapped onto a point in the ring. A key is owned by the node closest to it, with ties won by the predecessor. Each node keeps track of multiple (a given number of) its immediate successor nodes and predecessor nodes in the ring which is called the neighborhood set. Neighbors are used to run SF’s membership and failure detection protocol.

Membership and failure detection in Service Fabric relies on two key design principles: 1. A notion of strongly consistent membership and 2. Decoupling failure detection from failure decision. All nodes responsible for monitoring a node X must agree on whether X is up or down. When used in the SF-ring, all predecessors and successors in the neighborhood of a node X agree on its X’s status. This forms a consistent neighborhood. Failure detection protocols can lead to conflicting decisions. For this reason, the decision of which nodes are failed is decoupled from the failure detection itself.

Monitoring and leasing make this periodic. Heartbeating is fully decentralized where each node is monitored by a subset of other nodes which are called its monitors. Node X periodically sends a lease renewal request which is a heartbeat message with a unique sequence number, to each of its monitors. When a monitor acknowledges, node X is said to carry a timer value called T0 so that X can wait for T0 time and then take actions with respect to Y. If a node X suspects a neighbor Y, it sends a fail(Y) message to the arbitrator but waits for T0 time after receiving the accept(.) message before reclaiming the portion of the Y’s ring. Any routing requests received for Y will be queued but processed only after the range has been inherited by Y’s neighbors.

The SF-Ring is a distributed hash table. It provides a seamless way of scaling from small groups to large groups. SF-Ring was developed at around the same time as P2P DHTs like Pastry, Chord, and others. SF-Ring is unique because 1) Routing table entries are bidirectional and symmetrical, 2) the routing is bidirectional, 3) Routing tables are eventually convergent, 4) there is decoupled mapping of Nodes and keys and 5) There are consistent routing tokens.

1. SF-Ring maintains routing partners in routing tables at exponentially increasing distances in the ring. Routing partners are maintained both clockwise and anticlockwise. Most routing partners are symmetric due to bidirectionality

2. A bidirectional routing table enables a node looking to forward a message for a key to find another node whose ID is closest to the key so that it may forward the message. It is a distributed form of binary search and is greedy in nature.

3. SF nodes use a chatter protocol to continuously exchange routing table information. The symmetric nature of the routing enables failure information to propagate quickly and this leads to eventual converging of routing table entries that are affected.

4. Nodes and services are mapped onto the ring in a way that is decoupled from the ring. Nearby nodes on the rings are selected from different fault domains and services are mapped in a near-optimal way and load balanced way.

5. Each SF node owns a portion of the ring as encoded in its token. There is no overlap among tokens owned by nodes and every token range is eventually owned by at least one node. When a node leaves, its successor and predecessor split the range between them halfway. With these criteria, SF routing will eventually succeed.

Monday, March 21, 2022

Service Fabric (continued)

Part 1 introduced the Service Fabric. This article continues the discussion with a comparison of the messaging protocols:

Service Fabric is not like Kubernetes which is a heavily centralized system relying on an API server, multiple Kubelets, a central etcd cluster repository and heartbeats collected every period. Service Fabric avoids this with a decentralized ring and failure detection. It is motivated by the adage that distributed consensus is at the heart of numerous co-ordination problems, but it has a trivially simple solution if there is a reliable failure detection service. Service Fabric does not use a distributed consensus protocol like Raft or Paxos or a centralized store for cluster state. Instead, it proposes a Federation subsystem which answers the most important question on membership which is whether a specific node is part of the system. The nodes are organized in rings, and the heartbeats are only sent to a small subset of nodes called the neighborhood. The arbitration procedure involves more nodes besides the monitor but it is only executed on missed heartbeats. A quorum of privileged nodes in an arbitration group helps resolve the possible failure detection conflicts, isolate appropriate nodes and maintain a consistent view of the ring.

Kubernetes requires distributed consensus, but Service Fabric relies on its ring. Distributed consensus algorithms have evolved into many forms but two primarily dominate the market – Paxos and Raft. Both take a similar approach differing only in their approach for leader election. Raft only allows servers with up-to-date logs to become leaders, whereas Paxos allows any server to be leader if it then updates its logs to ensure that it remains up-to-date. Raft is more efficient since it does not involve log entries to be exchanged during leader elections.

Some of the other differences can be compared along the following dimensions:

1. How does it ensure that each term has at most one leader?

a. Paxos: A server s can only be candidate in a term t if t mod n = s. There will be only one candidate per term so there will be only one leader per term.

b. Raft: A follower can become a candidate in any term. Each follower will only vote for one candidate per term, so only one candidate can get a majority of votes and become the leader.

2. How does it ensure that a new leader’s log contains all committed entries?

a. Paxos: Each RequestVote reply includes the followers’s log entries. Once a candidate has received RequestVote responses from a majority of followers, it adds the entries with the highest term to its log.

b. Raft: A vote is granted only if the candidate’s log is at least as up-to-date as the followers’. This ensures that a candidate only becomes a leader if it’s log is at least as up-to-date as a majority of its followers.

3. How does it ensure that the leader safely commits log entries from previous terms?

a. Paxos: Log entries from the previous terms are added to the leader’s log with the leader’s terms. The leader then replicates the log entries as if they were from the leader’s terms.

b. Raft: The leader replicates the log entries to the other servers without changing the term. The leader cannot consider those entries committed until it has replicated a subsequent log entry from its own term.

Raft algorithm was proposed to address the long-standing issues with understandability of the widely studied Paxos algorithm. It has a clear abstraction and presentation and can be a simplified version of the original Paxos algorithm. Specifically, Paxos divides terms between servers whereas Raft allows a follower to become a candidate in any term, but followers will vote for only one candidate per term. Paxos followers will vote for any candidate, whereas Raft followers will only vote for a candidate if the candidate’s log is at least as up to date. If a leader has uncommitted log entries from a previous term, Paxos will replicate them in the current term whereas Raft will replicate them in their original term. Raft’s leader election is therefore lightweight when compared to Paxos.

Sunday, March 20, 2022

Azure Service Fabric:

Introduction: This is a continuation of a series of articles on Azure services from an operational engineering perspective and the role their design and algorithms play in the field. Most recently we discussed Azure Functions with the link here. This article turns to Microsoft Service Fabric with an intention to discuss the comparisons between Paxos, Raft and Service Fabric’s protocol.

Discussion:

ServiceFabric is Microsoft’s distributed platform for building, running, and maintaining microservices applications in the cloud. It is a container orchestrator and it is able to provide quality of service to microservice framework models such as stateful, stateless, and actor. It differs from Azure Container Service in that it is an Infrastructure-as-a-Service offering rather than a Platform-as-a-Service offering. There is also a Service Fabric Mesh offering that provides a PaaS service for Service Fabric applications. Service Fabric provides its own specific programming model, allows guest executables, and orchestrates Docker containers. It supports both Windows and Linux but it is primarily suited for Windows. It can scale to handle Internet-of-Things traffic. It is open to workloads and is technology-agnostic. It relies on Docker and has supported both Windows and Linux containers but it provides a more integrated feature-rich orchestration that gives more openness and flexibility.

Cloud-native container applications are evaluated on a 12-factor methodology for building web applications and software-as-a-service which demand the

Use of declarative frameworks for setup automation, minimizing time and cost for new developers joining the project
Use of a clean contract with the underlying operating system, offering maximum portability between execution environments,
Suitability for deployment on modern cloud platforms and avoiding the need for servers and system administration
Ability to minimize divergence between development and production and to enable continuous deployment for maximum agility
Ability to scale up without significant changes to tooling, architecture or development practices.

Service Fabric encourages all of these so that its workloads can focus more on their business requirements.

It is often compared to Kubernetes which is also a container orchestration framework that hosts applications. Kubernetes extends this idea of app+container all the way where the host can be nodes of a cluster. Kubernetes evolved as an industry effort from the native Linux containers support of the operating system. It can be considered as a step towards a truly container-centric development environment. Containers decouple applications from infrastructure which separates dev from ops. Containers demonstrate better resource isolation and improved resource utilization. Kubernetes is not a traditional, all-inclusive PaaS. Unlike PaaS which restricts applications, dictates the choice of application frameworks, restricts supported language runtimes, or distinguishes apps from services, Kubernetes aims to support an extremely diverse variety of workloads. If the application has been compiled to run in a container, it will work with Kubernetes. PaaS provides databases, message buses, cluster storage systems but those can run on Kubernetes. There is also no click to deploy service marketplace. Kubernetes does not build user code or deploy it. However, it facilitates CI workflows to run on it. Kubernetes allows users to choose logging, monitoring, and alerting Kubernetes also does not require a comprehensive application language or system. It is independent of machine configuration or management. But PaaS can run on Kubernetes and extend its reach to different clouds.

Service Fabric is used by some of the largest services in the Azure Cloud Service Portfolio but it comes with a different history, different goals, and different designs. The entire Microsoft Azure Stack hybrid offering relies on Service Fabric to run all the platform core services. Kubernetes is a heavily centralized system. It has an API server in the middle, and agents called Kubelets that are installed on all worker nodes. All Kubelets communicate with the API server and this saves the state in a centralized repository – the etcd cluster which is a Raft-backed distributed KV-store. Cluster membership is maintained by requiring kubelets to maintain a connection with the API server to send heartbeats every period. Service Fabric avoids this with a decentralized ring and failure detection. It is motivated by the adage that distributed consensus is at the heart of numerous co-ordination problems, but it has a trivially simple solution if there is a reliable failure detection service. Service Fabric does not use a distributed consensus protocol like Raft or Paxos or a centralized store for cluster state. Instead, it proposes a Federation subsystem that answers the most important question on membership which is whether a specific node is part of the system. The nodes are organized in rings, and the heartbeats are only sent to a small subset of nodes called the neighborhood. The arbitration procedure involves more nodes besides the monitor but it is only executed on missed heartbeats. A quorum of privileged nodes in an arbitration group helps resolve the possible failure detection conflicts, isolate appropriate nodes and maintain a consistent view of the ring.

Conclusion:

Service Fabric recognizes different types of workloads and is particularly well-suited for stateful workloads.