Wednesday, April 6, 2022

Service Fabric (continued)    

Part 2 compared Paxos and Raft. Part 3 discussed SF-Ring and Part 4 discussed its architecture. This article describes monitoring and diagnostics.

Service Fabric provides monitoring at various levels – Application Monitoring, Cluster Monitoring, Infrastructure monitoring etc.

Application monitoring allows us to study the performance of features and components of an application. The responsibility of application monitoring is on the users developing an application and its services. It answers questions on how much traffic is flowing to the application, the success of the calls made to the services, the actions taken by the users on the application, if the application is throwing unhandled exceptions and whether the services are running fine within their containers.

Service Fabric makes applications resilient to hardware failures. Failures are rapidly detected, and workloads go through failover to other nodes. Cluster monitoring is critical to this end. There are a comprehensive set of events out of the box. These events can be accessed through the event store or the operational channel. Service Fabric events are available from a single ETW provider with a set of relevant filters to separate them. EventStore provides events available in the Service Fabric Explorer and through the REST API.

These events illustrate activities taken by the platform on different entities such as Nodes, Applications, Services, Partitions, etc. If a node were to go down, the platform would emit a NodeDown event and a tool of choice would use that event to generate notifications. These events are available both on Windows and Linux clusters.

The platform includes a health model which provides extensible health reporting for the status of the entities in a cluster. Each node, application, service, platform, replica or instance has a continuously updatable health status. Whenever the health of a particular entity transitions, a corresponding event would also be emitted. Queries and alerts can be setup on a dashboard with these events.

Users can also override heath for entities. If the application is going through an upgrade and the validation tests were failing, then the events can be written to the Service Fabric Health using the Health API to indicate that the application is no longer healthy and the Service Fabric will automatically rollback the upgrade.

Watchdogs are available as a separate service that watches health and load across the services, pings endpoints and reports unexpected health events in the cluster. This can help prevent errors that may not be detected and are based only on the performance of a single service. Watchdogs are also a good place to host code that performs remedial actions that don’t require user involvement such as archiving older log entries

Infrastructure monitoring is also referred to as performance monitoring since it pertains to system performance and depends on many factors. These are typically measured through performance counters. They can come from a variety of sources including the operating system, the .Net framework, or the service fabric platform itself. Performance counters are also available for Reliable Services and Actors programming models.

Application Insights is used for application monitoring, Diagnostic agent is used for cluster monitoring and Azure monitor logs is used for infrastructure monitoring.

No comments:

Post a Comment