Service Fabric (continued)
Part 2 compared
Paxos and Raft. Part 3 discussed
SF-Ring and Part 4 discussed its
architecture. This article describes monitoring and diagnostics.
Service Fabric provides monitoring at various levels –
Application Monitoring, Cluster Monitoring, Infrastructure monitoring etc.
Application monitoring allows us to study the performance of
features and components of an application. The responsibility of application
monitoring is on the users developing an application and its services. It
answers questions on how much traffic is flowing to the application, the
success of the calls made to the services, the actions taken by the users on
the application, if the application is throwing unhandled exceptions and
whether the services are running fine within their containers.
Service Fabric makes applications resilient to hardware
failures. Failures are rapidly detected, and workloads go through failover to
other nodes. Cluster monitoring is critical to this end. There are a
comprehensive set of events out of the box. These events can be accessed
through the event store or the operational channel. Service Fabric events are
available from a single ETW provider with a set of relevant filters to separate
them. EventStore provides events available in the Service Fabric Explorer and
through the REST API.
These events illustrate activities taken by the platform on
different entities such as Nodes, Applications, Services, Partitions, etc. If a
node were to go down, the platform would emit a NodeDown event and a tool of
choice would use that event to generate notifications. These events are
available both on Windows and Linux clusters.
The platform includes a health model which provides
extensible health reporting for the status of the entities in a cluster. Each
node, application, service, platform, replica or instance has a continuously
updatable health status. Whenever the health of a particular entity
transitions, a corresponding event would also be emitted. Queries and alerts
can be setup on a dashboard with these events.
Users can also override heath for entities. If the
application is going through an upgrade and the validation tests were failing,
then the events can be written to the Service Fabric Health using the Health
API to indicate that the application is no longer healthy and the Service
Fabric will automatically rollback the upgrade.
Watchdogs are available as a separate service that watches
health and load across the services, pings endpoints and reports unexpected
health events in the cluster. This can help prevent errors that may not be
detected and are based only on the performance of a single service. Watchdogs
are also a good place to host code that performs remedial actions that don’t
require user involvement such as archiving older log entries
Infrastructure monitoring is also referred to as performance
monitoring since it pertains to system performance and depends on many factors.
These are typically measured through performance counters. They can come from a
variety of sources including the operating system, the .Net framework, or the
service fabric platform itself. Performance counters are also available for
Reliable Services and Actors programming models.
Application Insights is used for application monitoring,
Diagnostic agent is used for cluster monitoring and Azure monitor logs is used
for infrastructure monitoring.
No comments:
Post a Comment