Cluster computing

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions the conform to the Big Data architectural style. In this section, we focus on Data Lake monitoring.

As we might expect from the use of Azure storage account, the monitoring for the Azure Data Lake leverages the monitoring for the storage account. Azure Storage Analytics performs logging and provides metric data for a storage account. This data can be used to trace requests, analyze usage trends, and diagnose issues with the storage account.

The storage account must enable it individually for each service that needs to be monitored. Blobs, queues, tables and file services are all subject to monitoring. The aggregated data is stored in a well-known blob designated for logging and in well-known tables, which may be accessed using the Blob service and Table service APIs. There is a 20TB limit for the metrics and this is besides what size is provisioned for data, so we can forget about resizing. When we monitor a storage service, the service health, capacity, availability and performance of the service is studied. The service health can be observed from the portal and the notifications can be subscribed to. The $MetricsCapacityBlob enables monitoring for the blob service. Storage Metrics records this data once per day. The capacity is measured in bytes and both the containerCount and ObjectCount are available per daily entry. Availability is monitored in the hourly and minute metrics tables that record primary transactions against blob, tables and queues. The availability data is a column in these tables. Performance is measured in the AverageE2ELatency and AverageServerLatency columns. The E2ELatency is recorded only for successful requests and the average server latency includes time that the client takes to send the data and receive acknowledgements from the storage service. A high value for the first and a low value for the second implies that the client is slow, or the network connectivity is poor. Nagle’s algorithm is a TCP optimization on the sender and it is designed to reduce network congestion by coalescing small send requests into larger TCP segments. So small segments are held back until a larger segment is available to send the data. But it does not work well with delayed acknowledgements that are an optimization on the receiver side. When the receiver delays the ack and the sender waits for the ack to send a small segment, the data transfer gets stalled. Turning off these optimizations enables with improved table, blob and queue usages.

Requests to create blobs for logging and the requests to create table entities for metrics are billable. Older logging and metrics data can be archived or truncated. As with all data using Azure storage, this is set via the retention policy on the containers. Metrics are stored for both the service and the API operations of that service which includes the percentages and the count of certain status messages. These features can help analyze the cost aspect of the usages.

Cluster computing

Tuesday, December 28, 2021

No comments:

Post a Comment