Sunday, December 12, 2021

 

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud. 

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the noisy neighbor antipattern. This article focuses on the performance tuning of CosmosDB usages

An example of an application using CosmosDB is a drone delivery application that runs on Azure Kubernetes Service.  When a fleet of drones sends position data in real-time to Azure IoT Hub, a functions app receives the events, transforms the data into GeoJSON format, and writes it to CosmosDB. The geospatial data in CosmosDB can be indexed for efficient spatial queries which enables a client application to query all drones within a finite distance of a given location or find all drones in a certain polygon. Azure Functions is used to write data to CosmosDB because it can be lightweight and there is no requirement to require a full-fledged stream processing engine that joins streams, aggregates data, or processes across time windows and CosmosDB can support high write throughput.

Monitoring data for CosmosDB can show 429 error codes in responses. Cosmos DB would throw this error when it is temporarily throttling requests and usually when the caller is consuming more resource units than provisioned.  It can also be thrown when the items to be created are already existing in the store.

When the 429-error code and is accompanied with a wait of about 600 ms before the operation is retried, it points to waits without any corresponding activity. Another chart for resource unit consumption per partition versus provisioned resource units per partition will help with the original cause for the 429-error preceding the wait. This may show that the resource unit consumption has exceeded the provisioned resource units.

Another likely case for CosmosDB errors is the incorrect usage of partition keys. Cross-partition queries may result when queries do not include a partition key, and this is quite inefficient. It might even lead to high latency when multiple database partitions are queried in serial. On the opposite side, hot write partitions may result when the documents are being written and a partition key is missing.  A partition heat map can assist in this regard because it will show the head room between allocated and consumed resource units.

CosmosDB provides snapshot isolation. So, it is important to include version string with the operations. There is a system defined _eTag property that is automatically generated and updated by the server every time the item is updated. _eTag can be used with the client supplied if-match request header to allow the server to decide whether an item can be conditionally updated. This property value changes every time it is updated and this can be relied upon as a signal to the application to reapply updates and retry the original client request.

No comments:

Post a Comment