Cluster computing: November 2021

Tuesday, November 30, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the noisy neighbor antipattern. This article focuses on performance tuning for multiple backend services.

An example of an application using multiple backend services is a drone delivery application that runs on Azure Kubernetes Service. Customers use a web application to schedule deliveries by drone. The backend services include a delivery service manager that manages deliveries, a drone scheduler that schedules drones for pickup, and a package service manager that manages packages. The orders are not processed synchronously. An ingestion service puts the orders on a queue for processing and a workflow service coordinates the steps in the workflow. Clients call REST API to get their latest invoice which includes a summary of deliveries, packages, and total drone utilization. The information is retrieved from multiple backend services and then the results are aggregated for the user. The clients do not call the backend services directly. Instead, the application implements a Gateway Aggregation pattern.

Performance tuning begins with a baseline usually established with a load test. In this case, a six node AKS cluster with three replicas for each microservice was deployed for a step load test where the number of simulated users was stepped up from two to forty over a total duration of 8 minutes. It is observed that as the user load increases, the throughput average requests per second does not keep up. While there are no errors returned to the user, the throughput peaks halfway through the test and then drops off for the remainder. Resource contention, transient errors, and an increase in the rate of exceptions can contribute to this pattern.

One of the ways to tackle this bottleneck is to review the monitoring data. The average duration of the HTTP calls from the gateway to the backend services is noted. When the chart for the duration of different backend calls is plotted, it shows that the GetDroneUtilization takes much longer on average by an order of magnitude. The Gateway makes the calls to the backends in parallel, so the slowest operation determines how long it takes for the entire request to complete.

As the investigation narrows down to the GetDroneUtilization operation, the Azure Monitor for Containers is leveraged to pull up the resource consumption data for the CPU or memory utilization. Both the average and the maximum values are needed because the average will hide the spikes in utilization. If the overall utilization remains under 80%, this is not likely to be the issue.

Another chart that shows the response code from the Delivery services’ backend database shows that a considerable number of 429 error codes are returned from the calls made to the database. Cosmos DB which is the backend database in this case would throw this error only when it is temporarily throttling requests and usually when the caller is consuming more resource units than provisioned.

Fortunately, this level of focus comes with specific tools to assist with inferences. The Application Insight tool provides end-to-end telemetry for a representative sample of requests. The call to the GetDroneUtilization operation is analyzed for external dependencies. It shows that the Cosmos DB returns the 429-error code and waits 672 ms before retrying the operation. This means most of the delay is coming from waits without any corresponding activity. Another chart for resource unit consumption per partition versus provisioned resource units per partition will help with the original cause for the 429-error preceding the wait. It turns out that there are nine partitions that were provisioned with 100 resource units each and while the database spreads the operations across the partitions, the resource unit consumption has exceeded the provisioned resource units

Monday, November 29, 2021

This is a summary of the book titled “13 things mentally strong people don’t do” by Amy Morin. The author is a licensed, clinical, social worker, college psychology instructor and psychotherapist and is dedicated to all those who strive to become better today than they were yesterday. She cuts to the chase with clear and precise instructions. Some excerpts follow in this summary:

Thoughts, behaviors and feelings are intertwined. When used together, the “think positive” approach propels us forward otherwise they can even create a downward spiral. The points mentioned below are manifestations that are associated with people who understand this intertwining and become mentally strong. They need not appear tough or ignore their emotions, but they are resilient, more satisfied and demonstrate enhanced performance.

1. They don’t waste time feeling sorry for themselves. Self-pity is the classic symptom of the weak and to gain strength, they must avoid this self-destructive behavior by behaving in a manner that makes it hard to feel sorry for themselves. One way to do this is to exchange self-pity for gratitude. The more they journal their gratitude, the stronger they become.

2. They don’t give away their power. There is always a buffer between the stimulus and their response. They do not let others offend them, turn them or trigger a knee-jerk reaction. Retaining their power is about being confident about who they are and the choices they make. Identifying the people who have taken their power and reframing their language helps them in this regard.

3. They don’t shy away from change. Managing change can be daunting but the successful create a success plan for the change. They behave like the person they want to become. Balancing emotions and rational thoughts help make it easier.

4. They don’t focus on things they can’t control. They develop a balanced sense of control. They identify their fears. They focus on what they can do which includes influencing people even without controlling them. Insisting on doing everything by themselves goes against their practice.

5. They don’t worry about pleasing everyone. They identify their values and behave accordingly. They make a note of who they want to please, and it does not include everybody. They practice tolerating uncomfortable emotions.

6. They don’t fear taking calculated risks. They are aware of the emotional reactions to risk taking and they identify the types of risks that are particularly challenging. They analyze risks before they decide. They also monitor the results so they can learn from each risk.

7. They don’t dwell on the past. They reflect on the past just enough to learn from it. They move forward even if it is painful. Working through the grief lets them focus on the present and plan. They also find ways to make peace with the past, but they never pretend that it did not happen. They don’t try to undo the past or make up for past mistakes.

8. They don’t make the same mistakes repeatedly. They acknowledge their personal responsibility for each mistake and even create a written plan to avoid repeating it. They identify the triggers and the warning signs for old behavior patterns and practice self-discipline strategies. They never make excuses or respond impulsively. They never put themselves in situations where they are likely to fail. Resisting temptation is one way to avoid repeating mistakes.

9. They don’t resent other people’s success. They replace negative thoughts that breed resentment. They celebrate accomplishments, focus on strengths and co-operate rather than compete with everyone. They do not compare themselves to everyone around them or treat them as direct competition.

10. They don’t give up after the first failure. They view failure as a learning opportunity, and they resolve to try again. They identify and replace irrational thoughts and they focus on improving their skill rather than showing them off. They do not quit or assume that future attempts will be the same as the past.

11. They don’t fear alone time. They learn how to appreciate silence and to be alone with their thoughts. They schedule a date with themselves at least once a month and practice mindfulness and meditation regularly. They do not indulge in beliefs that limit them and they do not always keep background noise on.

12. They don’t feel the world owes them anything. They develop healthy amounts of self-esteem, and they recognize areas of life where they believe they are superior. They focus on what they must give rather than what they must take. They think about other people’s feelings. They are certainly not selfish or egoist.

13. They don’t expect immediate results. Instead, they create realistic expectations, find accurate ways to measure progress, and celebrate milestones along the way. They don’t limit themselves to believing that if it is not working for them now, they are not making progress. They don’t look for shortcuts.

And finally, a conclusion on maintaining mental strength. This is a continuous process where they monitor their behavior, regulate their emotions and think about their thoughts.

Sunday, November 28, 2021

Part-2

The problem might not be the cluster nodes but the containers or pods which might be resource-constrained. If the pods also appear healthy, then adding more pods will not solve the problem.

Application Insight might show that the duration of the workflow service’s process operation is 246 ms. The query can even request a breakout of the processing time per each of the calls to the three backend services. The individual processing times for each of these services might also appear reasonable leaving the shortfall in request processing unexplained. One of the key observations here is that the overall processing time of about 250 ms might indicate a fixed cost that puts an upper bound on how fast the messages can be processed in serial. The key to increasing throughput is to facilitate the parallel processing of messages. The delays in processing appear to come from network RTT for I/O Completions. Fortunately, orders in the request queue are independent of each other. These two factors enable us to increase parallelism which we demonstrate by setting the MaxConcurrentCalls to 20 from the initial value of 1 and the Prefetch Count to increase to 3000 from the local cache from the initial value of 0. The best practices for performance improvement using Service bus messaging indicate looking at dead letter queue. Service bus atomically retrieves and locks a message as it is processed so that it is not delivered to other receivers. When the lock expires, the messages can be delivered to other receivers. After a maximum number of delivery attempts which is configurable, Service Bus puts the messages in the dead letter queue for examining later. The Workflow service is prefetching a large batch of 3000 messages where the total time to process each message is longer and results in messages timing out, going back into the queue, and eventually reaching the dead-letter queue. This behavior can also be tracked via the MessageLostLockException. This symptom is mitigated with the lock duration set to 5 minutes to prevent lock timeouts. The plot for incoming and outgoing messages confirms that the system is now keeping up with the rate of incoming messages. The results from the performance load test show that over the total duration of 8 minutes, the application completed 25,000 operations, with a peak throughput of 72 operations per second, representing a 400% increase in maximum throughput.

While this solution clearly works, repeating the experiment over a much longer period shows that the application cannot sustain this rate. The container metrics show that the maximum CPU utilization was close to 100% At this point, the application appears to be CPU bound. So, scaling the cluster now might increase performance unlike earlier. The new setting for the cluster now includes 12 cluster nodes with 3 replicas for Ingestion service, 6 replicas for workflow service, and 9 replicas for package delivery and drone scheduler.

To recap, the bottlenecks identified include out-of-memory exceptions for Azure Cache for Redis, Lack of parallelism in message processing, insufficient message lock duration, leading to locking timeouts, and messages being placed in the dead letter queue and CPU exhaustion. The metrics used to detect these bottlenecks include the rate of incoming and outgoing Service Bus messages, the application map in application insights, errors and exceptions, custom log analytics queries, and CPU and memory utilization for containers.

Saturday, November 27, 2021

Part-1

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.

An example of an application using distributed transactions is a drone delivery application that runs on Azure Kubernetes Service. Customers use a web application to schedule deliveries by drone. The backend services include a delivery service manager that manages deliveries, a drone scheduler that schedules drones for pickup, and a package service manager that manages packages. The orders are not processed synchronously. An ingestion service puts the orders on a queue for processing and a workflow service coordinates the steps in the workflow.

Since the users get back a response the moment their request is put on a queue, the processing of requests is not useful to study but when the backend cannot keep up with the request rate as the users increase, then it becomes useful to make performance improvements. A plot of incoming and outgoing messages will serve this purpose. When the outgoing messages fall severely behind the incoming messages, a few actions need to be taken which depend on the errors encountered at the time this occurs and indicates ongoing systematic issues. For example, the workflow service might be getting errors from the Delivery service. Let us say the errors indicate that an exception is being thrown due to memory limits in Azure Cache for Redis.

When the cache is added, it resolves a lot of the internal errors seen from the log, but the outbound responses still lag the incoming requests by an order of magnitude. A Kusto query on the logs indicates that the throughput of completed messages based on data points at 5-second samples indicates that the backend is a bottleneck. This can be alleviated by scaling out the backend services - package, delivery, and drone scheduler to see if throughput increases. The number of replicas is increased from 3 to 6. The load test shows only modest improvement. Outgoing messages are still not keeping up with incoming messages. The Azure Monitor for containers indicates that the problem is not resource-exhaustion because the CPU is underutilized at less than 40% even in the 95^th percentile and memory utilization is under 20%. The problem might not be the cluster nodes but the containers or pods which might be resource-constrained. If the pods also appear healthy, then adding more pods will not solve the problem.

Friday, November 26, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the improper instantiation antipattern. This one focuses on busy database antipattern.

Database stores data but frequently some code is frequently used with calculations for that data. These are stored in the database as stored procedures and triggers. There is a lot of advantages to running code local to the data which avoids the transmission to a client application for processing. But the overuse of this feature can hurt performance due to the server spending more time processing, rather than accepting new client requests and fetching data. A database is also a shared resource, and it might deny resources to other requests when one of them is using a lot for computations. Runtime costs might shoot up if the database is metered. A database may have finite capacity to scale up. Compute resources are better suitable for hosting complicated logic while storage products are more customized for large disk space. The antipattern occurs when the database is used to host a service rather than a repository or it is used to format the data, manipulate data, or perform complex calculations. Developers trying to overcompensate for the extraneous fetching antipattern often write complex queries that take significantly longer to run but produce a small amount of data. Stored procedures are used to encapsulate business logic because they are considered easier to maintain and update. They lead to this antipattern.

This antipattern can be fixed in one of several ways. First the processing can be moved out of the database into an Azure Function or some application tier. As long as the database is confined to data access operations using only the capabilities that the database is optimized and will not manifest this antipattern. Queries can be simplified to fetching the data with a proper select statement that merely retrieves the data with the help of joins. The application then uses the .NET framework APIs to run standard query operators.

Database tuning is an important routine for many organizations. The introduction of long running queries and stored procedures often goes against the benefits of a tuned database. If the processing is already under the control of the database tuning techniques, then they should not be moved.

Avoiding unnecessary data transfer solves both this antipattern as well as chatty I/O antipattern. When the processing is moved to the application tier, it provides the opportunity to scale out rather than require the database to scale up.

Detection of this antipattern is easier with the monitoring tools and the built-in supportability features of the database. If the database activity reveals significant processing and very low data emission, it is likely that this antipattern is manifesting.

Examine the work performed by the database in terms of transaction units, number of queries processed and the data throughput which can be narrowed down by callers and this may reveal just the database objects that are likely to be causing this antipattern

Finally, periodic assessments must be performed with the database.

Thursday, November 25, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the retry storm antipattern. This one focuses on the noisy neighbor antipattern.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the monolithic persistence antipattern. This one focuses on synchronous I/O antipattern.

This antipattern occurs when there are many tenants that can starve other tenants as they hold up a disproportionate set of critical resources from a shared and reserved pool of resources meant for all tenants. The noisy neighbor problem occurs when one tenant causes problem for another. Some common examples of resource intensive operations include, retrieving or persisting data to a database, sending a request to a web service, posting a message or retrieving a message from a queue, and writing or reading from a file in a blocking manner. There is a lot of advantages to running dedicated calls especially from debugging and troubleshooting purposes because the calls do not have interference, but multi-tenancy enables reuse of the same components. The overuse of this feature can hurt performance due to the tenants' consuming resources that can starve other tenants. It appears notably when there are components or I/O requiring synchronous I/O. The application uses library that only uses synchronous methods or I/O in this case. The base tier may have finite capacity to scale up. Compute resources are better suitable for scale out rather than scale up and one of the primary advantages of a clean separation of layers with asynchronous processing is that they can be hosted even independently. Container orchestration frameworks facilitate this very well. As an example, the frontend can issue a request and wait for a response without having to delay the user experience. It can use the model-view-controller paradigms so that they are not only fast but can also be hosted such that tenants using one view model do not affect the other.

This antipattern can be fixed in one of several ways. First the processing can be moved out of the application tier into an Azure Function or some background api layer. Tenants are given promises and are actively monitored. If the application frontend is confined to data input and output display operations using only the capabilities that the frontend is optimized for, then it will not manifest this antipattern. APIs and Queries can articulate the business layer interactions such that the tenants find it responsive while the system reserves the right to perform. Many libraries and components provide both synchronous and asynchronous interfaces. These can then be used judiciously with the asynchronous pattern working for most API calls. Finally, limits and throttling can be applied. Application gateway and firewall rules can handle restrictions to specific tenants

The introduction of long running queries and stored procedures, blocking I/O and network waits often goes against the benefits of a responsive multi-tenant service. If the processing is already under the control of the service, then it can be optimized further.

There are several ways to fix this antipattern. They are about detection and remedy. The remedies include capping the number of tenant attempts and preventing retrying for a long period of time. The tenant calls could include an exponential backoff strategy that increases the duration between successive calls exponentially, handle errors gracefully, use the circuit breaker pattern which is specifically designed to break the retry storm. Official SDKs for communicating to Azure Services already include sample implementations of retry logic. When the number of I/O requests is many, they can be batched into coarse requests. The database can be read with one query substituting many queries. It also provides an opportunity for the database to execute it better and faster. Web APIs can be designed with the REST best practices. Instead of separate GET methods for different properties, there can be a single GET method for the resource representing the object. Even if the response body is large, it will likely be a single request. File I/O can be improved with buffering and using cache. Files may need not be opened or closed repeatedly. This helps to reduce fragmentation of the file on disk.

When more information is retrieved via fewer I/O calls and fewer retries, the operational necessary evil becomes less risky but there is also a risk of falling into the extraneous fetching antipattern. The right tradeoff depends on the usages. It is also important to read-only as much as necessary to avoid both the size and the frequency of calls and their retries for tenants. Sometimes, data can also be partitioned into two chunks, frequently accessed data that accounts for most requests and less frequently accessed data that is used rarely. When data is written, resources need not be locked at too large a scope or for a longer duration. Tenant calls, limits and throttling can also be prioritized so that only the higher priority tenant calls go through.

Wednesday, November 24, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the monolithic persistence antipattern. This one focuses on the Retry storm antipattern.

When I/O requests fail due to transient errors, services must retry their calls. It helps overcome errors, throttle and rate limits and avoid surfacing and requiring user intervention for operational errors. But when the number of retries or the duration of retries is not governed, the retries are frequent and numerous, which can have a significant impact on performance and responsiveness. Network calls and other I/O operations are much slower compared to compute tasks. Each I/O request has a significant overhead as it travels up and down the networking stack on local and remote and includes the round trip time, and the cumulative effect of numerous I/O operations can slow down the system. There are some manifestations of the retry storm.

Reading and writing individual records to a database as distinct requests – When records are often fetched one at a time, then a series of queries are run one after the other to get the information. It is exacerbated when the Object-Relational Mapping hides the behavior underneath the business logic and each entity is retrieved over several queries. The same might happen on writing for an entity. When each of these queries is wrapped in their own retry, they can cause severe errors.

Implementing a single logical operation as a series of HTTP requests. This occurs when objects residing on a remote server are represented as a proxy in the memory of the local system. The code appears as if an object is modified locally when in fact every modification is coming with at least the cost of the RTT. When there are many networks round trips, the cost is cumulative and even prohibitive. It is easily observable when a proxy object has many properties, and each property get/set requires a relay to the remote object. In such a case, there is also the requirement to perform validation after every access.

Reading and writing to a file on disk – File I/O also hides the distributed nature of interconnected file systems. Every byte written to a file on amount must be relayed to the original on the remote server. When the writes are several, the cost accumulates quickly. It is even more noticeable when the writes are only a few bytes and frequent. When individual requests are wrapped in a retry, the number of calls can rise dramatically.

There are several ways to fix the problem. They are about detection and remedy. The remedies include capping the number of retry attempts and preventing retrying for a long period of time. The retries could include an exponential backoff strategy that increases the duration between successive calls exponentially, handle errors gracefully, use the circuit breaker pattern which is specifically designed to break the retry storm. Official SDKs for communicating to Azure Services already include sample implementations of retry logic. When the number of I/O requests is many, they can be batched into coarse requests. The database can be read with one query substituting many queries. It also provides an opportunity for the database to execute it better and faster. Web APIs can be designed with the REST best practices. Instead of separate GET methods for different properties, there can be a single GET method for the resource representing the object. Even if the response body is large, it will likely be a single request. File I/O can be improved with buffering and using cache. Files may need not be opened or closed repeatedly. This helps to reduce fragmentation of the file on disk.

When more information is retrieved via fewer I/O calls and fewer retries, the operational necessary evil becomes less risky but there is also a risk of falling into the extraneous fetching antipattern. The right tradeoff depends on the usages. It is also important to read-only as much as necessary to avoid both the size and the frequency of calls and their retries. Sometimes, data can also be partitioned into two chunks, frequently accessed data that accounts for most requests and less frequently accessed data that is used rarely. When data is written, resources need not be locked at too large a scope or for a longer duration. Retries can also be prioritized so that only the lower scope retries are issued for idempotent workflows.

Monday, November 22, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the busy frontend antipattern. This one focuses on monolithic persistence antipattern.

This antipattern occurs when a single data store hurts performance due to resource contention. Additionally, the use of multiple data sources can help with virtualization of data and query.

A specific example of this antipattern is when applications save transactional records, logs, metrics and events to the same database. The online transaction processing benefits from a relational store but logs and metrics can be moved to a log index store and time-series database respectively. Usually, a single datastore works well for transactional data but this does not mean documents need to be stored in the same data store. An object storage or document database can be used in addition to a regular transactional database to allow individual documents to be shared without any impact to the business operations. Each document can then have its own web accessible address.

This antipattern can be fixed in one of several ways. First, the data types must be listed, and their corresponding data stores must be assigned. Many data types can be bound to the same database but when they are different, they must be passed to the data stores that handles them best. Second, the data access patterns for each data type must be analyzed. If the data type is a document, a CosmosDB instance is a good choice. Third, if the database instance is not suitable for all the data access patterns of the given data type, it must be scaled up. A premium sku will likely benefit this case.

Detection of this antipattern is easier with the monitoring tools and the built-in supportability features of the database layer. If the database activity reveals significant processing, contention and very low data rate, it is likely that this antipattern is manifesting.

Examine the work performed by the database in terms of data types which can be narrowed down by callers and scenarios, may reveal just the culprits that are likely to be causing this antipattern

Finally, periodic assessments must be performed on the data storage tier.

Sunday, November 21, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the busy database antipattern. This one focuses on busy frontend antipattern.

This antipattern occurs when there are many background threads that can starve foreground tasks of their resources which decreases response times to unacceptable levels. There is a lot of advantages to running background jobs which avoids the interactivity for processing and can be scheduled asynchronously. But the overuse of this feature can hurt performance due to the tasks consuming resources that foreground workers need for interactivity with the user, leading to a spinning wait and frustrations for the user. It appears notably when the foreground is monolithic compressing the business tier with the application frontend. Runtime costs might shoot up if this tier is metered. An application tier may have finite capacity to scale up. Compute resources are better suitable for scale out rather than scale up and one of the primary advantages of a clean separation of layers and components is that they can be hosted even independently. Container orchestration frameworks facilitate this very well. The Frontend can be as lightweight as possible and built on model-view-controller or other such paradigms so that they are not only fast but also hosted on separate containers that can scale out.

This antipattern can be fixed in one of several ways. First the processing can be moved out of the application tier into an Azure Function or some background api layer. If the application frontend is confined to data input and output display operations using only the capabilities that the frontend is optimized for, then it will not manifest this antipattern. APIs and Queries can articulate the business layer interactions. The application then uses the .NET framework APIs to run standard query operators on the data for display purposes.

UI interface is designed for purposes specific to the application. The introduction of long running queries and stored procedures often goes against the benefits of a responsive application. If the processing is already under the control of the application techniques, then they should not be moved.

Avoiding unnecessary data transfer solves both this antipattern as well as chatty I/O antipattern. When the processing is moved to the business tier, it provides the opportunity to scale out rather than require the frontend to scale up.

Detection of this antipattern is easier with the monitoring tools and the built-in supportability features of the application layer. If the frontend activity reveals significant processing and very low data emission, it is likely that this antipattern is manifesting.

Examine the work performed by the Frontend in terms of latency and page load times which can be narrowed down by callers and scenarios, may reveal just the view models that are likely to be causing this antipattern

Finally, periodic assessments must be performed on the application tier.

Saturday, November 20, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the cloud readiness antipatterns. This article focuses on the extraneous fetching antipattern.

When services call datastores, they retrieve data for a business operation, but they often result in unnecessary I/O overhead and reduced responsiveness. This antipattern can occur if the application is trying to save on the number of requests by fetching more than required. This is a form of overcompensation and is commonly seen with catalog operations because the filtering is delegated to the middle tier. For example, user may need to see a subset of the details and probably does not need to see all the products at once yet a large dataset from the catalog is retrieved. Even if the user is browsing the entire catalog, paginating the results avoids this antipattern.

Another example of this problem is the inappropriate choices in design or code where for example, a service gets all the product details via the entity framework and then filters only a subset of the fields while discarding the rest. Yet another example is when the application retrieves data to perform an aggregation that could be done by the database instead. The application calculates total sales by getting every record for all orders sold instead of executing a query where the predicates are pushed down to the store. Similarly other manifestations might come about when the EntityFramework uses LINQ to entities. In this case, the filtering is done in memory by retrieving the results from the table because a certain method in the predicate could not be translated to a query. The call to AsEnumerable is a hint that there is a problem because the filtering based on IEnumerable is usually done on the client side rather than the database. The default for LINQ to Entities is IQueryable which pushes the filters to the data source.

Fetching only the relevant columns from a table as compared to fetching all the columns is another classic example of this antipatterns and even though this might have worked when the table was only a few columns wide, it changes the game when the table adds several more columns. Similarly, aggregation performed in the database overcomes this antipattern instead of doing it in memory on the application side.

As with data access best practice, some considerations for performance holds true here as well. Partitioning data horizontally may reduce contention. Operations that support unbounded queries can implement pagination. Features that are built right into the data store can be leveraged. Some calculations need not be repeated especially with summation forms. Queries that return a lot of results can be further filtered. Not all operations can be offloaded to the database but those where the database is highly optimized can be offloaded.

A few ways to detect this antipattern include identifying slow workloads or transactions, behavioral patterns exhibited by the system due to limits, correlating the instances of slow workloads with those patterns, identifying the data stores being used, identify any slow running queries that reference these data source and performing a resource specific analysis of how the data is used and consumed.

These are some of the ways to mitigate this antipattern.

Some of the metrics that help with detecting and mitigation of extraneous fetching antipattern include total bytes per minute, average bytes per transaction and requests per minute.

Friday, November 19, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the Chatty I/O antipattern. This one focuses on improper instantiation antipattern

When new instances of classes are continually created instead of once, they can have a significant impact on performance and responsiveness. Connections and clients are significantly costly resources to setup. They must be created once and reused. Each connection or client instantiation requires server handshakes which not only incur network delay but also involve memory usage, and the cumulative effect of numerous setup requests can slow down the system. There are some common causes of improper instantiation which include:

Connections and clients are created for the purpose of a data access request. When they are scoped to one request for the sake of cleanup, they involve the server to respond. Reading and writing individual records to a database as distinct requests – When records are often fetched one at a time, then a series of queries are run one after the other to get the information. It is exacerbated when the shared libraries use hides this behavior and each access request recreates a connection or a client. The same might happen on write requests.

Implementing a single logical operation as a series of data access requests. This occurs when objects use wrappers for connections and clients and they are scoped to methods invoking them which results in connections and clients to be disposed often. The code appears as if a wrapper is used locally when in fact every instantiation of the wrapper is coming with at least the cost of the RTT. When there are many networks round trips, the cost is cumulative and even prohibitive. It is easily observable when a wrapper has many instantiations, and each time it creates a connection or client. In such case, there is also the requirement to perform validation after every access.

Reading and writing to a file on disk – File I/O also hides the distributed nature of interconnected file systems. Every byte written to a file on a mount must be relayed to the original on the remote server. When the writes are several, the cost accumulates quickly. It is even more noticeable when the writes are only a few bytes and frequent. If each access requires its own connection or client, the application might not even know the high number of connections it is making.

There are several ways to fix the problem. They are about detection and remedy. When the number of server handshakes are many, they can be batched into reused connections via shared clients or connection pooling. The database can be read with a shared and reusable connection pool rather than a single connection. It also provides an opportunity for the database to free up memory corresponding to client connections. Web APIs can be designed with the REST best practices. Instead of separate GET method for different properties there can be single GET method for the resource representing the object.

When more information is retrieved via fewer connection and client instantiations, there is a risk of falling into the extraneous fetching antipattern by trying to prefetch more than is necessary. The right tradeoff depends on the usages. It is also important to read only as much as necessary to avoid both the size and the frequency of connections. Sometimes, connections and clients can also be involving a mixed mode, shared for accounts with most requests and dedicated for everything else. When connection is reused from a shared pool, they need not be locked at too large a scope or for longer duration.

Thursday, November 18, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the extraneous fetching antipattern. This one focuses on the Chatty I/O antipattern.

When I/O requests are frequent and numerous, they can have a significant impact on performance and responsiveness. Network calls and other I/O operations are much slower compared to compute tasks. Each I/O request has a significant overhead as it travels up and down the networking stack on local and remote and includes the round trip time, and the cumulative effect of numerous I/O operations can slow down the system. There are some common causes of chatty I/O which include:

Implementing a single logical operation as a series of HTTP requests. This occurs when objects residing on a remote server are represented as proxy in the memory of the local system. The code appears as if an object is modified locally when in fact every modification is coming with at least the cost of the RTT. When there are many networks round trips, the cost is cumulative and even prohibitive. It is easily observable when a proxy object has many properties, and each property get / set requires a relay to the remote object. In such case, there is also the requirement to perform validation after every access.

Reading and writing to a file on disk – File I/O also hides the distributed nature of interconnected file systems. Every byte written to a file on a mount must be relayed to the original on the remote server. When the writes are several, the cost accumulates quickly. It is even more noticeable when the writes are only a few bytes and frequent.

There are several ways to fix the problem. They are about detection and remedy. When the number of I/O requests are many, they can be batched into coarse requests. The database can be read with one query substituting many queries. It also provides an opportunity for the database to execute it better and faster. Web APIs can be designed with the REST best practices. Instead of separate GET method for different properties there can be single GET method for the resource representing the object. Even if the response body is large, it will likely be a single request. File I/O can be improved with buffering and using cache. Files may need not be opened or closed repeatedly. This helps to reduce fragmentation of the file on disk.

When more information is retrieved via fewer I/O calls, there is a risk of falling into the extraneous fetching antipattern. The right tradeoff depends on the usages. It is also important to read only as much as necessary to avoid both the size and the frequency of calls. Sometimes, data can also be partitioned into two chunks, frequently accessed data that accounts for most requests and less frequently accessed data that is used rarely. When data is written, resources need not be locked at too large a scope or for longer duration.

Wednesday, November 17, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the cloud readiness antipatterns. This one talks about design principles and advanced operations.

A management baseline provides a minimum level of business commitment for all supported workloads. It includes a standard business commitment to minimize business interruptions and accelerate recovery if service is interrupted. Usually it includes inventory and visibility, operational compliance, and protection and recovery – all of which provide streamlined operational management. It does not apply to mission critical workloads, but it covers 80% of the less critical workloads.

There are a few ways to go beyond the management baseline which includes enhanced baseline, platform specialization, and workload specialization.

The enhanced management baseline uses cloud-native tools to improve uptime and decrease recovery times. It significantly reduces cost and implementation time.

The management specialization are aspects of workload and platform operations which require changes to design and architecture principles, and these could take time and result in increased operating expenses. The enhanced management baseline applies broadly to many workloads while this one applies specifically to certain cases. There are two areas of specialization: 1) the platform specialization and 2) workload specializations. The former resolves key pain points in the platform and distributes the investments across multiple workloads and the latter involves ongoing operations of a specific mission-critical workload.

In addition to these management baselines, there are a few steps that apply to each specialization process. These include improved system design, automated remediation, scaled solution, and continuous improvement. Improved system design is the most effective approach among these, and it applies universally to most operations of any platform. It increases stability and decreases impact from changes in business operations. Both the Cloud Adoption Framework and the Azure Well-architected framework provide guiding tenets for improving the quality of a platform or a specific workload with the five pillars of architecture excellence which include cost optimization, operational excellence, performance efficiency, reliability, and security.

Business interruptions cause technical debt and if it cannot be automatically resolved, automated remediation is an alternative. Use of Azure automation and Azure Monitor can detect trends and provide automated remediation which is the most common approach. Similarly, a service catalog can list applications that can be deployed for internal consumption. A platform can then maximize adoption and minimize maintenance overhead with the use of the service catalog.

Tuesday, November 16, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the cloud readiness antipatterns. This article focuses on the no-caching antipattern.

A no-caching antipattern occurs when a cloud application handles many concurrent requests, and they fetch the same data. Since there is contention for the data access, it can reduce performance and scalability. When the data is not cached, it leads to many manifestations of areas for improvement.

First, the fetching of data can traverse several layers and go deep into the stack taking significant resource consumption and increasing costs in terms of I/O overhead and latency. It repeatedly constructs the same objects or data structures.

Second, it makes excessive calls to a remote service that has a service quota and throttles clients past a certain limit.

Both these can lead to degradation in response times, increased contention, and poor scalability.

The examples of no-caching antipattern are easy to spot. Entity framework calls that are repeatedly called for the same read-only data fits this antipattern. The use of a cache might have simply been overlooked but usually the case is that the cache could not be included in the design because of some unknowns. The benefits and drawbacks of using a cache is not clear then. There might be a concern about the accuracy and the freshness of the cached data.

Other times, the cache was left out because the application was migrated from on-premises where network latency and response times were controlled. The system might have been running on expensive high-performance hardware unlike the commodity cloud virtual machine scale sets.

Rarely, it might even be the case where the caching was simply left out of the architecture design and for operations to include via standalone independent products which was not clearly communicated. Other times, the introduction of a cache might increase latency, maintenance and ownership and decrease overall availability. It might also interfere with existing caching strategies and expiration policies of the underlying systems. Some might prefer to not add an external cache to a database and only as a sidecar for the web services. It’s true that databases can cache even materialized views for a connection, but the addition of a cache lookup could be cheap in all cases where the compute in the deeper systems could be costly and can be avoided.

There are two strategies to fix the problem. The first one includes the on-demand network or cache-aside strategy. When the application tries to read the data from the cache, and if it isn’t there, it retrieves and puts it in the cache. When the application writes the change directly to the data source, it removes the old value from the source but refilled the next time it is required.

Another strategy might be to always keep static resources in the cache with no expiration date. This is equivalent to CDN usage although CDNs are for distribution. Applications that cache dynamic data should be designed to support eventual consistency.

No matter how the cache is implemented, it must support fallback to the deep data access when the data is not available in the cache. This Circuit-breaker pattern merely avoids overwhelming the data source.