Cluster computing

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

Social engineering applications provide a wealth of information to the end-user, but the questions and answers received on it are always limited to just that – social circle. Advice solicited for personal circumstances is never appropriate for forums which can remain in public view. It is also difficult to find the right forums or audience where the responses can be obtained in a short time. When we want more opinions in a discrete manner without the knowledge of those who surround us, the options become fewer and fewer. In addition, crowd-sourcing the opinions for a personal topic is not easily available via applications. This document tries to envision an application to meet this requirement. 

The previous article continued the elaboration on the usage of the public cloud services for provisioning queue, document store and compute. It talked a bit about the messaging platform required to support this social-engineering application. The problems encountered with social engineering are well-defined and have precedence in various commercial applications. They are primarily about the feed for each user and the propagation of solicitations to the crowd. The previous article described selective fan out. When the clients wake up, they can request their state to be refreshed. This perfects the write update because the data does not need to be sent out. If the queue sends messages back to the clients, it is a fan-out process. The devices can choose to check-in at selective times and the server can be selective about which clients to update. Both methods work well in certain situations. The fan-out happens in both writing as well as loading. It can be made selective as well. The fan-out can be limited during both pull and push. Disabling the writes to all devices can significantly reduce the cost. Other devices can load these updates only when reading. It is also helpful to keep track of which clients are active over a period so that only those clients get preference.   

In this section, we talk about extraneous fetching. When services call datastores, they retrieve data for a business operation, but they often result in unnecessary I/O overhead and reduced responsiveness. This antipattern can occur if the application is trying to save on the number of requests by fetching more than required. This is a form of overcompensation and is commonly seen with catalog operations because the filtering is delegated to the middle tier. For example, user may need to see a subset of the details and probably does not need to see all the responses at once yet a large dataset from the campaign is retrieved. Even if the user is browsing the entire campaign, paginating the results avoids this antipattern.

Another example of this problem is the inappropriate choices in design or code where for example, a service gets all the response details via the entity framework and then filters only a subset of the fields while discarding the rest. Yet another example is when the application retrieves data to perform an aggregation such as a count of responses that could be done by the database instead. The application calculates total sales by getting every record for all orders sold instead of executing a query where the predicates are pushed down to the store. Similarly other manifestations might come about when the EntityFramework uses LINQ to entities. In this case, the filtering is done in memory by retrieving the results from the table because a certain method in the predicate could not be translated to a query. The call to AsEnumerable is a hint that there is a problem because the filtering based on IEnumerable is usually done on the client side rather than the database. The default for LINQ to Entities is IQueryable which pushes the filters to the data source.

Fetching only the relevant columns from a table as compared to fetching all the columns is another classic example of this antipatterns and even though this might have worked when the table was only a few columns wide, it changes the game when the table adds several more columns. Similarly, aggregation performed in the database overcomes this antipattern instead of doing it in memory on the application side.

As with data access best practice, some considerations for performance holds true here as well. Partitioning data horizontally may reduce contention. Operations that support unbounded queries can implement pagination. Features that are built right into the data store can be leveraged. Some calculations need not be repeated especially with summation forms. Queries that return a lot of results can be further filtered. Not all operations can be offloaded to the database but those where the database is highly optimized can be offloaded.

A few ways to detect this antipattern include identifying slow workloads or transactions, behavioral patterns exhibited by the system due to limits, correlating the instances of slow workloads with those patterns, identifying the data stores being used, identify any slow running queries that reference these data source and performing a resource specific analysis of how the data is used and consumed.

These are some of the ways to mitigate this antipattern.

Some of the metrics that help with detecting and mitigation of extraneous fetching antipattern include total bytes per minute, average bytes per transaction and requests per minute.

Cluster computing

Tuesday, May 17, 2022

No comments:

Post a Comment