Cluster computing

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

Social engineering applications provide a wealth of information to the end-user, but the questions and answers received on it are always limited to just that – social circle. Advice solicited for personal circumstances is never appropriate for forums which can remain in public view. It is also difficult to find the right forums or audience where the responses can be obtained in a short time. When we want more opinions in a discrete manner without the knowledge of those who surround us, the options become fewer and fewer. In addition, crowd-sourcing the opinions for a personal topic is not easily available via applications. This document tries to envision an application to meet this requirement. 

The previous article continued the elaboration on the usage of the public cloud services for provisioning queue, document store and compute. It talked a bit about the messaging platform required to support this social-engineering application. The problems encountered with social engineering are well-defined and have precedence in various commercial applications. They are primarily about the feed for each user and the propagation of solicitations to the crowd. The previous article described selective fan out. When the clients wake up, they can request their state to be refreshed. This perfects the write update because the data does not need to be sent out. If the queue sends messages back to the clients, it is a fan-out process. The devices can choose to check-in at selective times and the server can be selective about which clients to update. Both methods work well in certain situations. The fan-out happens in both writing as well as loading. It can be made selective as well. The fan-out can be limited during both pull and push. Disabling the writes to all devices can significantly reduce the cost. Other devices can load these updates only when reading. It is also helpful to keep track of which clients are active over a period so that only those clients get preference.   

We talk about databases to meet the transactional aspects of the processing on both sides of the campaign generation side and the response accumulation side. The relational data from both these sides will need a warehouse where analytical queries can be run for reporting stacks. Separation of read-only from read-write store helps with both performance and security.

The choice of relational/cloud databases is left outside this discussion. Instead, we focus on the choice of this warehouse. There are five major players – Azure, BigQuery, Presto, RedShift and Snowflake. The responses accumulation is inherently tied to users and the warehouse can expect a lot of users to be differentiated based on their campaigns and responses. The type of queries invoked on the data is only relevant based on its accumulation and not in the stream of responses. One response is just like another, and the queries have little or no benefit to processing them in a stream like manner as opposed to processing them after their accumulation both from the individual’s point of view as well as the administrator’s point of view. The warehouse is also able to reconcile campaign and response activities to remain a source of truth and maintain accuracy on the tally. It provides the ability to write queries in simple SQL language and comes free from maintenance when hosted in the cloud regardless of the size of the data accrued. Picking one or the other warehouse will enable separation of reporting stack and the fostering of the other microservices that may be envisioned for future offerings. For example, a campaign based on responses accumulation could be forked as its own campaign management microservice utilizing only the database and a message broker. The microservice model is also best suited for separation of concerns in promoting offerings from this one-stop shop for responses while the data layer remains the same. All the microservices are expected to be slim because there is only a connection facilitated between producers and consumers of responses. A virtual elastic warehouse is the right choice to make this connection because it facilitates all kinds of workflows associated with the data most of which are independent of the transactional processing. Even message brokers work well with warehouses when the warehouse accepts json. The archiving of response accumulation mentioned earlier can now be automated to be redirected to the virtual data warehouse using an automated ingestion capability.

Cluster computing

Sunday, May 8, 2022

No comments:

Post a Comment