Cluster computing

Monday, May 9, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

Social engineering applications provide a wealth of information to the end-user, but the questions and answers received on it are always limited to just that – social circle. Advice solicited for personal circumstances is never appropriate for forums which can remain in public view. It is also difficult to find the right forums or audience where the responses can be obtained in a short time. When we want more opinions in a discrete manner without the knowledge of those who surround us, the options become fewer and fewer. In addition, crowd-sourcing the opinions for a personal topic is not easily available via applications. This document tries to envision an application to meet this requirement. 

The previous article continued the elaboration on the usage of the public cloud services for provisioning queue, document store and compute. It talked a bit about the messaging platform required to support this social-engineering application. The problems encountered with social engineering are well-defined and have precedence in various commercial applications. They are primarily about the feed for each user and the propagation of solicitations to the crowd. The previous article described selective fan out. When the clients wake up, they can request their state to be refreshed. This perfects the write update because the data does not need to be sent out. If the queue sends messages back to the clients, it is a fan-out process. The devices can choose to check-in at selective times and the server can be selective about which clients to update. Both methods work well in certain situations. The fan-out happens in both writing as well as loading. It can be made selective as well. The fan-out can be limited during both pull and push. Disabling the writes to all devices can significantly reduce the cost. Other devices can load these updates only when reading. It is also helpful to keep track of which clients are active over a period so that only those clients get preference.   

In this section, we talk about content delivery network on Azure. This is a distributed network of servers that deliver web content for the crowdsourced application to users. It includes resources for web pages such as JavaScript, Stylesheet and HTML. CDNs that are closest to the application or clients are used so that there is little or no latency. Azure CDN can also accelerate dynamic content which cannot be cached, by leveraging networking optimizations such as the Point-of-Presence (POP) location and the route optimization via border gateway protocol. Benefits of using Azure CDN include better performance, large scaling and distribution of user requests.

Azure CDN performs geo-replication and automatic synchronization between virtual datacenters which is a term used to denote shared-nothing collection of servers or clusters. It leverages some form of synchronization with the help of say, message-based consensus protocol. Web-accessible storage is provided by Azure Storage, but the CDN is hosted as its own service and comes with its ARM resource. As with all Azure services, the CDN service also provisions an Azure resource backed by an Azure resource manager template. Azure CDN can be used for enabling faster access to public resources from Azure CDN POP locations, Improving the experience for users who are further away from data centers, supporting the Internet of Things by scaling to a huge number of devices that can access content, and handling traffic surges without requiring the application to scale.

Some of the challenges involved when planning CDN involve deployment considerations about where to deploy CDN and a few others. For example, these include versioning and cache control of the content, testing of the resources independent of the publications, search engine optimizations and content security. In addition, CDN service must provide disaster recovery and backup options so that the data is not lost and is highly available. System engineering design looks down upon CDN because of the costs involved. For example, it is easier to scale the servers without requiring the planning of content delivery network which saves costs because the resources are co-located and there are easier options to scale. The customer would integrate the publication of their content which can be done with the help of the CDN.

Sunday, May 8, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

We talk about databases to meet the transactional aspects of the processing on both sides of the campaign generation side and the response accumulation side. The relational data from both these sides will need a warehouse where analytical queries can be run for reporting stacks. Separation of read-only from read-write store helps with both performance and security.

The choice of relational/cloud databases is left outside this discussion. Instead, we focus on the choice of this warehouse. There are five major players – Azure, BigQuery, Presto, RedShift and Snowflake. The responses accumulation is inherently tied to users and the warehouse can expect a lot of users to be differentiated based on their campaigns and responses. The type of queries invoked on the data is only relevant based on its accumulation and not in the stream of responses. One response is just like another, and the queries have little or no benefit to processing them in a stream like manner as opposed to processing them after their accumulation both from the individual’s point of view as well as the administrator’s point of view. The warehouse is also able to reconcile campaign and response activities to remain a source of truth and maintain accuracy on the tally. It provides the ability to write queries in simple SQL language and comes free from maintenance when hosted in the cloud regardless of the size of the data accrued. Picking one or the other warehouse will enable separation of reporting stack and the fostering of the other microservices that may be envisioned for future offerings. For example, a campaign based on responses accumulation could be forked as its own campaign management microservice utilizing only the database and a message broker. The microservice model is also best suited for separation of concerns in promoting offerings from this one-stop shop for responses while the data layer remains the same. All the microservices are expected to be slim because there is only a connection facilitated between producers and consumers of responses. A virtual elastic warehouse is the right choice to make this connection because it facilitates all kinds of workflows associated with the data most of which are independent of the transactional processing. Even message brokers work well with warehouses when the warehouse accepts json. The archiving of response accumulation mentioned earlier can now be automated to be redirected to the virtual data warehouse using an automated ingestion capability.

Saturday, May 7, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

The software agents that run on the user’s devices regardless of their form or application they interact with, have a dual role. They need to talk to the responses service and provide up-to-date crowdsourced responses to a campaign. They also need to allow users to create campaign or post a response.

The former is an API call to the backend services. It has a wide variety of choices in terms of language and technology used to make the calls. The latter is somewhat more customer special. Some topics are easy to write while others require embeddings. Active messages can be looked up for user reactions but more formal means of communication such as documents, presentations and other form of collaboration artifacts are usually stored in libraries and are reviewed by iterations involving several cycles. Usually at the end of their review, they are somewhat finalized. It is during these times that a label for recognition can be added. Even links to external data sources can contribute to campaign

Tags and labels are not expected to be changed. If they did, their first assignment is sufficient to correlated responses. The addin change for responses and campaigns works seamlessly with the library. For example, the SharePoint addin reads label attributes which it then classifies based on predetermined rules for campaigns Then it makes a call to the backend services for accumulated responses. Each such response is translated to a notification for the campaign that maintains a ledger of campaignand responses which are retrieved from the document author and addin classification results respectively. Each recipient is global for an organization and does not require integration with an IAM for the campaign and responses services to filter and route the notifications.

The Sharepoint addin can be written in any language. The JavaScript SDK will be a good choice for this purpose since it is likely that the challenge and responses service will use one like that to make it easy to call the APIs of the service.

Friday, May 6, 2022

Historical user activity processing, query processing on warehouse data, and workflows involving report generations for campaign collection do not have to be user interactive. They can be scheduled to run periodically or on-demand and may work on the data in batches or in stream mode. Such workflow automation can easily be serviced with a Master data management architecture that leverages microservices against the store. With this separation of transactional and analytical storage, individual parts may easily employ one or the other vendor while supporting old or new functionalities.

The Campaign processing system has the option to be streamlined to being only transactional in nature where the users can use their campaign responses and there is no analytics or warehouse required. Then there is the option for the business to gather more information for the campaigns and then use it to better serve the end users.

Some of the advantages for the gathering of information and their patterns in the campaign storage include the grouping and ranking of the top contributors, determining the most usages of the campaigns, the usefulness of certain options in the campaign response features used by the same user, the data mining techniques to perform market basket analysis on the campaign responses and so on.

Any storage pertaining to operations such as logs, metrics and events constitute machine data, and their storage can not only be dedicated but also support their own indexes, queries and reporting stacks. This is also left out of this discussion except with the mention that each kind of mentioned machine data has well-known solution for their storage and queries. Some of these stacks operated independently from the business and can even be non-intrusive.

Thursday, May 5, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

Partitioning of data especially horizontally partitioning by userIDs or campaigns is quite popular because it spreads the load across servers. On the other hand, the campaign-responses table cannot be partitioned as it will grow rapidly and have no predetermined order. It can however be archived on a regular basis where the older data ends up in a warehouse or in a secondary table. The archival requires a table similar to the source, possibly in a different database than the live one. The archival stored procedure could read the records a few at a time from the source, insert into the destination and delete the copied from the source. The insertion and deletes would not be expected to fail since we will select the records that are in the source but not in the destination. This way we will be in a good state between each incremental move of records. This helps when there is many records that makes such a stored procedure run long and become prone to interruptions or failures. The archival can resume from where it left off and the warehouse will separate the read-only stacks that are only interested in the aggregates.

A million responses for a campaign can translate to a million entries in the campaign response table. If those million records are of particular interest to a query, it can form its own materialized views of these rows based on the ownerID or campaign id of the responses. The archival policy of the table may be triggered based on the number of rows in the table or the created/modified time of the oldest record which may not be able to migrate these million rows as soon as they are created. Instead, the only approach in such a case would be to convert the million rows to a single row for that campaign with an increment of the votes in yes or no categories and these can be done right after all the entries have been added without any loss of responses. Databases – both on-premise and in the cloud, have somewhat alleviated the performance impact of a million rows being added to a table in the database and they can be considered as routine as a handful of entries.

Wednesday, May 4, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

For crowdsourcing applications where the number of users spans a segment of the population on the planet, the ability to store becomes like that used by the companies offering social engineering applications. For example, Presto can be used to store the high-volume data in NoSQL stores but with the ability to bridge a SQL query over the data. Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting ad-hoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it becomes available. This reduces end to end time while maximizing parallelization via stages on large data sets.

Given that users are interested mostly in the accumulated responses, it might be helpful to view the store as a data warehouse and one that can be supported in the cloud in virtual data centers, preferably one that can support data ingestion in the form of JSON from data pipelines. The ability to perform queries over this warehouse follows the conventional Online Analytical Processing model and serves the campaign and responses very well. While the choice of an external data store is not ruled out, it must scale. There are cost-benefit ratios to consider when deploying custom stores via something offered from public clouds.

Tuesday, May 3, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

With these the Campaigns program service provides a consistent and uniform responses program that enables individuals to crowdsource advice regardless of the user interface or advice they use. The setup of the campaigns responses as correlated data in the store enables faster and easier access to the data for rendering purposes.

The APIs for Campaigns that performs the create-update-delete as well as querying are REST APIs for a resource named Campaigns, which looks like the following:

• Storing campaigns in a database with user details such as

o Campaign-form,

o user relationship,

o target URL to fulfil the campaign from backend store, and

o active versus inactive state for the campaign processing

• API to create/update/delete campaigns that includes:

o GET /api/v1/campaign/ to list

o POST /api/v1/campaign/ to create

o GET /api/v1/campaign/:id/ to lookup

o PUT /api/v1/campaign/:id/ to edit

o DELETE /api/v1/campaign/:id to delete a subscription

• List and implement campaign types

o a name in the name.verb syntax.

o a payload to simply mirror the representation from the standard API.

• send hooks with POST to each of the target URLs for each matching campaign

o compiling and POSTing the combined payload for the triggering resource and hook resource

o sending to known online retail stores with the Campaign where X-Hook-Secret header has a unique string and one that matches what was issued by the backend retail store.

o confirming the hook legitimacy with a X-Hook-Signature header

o Handling responses like the 410 Gone and optionally retrying connection or other 4xx/5xx errors.