Cluster computing

Saturday, May 7, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

Social engineering applications provide a wealth of information to the end-user, but the questions and answers received on it are always limited to just that – social circle. Advice solicited for personal circumstances is never appropriate for forums which can remain in public view. It is also difficult to find the right forums or audience where the responses can be obtained in a short time. When we want more opinions in a discrete manner without the knowledge of those who surround us, the options become fewer and fewer. In addition, crowd-sourcing the opinions for a personal topic is not easily available via applications. This document tries to envision an application to meet this requirement. 

The previous article continued the elaboration on the usage of the public cloud services for provisioning queue, document store and compute. It talked a bit about the messaging platform required to support this social-engineering application. The problems encountered with social engineering are well-defined and have precedence in various commercial applications. They are primarily about the feed for each user and the propagation of solicitations to the crowd. The previous article described selective fan out. When the clients wake up, they can request their state to be refreshed. This perfects the write update because the data does not need to be sent out. If the queue sends messages back to the clients, it is a fan-out process. The devices can choose to check-in at selective times and the server can be selective about which clients to update. Both methods work well in certain situations. The fan-out happens in both writing as well as loading. It can be made selective as well. The fan-out can be limited during both pull and push. Disabling the writes to all devices can significantly reduce the cost. Other devices can load these updates only when reading. It is also helpful to keep track of which clients are active over a period so that only those clients get preference.   

The software agents that run on the user’s devices regardless of their form or application they interact with, have a dual role. They need to talk to the responses service and provide up-to-date crowdsourced responses to a campaign. They also need to allow users to create campaign or post a response.

The former is an API call to the backend services. It has a wide variety of choices in terms of language and technology used to make the calls. The latter is somewhat more customer special. Some topics are easy to write while others require embeddings. Active messages can be looked up for user reactions but more formal means of communication such as documents, presentations and other form of collaboration artifacts are usually stored in libraries and are reviewed by iterations involving several cycles. Usually at the end of their review, they are somewhat finalized. It is during these times that a label for recognition can be added. Even links to external data sources can contribute to campaign

Tags and labels are not expected to be changed. If they did, their first assignment is sufficient to correlated responses. The addin change for responses and campaigns works seamlessly with the library. For example, the SharePoint addin reads label attributes which it then classifies based on predetermined rules for campaigns Then it makes a call to the backend services for accumulated responses. Each such response is translated to a notification for the campaign that maintains a ledger of campaignand responses which are retrieved from the document author and addin classification results respectively. Each recipient is global for an organization and does not require integration with an IAM for the campaign and responses services to filter and route the notifications.

The Sharepoint addin can be written in any language. The JavaScript SDK will be a good choice for this purpose since it is likely that the challenge and responses service will use one like that to make it easy to call the APIs of the service.

Friday, May 6, 2022

Historical user activity processing, query processing on warehouse data, and workflows involving report generations for campaign collection do not have to be user interactive. They can be scheduled to run periodically or on-demand and may work on the data in batches or in stream mode. Such workflow automation can easily be serviced with a Master data management architecture that leverages microservices against the store. With this separation of transactional and analytical storage, individual parts may easily employ one or the other vendor while supporting old or new functionalities.

The Campaign processing system has the option to be streamlined to being only transactional in nature where the users can use their campaign responses and there is no analytics or warehouse required. Then there is the option for the business to gather more information for the campaigns and then use it to better serve the end users.

Some of the advantages for the gathering of information and their patterns in the campaign storage include the grouping and ranking of the top contributors, determining the most usages of the campaigns, the usefulness of certain options in the campaign response features used by the same user, the data mining techniques to perform market basket analysis on the campaign responses and so on.

Any storage pertaining to operations such as logs, metrics and events constitute machine data, and their storage can not only be dedicated but also support their own indexes, queries and reporting stacks. This is also left out of this discussion except with the mention that each kind of mentioned machine data has well-known solution for their storage and queries. Some of these stacks operated independently from the business and can even be non-intrusive.

Thursday, May 5, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

Partitioning of data especially horizontally partitioning by userIDs or campaigns is quite popular because it spreads the load across servers. On the other hand, the campaign-responses table cannot be partitioned as it will grow rapidly and have no predetermined order. It can however be archived on a regular basis where the older data ends up in a warehouse or in a secondary table. The archival requires a table similar to the source, possibly in a different database than the live one. The archival stored procedure could read the records a few at a time from the source, insert into the destination and delete the copied from the source. The insertion and deletes would not be expected to fail since we will select the records that are in the source but not in the destination. This way we will be in a good state between each incremental move of records. This helps when there is many records that makes such a stored procedure run long and become prone to interruptions or failures. The archival can resume from where it left off and the warehouse will separate the read-only stacks that are only interested in the aggregates.

A million responses for a campaign can translate to a million entries in the campaign response table. If those million records are of particular interest to a query, it can form its own materialized views of these rows based on the ownerID or campaign id of the responses. The archival policy of the table may be triggered based on the number of rows in the table or the created/modified time of the oldest record which may not be able to migrate these million rows as soon as they are created. Instead, the only approach in such a case would be to convert the million rows to a single row for that campaign with an increment of the votes in yes or no categories and these can be done right after all the entries have been added without any loss of responses. Databases – both on-premise and in the cloud, have somewhat alleviated the performance impact of a million rows being added to a table in the database and they can be considered as routine as a handful of entries.

Wednesday, May 4, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

For crowdsourcing applications where the number of users spans a segment of the population on the planet, the ability to store becomes like that used by the companies offering social engineering applications. For example, Presto can be used to store the high-volume data in NoSQL stores but with the ability to bridge a SQL query over the data. Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting ad-hoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it becomes available. This reduces end to end time while maximizing parallelization via stages on large data sets.

Given that users are interested mostly in the accumulated responses, it might be helpful to view the store as a data warehouse and one that can be supported in the cloud in virtual data centers, preferably one that can support data ingestion in the form of JSON from data pipelines. The ability to perform queries over this warehouse follows the conventional Online Analytical Processing model and serves the campaign and responses very well. While the choice of an external data store is not ruled out, it must scale. There are cost-benefit ratios to consider when deploying custom stores via something offered from public clouds.

Tuesday, May 3, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context. 

With these the Campaigns program service provides a consistent and uniform responses program that enables individuals to crowdsource advice regardless of the user interface or advice they use. The setup of the campaigns responses as correlated data in the store enables faster and easier access to the data for rendering purposes.

The APIs for Campaigns that performs the create-update-delete as well as querying are REST APIs for a resource named Campaigns, which looks like the following:

• Storing campaigns in a database with user details such as

o Campaign-form,

o user relationship,

o target URL to fulfil the campaign from backend store, and

o active versus inactive state for the campaign processing

• API to create/update/delete campaigns that includes:

o GET /api/v1/campaign/ to list

o POST /api/v1/campaign/ to create

o GET /api/v1/campaign/:id/ to lookup

o PUT /api/v1/campaign/:id/ to edit

o DELETE /api/v1/campaign/:id to delete a subscription

• List and implement campaign types

o a name in the name.verb syntax.

o a payload to simply mirror the representation from the standard API.

• send hooks with POST to each of the target URLs for each matching campaign

o compiling and POSTing the combined payload for the triggering resource and hook resource

o sending to known online retail stores with the Campaign where X-Hook-Secret header has a unique string and one that matches what was issued by the backend retail store.

o confirming the hook legitimacy with a X-Hook-Signature header

o Handling responses like the 410 Gone and optionally retrying connection or other 4xx/5xx errors.

Monday, May 2, 2022

RFC for the protocol between a voice-activated personal assistant and universal computing

The purpose of this document is to establish a vendor-independent open framework for integration between a front-end personal assistant and a back-end universal computing. While writing a new protocol has become less active as compared to the proliferation of web services and accepted medium of communication over the HTTP via Representational State Transfer (REST) APIs, this document assumes and hopes that developers will fall back to these principles just as they did with other open frameworks such as OpenID. Although RFCs come with proof-of-concept, this document merely tries to enumerate the rules which have themselves been picked from other standards or gained acceptance in the industry.

1) Voice is a form of identification as much as tokens are, given the scope in which it is valid. It may be used in conjunction with other credentials such as passwords or PINs. It may also be collaborative with sign in to associated devices such as the desktop or phone and it can be a point of contact standalone authenticator as opposed to login from social engineering application. In all these cases, the personal assistant must know who the owner is. Consequently, OpenID and extensions that address fragmented identity should be implemented in the personal assistant software.

2) Personal Assistant may have a jargon that is either built-in or expanded with experience. These may take the form of a rules engine that has nouns and verbs associated and logic expressed as conditions on the verbs. The jargon itself is open to the owner and customizable via programming on desktop. It can be backed up in the cloud or exported and imported. Sync to this jargon happens in an object-oriented way as derived from the publisher or implemented from industry standards or supported from external clients such as desktops or mobiles. Verbs may involve such things as search, sort and rank that can use published algorithms or packages which can execute remotely. Much of the processing is delegated so the personal assistant only keeps track of the customizations for the owner.

3) The interfaces implemented by the personal assistant involve read and writes to universal stores, delegation of commands to other devices and computing, establishing connections with devices and libraries over different networks and adding learned logic or associations to the jargon. The personal assistant does not need to have an enterprise grade software, rather it needs to be an agent for an enterprise based server software that can operate in standalone as well as connected to the server. The networks are implemented over bluetooth for device to device connectivity and over wifi for internet connectivity

4) The personal assistant also maintains a message bus for relays from the server, other processors and other agents with a simple and universal envelope that can be implemented even between co-ordinators. The REST services may come in helpful here but the assistant ensures a time-to-live on each envelope so that inbound and outbound queues may be refreshed as per the session with the owner. Since the entire device is for the owner, there is no separation of queues for different users and no priority difference between messages.

5) The personal assistant exposes its activities by sharing its log that can be exported and archived externally. It may also support points of injection of logic for appropriate use by the publisher and enforcement agents.

Sunday, May 1, 2022

This is a continuation of a series of articles on crowdsourcing application and including the most recent article. The original problem statement is included again for context.

Any cloud service or application is not complete without manageability and reporting. The service/application can choose to offer a set of packaged queries available for the user to choose from a dropdown menu while internalizing all query processing, their execution, and the return of the results. One of the restrictions that comes with packaged queries exported via REST APIs is their ability to scale since they consume significant resources on the backend and continue to run for a long time. These restrictions cannot be relaxed without some reduction on their resource usage. The API must provide a way for consumers to launch several queries with trackers and they should be completed reliably even if they are done one by one. This is facilitated with the help of a reference to the query and a progress indicator. The reference is merely an opaque identifier that only the system issues and uses to look up the status. The indicator could be another api that takes the reference and returns the status. It is relatively easy for the system to separate read-only status information from read-write operations so the number of times the status indicator is called has no degradation on the rest of the system. There is a clean separation of the status information part of the system which is usually periodically collected or pushed from the rest of the system. The separation of read-write from read-only also helps with their treatment differently. For example, it is possible to replace the technology for the read-only separately from the technology for read-write. Even the technology for read-only can be swapped from one to another for improvements on this side.

The design of all REST APIs generally follows a convention. This practice gives well recognized uri qualifier patterns, query parameters and methods. Exceptions, errors and logging are typically done with the best usage of http protocol.