This is a continuation of a series of articles on
crowdsourcing application and including the most recent article. The
original problem statement is included again for context.
Social engineering applications provide a wealth of
information to the end-user, but the questions and answers received on it are
always limited to just that – social circle. Advice solicited for personal
circumstances is never appropriate for forums which can remain in public view.
It is also difficult to find the right forums or audience where the responses
can be obtained in a short time. When we want more opinions in a discrete
manner without the knowledge of those who surround us, the options become fewer
and fewer. In addition, crowd-sourcing the opinions for a personal topic is not
easily available via applications. This document tries to envision an
application to meet this requirement.
The previous article continued the elaboration on the usage
of the public cloud services for provisioning queue, document store and
compute. It talked a bit about the messaging platform required to support this
social-engineering application. The problems encountered with social
engineering are well-defined and have precedence in various commercial
applications. They are primarily about the feed for each user and the
propagation of solicitations to the crowd. The previous article described
selective fan out. When the clients wake up, they can request their state to be
refreshed. This perfects the write update because the data does not need to be
sent out. If the queue sends messages back to the clients, it is a fan-out
process. The devices can choose to check-in at selective times and the server
can be selective about which clients to update. Both methods work well in
certain situations. The fan-out happens in both writing as well as loading. It
can be made selective as well. The fan-out can be limited during both pull and
push. Disabling the writes to all devices can significantly reduce the cost.
Other devices can load these updates only when reading. It is also helpful to
keep track of which clients are active over a period so that only those clients
get preference.
Partitioning of data especially horizontally partitioning
by userIDs or campaigns is quite popular because it spreads the load across
servers. On the other hand, the campaign-responses table cannot be partitioned
as it will grow rapidly and have no predetermined order. It can however be
archived on a regular basis where the older data ends up in a warehouse or in a
secondary table. The archival requires a
table similar to the source, possibly in a different database than the live
one. The archival stored procedure could read the records a few at a time from
the source, insert into the destination and delete the copied from the source.
The insertion and deletes would not be expected to fail since we will select
the records that are in the source but not in the destination. This way we will
be in a good state between each incremental move of records. This helps when
there is many records that makes such a stored procedure run long and become
prone to interruptions or failures. The archival can resume from where it left
off and the warehouse will separate the read-only stacks that are only
interested in the aggregates.
A million responses for a campaign can translate to a
million entries in the campaign response table. If those million records are of
particular interest to a query, it can form its own materialized views of these
rows based on the ownerID or campaign id of the responses. The archival policy
of the table may be triggered based on the number of rows in the table or the
created/modified time of the oldest record which may not be able to migrate
these million rows as soon as they are created. Instead, the only approach in
such a case would be to convert the million rows to a single row for that campaign
with an increment of the votes in yes or no categories and these can be done
right after all the entries have been added without any loss of responses.
Databases – both on-premise and in the cloud, have somewhat alleviated the
performance impact of a million rows being added to a table in the database and
they can be considered as routine as a handful of entries.
No comments:
Post a Comment