Cluster computing

Design of an Email Campaign system:

Email Campaigns are often required in the lines of work that involves customers for the resources they request. Such campaigns often involve a subset of the people who are chosen based on some criteria. The mails to them also have different format and content depending on the nature of the campaign. Furthermore, email templates may need to be filled before they can be sent out and there may be retries involved.

A system that enables campaigns to be generated is a good candidate for automation as a background job. Let us look at this job in a bit more detail:

1) First we need a way to specify the criteria for selecting the email recipients. This can be done with a set of logical conditions using ‘or’ and ‘and’ operators. Each condition is a rule and the rules may have some order to it. Consequently the rules do not necessarily fit in a table. They are best expressed as a stored procedure with versioning. If the data to filter resides in a table such as say the customers table, then the stored procedure resides as close to the data as possible.

2) The message may need to be formatted and prepared for each customer and consequently these can be put in a queue where they are appropriately filled and made ready. A message broker comes very useful in this regard. Preparation of emails is followed with sending them out consequently there may be separate queues for preparation and mailing out and they may be chained.

3) The number of recipients may be quite large and the mails for each of them may need to be prepared and sent out. This calls for parallelization. One way to handle this parallelization would be to have workers spawned to handle the load on the queue. Celery is a good example of such a capability and it works well with a message broker.

4) A web interface to generate campaigns can be useful for administrators to interact with the system.

The data flow begins with the administrator defining the campaign. This consists of at the very least the following: a) the email recipients b) the mail template c) the data sources from which to populate the databases and d) the schedule in which to send out the mails. 

The email recipients need not always be specified explicitly especially if they number in millions. On the other hand, the recipients may already be listed in a database somewhere. There may only be selection criteria for filtering the entire list for choosing the recipients. Such a criteria is best expressed in the form of a stored procedure. The translation of the user defined criteria into a stored procedure is not very hard. The user is given a set of constraints, logical operators and value inputs and these can be joined to form predicates which are then entered as is into the body of a stored procedure. Each time the criteria are executed through the stored procedure, the result set forms the recipients list. When the criteria change, the stored procedure is changed and this results in a new version.  Since the criteria and stored procedure are independent of the message preparation and mailing, they can be offline to the mailing process.  

The mailing process commences with the email list determined as above.  The next step is the data that needs to be acquired for each template. For example, the template may correspond to the resources that the recipients may have but the list of resources may need to be pulled from another database. It would be ideal if this could also be treated as SQL queries which provide the data that a task then uses to populate the contents of the email. Since this is per email basis, it can be parallelized to a worker pool where each worker grabs an email to prepare.  An email receives a recipient and content.  Initially the template is dropped on the queue with just the email recipient mentioned. The task then manages the conversion of the template to the actual email message before putting it on the queue for dispatch. The dispatcher simply mails out the prepared email with smtp.  

The task parallel library may hide the message broker from the parallelization. Celery comes with its own message broker that also allows the status of the enqueued items to be logged. However, a fully fledged message broker with a worker pool is preferred because it gives much more control over the queue and the messages permitted on the queue. Moreover, journaling and logging can with the automation. Messages may be purged from the queue so that the automation stops on user demand. 

Therefore data flows from data sources into the emails that are then mailed out.  The task that prepares the emails needs to have access to the database tables and stored procedures that determine who the recipients are and what the message is. Since they act on individual email basis, they are scalable.   

Conclusion: An Email campaign system can be written using a database, a message broker and celery.

1->2->3->4->5->6
1->6->2->5->3->4

void Interleave(Node root){
Node start = root
Node end = find_last(start);
int count = 0;
for(Node cur = start; cur; cur=cur.next) count++;
while ( count > 0)
{
Remove(start, end);
Insert(start, end);
if (start.next)
{
start = start.next.next;
}
end = find_last(start);
count -= 2;
}
}
Node find_last(Node start)
{
Node end = start;
for(Node cur = start; cur; cur=cur.next){
end = cur;
}
return end;
}

void Remove(Node start, node end)
{
Node prev = start;
Node target = start.next;
for(Node cur = target; cur && cur.next; cur=cur.next){
prev = cur;
}
assert(cur == end);
if (prev && prev.next == end)
prev.next = end ? end.next : null;
}

void Insert(Node position, node end)
{
if (end)
{
end.next = position.next;
}
position.next=end;
}

Cluster computing

Friday, May 6, 2016

No comments:

Post a Comment