Cluster computing

Tuesday, September 22, 2020

Network engineering continued...

This is a continuation of the article at http://ravinote.blogspot.com/2020/09/best-practice-from-networking.html

Protection against loss – Data when stagnant may get corrupted. In order to make sure the data does not change; we need to keep additional information. This is called erasure coding and with additional information about the data, we can not only validate the existing data, but we may also even be able to recreate the original data by tolerating certain loss. How we store the data and the erasure code, also determines the level of redundancy we can use. If the data is in transit, it can be made immutable and uninterpretable with encryption

Hot warm cold – Data differs in treatment based on the access. Hot data is one that is actively read and written. Warm and cold indicate progressive inactivity over the data. Each of these labels allows different leeway with the treatment of the data and the cost of network flow.

The organizational unit of data – Networking is always in layers due to the separation of concerns in each layer and its communication with a peer at the same level across a hybrid network.

Seal your packet – Every packet has a header and a payload start and length. Even if the data is chunked, the packet has to be well-formed so that any tool or application can validate the packet for its representation.

Versions and policy – As with most libraries, packet headers can be versioned and versions can be managed with policies. Headers may be static but policies can be dynamic. When a software-defined network is viewed as revisions, users can go back in time and track revisions.

Monday, September 21, 2020

Network Engineering (continued) ...

This is a continuation of the article at http://ravinote.blogspot.com/2020/09/best-practice-from-networking.html
Websocket – facilitates duplex communication and is independent of HTTP. Both the client and the server can be a producer as well as the consumer. The client and the server can both push events.

Address – Universal addressing without exhaustion is possible with IPv6 connectivity. This is independent of the existing IPv4 connectivity that powers the internet as we know.

Binding – These can be of three types – TCP/IP binding, HTTP binding, and net MSMQ binding, and each of them differentiates a way for an endpoint to be setup.

Contract – become a descriptor for the service just like address and binding and gives information to the client on the aspects of connecting to the endpoint of the service. Contracts can support stateful protocols but they are verbose, static, and brittle and became less popular in the face of growing competition from a stateless design that uses pre-determined and well-accepted verbs.

Stateful and Stateless design – In a stateless design, each request is granular authenticated, authorized, audited, and optionally encrypted. The resource usages are clean after a request-response exchange. The well-established protocols foster a community of developers, tools, and ecosystems.

Sunday, September 20, 2020

Best practice from networking

Best practice from networking:

Introduction: Networking is one of the three pillars of any commercial software. The other two are compute, and storage and the three are directly included as products to implement solutions, as components to make products, as perspectives for implementation details of a feature within a product and so on. Every algorithm that is implemented pays attention to these three perspectives in order to be efficient and correct. We cannot think of distributed or parallel algorithms without network, efficiency without storage, and convergence without compute. Therefore, these disciplines bring certain best practice from the industry.

We list a few in this article from networking perspective:

Not a singleton – Most network vendors know that networking is about data communications. Data cannot be lost or corrupted. Therefore, network industry vendors go to great lengths in making data safe in transit by not allowing a single point of failure such as a hop failure. If the data is written to the wire, it is relayed to the recipient eventually.

Chunked data – Packets form the core unit of transmission in any network. If the frame is too long, it may suffer from transmission failures and require retries. Instead if it were chunked, it will reduce the fault while subsequent packets will require to be sent only once.

Global connectivity – The public cloud has taught us that it is a massive sponge for global traffic that allows data to be consolidated in the datacenters behind the cloud. This makes networking popular for application and universal connectivity

Mobile IP – The ability to appear as if working of office computer with the same address while floating different networks gives unparalleled mobility only possible by networking.

Tunneling – The ability to wrap an existing packet with a header in the same IP protocol to allow packets to safely cross a public network but allow the endpoints on either end to be part of a secure network is made possible by tunneling. The virtual private network protocols help with this.

Saturday, September 19, 2020

An Email campaign management system

Problem statement

An email campaign management system empowers a user to send automated emails to many recipients. The content and the broadcast of the email is referred to as a campaign. A sample use case for the campaign helps describe the problem and this solution. Let us say a job seeker wants to mail out a template with a standard cover letter and resume as a self-introduction and advertisement to all the acquaintances. In this case, the letter and the resume become part of the campaign and the user may want to change the campaign and the target audience. The ability to do so from a web interface helps make the interaction minimal and error-free. The content can be uploaded as files while the email recipients can be added from the browser. After the contents have accrued to the intended group, the candidate can click on a button to mail the recipients using SMTP.

Role of a Database:

A database is useful to keep a table of entries and to support create, update, and delete operations independent of the purpose for which these contacts are accrued. The table serves well for an online transaction processing system and the interface to use such a table follows a standard convention for the usage.

Role of a Message Broker:

A message broker is useful for sending messages to multiple recipients with retries and dead letter queue. Besides, it journals the message and activities for review later. Messaging protocols are well-known and enable scriptability with a variety of libraries and packages.

Role of a user interface:

The user interface is intended only for one user – the campaign manager. The campaign manager can not only feed the recipients and the content but also review the activity and progress as the campaign is mailed out.

Design:

A system that enables campaigns to be generated is a good candidate for automation as a background job. Let us look at this job in a bit more detail:

1) First, we need a way to specify the criteria for selecting the email recipients. This can be done with a set of logical conditions using ‘or’ and ‘and’ operators. Each condition is a rule and the rules may have some order to them. They are best expressed as a stored procedure with versioning. If the data to filter resides in a table such as say the customers' table, then the stored procedure resides as close to the data as possible.

2) The message may need to be formatted and prepared for each customer and consequently, these can be put in a queue where they are appropriately filled and made ready. A message broker comes very usefully in this regard. Preparation of emails is followed by sending them out consequently there may be separate queues for preparation and mailing out and they may be chained.

3) The number of recipients may be quite large and the mails for each of them may need to be prepared and sent out. This calls for parallelization. One way to handle this parallelization would be to have workers spawned to handle the load on the queue. Celery is a good example of such a capability, and it works well with a message broker.

4) A web interface to generate campaigns can be useful for administrators to interact with the system.

The data flow begins with the administrator defining the campaign. This consists of at the very least the following: a) the email recipients b) the mail template c) the data sources from which to populate the databases and d) the schedule in which to send out the mails.

The email recipients need not always be specified explicitly especially if they number in millions. On the other hand, the recipients may already be listed in a database somewhere. There may only be selection criteria for filtering the entire list for choosing the recipients. Such criteria are best expressed in the form of a stored procedure. The translation of the user-defined criteria into a stored procedure is not very hard. The user is given a set of constraints, logical operators, and valuable inputs and these can be joined to form predicates which are then entered as-is into the body of a stored procedure. Each time the criteria are executed through the stored procedure, the result set forms the recipients' list. When the criteria change, the stored procedure is changed, and this results in a new version. Since the criteria and stored procedure are independent of the message preparation and mailing, they can be offline to the mailing process.

The mailing process commences with the email list determined as above. The next step is the data that needs to be acquired for each template. For example, the template may correspond to the resources that the recipients may have but the list of resources may need to be pulled from another database. It would be ideal if this could also be treated as SQL queries which provide the data that a task then uses to populate the contents of the email. Since this is per email basis, it can be parallelized to a worker pool where each worker grabs an email to prepare. An email receives a recipient and content. Initially, the template is dropped on the queue with just the email recipient mentioned. The task then manages the conversion of the template to the actual email message before putting it on the queue for dispatch. The dispatcher simply mails out the prepared email with SMTP.

The task-parallel library may hide the message broker from the parallelization. Celery comes with its own message broker that also allows the status of the enqueued items to be logged. However, a fully-fledged message broker with a worker pool is preferred because it gives much more control over the queue and the messages permitted on the queue. Moreover, journaling and logging can with automation. Messages may be purged from the queue so that the automation stops on user demand.

Therefore, data flows from data sources into the emails that are then mailed out. The task that prepares the emails needs to have access to the database tables and stored procedures that determine who the recipients are and what the message is. Since they act on an individual email basis, they are scalable.

Intelligent Routines:

The ability to form groups of recipients based on classification rules is an intelligence added to the system that does away with manual entry of data. The use of classifiers from groups depends on a set of rules that can be specified independently. Each rule can be added via the user interface and mailed out to the recipients.

Monitoring the progress:

The message queue broker helps with the queue statistics where the number of orders on the queue determines the progress. When an order is complete the status of the item reflects the information. Subsequent read-only queries on the status give an indication of the progress.

Testing:

Each content and group should be verified independently. The mailing of a campaign should have a dry run before being mass-mailed to the intended recipients.

Conclusion:

Implementation of an email campaign system allows the flexibility to customize all parts of the campaign process even beyond the capabilities of off-the-shelf automation systems.

Reference: paper titled “Queues are databases” by Jim Gray

Alternatives:

Instead of a database, an issue tracking software such as Jira can also be used together with the message broker. For example https://github.com/ravibeta/PythonExamples/blob/master/seemq.py