Cluster computing

Tuesday, March 9, 2021

Preparation for deploying API services to the cloud (continued...)

This is a continuation of the previous post.

Conditional modifications –ETags: full response is avoided by the server if the content has not changed.

Absolute Redirects – useful for delegation and automatically enabling client to fulfill their request elsewhere

Link headers or discoverable with links in response content – enables callers to discover as they make the calls and reduces trials and errors.

Canonical URLs – enables consistency and resolution which also works out great for pattern matching

Chunked transfer encoding becomes the only way to request the chunked transfer in HTTP 1

X-HTTP-Method-Override very useful to get past firewall since it is relatively easy to modify parameters

URL less than 2000 characters longer URLs are not only an eye-sore, they are difficult to spot typos

Statelessness frees the client to maintain state and enables retries

?format=json the content might be the same, but the format guides the integration with other systems. For example, virtual data warehouses prefer JSON

URI Templates determines patterns that can be exploited

Semantic interpretation to resources: also helps with Semantic search which goes beyond the syntax

Versioning: features usually span releases. Versioning informs breaks a d and adds discoverable information in logs

Authorization: the privilege granted is easy to map to response codes

Bulk operations: reduces individual calls, while gaining the opportunity to the server to handle them differently

Query parameter for limit and offset standardize listing behavior across resources

No Unicode in URLs: enables searchability while reducing errors.

Error logging: this alone reduces costs for the organization in terms of Maintenance

Timestamp. Critical for correlation and establishing order among events.

SSL encryption is a necessity and uses to turn on or off when not at the request level

Retry-After ensures server health while providing a clear directive to the caller.

Prevent DoS security measures help improve the uptime and availability of server

CSRF: prevents forgery and enables compliance with security standards

Testing: browser-based testing is one of the most popular modes of testing

Documentation: one of the must-haves to endear to the developer audience

Logs: All local logs drain via Syslog but the option to use a log index is reserved for large deployments. That instance can be shared across service and application deployments with the separation of indexes, and an investment in a dedicated log index software product will reduce the cost of the operation.

Metrics: Metrics don’t just look good on the operations dashboard. They look good even from the programmability standpoint. This is easy to achieve with a dedicated Grafana, InfluxDB, and SQL stack. Just like a solution for log indexes, a solution for metrics will lower reporting and manageability costs of operations.

Events: Earlier events used to be analyzed exclusively via message brokers. This is now overcome with stream processors with the help of beautiful event processing languages such as Flink and Spark from Apache foundation. Storage and analytics platforms are also savvy about offering an integrated solution for events.

Notifications: No one should try to manually watch the dashboard for breaches in thresholds. That is left to the automation of notifications from events. Notifications can be generated from solutions catering to logs, events, and metrics and this is a one-time cost.

(...to be continued)

Monday, March 8, 2021

Preparation for deploying API services to the cloud (continued...)

This is a continuation of the previous post.

Create pipelines and dashboards for operations: Continuous Integration, Continuous Deployment, and Continuous Monitoring are core aspects of API service deployments. Investment in tools such as log indexing can automate and enable proper alerts and notifications to tend to the services.

A pipeline for staged progression of code to the production environment allows multiple opportunities to test and rollback the changes. The tests can even be mixes from those of other flows so that the code can be vetted against other environmental factors.

The deployment must be automated so that it can be repeated in different stages and environments. With the code propagation streamlined to not require manual intervention, it is possible to have a continuous release.

Each stage or environment in which the code is tested must have a dashboard so that the operations of the code in that environment can be analyzed.

Choices of implementation also improve deployment and operations Let us investigate this a little bit more closely with the following checklist.

Idempotent methods – when changes to the data occur only once, there is very little room for error and costly investigations. This kind of deterministic behavior is easier to test as well.

Authentication – beyond the usual identity determination, provides assurance that the request is not forged and determines the authorization permitted. Very useful for scoping and role-based Access Control which automatically becomes visible in the logs.

Status codes – Created, Accepted, errors. Everyone uses the 200 status code but the use of others helps in the reports rolled up for API response breakouts Keep-Alive – uses the mechanism to specify timeout or the number of requests honored, effectively forcing a handshake after that.

Accept-encoding & Content-Encoding – used to indicate compression and very useful for reducing packet size for listings

Cache-control: no-cache determines server performance in addition to deciding when to reach the backend.

Last-Modified Cache Validation reduces load time and latency on page requests.

Conditional modifications –ETags: full response is avoided by the server if the content has not changed.

Sunday, March 7, 2021

The choices in data mining algorithms:

There are several data mining algorithms that can be applied to a given dataset. The choice of the data mining algorithm does not always become obvious. Some exploration of the data becomes necessary in this regard.

If the use case was well articulated, the choice for the data mining algorithm becomes immediately clear. The use case becomes clear only when the data is well-known and the objective for the business purpose is known. Usually, only the latter is mentioned such as the prediction of an attribute associated with the data.

The dataset may also not be suitable for supervised learning whether the labels are already given for some training data. Some techniques are required to determine the rules with which to assign labels to the raw data. If the rules were available for business purposes, then the assignment of labels is merely an automation task and helps prepare the training set for the data.

In the absence of business rules to assign labels to the data, the dataset for data mining is usually large and cannot be compared by mere inspection. Some visualization tools are necessary. In this regard, two algorithms stand out for making this task easier. First, the decision tree algorithm can be used to find the relationships between the rows, and the visualization in the form of attributes that are significant to the outcome can be established. The tree can be pruned to see which attributes matter and which do not matter. The split of the nodes on each level helps visualize the relative strength of those attributes across rows. This is very helpful when the tree is generated without supervision.

The other algorithm is the use of the Naive Bayes Classifier to assign data. This classifier is helpful to explore data, finding relationships between input columns and predictable columns, and then using the initial exploration to create additional algorithms. Since it compares across columns for a given row, it evaluates the binary probabilities for with and without that attribute in each column.

Together these attributes can help with the initial exploration of data to choose the right algorithm for a given purpose. Usually, the split between training data and test data for the purpose of prediction, is 70% for training data and 30% for test data. The preprocessing and initial exploration even after extract-transform-load help prepare the training data. The better the training, the better the result.

Saturday, March 6, 2021

Preparation for deploying API services to the cloud (continued...)

This is a continuation of the previous post. We mentioned that APIs are desirable features to deploy because they enable automation, programmability, and connectivity from remote devices. Deploying the API to the cloud makes it even more popular now that the clients can reach them from anywhere that has IP connectivity. The public clouds offer immense capabilities to write API services and deploy them, but the preparation is largely left to the source. Some of the mentions we made include:

1) Choose the right technology: There is a variety of stacks depending on the language and platform to choose from. There are side-by-side comparisons available to choose from, and the investment is usually a one-time cost even if the technical debt accrues over time.

2) Anticipate the load: Some back-of-the-envelope calculations in terms of the number of servers depending on the total load and the load per server will help figure out the capacity required but Service-level agreement and performance indicators will help articulate those numbers better.

3) Determine the storage: Disk i/O is one of the most significant cost contributors especially when it occurs over the network. ACID guarantees are required for certain storage and in others, the size of the data matters more. The proper choice of storage even if it is a public cloud global database matters.

4) Determine the topology: The firewall, load-balancers, proxies, and server distributions are only part of the topology. The data and control paths will vary based on topology and the right choices can make them more efficient.

5) Tooling: Investing in tooling will reduce troubleshooting costs.

6) Create pipelines and dashboards for operations: Continuous Integration, Continuous Deployment, and Continuous Monitoring are core aspects of API service deployments. Investment in tools such as Splunk can automate and enable proper alerts and notifications to tend to the services.

7) A pipeline for staged progression of code to the production environment allows multiple opportunities to test and rollback the changes. The tests can even be mixes from those of other flows so that the code can be vetted against other environmental factors.

8) The deployment must be automated so that it can be repeated over and over again in different stages and environments. With the code propagation streamlined to not require manual intervention, it is possible to have a continuous release.

9) Each stage or environment in which the code is tested must have a dashboard so that the operations of the code in that environment can be analyzed.