Cluster computing

Saturday, September 28, 2019

Humour behind Software engineering practice
This is an outline for a picture book, possibly with caricatures, to illustrate frequently encountered humour in the software engineering industry:
1) Build a Chimera because different parts evolve differently
2) Build a car with the hood first rather than the engine
3) Build a product with dysfunctional parts or with annoying and incredibly funny messages
4) Build a product with clunky parts that break down on the customer site
5) Build a product that forces you to twist your arms as you work with it
6) Build a product that makes you want to buy another from the same maker
7) Build a product that highly priced to justify the cost of development and have enterprise foot the bill
8) Build a product that features emerging trends and showcases quirky development rather than the use cases
9) Build a product that requires more manuals to be written but never read
10) Build a product that requires frequent updates with patches for security
11) Build a product that causes outages from frequent updates
12) Build a product that snoops on user activity or collects and send information to the maker
13) Build a product that won’t play well with others to the detriment of the customer buying the software
14) Build a product that says so much in legal language that the it amounts to extortion
15) Build a product that requires the customer to pay for servicing.
16) Build a product that splashes on the newspapers when a security flaw is found.
17) Build a product that finds a niche space and continues to grow by assimilating other products and results in huge debt towards blending the mix
18) Build a product with hard to find defects or those that involve significant costs and pain
19) Build a product with ambitious goals and then retrofit short term realizations
20) Build a product with poorly defined specification that causes immense code churn between releases causing ripples to the customer
21) Build a product with overly designed specification that goes obsolete by the time it hits the shelf.
22) Build a product that fails to comply with regional and global requirements and enforcements causing trouble for customers to work with each other
23) Build a product that creates a stack and ecosystem dedicated to its own growth and with significant impairment for partners and competitors to collaborate
24) Build a product that impairs interoperability to drive business
Software engineers frequently share images that illustrate their ideas, feelings and mechanisms during their discussions in building the product. Some of these images resonate across the industry but no effort has yet been undertaken to unify a comic them and present it as a comic strip. Surprisingly, there is no dearth of such comic books and illustrations in parenting and those involving kids.

Friday, September 27, 2019

Performance Tuning considerations in Stream queries:
Query language and query operators have made writing business logic extremely easy and independent of the data source. This suffices for the most part but there are a few cases when the status quo is just not enough. Enter real-time processing needs and high priority queries, the size of the data, the complexity of the computation and the latency of the response begins to become a concern.
Databases have had a long and cherished history in encountering and mitigating query execution responses. However, relational databases pose a significantly different domain of considerations as opposed to NoSQL storage primarily due to the layered and interconnected data requiring scale up rather than scale out technologies. Both have their independent performance tuning considerations.
Stream storage is no different to suffer from performance issues with disparate queries ranging from small to big. The compounded effect of append only data and stream requiring to be evaluated in windows makes iterations difficult. The processing of the streams is also exported out of the storage and this causes significant round trip time and back and forth.
Apache stack has significantly improved the offerings on the stream processing. Apache Kafka and Flink are both able to execute with stateful processing. They can persist the states to allow the processing to pick up where it left off. The states also help with fault tolerance. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store. Stream processing applications often take in the incoming events from an event log. This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. Stateful stream processing has become the norm for event-driven applications, data pipeline applications and data analytics application.
Persistence of streams for intermediate executions helps with reusability and improves the pipelining of operations so the query operators are small and can be executed by independent actors. If we have equivalent of lambda processing on persisted streams, the pipelining can significantly improve performance earlier where the logic was monolithic and proved slow from progressing window to window. There is no distinct thumb rule but the fine-grained operators have proven to be effective since they can be studied.
Streams that articulate the intermediary result also help determine what goes into each stage of the pipeline. Watermarks and savepoints are similarly helpful This kind of persistence proves to be a win-win situation for parallelizing as well as subsequent processing while disk access used to be costly in dedicated systems. There is no limit to the number of operators and their scale to be applied on streams so proper planning mitigates the efforts need to choreograph a bulky search operation.
These are some of the considerations for the performance improvement of stream processing.

Thursday, September 26, 2019

Since Event is a core Kubernetes resource much as the same as others, it not only has the metadata with it just like other resources, but also can created, updated and deleted just like others via APIs.

func recordEvent(sink EventSink, event *v1.Event, patch []byte, updateExistingEvent bool, eventCorrelator *EventCorrelator) bool {

newEvent, err = sink.Create(event)

}

Events conform to append only stream storage due to the sequential nature of the events. Events are also processed in windows making a stream processor such as Flink extremely suitable for events. Stream processors benefit from stream storage and such a storage can be overlaid on any Tier-2 storage. In particular, object storage unlike file storage can come very useful for this purpose since the data also becomes web accessible for other analysis stacks.

As compute, network and storage are overlapping to expand the possibilities in each frontier at cloud scale, message passing has become a ubiquitous functionality. While libraries like protocol buffers and solutions like RabbitMQ are becoming popular, Flows and their queues can be given native support in unstructured storage. Messages are also time-stamped and can be treated as events

Although stream storage is best for events, any time-series database could also work. However, they are not web-accessible unless they are in an object store. Their need for storage is not very different from applications requiring object storage that facilitate store and access. However, as object storage makes inwards into vectorized execution, the data transfers become increasingly fragmented and continuous. At this junction it is important to facilitate data transfer between objects and Event and it is in this space that Events and object store find suitability. Search, browse and query operations are facilitated in a web service using a web-accessible store.

Query in Event Stream is easier with Flink than using own web service to enumerate the DataSet:

package org.apache.flink.examples.java.eventcount;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.examples.java.wordcount.util.WordCountData;
import org.apache.flink.util.Collector;

public static void main(String[] args) throws Exception {

final ParameterTool params = ParameterTool.fromArgs(args);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(params);

DataSet<String> text;
text = env.readTextFile(params.get("input"));

DataSet<Tuple2<String, Integer>> counts =
text.flatMap(new Tokenizer())
.groupBy(0)
.sum(1);
counts.print();
}

public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
String hours = getSubstring(entry, "before");
out.collect(new Tuple2<>(hours, 1));
}
}

Wednesday, September 25, 2019

Logs may also optionally be converted to events which can then be forwarded to event gateway. Products like badger db are able to retain the events for a certain duration.

Components will then be able to raise events like so:
c.recorder.Event(component, corev1.EventTypeNormal, successReason, successMessage)

An audit event can be similarly raised as shown below:
ev := &auditinternal.Event{
RequestReceivedTimestamp: metav1.NewMicroTime(time.Now()),
Verb: attribs.GetVerb(),
RequestURI: req.URL.RequestURI(),
UserAgent: maybeTruncateUserAgent(req),
Level: level,
}
With the additional data as:
if attribs.IsResourceRequest() {
ev.ObjectRef = &auditinternal.ObjectReference{
Namespace: attribs.GetNamespace(),
Name: attribs.GetName(),
Resource: attribs.GetResource(),
Subresource: attribs.GetSubresource(),
APIGroup: attribs.GetAPIGroup(),
APIVersion: attribs.GetAPIVersion(),
}
}

Since Event is a core Kubernetes resource much as the same as others, it not only has the metadata with it just like other resources, but also can created, updated and deleted just like others via APIs.

func recordEvent(sink EventSink, event *v1.Event, patch []byte, updateExistingEvent bool, eventCorrelator *EventCorrelator) bool {
:
newEvent, err = sink.Create(event)
:
}
Events conform to append only stream storage due to the sequential nature of the events. Events are also processed in windows making a stream processor such as Flink extremely suitable for events. Stream processors benefit from stream storage and such a storage can be overlaid on any Tier-2 storage. In particular, object storage unlike file storage can come very useful for this purpose since the data also becomes web accessible for other analysis stacks.

As compute, network and storage are overlapping to expand the possibilities in each frontier at cloud scale, message passing has become a ubiquitous functionality. While libraries like protocol buffers and solutions like RabbitMQ are becoming popular, Flows and their queues can be given native support in unstructured storage. Messages are also time-stamped and can be treated as events
Although stream storage is best for events, any time-series database could also work. However, they are not web-accessible unless they are in an object store. Their need for storage is not very different from applications requiring object storage that facilitate store and access. However, as object storage makes inwards into vectorized execution, the data transfers become increasingly fragmented and continuous. At this junction it is important to facilitate data transfer between objects and Event and it is in this space that Events and object store find suitability. Search, browse and query operations are facilitated in a web service using a web-accessible store.

Tuesday, September 24, 2019

We wee discussing how to implement the kunernetes sink destination as a service within the cluster, in this regard, the service merely needs to listen on a port.

public class PravegaGateway {
private static final Logger logger = Logger.getLogger(PravegaGateway.class.getName());

private Server server;

private void start() throws IOException {
int port = CommonParams.getListenPort();
server = NettyServerBuilder.forPort(port)
.permitKeepAliveTime(1, TimeUnit.SECONDS)
.permitKeepAliveWithoutCalls(true)
.addService(new PravegaServerImpl())
.build()
.start();
logger.info("Server started, listening on " + port);
logger.info("Pravega controller is " + CommonParams.getControllerURI());
Runtime.getRuntime().addShutdownHook(new Thread() {
@Override
public void run() {
// Use stderr here since the logger may have been reset by its JVM shutdown hook.
System.err.println("*** shutting down gRPC server since JVM is shutting down");
PravegaGateway.this.stop();
System.err.println("*** server shut down");
}
});
}

Monday, September 23, 2019

We continue with our discussion on logging in Kubernetes. In the example provided, a log service is assumed to be made available. This service accessible at a predefined host and port.
Implementing the log-service within the cluster and listening on a tcp port will help with moving the log destination inside the cluster as stream storage and yet another application within the cluster. A service listening on a port typical for a Java application. The application will read from the socket and write to a pre-determined stream defined by the following parameters:
Stream name
Stream scope name
Stream reading timeout
Stream credential username
Stream credential password

public void write(String scope, String streamName, URI controllerURI, String routingKey, String message) {
StreamManager streamManager = StreamManager.create(controllerURI);
final boolean scopeIsNew = streamManager.createScope(scope);

StreamConfiguration streamConfig = StreamConfiguration.builder()
.scalingPolicy(ScalingPolicy.fixed(1))
.build();
final boolean streamIsNew = streamManager.createStream(scope, streamName, streamConfig);

try (ClientFactory clientFactory = ClientFactory.withScope(scope, controllerURI);
EventStreamWriter<String> writer = clientFactory.createEventWriter(streamName,
new JavaSerializer<String>(),
EventWriterConfig.builder().build())) {

System.out.format("Writing message: '%s' with routing-key: '%s' to stream '%s / %s'%n",
message, routingKey, scope, streamName);
final CompletableFuture writeFuture = writer.writeEvent(routingKey, message);
}
}

public void readAndTransform(String scope, String streamName, URI controllerURI) {
StreamManager streamManager = StreamManager.create(controllerURI);

final boolean scopeIsNew = streamManager.createScope(scope);
StreamConfiguration streamConfig = StreamConfiguration.builder()
.scalingPolicy(ScalingPolicy.fixed(1))
.build();
final boolean streamIsNew = streamManager.createStream(scope, streamName, streamConfig);

final String readerGroup = UUID.randomUUID().toString().replace("-", "");
final ReaderGroupConfig readerGroupConfig = ReaderGroupConfig.builder()
.stream(Stream.of(scope, streamName))
.build();
try (ReaderGroupManager readerGroupManager = ReaderGroupManager.withScope(scope, controllerURI)) {
readerGroupManager.createReaderGroup(readerGroup, readerGroupConfig);
}

try (ClientFactory clientFactory = ClientFactory.withScope(scope, controllerURI);
EventStreamReader<String> reader = clientFactory.createReader("reader",
readerGroup,
new JavaSerializer<String>(),
ReaderConfig.builder().build())) {
System.out.format("Reading all the events from %s/%s%n", scope, streamName);
EventRead<String> event = null;
do {
try {
event = reader.readNextEvent(READER_TIMEOUT_MS);
if (event.getEvent() != null) {
System.out.format("Read event '%s'%n", event.getEvent());
String message = transform(event.getEvent().toString());
write("srsScope", "srsStream", "tcp://127.0.0.1:9090", "srsRoutingKey", message);
}
} catch (ReinitializationRequiredException e) {
//There are certain circumstances where the reader needs to be reinitialized
e.printStackTrace();
}
} while (event.getEvent() != null);
System.out.format("No more events from %s/%s%n", scope, streamName);
}
}
}

Sunday, September 22, 2019

Logging in Kubernetes illustrated:
Kubernetes is a container orchestration framework that hosts applications. The logs from the applications are saved local to the container. Hosts are nodes of a cluster and can be scaled out to as many as required for the application workload. A forwarding agent forwards the logs from the container to a sink. A sink is a destination where all the logs forwarded can be redirected to a log service that can better handle the storage and analysis of the logs. The sink is usually external to the cluster so it can scale independently and may be configured to receive via a well known protocol called syslog. A continuously running process forwards the logs from the container to the node before drains to the sink via the forwarding agent. This process is specified with a syntax referred to as a Daemonset which enables the same process to be started on a set of nodes which is usually all the nodes in the cluster.

The log destination may be internal to the Kubernetes if the application providing log service hosted within the cluster.
The following are the configurations to setup the above logging mechanism:
1. Registering a sink controller:
apiVersion: apps.pivotal.io/v1beta1
kind: Sink
metadata:
name: YOUR-SINK
namespace: YOUR-NAMESPACE
spec:
type: syslog
host: YOUR-LOG-DESTINATION
port: YOUR-LOG-DESTINATION-PORT
enable_tls: true

2. Deploying a forwarding agent
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluentd
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: fluentd
namespace: kube-system
rules:
- apiGroups:
- ""
resources:
- pods
- namespaces
verbs:
- get
- list
- watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: fluentd
roleRef:
kind: ClusterRole
name: fluentd
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: fluentd
namespace: kube-system
3. Deploying the Daemonset
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
labels:
k8s-app: fluentd-logging
version: v1
kubernetes.io/cluster-service: "true"
spec:
template:
metadata:
labels:
k8s-app: fluentd-logging
version: v1
kubernetes.io/cluster-service: "true"
spec:
serviceAccount: fluentd
serviceAccountName: fluentd
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: “sample.elasticsearch.com”
- name: FLUENT_ELASTICSEARCH_PORT
value: "30216"
- name: FLUENT_ELASTICSEARCH_SCHEME
value: "https"
- name: FLUENT_UID
value: "0"
# X-Pack Authentication
# =====================
- name: FLUENT_ELASTICSEARCH_USER
value: "someusername"
- name: FLUENT_ELASTICSEARCH_PASSWORD
value: "somepassword"
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers