Cluster computing

Sunday, March 15, 2020

We were discussing Minikube applications and the steps taken for allowing connectivity to the applications.
The technique to allow external access to the application hosted on Minikube is port-forwarding. If the application is hosted on http and https both, then a set of ports can be opened on the host to send traffic to and from the application.
On Windows we take extra precautions in handling betwork traffic. The default firewall settings may prevent access to these ports. A new set of inbound and outbound rules must be specified for the new ports on the host. A set of inbound and outbound rules need to be written to allow access to each port.
Redirects continue to operate as before because all available endpoints will have port-forwarding. The web address itself does not need to be translated to have the localhost and host port to be included as long as the application is the same point of origin.
The other option aside from port forwarding is to ask the minikube to expose the service. This option provides another Kubernetes service with an ip address and port as its own url. This url can then be accessed from the host. There is no direct external network ip connectivity over the NAT without using static ip addressing. That said, Minikube does provide an option for tunneling.
Tunnel creates a route to services deployed with type LoadBalancer and sets their ingress to be cluster ip
We can also create an ingress resource with nginx-ingress-controller. The resource has an ip address mentioned that can be reached from host. The /etc/hosts file on the host must specify this ip address and the corresponding host specified in the ingress resource.

Saturday, March 14, 2020

Minikube applications can be accessed outside the host via port-forwarding. The applications hosted on Minikube have external cluster-IP address but the ip address is NAT'ed which means it is on a private network where the address is translated from the external IP address.
The external and cluster Ip address are two different layers of abstraction. The external in this case refers to the address that is visible only to the host since the minikube is hosted with a only host visible network adapter. It has outbound external connectivity but no internal access except from what the host permits. The IP address does not route automatically to the pods within a minikube.
The cluster IP address refers to one that has been marked cluster wide and is accessible outside the kubernetes cluster. It does not mean it is accessible over the NAT. It is different from the internal ip addresses used for the pods.
The layering therefore looks like the following:
- Outside world
- Host (IP connectivity)
- Minikube (Network Address Translation)
- Cluster IP address ( Kubernetes )
- Pod IP address ( Kubernetes )

The Minikube provides two features that enable transmission of data to the pod to and from the outside world.
This is called port-forwarding.
To transmit the data to a web application serving at port 80, we can run the following commands on the host:
> kubectl port-forward pod/<podName> -n namespace 9880:80 for the inbound traffic
Forwarding from 80 -> 9880
and
> kubectl port-forward --address 0.0.0.0 pod/podName -n namespace for the outbound traffic from the application
Forwarding from 0.0.0.0:9880 -> 9000

It is important to recognize that the inbound and outbound rules must be specified separately for the same application. If the traffic involves both http and https then this results in a set of two rules for each kind of traffic - plain and encrypted.

Friday, March 13, 2020

Kubernetes application install on windows
Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services. It is often a strategic decision for any company because it decouples the application from the hosts so that the same application can work elsewhere with minimal disruption to its use.
Windows is the most common operating system software on personal workstations. Most Kubernetes deployments are on Linux flavor virtual machines for large scale deployments. The developer workstation is considered a small deployment.
The most convenient way to install Kubernetes on windows for hosting any application is with the help of software product called Minikube. This software provisions a dedicated Kubernetes cluster ideal for use in a small resource environment
It simplifies storage with the help of a storage class that refers to an abstraction on how data is persisted. It uses a storage provisioner called K8s.io/minikube-hostpath which unlike other storage provisioners does not require static configuration beforehand for hosted applications to be able to persist files. All request for persisting files are honored dynamically as and when they are required. It stores the data from the applications on the host itself unlike nfs-client-provisioner which provisions on a remote storage.
It simplifies networking with the help of dual network adapters that let’s the cluster provide connectivity with the host and for the outside world which let’s the application appear as if it is on a traditional deployment that is reachable on the internet. The network adapter with the host provides ability to seamlessly port-forward for services deployed to pods with external cluster-ip address.
Together the storage and networking convenience makes application portability easy. Minikube also his calls forcomes with its own docker runtime and kubectl toolset that makes it easy to provide the software to run on the cluster.
These help with the convenience to host any application on Kubernetes over windows in a resource constrained environment

Thursday, March 12, 2020

1) The Flink programming model helps a lot with writing queries for streams. Several examples of this are available on their documentation page and as a sample here. The ability to combine a global index for the stream store with their programming model boosts the analytics that can be performed on the stream. An example to create this index and use it with storage is shown here.
2) The utility of the index is in its ability to lookup based on keywords. The query engine for using the index exposes additional semantics and syntax to the analytics user over the Flink queries or to be used with Flink queries. Then the logic to use the queries and the Flink can be packaged in a maven published jar. Credentials to access the streams can be injected into the jar with the help of a resolver which utilizes the application context.
3) Some of the streams may be generated as part of a running map or flatMap operations on an existing stream and they might come useful later. Unlike the stream for index, there could be a stream for transformation of events. Such transformation will happen once and persist in a new stream. Indexes are rebuilt. Transformation is one time. Indexes are used for a while. Transformations are temporary and persisted only when used with several queries. Indexes might even be stored better as files since they are rewritten. Transformed streams will be append only. The Extract-Transform-Load operation to generate this stream could be packaged in a maven artifact that is easy to write in Flink. If the indexing automatically includes all streams in a project, then this transformed stream would become automatically available to the user. If there is a way for user to blacklist or whitelist the streams for inclusion in the index, it will give more power to the user and prevent unnecessary indexing. All project members can have the privilege to add a stream to the indexing. If the stream is earmarked to be indexed, indexing may even be kicked off by the project member or require the administrator to do so.
4) Overall, there is a comparision made between indexing across stream and transforming one or more streams into another stream. The Flink programmability model works well with transformations. Utilization of index-based querying adds more power to this analytics. Finally, the data for the transformed streams and indexes can be stored with a tier 2 that brings storage engineering best practice and allowing the businesses to focus more on the querying, transformations and indexing.

Wednesday, March 11, 2020

Streams and Tier2 Storage (continued):

Object storage is a limitless storage in terms of capacity. Stream storage is limitless in continuous storage. The two can send data to each other in a mutually beneficial manner. Stream processing can help with analytics while object storage can help with storage engineering best practice. There have their own advantages and can transmit data between themselves for their respective benefits.

A cache can help improve performance of data in transit. The techniques for providing stream segments do not matter to the client and the cache can use any algorithm. The cache also provides the benefits of alleviating load from the stream store without any additional constraints. In fact the cache will also use the same stream reader as the client and with the only difference that there will be fewer stream readers on the stream store than before.

We have not compared this cache layer with a message queue server but there are interesting problems common to both. For example, we have a multiple consumer single producer pattern in the periodic reads from the stream storage. The message queue server or broker enables this kind of publisher-subscriber pattern with retries and dead letter queue. In addition, it journals the messages for review later. This leaves the interaction between the caches and the storage to be handled elegantly with well-known messaging framework. The message broker inherently comes with a scheduler to perform repeated tasks across publishers. Hence it is easy for the message queue server to perform as an orchestrator between the cache and the storage, leaving the cache to focus exclusively on the cache strategy suitable to the workloads. Journaling of messages also helps with diagnosis and replay and probably there is no better store for these messages than the object storage itself. Since the broker operates in a cluster mode, it can scale to as many caches as available. Moreover, the journaling is not necessarily available with all the messaging protocols which counts as one of the advantages of using a message broker. Aside from the queues dedicated to handle the backup of objects from cache to storage, the message broker is also uniquely positioned to provide differentiated treatment to the queues. This introduction of quality of service levels expands the ability of the solution to meet varying and extreme workloads The message queue server is not only a nice to have feature but also a necessity and a convenience when we have a distributed cache to work with the stream storage.

The distance between stream stores and object stores that send data to each other does not matter since the relay is usually asynchronous. This enables cloud storage to work with on-premise storage and tier 2 overlays to send and receive data from remote storage. It facilitates global distribution of data while allowing applications to work with each and every store. The web accessibility of data in object stores also lets it be used as intermediary storage.
The stream stores are helpful to analytics with powerful programmability features. Stream stores can be used on virtual machines and workstations for programmability while using object storage in the cloud. This makes it easy for development of applications.

Tuesday, March 10, 2020

Streams and tier2 storage.

Streams are overlayed on tier 2 storage which includes S3. Each stream does not have to map one on one with a file or a blob for user data isolation. This is entirely handled by the stream storage. Let us take a look at the following instead

A stream may have an analytics jar. These jars can be published to a maven repository on a blob store.

For example:

publishing {

publications {

mavenJava(MavenPublication) {

from components.java

}

repositories {

maven {

url "s3://${repoBucketName}/releases"

credentials(AwsCredentials) {

accessKey awsCredentials.AWSAccessKeyId

secretKey awsCredentials.AWSSecretKey

}

Artifactory is already popular to all maven publishers as jcenter(). This approach is just to generalize that to S3 storage whether it is on-premises or in the cloud.

Taking this one step forward to generalize even publishers that can package user data into a tar ball or extract a stream to a blob are all similarly useful.

When a publisher packages a stream, it can ask the stream store to send all the segments of the stream over the http. A user script that makes this curl call can redirect the output to a file. The file then becomes included in another S3 call to upload it as a blob.

Such a publisher can combine more than one stream in an archive or package metadata with it. This publisher knows what is relevant to the user to pack her data into an archive. Another publisher could make this conversion of user data to blob without the user needing to know about any intermediary file.

Sunday, March 8, 2020

We were discussing http proxy. A gateway and http proxy services can be used with stream storage for the site specific stream store. The gateway also acts as a http proxy. Any implementation of gateway has to maintain a registry of destination addresses. As streams proliferate with their geo-replications, this registry becomes global while enabling rules to determine the site from which they need to be accessed. Finally they gather statistics in terms of access and metrics which come very useful for understanding the http accesses of specific sites for stream storage.
Both the above functionalities can be elaborate allowing gateway service to provide immense benefit per deployment.
The advantages of an http proxy include aggregations of usages. In terms of success and failure, there can be detailed count of calls. Moreover, the proxy could include all the features of a conventional http service like Mashery such as Client based caller information, destination-based statistics, per object statistics, categorization by cause and many other features along with a RESTful api service for the features gathered.
Since stream can be overlayed on object storage, and objects have their own web addresses, the gateway can also resolve site specific web address at the object level granularity. This enhances the purpose of the gateway and relieves the administration from searching and looking up the site for a specific object.