Monday, March 16, 2020

External key management on stream store analytics:


Stream stores are overlaid on Tier 2 storage where the assumption is that the latter takes care of securing data at rest. Tier 2 such as object storage has always supported Data at Rest Encryption(D@RE) by maintaining a set of encryption keys in the system. These include Data Encryption Keys (DeKs) and Key Encryption Keys (KeKs). Certain object storage even supports external key management (EKM) by providing integration with Gemalto Key Secure servers for industry best practice. With the help of external keys, there is reduced risk when there is a compromise against a single instance of an application. Keys are rotated periodically, and this integration helps with performing the re-encryption on storage artifacts. Products that combine analytics over stream stores have at least two levels of data transfers – one involving the analytical application and the stream store and another involving stream store and tier 2 which may either be a nfs file system or a blob store. They can also occur side by side if the product allows storage independent of streams or with a virtualizer that involves a storage class provisioner or finally with an abstraction that syncs between hybrid stores. In these cases, there is replicated data often without protection. When the product supports the ability to use the same key to secure all parts of the data and their copies along with the ability to rotate the keys, an external key manager comes useful to safeguard the keys both old and new. 

Data is organized in containers and hierarchy specific to the store and encryption can be applied at each hierarchical level. All the data is at the lowest level and have their own DeK per container while the higher-level containers have their own KeKs. A master KeK is available for the overall store. When the stores are multiple and hybrid the masters become different, but it can be treated as just another intermediary level as long as the stores are registered at an abstraction layer.


Sunday, March 15, 2020

We were discussing Minikube applications and the steps taken for allowing connectivity to the applications.
The technique to allow external access to the application hosted on Minikube is port-forwarding. If the application is hosted on http and https both, then a set of ports can be opened on the host to send traffic to and from the application.
On Windows we take extra precautions in handling betwork traffic. The default firewall settings may prevent access to these ports. A new set of inbound and outbound rules must be specified for the new ports on the host.  A set of inbound and outbound rules need to be written to allow access to each port.
Redirects continue to operate as before because all available endpoints will have port-forwarding. The web address itself does not need to be translated to have the localhost and host port to be included as long as the application is the same point of origin.
The other option aside from port forwarding is to ask the minikube to expose the service. This option provides another Kubernetes service with an ip address and port as its own url. This url can then be accessed from the host. There is no direct external network ip connectivity over the NAT without using static ip addressing. That said, Minikube does provide an option for tunneling.
Tunnel creates a route to services deployed with type LoadBalancer and sets their ingress to be cluster ip
We can also create an ingress resource with nginx-ingress-controller. The resource has an ip address mentioned that can be reached from host. The /etc/hosts file on the host must specify this ip address and the corresponding host specified in the ingress resource. 

Saturday, March 14, 2020

Minikube applications can be accessed outside the host via port-forwarding. The applications hosted on Minikube have external cluster-IP address but the ip address is NAT'ed which means it is on a private network where the address is translated from the external IP address.
The external and cluster Ip address are two different layers of abstraction. The external in this case refers to the address that is visible only to the host since the minikube is hosted with a  only host visible network adapter. It has outbound external connectivity but no internal access except from what the host permits. The IP address does not route automatically to the pods within a minikube.
The cluster IP address refers to  one that has been marked cluster wide and is accessible outside the kubernetes cluster. It does not mean it is accessible over the NAT. It is different from the internal ip addresses used for the pods.
The layering therefore looks like the following:
 - Outside world
     -   Host (IP connectivity)
          - Minikube (Network Address Translation)
              - Cluster IP address ( Kubernetes )
                  - Pod IP address  ( Kubernetes )

The Minikube provides two features that enable transmission of data to the pod to and from the outside world.
This is called port-forwarding.
 To transmit the data to a web application serving at port 80, we can run the following commands on the host:
> kubectl port-forward pod/<podName> -n namespace 9880:80  for the inbound traffic
Forwarding from 80 -> 9880
and
> kubectl port-forward --address 0.0.0.0 pod/podName -n namespace for the outbound traffic from the application
Forwarding from 0.0.0.0:9880 -> 9000

It is important to recognize that the inbound and outbound rules must be specified separately for the same application. If the traffic involves both http and https then this results in a set of two rules for each kind of traffic - plain and encrypted.

Friday, March 13, 2020

Kubernetes application install on windows
Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services. It is often a strategic decision for any company because it decouples the application from the hosts so that the same application can work elsewhere with minimal disruption to its use.
Windows is the most common operating system software on personal workstations. Most Kubernetes deployments are on Linux flavor virtual machines for large scale deployments. The developer workstation is considered a small deployment.
The most convenient way to install Kubernetes on windows for hosting any application is with the help of software product called Minikube. This software provisions a dedicated Kubernetes cluster ideal for use in a small resource environment
It simplifies storage with the help of a storage class that refers to an abstraction on how data is persisted. It uses a storage provisioner called K8s.io/minikube-hostpath which unlike other storage provisioners does not require static configuration beforehand for hosted applications to be able to persist files. All request for persisting files are honored dynamically as and when they are required. It stores the data from the applications on the host itself unlike nfs-client-provisioner which provisions on a remote storage.
It simplifies networking with the help of dual network adapters that let’s the cluster provide connectivity with the host and for the outside world which let’s the application appear as if it is on a traditional deployment that is reachable on the internet. The network adapter with the host provides ability to seamlessly port-forward for services deployed to pods with external cluster-ip address.
Together the storage and networking convenience makes application portability easy. Minikube also his calls forcomes with its own docker runtime and kubectl toolset that makes it easy to provide the software to run on the cluster.
These help with the convenience to host any application on Kubernetes over windows in a resource constrained environment

Thursday, March 12, 2020

1) The Flink programming model helps a lot with writing queries for streams. Several examples of this are available on their documentation page and as a sample here. The ability to combine a global index for the stream store with their programming model boosts the analytics that can be performed on the stream. An example to create this index and use it with storage is shown here.
2) The utility of the index is in its ability to lookup based on keywords. The query engine for using the index exposes additional semantics and syntax to the analytics user over the Flink queries or to be used with Flink queries. Then the logic to use the queries and the Flink can be packaged in a maven published jar. Credentials to access the streams can be injected into the jar with the help of a resolver which utilizes the application context.
3) Some of the streams may be generated as part of a running map or flatMap operations on an existing stream and they might come useful later. Unlike the stream for index, there could be a stream for transformation of events. Such transformation will happen once and persist in a new stream. Indexes are rebuilt. Transformation is one time. Indexes are used for a while. Transformations are temporary and persisted only when used with several queries. Indexes might even be stored better as files since they are rewritten. Transformed streams will be append only. The Extract-Transform-Load operation to generate this stream could be packaged in a maven artifact that is easy to write in Flink. If the indexing automatically includes all streams in a project, then this transformed stream would become automatically available to the user. If there is a way for user to blacklist or whitelist the streams for inclusion in the index, it will give more power to the user and prevent unnecessary indexing. All project members can have the privilege to add a stream to the indexing. If the stream is earmarked to be indexed, indexing may even be kicked off by the project member or require the administrator to do so.
4) Overall, there is a comparision made between indexing across stream and transforming one or more streams into another stream. The Flink programmability model works well with transformations. Utilization of index-based querying adds more power to this analytics. Finally, the data for the transformed streams and indexes can be stored with a tier 2 that brings storage engineering best practice and allowing the businesses to focus more on the querying, transformations and indexing.


Wednesday, March 11, 2020

Streams and Tier2 Storage (continued):
Object storage is a limitless storage in terms of capacity. Stream storage is limitless in continuous storage. The two can send data to each other in a mutually beneficial manner. Stream processing can help with analytics while object storage can help with storage engineering best practice. There have their own advantages and can transmit data between themselves for their respective benefits.  
A cache can help improve performance of data in transit. The techniques for providing stream segments do not matter to the client and the cache can use any algorithm. The cache also provides the benefits of alleviating load from the stream store without any additional constraints. In fact the cache will also use the same stream reader as the client and with the only difference that there will be fewer stream readers on the stream store than before. 
We have not compared this cache layer with a message queue server but there are interesting problems common to both. For example, we have a multiple consumer single producer pattern in the periodic reads from the stream storage. The message queue server or broker enables this kind of publisher-subscriber pattern with retries and dead letter queue. In addition, it journals the messages for review later.  This leaves the interaction between the caches and the storage to be handled elegantly with well-known messaging framework. The message broker inherently comes with a scheduler to perform repeated tasks across publishers. Hence it is easy for the message queue server to perform as an orchestrator between the cache and the storage, leaving the cache to focus exclusively on the cache strategy suitable to the workloads.  Journaling of messages also helps with diagnosis and replay and probably there is no better store for these messages than the object storage itself. Since the broker operates in a cluster mode, it can scale to as many caches as available. Moreover, the journaling is not necessarily available with all the messaging protocols which counts as one of the advantages of using a message broker.  Aside from the queues dedicated to handle the backup of objects from cache to storage, the message broker is also uniquely positioned to provide differentiated treatment to the queues.  This introduction of quality of service levels expands the ability of the solution to meet varying and extreme workloads The message queue server is not only a nice to have feature but also a necessity and a convenience when we have a distributed cache to work with the stream storage. 
The distance between stream stores and object stores that send data to each other does not matter since the relay is usually asynchronous. This enables cloud storage to work with on-premise storage and tier 2 overlays to send and receive data from remote storage. It facilitates global distribution of data while allowing applications to work with each and every store. The web accessibility of data in object stores also lets it be used as intermediary storage. 
The stream stores are helpful to analytics with powerful programmability features. Stream stores can be used on virtual machines and workstations for programmability while using object storage in the cloud. This makes it easy for development of applications.

Tuesday, March 10, 2020

Streams and tier2 storage. 
Streams are overlayed on tier 2 storage which includes S3. Each stream does not have to map one on one with a file or a blob for user data isolation. This is entirely handled by the stream storage. Let us take a look at the following instead 
A stream may have an analytics jar. These jars can be published to a maven repository on a blob store.  
For example: 
publishing { 
    publications { 
        mavenJava(MavenPublication) { 
            from components.java 
        } 
    } 
    repositories { 
        maven { 
            url "s3://${repoBucketName}/releases" 
            credentials(AwsCredentials) { 
                accessKey awsCredentials.AWSAccessKeyId 
                secretKey awsCredentials.AWSSecretKey 
            } 
        } 
    } 
} 
Artifactory is already popular to all maven publishers as jcenter(). This approach is just to generalize that to S3 storage whether it is on-premises or in the cloud.  
Taking this one step forward to generalize even publishers that can package user data into a tar ball or extract a stream to a blob are all similarly useful. 
When a publisher packages a stream, it can ask the stream store to send all the segments of the stream over the http.  A user script that makes this curl call can redirect the output to a file. The file then becomes included in another S3 call to upload it as a blob. 
Such a publisher can combine more than one stream in an archive or package metadata with it. This publisher knows what is relevant to the user to pack her data into an archive. Another publisher could make this conversion of user data to blob without the user needing to know about any intermediary file.