Cluster computing

Thursday, April 23, 2020

We discuss a data export tool for Kubernetes:
Data Export Tool:
When Applications are hosted on Kubernetes, they choose to persist their state using persistent volumes. The data stored on these volumes is available between application restarts. The storageclass which provides storage for these persistent volumes will be external to the pods and the container on which the application is running. When the tier 2 storage is nfs, the persistent volumes appear as mounted file system and this is usable with all standard shell tools including those for backup and export such as duplicity. The backups usually exist together with the source and as another persistent volume which can then be exposed to users via curl requests. Therefore, there is a two-part separation – one which involves an extract-transform-load between a source and destination and another that relays the prepared data to the customer.
Both can take arbitrary amount of data and prolonged processing. In the Kubernetes world, with arbitrary lifetime of pods and containers, this kind of processing becomes prone to failures. It is this special consideration that sets apart the application logic from traditional data export techniques. The ETL may be written in Java but a Kubernetes Job will need to be specified in the operator code base so that the jobs can be launched on user demand and survive all the interruptions and movements possible in the control plane of Kubernetes.
Kubernetes jobs run to completion. It creates one or more pods and as the pods complete, the job tracks the completions. The job has ownership of the pods so the pods will be cleaned up when the jobs are deleted. The job spec can be used to describe the job and usually requires the pod template, apiVersion, kind and metadata fields. The selector field is optional. Jobs may be sequential, parallel with a fixed completion count and parallel jobs as in a work queue – all of which are suitable for multi-part export of data.
Data Export from the Kubernetes data plane can be ensured to be on demand and associated with a corresponding K8s resource – custom or standard for visibility in the control plane.
An alternative technique to this solution is to enable a multipart download REST API that exposes the filesystem or S3 storage directly. This kind of pattern keeps the data transfer out of the Kubernetes control plane and exposed only internally which is then used from the user interface.
The benefits of this technique is that the actions are tied to the user interface-based authentication and all actions are on –demand. The trade-off is that the user interface has to relay the api call to another pod and it does not work for long downloads without interruptions.
Regardless of the preparation of the data to be streamed to the client behind an api call, it is better to not require relays in the data transfer. The api call is useful to make the request for the perpared data to be on demand and the implementation can scale to as many requests as necessary.

Cluster computing

Thursday, April 23, 2020

No comments:

Post a Comment