Wednesday, May 13, 2020

Data Import tool

We talked about a data export tool as follows:


When Applications are hosted on Kubernetes, they choose to persist their state using persistent volumes. The data stored on these volumes is available between application restarts. The storageclass which provides storage for these persistent volumes will be external to the pods and the container on which the application is running. When the tier 2 storage is nfs, the persistent volumes appear as mounted file system and this is usable with all standard shell tools including those for backup and export such as duplicity. The backups usually exist together with the source and as another persistent volume which can then be exposed to users via curl requests. Therefore, there is a two-part separation – one which involves an extract-transform-load between a source and destination and another that relays the prepared data to the customer.

Both can take arbitrary amount of data and prolonged processing. In the Kubernetes world, with arbitrary lifetime of pods and containers, this kind of processing becomes prone to failures. It is this special consideration that sets apart the application logic from traditional data export techniques. The ETL may be written in Java but a Kubernetes Job will need to be specified in the operator code base so that the jobs can be launched on user demand and survive all the interruptions and movements possible in the control plane of Kubernetes.

Kubernetes jobs run to completion. It creates one or more pods and as the pods complete, the job tracks the completions. The job has ownership of the pods so the pods will be cleaned up when the jobs are deleted. The job spec can be used to describe the job and usually requires the pod template, apiVersion, kind and metadata fields. The selector field is optional.  Jobs may be sequential, parallel with a fixed completion count and parallel jobs as in a work queue – all of which are suitable for multi-part export of data.

Data Export from the Kubernetes data plane can be ensured to be on demand and associated with a corresponding K8s resource – custom or standard for visibility in the control plane.

An alternative technique to this solution is to enable a multipart download REST API that exposes the filesystem or S3 storage directly. This kind of pattern keeps the data transfer out of the Kubernetes control plane and exposed only internally which is then used from the user interface.

The benefits of this technique is that the actions are tied to the user interface-based authentication and all actions are on –demand. The trade-off is that the user interface has to relay the api call to another pod and it does not work for long downloads without interruptions.

Regardless of the preparation of the data to be streamed to the client behind an api call, it is better to not require relays in the data transfer. The api call is useful to make the request for the perpared data to be on demand and the implementation can scale to as many requests as necessary. 


This can also be a data import tool

The tool can work in both export mode and import mode. The export mode is used to send data from a source stream to the target bucket. The import mode is used to reverse the data from the s3 bucket to the local stream.  The export is done by readers that read from the stream. The import is done by writers that write to the stream. The readers and writers are both capable of being paused and resumed. They do this with the help of the stream store functionalities. The S3 store is already web-accessible so the requests and responses are granular and the uploads can be multipart. The writer may be generated one for each transfer operation with the ability to perform its operation over a long time. This kind of action is independent and isolated both for readers and writers. There can be many readers for the same stream without affecting each other and each writer is writing to a stream reserved for it. Since each event is sequenced, the last position is known which helps with progress and time remaining. The size of an event is finite. When the data exceeds an event, it can be written into another event. The size of the object and the size of the event do not have to match. They can both accept spillover to another object/event. 

 

No comments:

Post a Comment