Cluster computing: data export and import tool

Friday, May 15, 2020

data export and import tool

Let us change the topic for today to resume another thread of discussion for the data export and import tool.

The tool can work in both export mode and import mode. The export mode is used to send data from a source stream to the target bucket. The import mode is used to reverse the data from the s3 bucket to the local stream. The export is done by readers that read from the stream. The import is done by writers that write to the stream. The readers and writers are both capable of being paused and resumed. They do this with the help of the stream store functionalities. The S3 store is already web-accessible so the requests and responses are granular and the uploads can be multipart. The writer may be generated one for each transfer operation with the ability to perform its operation over a long time. This kind of action is independent and isolated both for readers and writers. There can be many readers for the same stream without affecting each other and each writer is writing to a stream reserved for it. Since each event is sequenced, the last position is known which helps with progress and time remaining. The size of an event is finite. When the data exceeds an event, it can be written into another event. The size of the object and the size of the event do not have to match. They can both accept spillover to another object/event.

The importer has one challenge different from the exporter. The destination for the exporter can accept multipart upload but it has limitations in sending the same payload back other sending it back as an OutputStream which makes importer to have to wait until the progress completes. The exporter, on the other hand, has the ability to pause and resume independent of the destination. This facilitates the sender to be smart in the way to orchestrate simultaneous transfers to multiple destinations without requiring more replicas.

As with any application, importers and exporters can be a dedicated single thread of activity that can be tested, serviced and monitored with the best practice from testing, dev-ops and call-home functionalities. These functionalities can be independently added via their own stacks or applications that sit well with those deployed by the tool. There is very little need for the application to take on the onus from these perspectives since specialized products continue to serve similar functionalities across applications. For example, reporting stacks can work off the logs and read only queries from the importer and exporter.

API functionality is a separate concern from the above and belongs exclusively to the tool. The tool may take in parameters over the API requiring little or no redeployment for the end user. This kind of functionality alleviates the setup and teardown associated with adhoc and changing requirements.

Cluster computing

Friday, May 15, 2020

data export and import tool

No comments:

Post a Comment