Cluster computing

Sunday, April 2, 2023

Some essentials for an integration pipeline include:

Data Ingestion: data passed to the blob storage and the metadata queued in a job database for downstream processing.

Central components:

messaging queue: all components interact with these which allows scalability.

config store: enables stable pipeline with configuration driven variability

job store: keeps track of various job executions

Job poller/scheduler: this picks up and drops a message in the message queue, for example, Kafka.

Consumers: these must be horizontally scalable where the data is staged in tenant specific database.

Databases: these must be horizontally scalable where the data is transformed and stored in suitable multi-tenant databases.

Data warehouses/ data lake: the data must be denormalized with dimensions along say tenants to support multitenant data sources.

Analytics stack: The Data Lake/warehouse is the source for the analysis stack and preferably U-SQL based for leveraging existing skills

ML stack: also leverages the data lake/warehouse but with emphasis on separation of training and test datasets and the execution and feedback loop for a model.

Monitoring, performance, and security: these include telemetry, auditing, security and compliance, aging and lifecycle management.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)