Some essentials for an integration pipeline include:
Data Ingestion: data passed to the blob storage and the metadata queued in a job database for downstream processing.
Central components:
messaging queue: all components interact with these which allows scalability.
config store: enables stable pipeline with configuration driven variability
job store: keeps track of various job executions
Job poller/scheduler: this picks up and drops a message in the message queue, for example, Kafka.
Consumers: these must be horizontally scalable where the data is staged in tenant specific database.
Databases: these must be horizontally scalable where the data is transformed and stored in suitable multi-tenant databases.
Data warehouses/ data lake: the data must be denormalized with dimensions along say tenants to support multitenant data sources.
Analytics stack: The Data Lake/warehouse is the source for the analysis stack and preferably U-SQL based for leveraging existing skills
ML stack: also leverages the data lake/warehouse but with emphasis on separation of training and test datasets and the execution and feedback loop for a model.
Monitoring, performance, and security: these include telemetry, auditing, security and compliance, aging and lifecycle management.
No comments:
Post a Comment