Sunday, April 2, 2023

 Some essentials for an integration pipeline include: 

  1. Data Ingestion: data passed to the blob storage and the metadata queued in a job database for downstream processing. 


  1. Central components:  

  1. messaging queue: all components interact with these which allows scalability.  

  1. config store: enables stable pipeline with configuration driven variability 

  1. job store: keeps track of various job executions 


  1. Job poller/scheduler: this picks up and drops a message in the message queue, for example, Kafka. 


  1. Consumers: these must be horizontally scalable where the data is staged in tenant specific database. 


  1. Databases: these must be horizontally scalable where the data is transformed and stored in suitable multi-tenant databases. 


  1. Data warehouses/ data lake: the data must be denormalized with dimensions along say tenants to support multitenant data sources. 


  1. Analytics stack: The Data Lake/warehouse is the source for the analysis stack and preferably U-SQL based for leveraging existing skills 


  1. ML stack: also leverages the data lake/warehouse but with emphasis on separation of training and test datasets and the execution and feedback loop for a model. 


  1. Monitoring, performance, and security: these include telemetry, auditing, security and compliance, aging and lifecycle management. 

No comments:

Post a Comment