Tuesday, March 28, 2023

 

Data Pipelines are specific to organizational needs, so it is hard to come up with a tried and tested methodology that suits all, but standard practice continues to be applicable across domains. One such principle is to focus on virtualizing the different sources of data, say into a data lake, so that there is one or at most a few pipeline paths. Another principle is to be diligent about consistency and standardization to prevent unwieldy or numerous customizations. For example, if a patient risk score needs to be calculated, then a general scoring logic must be first applied that is not source specific and then apply an override for a specific source. Reuse can be boosted by managing configurations stored in a database. This avoids a pipeline per data source antipattern.

Pipelines also need to support scalability. One approach to scale involves an event driven stack. Each step picks up its task from a messaging queue and also sends its results to a queue and the processing logic works on an event-by-event basis. Apache Kafka is a good option for this type of setup and works equally well for both stream processing and batch processing.

Another approach to support scalability involves the use of a data warehouse. These help to formalize extract-transform-load operations from diverse data sources and support many types of read-only analytical stacks.

Finally, on-premises solutions can be migrated to the cloud for scalability because of elasticity and higher rate limits. And there is transparency and pay-as-you-go pricing that appeals to the return on investment. Some apprehension about data security precedes many design decisions about on-cloud solutions but security and compliance in the cloud is unparalleled and provides better opportunities for hardening.

Monitoring and alerting increases transparency and visibility into the application and are crucial to health checks and troubleshooting. A centralized dashboard for metrics and alerts tremendously improves the operations of the pipeline. It also helps with notifications.

There are so many technological stacks and services to use in the public cloud that there is always some with missing expertise on the team. Development teams must focus on skills and internal cultural change. Some of the sub-optimal practices happen when leadership is not prioritizing cloud cost optimizations. For example, developers ignore small administrative tasks that may significantly improve operating costs, architects select suboptimal designs that are easier and faster to run but are more expensive to implement, the algorithms and code has not been streamlined and tightened to leverage the best practices in the cloud, deployment automation is neglected or even skipped altogether when they could have correctly adjusted the size of the resources deployed and finally, finance and procurement teams are viewing misplacing their focus on the numbers in the cloud bill and creating tension between them and the IT/development teams. A non-committal mindset towards cloud technologies is a missed opportunity for business leaders because long-term engagements are more cost friendly.

 

No comments:

Post a Comment