Data Pipelines are specific to organizational needs, so it
is hard to come up with a tried and tested methodology that suits all, but
standard practice continues to be applicable across domains. One such principle
is to focus on virtualizing the different sources of data, say into a data
lake, so that there is one or at most a few pipeline paths. Another principle
is to be diligent about consistency and standardization to prevent unwieldy or
numerous customizations. For example, if a patient risk score needs to be
calculated, then a general scoring logic must be first applied that is not
source specific and then apply an override for a specific source. Reuse can be
boosted by managing configurations stored in a database. This avoids a pipeline
per data source antipattern.
Pipelines also need to support scalability. One approach to
scale involves an event driven stack. Each step picks up its task from a
messaging queue and also sends its results to a queue and the processing logic
works on an event-by-event basis. Apache Kafka is a good option for this type
of setup and works equally well for both stream processing and batch
processing.
Another approach to support scalability involves the use of
a data warehouse. These help to formalize extract-transform-load operations
from diverse data sources and support many types of read-only analytical
stacks.
Finally, on-premises solutions can be migrated to the cloud
for scalability because of elasticity and higher rate limits. And there is
transparency and pay-as-you-go pricing that appeals to the return on
investment. Some apprehension about data security precedes many design
decisions about on-cloud solutions but security and compliance in the cloud is
unparalleled and provides better opportunities for hardening.
Monitoring and alerting increases transparency and
visibility into the application and are crucial to health checks and
troubleshooting. A centralized dashboard for metrics and alerts tremendously
improves the operations of the pipeline. It also helps with notifications.
There are so many technological stacks and services to use
in the public cloud that there is always some with missing expertise on the
team. Development teams must focus on skills and internal cultural change. Some
of the sub-optimal practices happen when leadership is not prioritizing cloud
cost optimizations. For example, developers ignore small administrative tasks
that may significantly improve operating costs, architects select suboptimal
designs that are easier and faster to run but are more expensive to implement,
the algorithms and code has not been streamlined and tightened to leverage the
best practices in the cloud, deployment automation is neglected or even skipped
altogether when they could have correctly adjusted the size of the resources
deployed and finally, finance and procurement teams are viewing misplacing
their focus on the numbers in the cloud bill and creating tension between them
and the IT/development teams. A non-committal mindset towards cloud
technologies is a missed opportunity for business leaders because long-term
engagements are more cost friendly.
No comments:
Post a Comment