Cluster computing

This section explores dependency management and pipeline orchestration that are aptly dubbed “under-currents” in the book “Fundamentals of Data Engineering” by Matt Housley and Joe Reis.

Data orchestration is a process of dependency management, facilitated through automation. It involves scheduling, triggering, monitoring, and resource allocation. Data orchestrators are different from schedulers, which are cron-based. They can trigger events, webhooks, schedules, and even intra-workflow dependencies. Data orchestration provides a structured, automated, and efficient way to handle large-scale data from diverse sources.

Orchestration steers workflows toward efficiency and functionality, with an orchestrator serving as the tool enabling these workflows. They typically trigger pipelines based on a schedule or a specific event. Event-driven pipelines are beneficial for handling unpredictable data or resource-intensive jobs.

The perks of having an orchestrator in your data engineering toolkit include workflow management, automation, error handling and recovery, monitoring and alerting, and resource optimization. Directed Acyclic Graphs or DAGs for short bring order, control, and repeatability to data workflows, managing dependencies and ensuring a structured and predictable flow of data. They are pivotal for orchestrating and visualizing pipelines, making them indispensable in managing complex workflows, particularly within a team or large-scale setups. For example, a DAG serves as a clear roadmap defining the order of tasks and with this lens, it is possible to organize the creation, scheduling, and monitoring of data pipelines.

Data orchestration tools have evolved significantly over the past few decades, with the creation of Apache Airflow and Luigi being the most dominant tools. However, it is crucial to choose the right tool for the job, as each tool has its strengths and weaknesses. Data orchestrators, like the conductor of an orchestra, balance declarative and imperative frameworks to provide flexibility and efficiency in software engineering best practices.

When selecting an orchestrator, factors such as scalability, code and configuration reusability, and the ability to handle complex logic and dependencies are important to consider. The orchestrator should be able to scale vertically or horizontally, ensuring that the process of orchestrating data is separate from the process of transforming data.

Orchestration is about connections, and platforms like Azure, Amazon, and Google offer hosted versions of popular tools like Airflow. Platform-embedded alternatives like Databricks Workflows provide more visibility into tasks orchestrated by orchestrators. Popular orchestrators have strong community support and continuous development, ensuring they remain up to date with the latest technologies and best practices. Support is crucial for both closed source and paid solutions, as solutions engineers can help resolve issues.

Observability is essential for understanding transformation flows and ensuring your orchestrator supports various methods for alerting your team. To implement an orchestration solution, you can build a solution, buy an off-the-shelf tool, self-host an open-source tool, or use a tool included with your cloud provider or data platform. Apache Airflow, developed by Airbnb, is a popular choice due to its ease of adoption, simple deployment, and ubiquity in the data space. However, it has flaws, such as being engineered to orchestrate, not transform or ingest.

Open-source tools like Airflow and Prefect are popular orchestrators with paid, hosted services and support. Newer tools like Mage, Keboola, and Kestra are also innovating. Open-source tools offer community support and the ability to modify source code. However, they depend on support for continued development and may risk project abandonment or instability. A tool's history, support, and stability must be considered when choosing a solution.

Data orchestration is a crucial aspect of modern data engineering, involving the use of relational databases for data transformation. Tools like dbt, Delta Live Tables, Dataform, and SQLMesh are used as orchestrators to evaluate dependencies, optimize, and execute commands against a database to produce desired results. However, there is a potential limitation in data orchestration due to the need for a mechanism to observe data across different layers, leading to a disconnection between sources and cleaned data. This can be a challenge in identifying errors in downstream data.

Design patterns can significantly enhance the efficiency, reliability, and maintainability of data orchestration processes. Some orchestration solutions make these patterns easier, such as building backfill logic into pipelines, ensuring idempotence, and event-driven data orchestration. These patterns can help avoid one-time thinking and ensure consistent results in data engineering. Choosing a platform-specific data orchestrator can provide greater visibility between and within data workflows, making it essential for ETL workflows.

Orchestrators are complex and difficult to develop locally due to their complex trigger actions. To improve performance, invest in tools that allow for fast feedback loops, error identification, and a local environment that is developer friendly. Retry and fallback logic are essential for handling failures in a complex data stack, ensuring data integrity and system reliability. Idempotent pipelines set up scenarios for retrying operations or skipping and alerting the proper parties. Parameterized execution allows for more malleability in orchestration, allowing for multiple cases and reuse of pipelines. Lineage refers to the path traveled by data through its lifecycle, and a robust lineage solution is crucial for debugging issues and extending pipelines. Column-level lineage is becoming an industry norm in SQL orchestration, and platform-integrated orchestration solutions like Databricks Unity Catalog and Delta Live Tables offer advanced lineage capabilities. Pipeline decomposition breaks pipelines into smaller tasks for better monitoring, error handling, and scalability. Building autonomous DAGs can mitigate dependencies and critical failures, making it easier to build and debug workflows.

The evolution of transformation tools, such as containerized infrastructure and hybrid tools like Prefect and Dagster, may change the landscape of data teams. These tools can save time and resources, enabling better observability and monitoring within data warehouses. Emerging tools like SQLMesh may challenge established players like dbt, while plug-and-play solutions like Databricks Workflows are becoming more appealing. These developments will enable data teams to deliver quality data in a timely and robust manner.

Cluster computing

Wednesday, October 30, 2024

No comments:

Post a Comment