Thursday, June 29, 2023

 Workflows with AirFlow:

Apache AirFlow is a platform used to build and run workflows. A workflow is represented as a Directed Acyclic Graph where the nodes are the tasks and the edges are the dependencies. This helps to determine the order in which to run them and with retries. The tasks are self-described. An Airflow deployment consists of a scheduler to trigger scheduled workflows and to submit tasks to the executor to run, an executor to run the tasks, a web server for a management interface, a folder for the DAG artifacts, and a metadata database to store state. The workflows don’t restrict what can be specified as a task which can be an Operator or a predefined task using say Python, a Sensor which is entirely about waiting for an external event to happen, and a Custom task that can be specified via a Python function decorated with an @task.

 

Runs of the tasks in a workflow can occur repeatedly by processing the DAG and can occur in parallel. Edges can be modified by setting the upstream and downstream for a task and its dependency. Data can be passed between tasks using an XCom, a cross-communications system for exchanging state, uploading and downloading from an external storage, or via implicit exchanges.

 

Airflows send out tasks to run on workers as space become available so they can fail but they will eventually complete. Notions for sub-DAGs and TaskGroups are introduced for better manageability.

 

One of the characteristics of AirFlow is that it prioritizes flow, so that there is no need to describe data input or output and all aspects of the flow can be visualized whether they include pipeline dependencies, progress, logs, code, tasks and success status.

AirFlow is in use by over ten thousand organizations with popular use cases involving orchestrating batch ETL jobs, organizing, executing and monitoring data flow, building ETL pipelines for extracting batch data from hybrid data sources and running Spark jobs, training machine learning models, generating automated reports and performing backups and other DevOps tasks. It might not be ideal for streaming events because the scheduling required is different between batch and stream. Also AirFlow does not offer versioning of pipelines, so a source control might become necessary for such cases. AirFlow epitomizes pipeline as a code with artifacts described in Python for creating jobs, stitching jobs, programming other necessary data pipelines and debugging and troubleshooting.

No comments:

Post a Comment