Workflows with AirFlow:
Apache AirFlow is a platform used to build
and run workflows. A workflow is represented as a Directed Acyclic Graph where the
nodes are the tasks and the edges are the dependencies. This helps to determine
the order in which to run them and with retries. The tasks are self-described.
An Airflow deployment consists of a scheduler to trigger scheduled workflows
and to submit tasks to the executor to run, an executor to run the tasks, a web
server for a management interface, a folder for the DAG artifacts, and a
metadata database to store state. The workflows don’t restrict what can be specified
as a task which can be an Operator or a predefined task using say Python, a Sensor
which is entirely about waiting for an external event to happen, and a Custom
task that can be specified via a Python function decorated with an @task.
Runs of the tasks in a workflow can occur
repeatedly by processing the DAG and can occur in parallel. Edges can be
modified by setting the upstream and downstream for a task and its dependency. Data
can be passed between tasks using an XCom, a cross-communications system for
exchanging state, uploading and downloading from an external storage, or via
implicit exchanges.
Airflows send out tasks to run on workers
as space become available so they can fail but they will eventually complete.
Notions for sub-DAGs and TaskGroups are introduced for better manageability.
One of the characteristics of AirFlow is that
it prioritizes flow, so that there is no need to describe data input or output
and all aspects of the flow can be visualized whether they include pipeline
dependencies, progress, logs, code, tasks and success status.
No comments:
Post a Comment