How to Control the Parallelism or Concurrency of an Airflow Installation

In Apache Airflow, you can control parallelism or concurrency by setting various configuration options in the airflow.cfg file.

Here are some key settings you should be aware of:

Setting 1: parallelism

The maximum number of task instances can run concurrently across all active DAGs. This setting controls the overall parallelism of your Airflow installation.

parallelism = 32

Setting 2: dag_concurrency

The maximum number of task instances that can run concurrently per DAG. This setting controls the parallelism at the DAG level.


dag_concurrency = 16

Setting 3: max_active_runs_per_dag

The maximum number of active DAG runs per DAG. This setting controls how many instances of a DAG can run concurrently.

max_active_runs_per_dag = 2

Setting 4: worker_concurrency

The number of concurrent tasks that each worker can process. This setting is relevant if you are using the CeleryExecutor or the KubernetesExecutor.

worker_concurrency = 16

Setting 5: pool

Task-level concurrency can be controlled using pools. You can define pools with a specific number of available slots and assign tasks to those pools. Tasks in a pool will only run as many instances concurrently as there are slots.

To create a pool, navigate to the Airflow UI, click “Admin” in the top menu, and then “Pools”. Click “Create” to define a new pool with a specific number of slots.

To assign a task to a pool, set the pool parameter when defining the task in your DAG:

my_task = BashOperator(
  bash_command='echo "Hello, World!"',

Adjust these settings based on your specific requirements and the resources available in your environment.

Remember that increasing concurrency and parallelism may also require more resources, such as CPU, memory, and network capacity.

Balancing these factors according to your infrastructure constraints and workload requirements is essential.

That’s it.

Leave a Comment