In Apache Airflow, you can control parallelism or concurrency by setting various configuration options in the airflow.cfg file.
Here are some key settings you should be aware of:
Setting 1: parallelism
The maximum number of task instances can run concurrently across all active DAGs. This setting controls the overall parallelism of your Airflow installation.
[core] parallelism = 32
Setting 2: dag_concurrency
The maximum number of task instances that can run concurrently per DAG. This setting controls the parallelism at the DAG level.
[core] dag_concurrency = 16
Setting 3: max_active_runs_per_dag
The maximum number of active DAG runs per DAG. This setting controls how many instances of a DAG can run concurrently.
 max_active_runs_per_dag = 2
Setting 4: worker_concurrency
The number of concurrent tasks that each worker can process. This setting is relevant if you are using the CeleryExecutor or the KubernetesExecutor.
 worker_concurrency = 16
Setting 5: pool
Task-level concurrency can be controlled using pools. You can define pools with a specific number of available slots and assign tasks to those pools. Tasks in a pool will only run as many instances concurrently as there are slots.
To create a pool, navigate to the Airflow UI, click “Admin” in the top menu, and then “Pools”. Click “Create” to define a new pool with a specific number of slots.
To assign a task to a pool, set the
pool parameter when defining the task in your DAG:
my_task = BashOperator( task_id='my_task', bash_command='echo "Hello, World!"', pool='my_custom_pool', dag=dag )
Adjust these settings based on your specific requirements and the resources available in your environment.
Remember that increasing concurrency and parallelism may also require more resources, such as CPU, memory, and network capacity.
Balancing these factors according to your infrastructure constraints and workload requirements is essential.
Amit Doshi is a Cloud Engineer who has experienced more than 5 years in AWS, Azure, and Google Cloud. He is an IT professional responsible for designing, implementing, managing, and maintaining cloud computing infrastructure, applications, and services.