Bugfix/fix dagrun starvation by Nataneljpwd · Pull Request #64109 · apache/airflow

Nataneljpwd · 2026-03-23T16:45:40Z

We have been experiencing severe dagrun starvation at our cluster, where when there were a lot of dagruns, and a low max_active_runs limit (hundreds to thousands runs with a limit in the 10s) this caused a lot of dags to get stuck in queued state without moving to running, causing those dagruns to timeout.
After investigation, we found that the reason was due to the _start_queued_dagruns method, where the query was returning dagruns which cannot be set to running due to the max_active_runs limit, meaning that other dagruns where starved.

A similar issue occurs when new dagruns are created in large batches (due to the nulls first), yet this is out of scope for the given pr, I will submit an additional PR soon.

closes #49508

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

collinmcnulty · 2026-03-23T17:10:14Z

I'm struggling to understand how the problem has been solved from reading the code. Can you explain your solution for preventing the starvation?

Nataneljpwd · 2026-03-23T17:31:38Z

I'm struggling to understand how the problem has been solved from reading the code. Can you explain your solution for preventing the starvation?

Sure,
Instead of (as of now) querying N first runs, and then filtering on the max active runs, we query the first N runs where we (in SQL) check the the max active runs (before the limit is applied)
And so we skip a lot of runs which cannot be scheduled

Assume dags a, b
a - 3 max active runs
b - no limit (default to 16 from config)
If now the query result looked like so (small letter is schedulable, capital letter is schedulable according to ) where each row represents a run (the - determine the limit, all runs before the - are selected, all other are ignored) where the max dagruns to schedule per loop (the limit) is 5

A
A
A
a
a

B
B
B

Here (as of now) the last 3 dagruns are ommitted and ignored (starving runs from b)

After the change it will look like so:

A
A
A
B
B

B

Now we do schedule everything we queried without dagruns from a limiting us (the limit now becomes the max dagruns per loop to schedule configuration) and it is guaranteed that the runs queried will be able to run

Hope this explained it, if anything is not clear feel free to let me know, I will write a better explanation.

Nataneljpwd added 4 commits March 23, 2026 18:24

fixed dagrun starvation with the max_active_tasks limit

f922f2b

formatted files

85e4c6a

removed print

dd6b74d

removed redundent tests

ecdb722

Nataneljpwd requested review from XD-DENG and ashb as code owners March 23, 2026 16:45

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Mar 23, 2026

Nataneljpwd marked this pull request as draft March 23, 2026 16:48

Nataneljpwd added 2 commits March 23, 2026 21:20

fix mysql test

8356843

merge branch main

fe5462d

Nataneljpwd force-pushed the bugfix/fix-dagrun-starvation branch from 6b09818 to fe5462d Compare March 23, 2026 19:21

fix typo

422636c

Nataneljpwd marked this pull request as ready for review March 23, 2026 21:01

Nataneljpwd added 2 commits March 23, 2026 23:01

Merge branch 'main' into bugfix/fix-dagrun-starvation

ee7c308

Merge branch 'main' into bugfix/fix-dagrun-starvation

a9c77cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/fix dagrun starvation#64109

Bugfix/fix dagrun starvation#64109
Nataneljpwd wants to merge 9 commits intoapache:mainfrom
Nataneljpwd:bugfix/fix-dagrun-starvation

Nataneljpwd commented Mar 23, 2026

Uh oh!

collinmcnulty commented Mar 23, 2026

Uh oh!

Nataneljpwd commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Nataneljpwd commented Mar 23, 2026

Was generative AI tooling used to co-author this PR?

Uh oh!

collinmcnulty commented Mar 23, 2026

Uh oh!

Nataneljpwd commented Mar 23, 2026

A A A a a

A A A B B

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

A
A
A
a
a

A
A
A
B
B