You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have been experiencing severe dagrun starvation at our cluster, where when there were a lot of dagruns, and a low max_active_runs limit (hundreds to thousands runs with a limit in the 10s) this caused a lot of dags to get stuck in queued state without moving to running, causing those dagruns to timeout.
After investigation, we found that the reason was due to the _start_queued_dagruns method, where the query was returning dagruns which cannot be set to running due to the max_active_runs limit, meaning that other dagruns where starved.
A similar issue occurs when new dagruns are created in large batches (due to the nulls first), yet this is out of scope for the given pr, I will submit an additional PR soon.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.
I'm struggling to understand how the problem has been solved from reading the code. Can you explain your solution for preventing the starvation?
Sure,
Instead of (as of now) querying N first runs, and then filtering on the max active runs, we query the first N runs where we (in SQL) check the the max active runs (before the limit is applied)
And so we skip a lot of runs which cannot be scheduled
Assume dags a, b
a - 3 max active runs
b - no limit (default to 16 from config)
If now the query result looked like so (small letter is schedulable, capital letter is schedulable according to ) where each row represents a run (the - determine the limit, all runs before the - are selected, all other are ignored) where the max dagruns to schedule per loop (the limit) is 5
A
A
A
a
a
B
B
B
Here (as of now) the last 3 dagruns are ommitted and ignored (starving runs from b)
After the change it will look like so:
A
A
A
B
B
B
Now we do schedule everything we queried without dagruns from a limiting us (the limit now becomes the max dagruns per loop to schedule configuration) and it is guaranteed that the runs queried will be able to run
Hope this explained it, if anything is not clear feel free to let me know, I will write a better explanation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
area:Schedulerincluding HA (high availability) scheduler
2 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We have been experiencing severe dagrun starvation at our cluster, where when there were a lot of dagruns, and a low max_active_runs limit (hundreds to thousands runs with a limit in the 10s) this caused a lot of dags to get stuck in queued state without moving to running, causing those dagruns to timeout.
After investigation, we found that the reason was due to the _start_queued_dagruns method, where the query was returning dagruns which cannot be set to running due to the max_active_runs limit, meaning that other dagruns where starved.
A similar issue occurs when new dagruns are created in large batches (due to the nulls first), yet this is out of scope for the given pr, I will submit an additional PR soon.
closes #49508
Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.