Conversation
104046f to
0495cb7
Compare
3dc30d7 to
ff538e7
Compare
bab08ba to
98e4e87
Compare
9a07ad3 to
8a22d33
Compare
In Jan-Feb 2026: NuttX CI hit a [record high usage of GitHub Runners](apache#17914), exceeding the limit enforced by ASF Infrastructure Team. We analysed the PRs and discovered that most GitHub Runners were wasted on __(1) Failure to Download the Build Dependencies__ for DTC Device Tree, OpenAMP Messaging, MicroADB Debugger, MCUBoot Bootloader, NimBLE Bluetooth, etc __(2) Resubmitting PR Commits__: - [Video: Analysing the Most Expensive PR](https://youtu.be/swFaxaTCEQg) - [Video: Second Most Expensive PR](https://youtu.be/uSpQkzBogEw) - [Video: Third Most Expensive PR](https://youtu.be/J7w1gyjwZ1w) - [Video: Most Expensive Apps PR](https://youtu.be/182h8cRpfvI) - [Spreadsheet: Most Expensive PRs](https://docs.google.com/spreadsheets/d/1HY7fIZzd_fs3QPyA0TX7vsYOjL86m1fNOf1Wls93luI/edit?gid=70515654#gid=70515654) Why would __Download Failures__ waste GitHub Runners? That's because Download Failures will terminate the Entire CI Build (across All CI Jobs), requiring a restart of the CI Build. And the CI Build isn't terminated immediately upon failure: NuttX CI waits for the CI Job to complete (e.g. arm-01), before terminating the CI Build. Which means that CI Builds can get terminated 2.5 hours into the CI Build, wasting 2.5 elapsed hours x [7.4 parallel processes](https://lupyuen.org/articles/ci3#live-metric-for-full-time-runners) of GitHub Runners. This PR proposes to __Retry the Build for Each CI Target__. NuttX CI shall rebuild each CI Target (e.g. `sim:nsh`), upon failure, up to 3 times (total 4 builds). Each rebuild will be attempted after a Randomised Delay with Exponential Backoff, initially set to 60 seconds, then 120 seconds, 240 seconds. The rebuilds will mitigate the effects of Intermittent Download Failures that occur in GitHub Actions. (And eliminate developer frustration) If the build fails after 3 retries: Subsequent CI Targets will __not be allowed to rebuild__ upon failure. This is to prevent cascading build failures from overloading GitHub Actions, and consuming too many GitHub Runners. Note that NuttX CI shall retry the build for __Any Kind of Build Failure__, including Download Failures, Compile Errors and Config Errors. We designed it simplistically due to our current constraints: (1) Lack of CI Expertise (2) NuttX CI is Mission Critical (3) Legacy CI Scripts are Highly Complex. To prevent Compile Errors and Config Errors: We expect NuttX Devs to [Build and Test PRs in Our Own Repos](apache#18568), before submitting to NuttX. What about __Resubmitting PR Commits__ and its wastage of GitHub Runners? We also require NuttX Devs to [Build and Test PRs in Our Own Repos](apache#18568), before resubmitting to NuttX. GitHub Runners will then be charged to the developer's quota, without affecting the GitHub Runners quota for Apache NuttX Project. We plan to [Kill All CI Jobs](https://youtu.be/182h8cRpfvI?si=MmAuwLISZPPMoqDq&t=1479) for PRs that have been switched to Draft Mode. We'll monitor this through the [NuttX Build Monitor](apache#18659). Modified Files: `tools/testbuild.sh`: We introduce a New Wrapper Function `retrytest` that will call the Existing Function `dotest`, to build the CI Target and retry on error. `Documentation/components/tools/testbuild.rst`: Updated the `testbuild.sh` doc with the Retry Logic. Signed-off-by: Lup Yuen Lee <[email protected]>
|
@lupyuen Thank you so much ! |
hartmannathan
left a comment
There was a problem hiding this comment.
In retrytest, backoff begins at 60 and is doubled for each retry, but actual delay can be any value from 1 up to backoff, every time. This means the second delay might be shorter than the first. In the PR description, this occurs in one of the runs: the second delay was 2 seconds, rather than something longer than 60 seconds. If the intention is to ensure increasing delay lengths with each retry, this could be accomplished with logic like:
local backoff=30 # half of initial minimum delay
...
delay=$(( (RANDOM % backoff) + backoff)
backoff=$((backoff * 2))
First delay will be from 30 to 59 seconds.
Second delay will be from 60 to 119 seconds.
Third delay will be from 120 to 239 seconds.
Fourth delay will be from 240 to 479 seconds.
I am still disappointed that GitHub doesn't give us a way (at least an obvious way) to avoid re-downloading these artifacts over and over every time we run a CI test. See, for example, this discussion on HackerNews about the GMP library blocking GitHub because repeated CI downloads were overloading their servers: https://news.ycombinator.com/item?id=36380325
|
Thanks @hartmannathan for the Minimum Backoff suggestion! I'll monitor the builds over the next few days to watch for any Download Errors, whether we need to set a Minimum Backoff. Our situation is a little delicate right now: Imposing a Minimum Backoff might mean that Compile Errors will take longer to complete the retries, holding up our devs. I'll also watch out for CI Test Errors, which will have longer retries. |
|
@lupyuen Sure, we can try it in its current form for now. Meanwhile I had an idea that might be worth exploring: In this idea, we would create a special repository on GitHub; suggested name: nuttx-ci-deps, meaning NuttX CI Dependencies. In that repository, we would place a copy of every dependency we are currently downloading from third parties. This would be used for CI builds only, not by normal developers or users. How would it work? Our GitHub CI scripts could pass a special command line argument to the NuttX build scripts (make or cmake). In the NuttX build scripts, the special argument would cause the download logic to get files from nuttx-ci-deps instead of downloading from third parties. The rationale behind this idea is that downloading from third parties introduces a failure mode that is separate from GitHub. It could be, for example, that some third parties limit downloads originating from GitHub for the same reasons as GMP: too many downloads. That might cause some of our download failures. Downloading from third parties again and again for every build is also unfair to them. Furthermore, I would assume that GitHub probably has a more efficient and reliable path to get data from itself than from third parties. One more idea: Putting the retry logic in the GitHub CI scripts has the (known) disadvantage of retrying builds that fail due to compiler errors. In those cases we want to stop the build. We might accomplish that in the following way: A script could be implemented at tools/download.sh. This script would encapsulate all download-related logic for the NuttX build system. It would know how to run curl, wget, fetch, and git, to retrieve and validate files when called for by the config. This script could also encapsulate the retry logic. Finally, this script could check the special argument and override the URL to get the files from nuttx-ci-deps instead of the normal download location. Thoughts? Just to clarify, yes, we should try this PR as-is and see if it improves our GHA usage. If we aren't satisfied, the ideas above are additional avenues for exploration... |
|
@hartmannathan Awesome ideas! I was pondering your ideas with @simbit18, we hit a couple of roadblocks:
|
Thanks @lupyuen @simbit18 ! I will try to answer:
Interesting, I hadn't considered auto-refresh. My thoughts were that the nuttx-ci-deps repo would be populated with specific versions of the packages, that we would consider "blessed" versions for CI testing. When a NuttX release happens, we could document which versions of dependencies have been used in testing. Developers who wish to use more bleeding edge versions would of course have the option to do that. Perhaps after each release, we could evaluate the versions of dependencies that are available and decide whether to update the nuttx-ci-deps repo. This means there is a human in the loop who can verify that package is legit and so on.
I'm not familiar with ESP32 so I'm unsure how to answer this. Could you tell me a bit more about how it gets downloaded?
In this case, maybe there should be a short (less than 10 seconds maybe) random delay before each download starts? That might cause the requests to be more serialized and perhaps avoid or reduce the failures.
This was exactly my thought, but for all downloaded deps. Our cached libyuv lib could move to a nuttx-ci-deps repo if we go this route. |
|
Thanks @hartmannathan! Sorry for messing up this complex discussion, maybe we should move it to the Project TODOs. I think to roll out any meaningful updates to NuttX CI (like Build Dependency Caching) we need to...
|

Summary
In Jan-Feb 2026: NuttX CI hit a record high usage of GitHub Runners, exceeding the limit enforced by ASF Infrastructure Team. We analysed the PRs and discovered that most GitHub Runners were wasted on (1) Failure to Download the Build Dependencies for DTC Device Tree, OpenAMP Messaging, MicroADB Debugger, MCUBoot Bootloader, NimBLE Bluetooth, etc (2) Resubmitting PR Commits:
Why would Download Failures waste GitHub Runners? That's because Download Failures will terminate the Entire CI Build (across All CI Jobs), requiring a restart of the CI Build. And the CI Build isn't terminated immediately upon failure: NuttX CI waits for the CI Job to complete (e.g.
arm-01), before terminating the CI Build. Which means that CI Builds can get terminated 2.5 hours into the CI Build, wasting 2.5 elapsed hours x 7.4 parallel processes of GitHub Runners.This PR proposes to Retry the Build for Each CI Target. NuttX CI shall rebuild each CI Target (e.g.
sim:nsh), upon failure, up to 3 times (total 4 builds). Each rebuild will be attempted after a Randomised Delay with ExponentialBackoff, initially set to 60 seconds, then 120 seconds, 240 seconds. The rebuilds will mitigate the effects of Intermittent Download Failures that occur in GitHub Actions. (And eliminate developer frustration)
If the build fails after 3 retries: Subsequent CI Targets will not be allowed to rebuild upon failure. This is to prevent cascading build failures from overloading GitHub Actions, and consuming too many GitHub Runners.
Note that NuttX CI shall retry the build for Any Kind of Build Failure, including Download Failures, Compile Errors and Config Errors. We designed it simplistically due to our current constraints: (1) Lack of CI Expertise (2) NuttX CI is Mission Critical (3) Legacy CI Scripts are Highly Complex (explained below). To prevent Compile Errors and Config Errors: We expect NuttX Devs to Build and Test PRs in Our Own Repos, before submitting to NuttX.
What about Resubmitting PR Commits and its wastage of GitHub Runners? We also require NuttX Devs to Build and Test PRs in Our Own Repos, before resubmitting to NuttX. GitHub Runners will then be charged to the developer's quota, without affecting the GitHub Runners quota for Apache NuttX Project. We plan to Kill All CI Jobs for PRs that have been switched to Draft Mode. We'll monitor this through the NuttX Build Monitor.
Modified Files
tools/testbuild.sh: We introduce a New Wrapper Functionretrytestthat will call the Existing Functiondotest, to build the CI Target and retry on error.Documentation/components/tools/testbuild.rst: Updated thetestbuild.shdoc with the Retry Logic.Impact
NuttX CI shall retry the build for Any Kind of Build Failure, including Download Failures, Compile Errors and Config Errors. We designed it simplistically due to our current constraints:
Lack of CI Expertise: We have a tiny team of 2 part-time volunteers, managing everything in NuttX CI. We cannot afford to roll out and maintain Complex CI Solutions. That's why we are not able to detect the kind of Build Failure and handle it intelligently: Download Failure vs Compile Error vs Config Error
NuttX CI is Mission Critical: NuttX CI must run continuously 24 x 7, especially during Peak Periods (Jan-Feb) and NuttX Release Windows. NuttX CI must never fail e.g. due to the Retry Logic.
Legacy CI Scripts are Highly Complex: We have inherited many Legacy CI Scripts that we don't completely understand. Therefore we won't patch Individual CI Scripts, but instead, we introduce a New Wrapper Function
retrytestthat will call the Existing Functiondotest, to build the Legacy CI Target and retry on error.To prevent Compile Errors and Config Errors: We expect NuttX Devs to Build and Test PRs in Our Own Repos, before submitting to NuttX.
To minimise the Retry Delay for Compile Errors and Config Errors: Subsequent CI Targets will not be allowed to rebuild upon failure. This is to prevent cascading build failures from consuming too many GitHub Runners. (Also avoid wasting our developer's time)
With this simplistic solution, we hope to minimise any Retry Delays while eliminating Developer Frustration for failed downloads:
Is new feature added? YES: CI Builds will now retry upon failure
Impact on build / user? YES. CI Builds should no longer be terminated due to Download Failures. But CI Builds will be slightly slower (approx 5.6 minutes, see below) if there are Compile Errors and Config Errors, due to the Retry Logic. We expect NuttX Devs to Build and Test PRs in Our Own Repos, before submitting to NuttX.
Impact on hardware / compatibility / security? NO. We are reusing all Legacy CI Scripts, without changes.
Impact on documentation? YES. We have updated the
testbuild.shdoc with the Retry Logic.Testing
Why do we Retry 3 Times? We tested the Retry Logic over the Past 4 Weeks, across 50 Builds in NuttX Production CI. Our Retry Logic successfully mitigated the following Download Failures, with a maximum of 3 retries (total 4 attempts). That's why we decided to Retry 3 Times:
qemu-armv8a:netnsh_smpOK after 4 attempts: OpenAMP libmetal download retryesp32s3-devkit:eth_lan9250OK after 3 attempts: Xtensa codeload.github download retryicicle:rpmsg-ch2OK after 2 attempts: OpenAMP libmetal download retrysim:matterOK after 2 attempts: LLVM libcxx download retrysim:matterOK after 2 attempts: NestLabs nlunit-test download retrynrf52832-dk:sdcOK after 2 attempts: nRFConnect sdk-nrfxlib download retry(We exclude sim:nxcamera, which has been fixed)
For Compile Errors: Subsequent CI Targets will not be allowed to rebuild upon failure. To test this, we simulate a Compile Error: https://github.com/lupyuen13/nuttx/actions/runs/24430763763/job/71374671998#step:10:273
Note that Max Attempts has been reduced to 1 (instead of 4). We see that Subsequent CI Targets will not be allowed to rebuild upon failure: https://github.com/lupyuen13/nuttx/actions/runs/24430763763/job/71374671998#step:10:516
From above: We see that the Retry Delay is 5.6 minutes. Which means that our developers shall wait roughly 5.6 minutes for the First Compile Error to complete all retries. Subsequent Compile Errors will not incur any Retry Delay:
For Config Errors: Subsequent CI Targets will not be allowed to rebuild upon failure. To test this, we simulate a Config Error: https://github.com/lupyuen13/nuttx/actions/runs/24430804675/job/71374813812#step:10:272
Note that Max Attempts will be reduced to 1 (instead of 4). We see that Subsequent CI Targets will not be allowed to rebuild upon failure: https://github.com/lupyuen13/nuttx/actions/runs/24430804675/job/71374813812#step:10:469
Could this Download Failure be a problem with GitHub Actions? Shouldn't we escalate to GitHub?
Outside GitHub Actions: We see the same Download Failures happening in our NuttX Build Farm (see below), which runs on a Home PC. Thus it's not a problem specific to GitHub Actions. We should not assume that Dependency Downloads are perfect, we should always retry.
NuttX Build Farm
qemu-armv8a:netnsh_smp: argtable3 download failedNuttX Build Farm
sim:lua: lua download failedNuttX CI uses a Docker Container, in GitHub Actions and in NuttX Build Farm. Maybe our Docker Image isn't configured correctly for networking?
That's possible. However our team has no expertise to troubleshoot Docker Networking.
Isn't it easier to fix the curl command in our Build Scripts to do retry?
There are at least 73 curl commands in our Build Scripts. We will require significant effort to change and test all 73 curl commands. FYI curl also supports Exponential Backoff (not randomised though).
More videos on CI Build Retry: