Skip to content

CI: Retry build upon failure#18576

Merged
simbit18 merged 1 commit intoapache:masterfrom
lupyuen13:retry-build
Apr 15, 2026
Merged

CI: Retry build upon failure#18576
simbit18 merged 1 commit intoapache:masterfrom
lupyuen13:retry-build

Conversation

@lupyuen
Copy link
Copy Markdown
Member

@lupyuen lupyuen commented Mar 22, 2026

Summary

In Jan-Feb 2026: NuttX CI hit a record high usage of GitHub Runners, exceeding the limit enforced by ASF Infrastructure Team. We analysed the PRs and discovered that most GitHub Runners were wasted on (1) Failure to Download the Build Dependencies for DTC Device Tree, OpenAMP Messaging, MicroADB Debugger, MCUBoot Bootloader, NimBLE Bluetooth, etc (2) Resubmitting PR Commits:

Why would Download Failures waste GitHub Runners? That's because Download Failures will terminate the Entire CI Build (across All CI Jobs), requiring a restart of the CI Build. And the CI Build isn't terminated immediately upon failure: NuttX CI waits for the CI Job to complete (e.g. arm-01), before terminating the CI Build. Which means that CI Builds can get terminated 2.5 hours into the CI Build, wasting 2.5 elapsed hours x 7.4 parallel processes of GitHub Runners.

This PR proposes to Retry the Build for Each CI Target. NuttX CI shall rebuild each CI Target (e.g. sim:nsh), upon failure, up to 3 times (total 4 builds). Each rebuild will be attempted after a Randomised Delay with Exponential
Backoff, initially set to 60 seconds, then 120 seconds, 240 seconds. The rebuilds will mitigate the effects of Intermittent Download Failures that occur in GitHub Actions. (And eliminate developer frustration)

If the build fails after 3 retries: Subsequent CI Targets will not be allowed to rebuild upon failure. This is to prevent cascading build failures from overloading GitHub Actions, and consuming too many GitHub Runners.

Note that NuttX CI shall retry the build for Any Kind of Build Failure, including Download Failures, Compile Errors and Config Errors. We designed it simplistically due to our current constraints: (1) Lack of CI Expertise (2) NuttX CI is Mission Critical (3) Legacy CI Scripts are Highly Complex (explained below). To prevent Compile Errors and Config Errors: We expect NuttX Devs to Build and Test PRs in Our Own Repos, before submitting to NuttX.

What about Resubmitting PR Commits and its wastage of GitHub Runners? We also require NuttX Devs to Build and Test PRs in Our Own Repos, before resubmitting to NuttX. GitHub Runners will then be charged to the developer's quota, without affecting the GitHub Runners quota for Apache NuttX Project. We plan to Kill All CI Jobs for PRs that have been switched to Draft Mode. We'll monitor this through the NuttX Build Monitor.

Modified Files

tools/testbuild.sh: We introduce a New Wrapper Function retrytest that will call the Existing Function dotest, to build the CI Target and retry on error.

Documentation/components/tools/testbuild.rst: Updated the testbuild.sh doc with the Retry Logic.

retry-build

Impact

NuttX CI shall retry the build for Any Kind of Build Failure, including Download Failures, Compile Errors and Config Errors. We designed it simplistically due to our current constraints:

  1. Lack of CI Expertise: We have a tiny team of 2 part-time volunteers, managing everything in NuttX CI. We cannot afford to roll out and maintain Complex CI Solutions. That's why we are not able to detect the kind of Build Failure and handle it intelligently: Download Failure vs Compile Error vs Config Error

  2. NuttX CI is Mission Critical: NuttX CI must run continuously 24 x 7, especially during Peak Periods (Jan-Feb) and NuttX Release Windows. NuttX CI must never fail e.g. due to the Retry Logic.

  3. Legacy CI Scripts are Highly Complex: We have inherited many Legacy CI Scripts that we don't completely understand. Therefore we won't patch Individual CI Scripts, but instead, we introduce a New Wrapper Function retrytest that will call the Existing Function dotest, to build the Legacy CI Target and retry on error.

  4. To prevent Compile Errors and Config Errors: We expect NuttX Devs to Build and Test PRs in Our Own Repos, before submitting to NuttX.

  5. To minimise the Retry Delay for Compile Errors and Config Errors: Subsequent CI Targets will not be allowed to rebuild upon failure. This is to prevent cascading build failures from consuming too many GitHub Runners. (Also avoid wasting our developer's time)

With this simplistic solution, we hope to minimise any Retry Delays while eliminating Developer Frustration for failed downloads:

  • Is new feature added? YES: CI Builds will now retry upon failure

  • Impact on build / user? YES. CI Builds should no longer be terminated due to Download Failures. But CI Builds will be slightly slower (approx 5.6 minutes, see below) if there are Compile Errors and Config Errors, due to the Retry Logic. We expect NuttX Devs to Build and Test PRs in Our Own Repos, before submitting to NuttX.

  • Impact on hardware / compatibility / security? NO. We are reusing all Legacy CI Scripts, without changes.

  • Impact on documentation? YES. We have updated the testbuild.sh doc with the Retry Logic.

Testing

Why do we Retry 3 Times? We tested the Retry Logic over the Past 4 Weeks, across 50 Builds in NuttX Production CI. Our Retry Logic successfully mitigated the following Download Failures, with a maximum of 3 retries (total 4 attempts). That's why we decided to Retry 3 Times:

Build Attempt 1 of N
Configuration/Tool: qemu-armv8a/netnsh_smp
Error: cannot find zipfile directory in one of libmetal.zip
...
Wait 58 seconds (60 backoff)
Build Attempt 2 of N
Configuration/Tool: qemu-armv8a/netnsh_smp
(Same download error)
...
Wait 2 seconds (120 backoff)
Build Attempt 3 of N
Configuration/Tool: qemu-armv8a/netnsh_smp
(Same download error)
...
Wait 103 seconds (240 backoff)
Build Attempt 4 of N
Configuration/Tool: qemu-armv8a/netnsh_smp
(Successful at 4th attempt yay!)
  1. qemu-armv8a:netnsh_smp OK after 4 attempts: OpenAMP libmetal download retry

  2. esp32s3-devkit:eth_lan9250 OK after 3 attempts: Xtensa codeload.github download retry

  3. icicle:rpmsg-ch2 OK after 2 attempts: OpenAMP libmetal download retry

  4. sim:matter OK after 2 attempts: LLVM libcxx download retry

  5. sim:matter OK after 2 attempts: NestLabs nlunit-test download retry

  6. nrf52832-dk:sdc OK after 2 attempts: nRFConnect sdk-nrfxlib download retry

    (We exclude sim:nxcamera, which has been fixed)


For Compile Errors: Subsequent CI Targets will not be allowed to rebuild upon failure. To test this, we simulate a Compile Error: https://github.com/lupyuen13/nuttx/actions/runs/24430763763/job/71374671998#step:10:273

Build Attempt 1 of 4 @ 2026-04-15 01:09:04
Error: SIMULATED_COMPILE_ERROR' undeclared
...
Wait 38 seconds (60 backoff)
Build Attempt 2 of 4 @ 2026-04-15 01:10:35
(Same compile error)
...
Wait 75 seconds (120 backoff)
Build Attempt 3 of 4 @ 2026-04-15 01:12:18
(Same compile error)
...
Wait 183 seconds (240 backoff)
Build Attempt 4 of 4 @ 2026-04-15 01:15:49
(Same compile error)
...
(Next CI Target)
Build Attempt 1 of 1 @ 2026-04-15 01:16:16

Note that Max Attempts has been reduced to 1 (instead of 4). We see that Subsequent CI Targets will not be allowed to rebuild upon failure: https://github.com/lupyuen13/nuttx/actions/runs/24430763763/job/71374671998#step:10:516

Build Attempt 1 of 1: pinephone/lvgl
Error: SIMULATED_COMPILE_ERROR' undeclared
...
Build Attempt 1 of 1: pinephone/sensor
(Same compile error)

From above: We see that the Retry Delay is 5.6 minutes. Which means that our developers shall wait roughly 5.6 minutes for the First Compile Error to complete all retries. Subsequent Compile Errors will not incur any Retry Delay:

(First CI Target)
Build Attempt 1 of 4 @ 2026-04-15 01:09:04
Build Attempt 2 of 4 @ 2026-04-15 01:10:35
...
(Next CI Target)
Build Attempt 1 of 1 @ 2026-04-15 01:16:16

For Config Errors: Subsequent CI Targets will not be allowed to rebuild upon failure. To test this, we simulate a Config Error: https://github.com/lupyuen13/nuttx/actions/runs/24430804675/job/71374813812#step:10:272

Build Attempt 1 of 4: pinephone/nsh
  [1/1] Normalize pinephone/nsh
8d7
< CONFIG_SECOND_SIMULATED_CONFIG_ERROR=0
...
Wait 24 seconds (60 backoff)
Build Attempt 2 of 4
(Same config error)
...
Wait 65 seconds (120 backoff)
Build Attempt 3 of 4
(Same config error)
...
Wait 105 seconds (240 backoff)
Build Attempt 4 of 4
(Same config error)

Note that Max Attempts will be reduced to 1 (instead of 4). We see that Subsequent CI Targets will not be allowed to rebuild upon failure: https://github.com/lupyuen13/nuttx/actions/runs/24430804675/job/71374813812#step:10:469

Build Attempt 1 of 1: pinephone/lcd
...
Build Attempt 1 of 1: pinephone/lvgl

Could this Download Failure be a problem with GitHub Actions? Shouldn't we escalate to GitHub?

Outside GitHub Actions: We see the same Download Failures happening in our NuttX Build Farm (see below), which runs on a Home PC. Thus it's not a problem specific to GitHub Actions. We should not assume that Dependency Downloads are perfect, we should always retry.

NuttX CI uses a Docker Container, in GitHub Actions and in NuttX Build Farm. Maybe our Docker Image isn't configured correctly for networking?

That's possible. However our team has no expertise to troubleshoot Docker Networking.

Isn't it easier to fix the curl command in our Build Scripts to do retry?

There are at least 73 curl commands in our Build Scripts. We will require significant effort to change and test all 73 curl commands. FYI curl also supports Exponential Backoff (not randomised though).

More videos on CI Build Retry:

@github-actions github-actions bot added Area: Build system Size: S The size of the change in this PR is small labels Mar 22, 2026
@lupyuen lupyuen force-pushed the retry-build branch 7 times, most recently from 104046f to 0495cb7 Compare March 23, 2026 10:05
Comment thread tools/testbuild.sh Outdated
@lupyuen lupyuen force-pushed the retry-build branch 2 times, most recently from 3dc30d7 to ff538e7 Compare March 28, 2026 06:32
@lupyuen lupyuen force-pushed the retry-build branch 15 times, most recently from bab08ba to 98e4e87 Compare April 5, 2026 14:22
@lupyuen lupyuen force-pushed the retry-build branch 2 times, most recently from 9a07ad3 to 8a22d33 Compare April 15, 2026 04:45
In Jan-Feb 2026: NuttX CI hit a [record high usage of GitHub Runners](apache#17914), exceeding the limit enforced by ASF Infrastructure Team. We analysed the PRs and discovered that most GitHub Runners were wasted on __(1) Failure to Download the Build Dependencies__ for DTC Device Tree, OpenAMP Messaging, MicroADB Debugger, MCUBoot Bootloader, NimBLE Bluetooth, etc __(2) Resubmitting PR Commits__:

- [Video: Analysing the Most Expensive PR](https://youtu.be/swFaxaTCEQg)
- [Video: Second Most Expensive PR](https://youtu.be/uSpQkzBogEw)
- [Video: Third Most Expensive PR](https://youtu.be/J7w1gyjwZ1w)
- [Video: Most Expensive Apps PR](https://youtu.be/182h8cRpfvI)
- [Spreadsheet: Most Expensive PRs](https://docs.google.com/spreadsheets/d/1HY7fIZzd_fs3QPyA0TX7vsYOjL86m1fNOf1Wls93luI/edit?gid=70515654#gid=70515654)

Why would __Download Failures__ waste GitHub Runners? That's because Download Failures will terminate the Entire CI Build (across All CI Jobs), requiring a restart of the CI Build. And the CI Build isn't terminated immediately upon failure: NuttX CI waits for the CI Job to complete (e.g. arm-01), before terminating the CI Build. Which means that CI Builds can get terminated 2.5 hours into the CI Build, wasting 2.5 elapsed hours x [7.4 parallel processes](https://lupyuen.org/articles/ci3#live-metric-for-full-time-runners) of GitHub Runners.

This PR proposes to __Retry the Build for Each CI Target__. NuttX CI shall rebuild each CI Target (e.g. `sim:nsh`), upon failure, up to 3 times (total 4 builds). Each rebuild will be attempted after a Randomised Delay with Exponential
Backoff, initially set to 60 seconds, then 120 seconds, 240 seconds. The rebuilds will mitigate the effects of Intermittent Download Failures that occur in GitHub Actions. (And eliminate developer frustration)

If the build fails after 3 retries: Subsequent CI Targets will __not be allowed to rebuild__ upon failure. This is to prevent cascading build failures from overloading GitHub Actions, and consuming too many GitHub Runners.

Note that NuttX CI shall retry the build for __Any Kind of Build Failure__, including Download Failures, Compile Errors and Config Errors. We designed it simplistically due to our current constraints: (1) Lack of CI Expertise (2) NuttX CI is Mission Critical (3) Legacy CI Scripts are Highly Complex. To prevent Compile Errors and Config Errors: We expect NuttX Devs to [Build and Test PRs in Our Own Repos](apache#18568), before submitting to NuttX.

What about __Resubmitting PR Commits__ and its wastage of GitHub Runners? We also require NuttX Devs to [Build and Test PRs in Our Own Repos](apache#18568), before resubmitting to NuttX. GitHub Runners will then be charged to the developer's quota, without affecting the GitHub Runners quota for Apache NuttX Project. We plan to [Kill All CI Jobs](https://youtu.be/182h8cRpfvI?si=MmAuwLISZPPMoqDq&t=1479) for PRs that have been switched to Draft Mode. We'll monitor this through the [NuttX Build Monitor](apache#18659).

Modified Files:

`tools/testbuild.sh`: We introduce a New Wrapper Function `retrytest` that will call the Existing Function `dotest`, to build the CI Target and retry on error.

`Documentation/components/tools/testbuild.rst`: Updated the `testbuild.sh` doc with the Retry Logic.

Signed-off-by: Lup Yuen Lee <[email protected]>
@lupyuen lupyuen changed the title [Do Not Merge] Testing of CI Build Retry CI: Retry build upon failure Apr 15, 2026
@lupyuen lupyuen marked this pull request as ready for review April 15, 2026 07:43
@lupyuen lupyuen linked an issue Apr 15, 2026 that may be closed by this pull request
1 task
@lupyuen lupyuen requested a review from simbit18 April 15, 2026 09:40
@simbit18
Copy link
Copy Markdown
Contributor

@lupyuen Thank you so much !

Copy link
Copy Markdown
Contributor

@hartmannathan hartmannathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrytest, backoff begins at 60 and is doubled for each retry, but actual delay can be any value from 1 up to backoff, every time. This means the second delay might be shorter than the first. In the PR description, this occurs in one of the runs: the second delay was 2 seconds, rather than something longer than 60 seconds. If the intention is to ensure increasing delay lengths with each retry, this could be accomplished with logic like:

local backoff=30 # half of initial minimum delay
...
delay=$(( (RANDOM % backoff) + backoff)
backoff=$((backoff * 2))

First delay will be from 30 to 59 seconds.
Second delay will be from 60 to 119 seconds.
Third delay will be from 120 to 239 seconds.
Fourth delay will be from 240 to 479 seconds.

I am still disappointed that GitHub doesn't give us a way (at least an obvious way) to avoid re-downloading these artifacts over and over every time we run a CI test. See, for example, this discussion on HackerNews about the GMP library blocking GitHub because repeated CI downloads were overloading their servers: https://news.ycombinator.com/item?id=36380325

@lupyuen
Copy link
Copy Markdown
Member Author

lupyuen commented Apr 15, 2026

Thanks @hartmannathan for the Minimum Backoff suggestion! I'll monitor the builds over the next few days to watch for any Download Errors, whether we need to set a Minimum Backoff. Our situation is a little delicate right now: Imposing a Minimum Backoff might mean that Compile Errors will take longer to complete the retries, holding up our devs. I'll also watch out for CI Test Errors, which will have longer retries.

@hartmannathan
Copy link
Copy Markdown
Contributor

@lupyuen Sure, we can try it in its current form for now.

Meanwhile I had an idea that might be worth exploring:

In this idea, we would create a special repository on GitHub; suggested name: nuttx-ci-deps, meaning NuttX CI Dependencies. In that repository, we would place a copy of every dependency we are currently downloading from third parties. This would be used for CI builds only, not by normal developers or users.

How would it work? Our GitHub CI scripts could pass a special command line argument to the NuttX build scripts (make or cmake).

In the NuttX build scripts, the special argument would cause the download logic to get files from nuttx-ci-deps instead of downloading from third parties.

The rationale behind this idea is that downloading from third parties introduces a failure mode that is separate from GitHub. It could be, for example, that some third parties limit downloads originating from GitHub for the same reasons as GMP: too many downloads. That might cause some of our download failures. Downloading from third parties again and again for every build is also unfair to them. Furthermore, I would assume that GitHub probably has a more efficient and reliable path to get data from itself than from third parties.

One more idea:

Putting the retry logic in the GitHub CI scripts has the (known) disadvantage of retrying builds that fail due to compiler errors. In those cases we want to stop the build. We might accomplish that in the following way:

A script could be implemented at tools/download.sh. This script would encapsulate all download-related logic for the NuttX build system. It would know how to run curl, wget, fetch, and git, to retrieve and validate files when called for by the config. This script could also encapsulate the retry logic. Finally, this script could check the special argument and override the URL to get the files from nuttx-ci-deps instead of the normal download location.

Thoughts?

Just to clarify, yes, we should try this PR as-is and see if it improves our GHA usage. If we aren't satisfied, the ideas above are additional avenues for exploration...

@lupyuen
Copy link
Copy Markdown
Member Author

lupyuen commented Apr 15, 2026

@hartmannathan Awesome ideas! I was pondering your ideas with @simbit18, we hit a couple of roadblocks:

  1. How will we auto-refresh the Cached Dependencies in nuttx-ci-deps? To make sure they are always up-to-date? Who will watch over nuttx-ci-deps, and fix it if any downloads fail?

  2. ESP32 Downloads might be tricky, I don't see any obvious way to hook the ESP32 Builds to our Download Script. Also: I think ESP32 Downloads might need to be refreshed more often.

  3. Nearly all Download Failures are for URLs hosted at github.com. So we don't have much problems with downloads from External Third Parties. I suspect github.com is blocking our downloading because we run too many CI Jobs in parallel, all spamming github.com at the same time. (Hence I implemented Random Exponential Backoff)

  4. FYI chromium.googlesource.com was blocking our download of libyuv library, so we cached it at the NuttX Mirror Repo. libyuv hasn't been updated since 2021, so we probably won't need to worry about refreshing:

@hartmannathan
Copy link
Copy Markdown
Contributor

@hartmannathan Awesome ideas! I was pondering your ideas with @simbit18, we hit a couple of roadblocks:

Thanks @lupyuen @simbit18 ! I will try to answer:

  1. How will we auto-refresh the Cached Dependencies in nuttx-ci-deps? To make sure they are always up-to-date? Who will watch over nuttx-ci-deps, and fix it if any downloads fail?

Interesting, I hadn't considered auto-refresh. My thoughts were that the nuttx-ci-deps repo would be populated with specific versions of the packages, that we would consider "blessed" versions for CI testing. When a NuttX release happens, we could document which versions of dependencies have been used in testing. Developers who wish to use more bleeding edge versions would of course have the option to do that. Perhaps after each release, we could evaluate the versions of dependencies that are available and decide whether to update the nuttx-ci-deps repo. This means there is a human in the loop who can verify that package is legit and so on.

  1. ESP32 Downloads might be tricky, I don't see any obvious way to hook the ESP32 Builds to our Download Script. Also: I think ESP32 Downloads might need to be refreshed more often.

I'm not familiar with ESP32 so I'm unsure how to answer this. Could you tell me a bit more about how it gets downloaded?

  1. Nearly all Download Failures are for URLs hosted at github.com. So we don't have much problems with downloads from External Third Parties. I suspect github.com is blocking our downloading because we run too many CI Jobs in parallel, all spamming github.com at the same time. (Hence I implemented Random Exponential Backoff)

In this case, maybe there should be a short (less than 10 seconds maybe) random delay before each download starts? That might cause the requests to be more serialized and perhaps avoid or reduce the failures.

  1. FYI chromium.googlesource.com was blocking our download of libyuv library, so we cached it at the NuttX Mirror Repo. libyuv hasn't been updated since 2021, so we probably won't need to worry about refreshing:

This was exactly my thought, but for all downloaded deps. Our cached libyuv lib could move to a nuttx-ci-deps repo if we go this route.

@lupyuen
Copy link
Copy Markdown
Member Author

lupyuen commented Apr 16, 2026

Thanks @hartmannathan! Sorry for messing up this complex discussion, maybe we should move it to the Project TODOs. I think to roll out any meaningful updates to NuttX CI (like Build Dependency Caching) we need to...

  1. Identify the NuttX CI Lead who will implement, monitor and maintain these CI Enhancements

  2. Sorry I can't be CI Lead: I suffer from hypertension and I can't commit to long-term projects. I believe @simbit18 might be busy with commitments (we should ask). So we might need to hunt for a suitable CI Lead.

  3. To assist the CI Lead: We should nominate Arch Owners (i.e. Owner for everything Arm32 / Arm64 / RISC-V / Xtensa / ...) and Board Representatives (i.e. Representative for STM32 Boards / RP2040 Boards / IMX Boards / ...)

  4. So it's easier for NuttX CI Lead to consult the Arch Owner on any CI Design Decisions. And Arch Owner can talk to the Board Reps, on behalf of CI Lead, as explained here. Can't find the ESP32 Downloads? We ask the Xtensa Owner :-)

  5. As for Non-ESP32 Downloads: NuttX has 73 curl commands for downloading Build Dependencies. We could rename the 73 curl commands to nuttx-curl.sh, which will read from our Dependency Cache. And ask Arch Owners + Board Reps to help test all 73 nuttx-curl.sh commands.

  6. Remember: nuttx-curl.sh needs to support all kinds of Local PCs for development: macOS, WSL, Ubuntu, FreeBSD, etc. Maybe on Local PCs: We simply point nuttx-curl.sh to curl?

  7. Will we have an Operations Person (paid, preferably) to monitor NuttX CI daily? Like checking the Daily Builds? Making sure that Build Dependencies are cached correctly and updated often (e.g. due to security patches)? This person will affect our Upcoming CI Redesign. That's why I implemented CI Build Retry in the simplest way, that requires Zero Maintenance, Zero Monitoring.

  8. Are we planning to host the Build Dependency Cache at GitHub? This might fail though, because we already have problems downloading github.com from our NuttX CI Scripts.

vlcsnap-2026-03-09-09h04m27s740

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Build system Size: S The size of the change in this PR is small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Retry for CI Builds [Warning] High Utilisation of GitHub Runners

4 participants