feat(recovery): [CON-1604] allow to override what state height to download by pierugo-dfinity · Pull Request #9707 · dfinity/ic

pierugo-dfinity · 2026-04-01T15:47:58Z

This PR aims at deflaking subnet recovery system tests that either:

Download a state older than the latest CUP because the download_state_node has not computed the manifest yet.
Download a state higher than the latest CUP because the subnet stalled right after computing the manifest but before creating a CUP.
Download the correct state (equal to the latest CUP) but was considered invalid because it was received by state sync.

It does so by introducing a new download_state_height argument to ic-recovery. If provided, it will download the checkpoint at that height and throw and error if it does not exist.
If not provided (recommended in most production cases), it will default to the latest checkpoint.

Subnet recovery system tests are then adapted as such:

Always download the checkpoint that matches the highest CUP on the subnet (solves first two flakes).
Download it from a node that created a manifest for this CUP (solves third flake).
Download the consensus pool from a node that has this highest CUP, and choose the one with highest certification share height.

Copilot

Pull request overview

Refactors the subnet recovery test harness and ic-recovery tooling to allow explicitly selecting which checkpoint height to download (targeting the latest CUP height), reducing flakes caused by manifest/CUP timing races during state download.

Changes:

Add --download-state-height support to recovery flows (App subnet, NNS same-nodes, NNS failover-nodes) and thread it through download/copy-local steps.
Extend node metrics parsing to include CUP height and last computed manifest height; update node selection logic in tests to use these heights.
Update system tests to pass an explicit download_state_height derived from CUP height and refactor parameter selection into a helper.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
rs/tests/testnets/nns_recovery.rs	Switches to updated node-height selection helper and struct return type.
rs/tests/nested/nns_recovery/common.rs	Uses CUP height to set `download_state_height` for NNS recovery runs.
rs/tests/consensus/subnet_recovery/utils.rs	Introduces `NodeHeights` + selects node based on cert-share/CUP heights; threads `download_state_height` into local CLI args.
rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs	Passes `download_state_height` into NNS failover recovery args.
rs/tests/consensus/subnet_recovery/common.rs	Refactors app-subnet recovery parameter selection; ensures state download node has manifest >= CUP; passes `download_state_height`.
rs/recovery/subnet_splitting/src/subnet_splitting.rs	Updates `get_ic_state_includes` call signature.
rs/recovery/src/steps.rs	Makes local state copy step accept precomputed include paths; uses `MaybeRemote` for latest checkpoint lookup remotely.
rs/recovery/src/recovery_state.rs	Updates tests to account for new args field and simplifies expectations.
rs/recovery/src/nns_recovery_same_nodes.rs	Adds CLI arg + interactive prompt for `download_state_height`; threads it into download/copy-local steps.
rs/recovery/src/nns_recovery_failover_nodes.rs	Adds CLI arg + interactive prompt for `download_state_height`; threads it into download step.
rs/recovery/src/lib.rs	Implements `download_height` support in `get_ic_state_includes`; adds `MaybeRemote`; extends `NodeMetrics`; adds unit test.
rs/recovery/src/cli.rs	Updates height-selection guidance message to mention CUP height.
rs/recovery/src/app_subnet_recovery.rs	Adds CLI arg + interactive prompt for `download_state_height`; threads it into download/copy-local steps.
rs/recovery/Cargo.toml	Adds `assert_matches` dev-dependency for new unit test assertions.
rs/recovery/BUILD.bazel	Adds Bazel dev-dependency for `assert_matches`.
Cargo.lock	Locks `assert_matches` into `ic-recovery` dependency set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…user

Copilot

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kpop-dfinity

This will make @basvandijk very happy

kpop-dfinity · 2026-04-16T14:52:01Z

+                    // We could pick a node with highest finalization and CUP height automatically,
+                    // but we might have a preference between nodes of same heights.


IIRC we don't print CUP heights in print_height_info. Should we start doing so?

We print the debug representation of NodeMetrics which has a catch_up_package_height field with this PR :)

kpop-dfinity · 2026-04-16T15:04:33Z

-    pub fn get_ic_state_includes(ssh_helper: Option<&SshHelper>) -> RecoveryResult<Vec<PathBuf>> {
+    pub fn get_ic_state_includes(
+        maybe_remote: MaybeRemote<'_>,
+        download_height: Option<u64>,


I'm probably overengineering but maybe we could use an enum here?

enum CheckpointHeight { Latest, Specific(u64), // I don't know what a good name for this variant would be }

Otherwise, without reading the documentation, it's not immediately clear what None means

I don't dislike it. I was also using an Option<&SshHelper> before switching to MaybeRemote which I think helps. WDYT of 8c9b674 ?

kpop-dfinity · 2026-04-16T15:04:52Z

-            Self::get_maybe_latest_checkpoint_name_remotely(ssh_helper, &ic_checkpoints_path)?
+
+        let checkpoint_name = if let Some(height) = download_height {
+            let name = format!("{:016x}", height);


Suggested change

let name = format!("{:016x}", height);

let name = format!("{height:016x}");

kpop-dfinity · 2026-04-16T15:05:25Z

-    pub fn get_ic_state_includes(ssh_helper: Option<&SshHelper>) -> RecoveryResult<Vec<PathBuf>> {
+    pub fn get_ic_state_includes(
+        maybe_remote: MaybeRemote<'_>,
+        download_height: Option<u64>,


Suggested change

download_height: Option<u64>,

checkpoint_height_to_download: Option<u64>,

kpop-dfinity · 2026-04-16T15:06:44Z

+                // No checkpoints, return an empty list of includes. This is not an error, as the
+                // subnet could have stalled in its first DKG interval.


Should we at least log a warning?

Good idea, b3a6a23

kpop-dfinity · 2026-04-16T15:39:08Z

+    pub cert_share: u64,
+}
+
 /// Select a node with highest certification share height in the given subnet snapshot


This not necessarily holds any more, it could happen that there is a node with a cert share higher than everyone else, yet who hasn't produced the lastest CUP.

Is that an issue?

Yeah I have thought about that and I think you're right :( I do not think that should lead to a lot of flakiness. The way I understand it:

Except if the test driver is very fast*, all remaining healthy nodes after halting/breaking the subnet should have the same CUP height (i.e. P2P still runs in both cases).

Out of these, it should be fine to select the node with highest certification share height.

*it could happen that:

A node aggregates a CUP

Subnet halts/breaks

The test driver scrapes all metrics

The node gossips the CUP

In that case, only 1 (or a few) nodes would have the latest CUP and, indeed, may not have the highest certification share available. The test would flake if its certification share height is lower than the subnet's highest certification height in ValidateReplayOutput step. This is theoretically possible indeed. Let me think if we could also avoid that, but note that between halting/breaking the subnet and fetching the metrics, the test driver also asserts that the subnet is broken (1 query + update). The racing window looks quite unrealistic.

(I have updated the comment 5a8a090)

kpop-dfinity · 2026-04-16T15:42:48Z

+    let mut parameters = if cfg.corrupt_cup.can_determine_subnet_id() {
+        // If we can deploy read-only access to the subnet, then:


so cfg.corrupt_cup.can_determine_subnet_id() implies that we can deploy read-only access?
Can we instead add another argument to get_recovery_parameters, say can_deploy_read_only_access: bool, so we don't have to rely on some implicit assumptions?

Sure 52f9c66

kpop-dfinity · 2026-04-16T15:46:59Z

+    } else {
+        // If we cannot deploy read-only access to the subnet, this would mean that the CUP is
+        // corrupted on enough nodes to stall the subnet which, in practice, should happen only
+        // during upgrades. In that case, all nodes stalled at the same height (the upgrade height)


all nodes stalled at the same height (the upgrade height) except the lagging behind nodes (like you mention below). If a node was very slow and doesn't have all the artifacts to reach the upgrade height, then it might never reach that height

Indeed. I think there's an assumption that between the moment the subnet halts/breaks and the moment the test driver fetches metrics, there is enough time for the node to catch-up in case it's falling behind (through P2P).

kpop-dfinity · 2026-04-16T15:58:07Z

+        let upload_node = env
+            .topology_snapshot()
+            .unassigned_nodes()
+            .next()


Do I understand correctly that if there are unassigned nodes then we are in failover nodes recovery? (that's how I understand of the comment above)

I'd rather add another parameter to the get_recovery_parameters function, sayfailover_mode: bool, and do here:

let upload node = if failover_mode { env .topology_snapshot() .unassigned_nodes() .next() .expect("To do a failover nodes recovery we must have unassigned nodes") } else { download_state_node };

Instead of relying on some implicit assumptions about the setup of the tests

kpop-dfinity · 2026-04-16T15:58:25Z

+        let upload_node = env
+            .topology_snapshot()
+            .unassigned_nodes()
+            .next()
+            .unwrap_or_else(|| download_state_node.clone());


actually maybe we can even move this outside the if {} else {}, to avoid code duplciation

Unfortunately, it depends on download_state_node so it must be declared after it. Though admin_nodes (whose definition is different based on the if-else) depends on upload_nodes :(
I could introduce a closure that takes download_state_node as an argument, but I'm afraid that would badly affect readability.

kpop-dfinity · 2026-04-16T16:02:09Z

One more meta thing: I don't think refactor tag is appropriate here. I think feat makes more sense

basvandijk · 2026-04-16T16:02:35Z

This will make @basvandijk very happy

It does!

…user

feat: initial compiling draft

ee64ca4

github-actions bot added the refactor label Apr 1, 2026

pierugo-dfinity added 15 commits April 2, 2026 09:18

fix: clippy

1e60226

fix: clippy

18ecb85

fix

dcffb73

fix

928e3d9

fix: given height == 0 downloads nothing

fe0f840

fix: do not print height infos twice

868a927

refactor: consistent download <-> copy

3156de5

docs

c474c28

style: introduce MaybeRemote

666c924

refactor: cleanup recovery parameters

29a3a46

style: named struct

5d2a2c3

style: match

408b49a

refactor: inline implementations

a308460

test: add unit tests

72edf77

fix: do not exit(1) in unit tests

e5a40ac

pierugo-dfinity mentioned this pull request Apr 15, 2026

refactor(recovery): [CON-1604] download the latest CUP checkpoint #8843

Closed

pierugo-dfinity requested a review from Copilot April 15, 2026 12:13

Copilot started reviewing on behalf of pierugo-dfinity April 15, 2026 12:13 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread rs/tests/consensus/subnet_recovery/utils.rs Outdated

Comment thread rs/tests/nested/nns_recovery/common.rs Outdated

Comment thread rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs

Comment thread rs/recovery/src/lib.rs

pierugo-dfinity added 9 commits April 15, 2026 13:28

Merge branch 'master' into pierugo/recovery/download-state-chosen-by-…

e10e867

…user

feat: max CUP first

442ffe9

fix: check for nonzero state height download

f5626d1

docs: fix missing docs

faa05c9

style: pass MaybeRemote directly

df10c4c

style: enum instead

d21dbae

style: move MaybeRemote to util

20f31fe

docs: remove self-explaining docs

c168b4f

style: clippy

3d2cbe7

pierugo-dfinity requested a review from Copilot April 16, 2026 13:03

Copilot started reviewing on behalf of pierugo-dfinity April 16, 2026 13:04 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Comment thread rs/recovery/src/util.rs Outdated

pierugo-dfinity added 2 commits April 16, 2026 13:22

docs: AI review

6595691

style: clippy

96821d9

pierugo-dfinity marked this pull request as ready for review April 16, 2026 14:47

pierugo-dfinity requested review from a team as code owners April 16, 2026 14:47

github-actions bot added @consensus @node labels Apr 16, 2026

r-birkner approved these changes Apr 16, 2026

View reviewed changes

kpop-dfinity reviewed Apr 16, 2026

View reviewed changes

pierugo-dfinity added 10 commits April 17, 2026 09:21

refactor: use CheckpointHeight

8c9b674

style: include arg in format string

7b3cfc4

style: rename variable

f2d53c9

feat: log warning on no checkpoints found

b3a6a23

style: rename MaybeRemote to ExecutionMode

03813ec

refactor: run same commands locally and remotely

15b6516

style: do not unnecessarily deconstruct NodeHeights

64f31e9

docs: adapt comment

5a8a090

style: remove implicit assumptions

52f9c66

fix: clippy

f61eade

pierugo-dfinity changed the title ~~refactor(recovery): [CON-1604] allow to override what state height to download~~ feat(recovery): [CON-1604] allow to override what state height to download Apr 17, 2026

github-actions bot added feat and removed refactor labels Apr 17, 2026

pierugo-dfinity added 3 commits April 17, 2026 14:25

fix: clippy

af89c16

fix: clippy

ce0f431

Merge branch 'master' into pierugo/recovery/download-state-chosen-by-…

34b75f4

…user

		// We could pick a node with highest finalization and CUP height automatically,
		// but we might have a preference between nodes of same heights.

	let name = format!("{:016x}", height);
	let name = format!("{height:016x}");

	download_height: Option<u64>,
	checkpoint_height_to_download: Option<u64>,

		// No checkpoints, return an empty list of includes. This is not an error, as the
		// subnet could have stalled in its first DKG interval.

		let mut parameters = if cfg.corrupt_cup.can_determine_subnet_id() {
		// If we can deploy read-only access to the subnet, then:

Conversation

pierugo-dfinity commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

kpop-dfinity left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpop-dfinity commented Apr 16, 2026

Uh oh!

basvandijk commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pierugo-dfinity commented Apr 1, 2026 •

edited

Loading