Skip to content

feat(recovery): [CON-1604] allow to override what state height to download#9707

Open
pierugo-dfinity wants to merge 40 commits intomasterfrom
pierugo/recovery/download-state-chosen-by-user
Open

feat(recovery): [CON-1604] allow to override what state height to download#9707
pierugo-dfinity wants to merge 40 commits intomasterfrom
pierugo/recovery/download-state-chosen-by-user

Conversation

@pierugo-dfinity
Copy link
Copy Markdown
Contributor

@pierugo-dfinity pierugo-dfinity commented Apr 1, 2026

This PR aims at deflaking subnet recovery system tests that either:

  • Download a state older than the latest CUP because the download_state_node has not computed the manifest yet.
  • Download a state higher than the latest CUP because the subnet stalled right after computing the manifest but before creating a CUP.
  • Download the correct state (equal to the latest CUP) but was considered invalid because it was received by state sync.

It does so by introducing a new download_state_height argument to ic-recovery. If provided, it will download the checkpoint at that height and throw and error if it does not exist.
If not provided (recommended in most production cases), it will default to the latest checkpoint.

Subnet recovery system tests are then adapted as such:

  • Always download the checkpoint that matches the highest CUP on the subnet (solves first two flakes).
  • Download it from a node that created a manifest for this CUP (solves third flake).
  • Download the consensus pool from a node that has this highest CUP, and choose the one with highest certification share height.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the subnet recovery test harness and ic-recovery tooling to allow explicitly selecting which checkpoint height to download (targeting the latest CUP height), reducing flakes caused by manifest/CUP timing races during state download.

Changes:

  • Add --download-state-height support to recovery flows (App subnet, NNS same-nodes, NNS failover-nodes) and thread it through download/copy-local steps.
  • Extend node metrics parsing to include CUP height and last computed manifest height; update node selection logic in tests to use these heights.
  • Update system tests to pass an explicit download_state_height derived from CUP height and refactor parameter selection into a helper.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
rs/tests/testnets/nns_recovery.rs Switches to updated node-height selection helper and struct return type.
rs/tests/nested/nns_recovery/common.rs Uses CUP height to set download_state_height for NNS recovery runs.
rs/tests/consensus/subnet_recovery/utils.rs Introduces NodeHeights + selects node based on cert-share/CUP heights; threads download_state_height into local CLI args.
rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs Passes download_state_height into NNS failover recovery args.
rs/tests/consensus/subnet_recovery/common.rs Refactors app-subnet recovery parameter selection; ensures state download node has manifest >= CUP; passes download_state_height.
rs/recovery/subnet_splitting/src/subnet_splitting.rs Updates get_ic_state_includes call signature.
rs/recovery/src/steps.rs Makes local state copy step accept precomputed include paths; uses MaybeRemote for latest checkpoint lookup remotely.
rs/recovery/src/recovery_state.rs Updates tests to account for new args field and simplifies expectations.
rs/recovery/src/nns_recovery_same_nodes.rs Adds CLI arg + interactive prompt for download_state_height; threads it into download/copy-local steps.
rs/recovery/src/nns_recovery_failover_nodes.rs Adds CLI arg + interactive prompt for download_state_height; threads it into download step.
rs/recovery/src/lib.rs Implements download_height support in get_ic_state_includes; adds MaybeRemote; extends NodeMetrics; adds unit test.
rs/recovery/src/cli.rs Updates height-selection guidance message to mention CUP height.
rs/recovery/src/app_subnet_recovery.rs Adds CLI arg + interactive prompt for download_state_height; threads it into download/copy-local steps.
rs/recovery/Cargo.toml Adds assert_matches dev-dependency for new unit test assertions.
rs/recovery/BUILD.bazel Adds Bazel dev-dependency for assert_matches.
Cargo.lock Locks assert_matches into ic-recovery dependency set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rs/tests/consensus/subnet_recovery/utils.rs Outdated
Comment thread rs/tests/nested/nns_recovery/common.rs Outdated
Comment thread rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs
Comment thread rs/recovery/src/lib.rs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rs/recovery/src/util.rs Outdated
@pierugo-dfinity pierugo-dfinity marked this pull request as ready for review April 16, 2026 14:47
@pierugo-dfinity pierugo-dfinity requested review from a team as code owners April 16, 2026 14:47
Copy link
Copy Markdown
Contributor

@kpop-dfinity kpop-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make @basvandijk very happy

Comment on lines +313 to +314
// We could pick a node with highest finalization and CUP height automatically,
// but we might have a preference between nodes of same heights.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we don't print CUP heights in print_height_info. Should we start doing so?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We print the debug representation of NodeMetrics which has a catch_up_package_height field with this PR :)

Comment thread rs/recovery/src/lib.rs Outdated
pub fn get_ic_state_includes(ssh_helper: Option<&SshHelper>) -> RecoveryResult<Vec<PathBuf>> {
pub fn get_ic_state_includes(
maybe_remote: MaybeRemote<'_>,
download_height: Option<u64>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably overengineering but maybe we could use an enum here?

enum CheckpointHeight {
     Latest,
     Specific(u64),  // I don't know what a good name for this variant would be
}

Otherwise, without reading the documentation, it's not immediately clear what None means

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't dislike it. I was also using an Option<&SshHelper> before switching to MaybeRemote which I think helps. WDYT of 8c9b674 ?

Comment thread rs/recovery/src/lib.rs Outdated
Self::get_maybe_latest_checkpoint_name_remotely(ssh_helper, &ic_checkpoints_path)?

let checkpoint_name = if let Some(height) = download_height {
let name = format!("{:016x}", height);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let name = format!("{:016x}", height);
let name = format!("{height:016x}");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread rs/recovery/src/lib.rs Outdated
pub fn get_ic_state_includes(ssh_helper: Option<&SshHelper>) -> RecoveryResult<Vec<PathBuf>> {
pub fn get_ic_state_includes(
maybe_remote: MaybeRemote<'_>,
download_height: Option<u64>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_height: Option<u64>,
checkpoint_height_to_download: Option<u64>,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread rs/recovery/src/lib.rs Outdated
Comment on lines +408 to +409
// No checkpoints, return an empty list of includes. This is not an error, as the
// subnet could have stalled in its first DKG interval.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we at least log a warning?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, b3a6a23

pub cert_share: u64,
}

/// Select a node with highest certification share height in the given subnet snapshot
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This not necessarily holds any more, it could happen that there is a node with a cert share higher than everyone else, yet who hasn't produced the lastest CUP.

Is that an issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I have thought about that and I think you're right :( I do not think that should lead to a lot of flakiness. The way I understand it:

  • Except if the test driver is very fast*, all remaining healthy nodes after halting/breaking the subnet should have the same CUP height (i.e. P2P still runs in both cases).
  • Out of these, it should be fine to select the node with highest certification share height.

*it could happen that:

  • A node aggregates a CUP
  • Subnet halts/breaks
  • The test driver scrapes all metrics
  • The node gossips the CUP

In that case, only 1 (or a few) nodes would have the latest CUP and, indeed, may not have the highest certification share available. The test would flake if its certification share height is lower than the subnet's highest certification height in ValidateReplayOutput step. This is theoretically possible indeed. Let me think if we could also avoid that, but note that between halting/breaking the subnet and fetching the metrics, the test driver also asserts that the subnet is broken (1 query + update). The racing window looks quite unrealistic.

(I have updated the comment 5a8a090)

Comment on lines +1107 to +1108
let mut parameters = if cfg.corrupt_cup.can_determine_subnet_id() {
// If we can deploy read-only access to the subnet, then:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so cfg.corrupt_cup.can_determine_subnet_id() implies that we can deploy read-only access?
Can we instead add another argument to get_recovery_parameters, say can_deploy_read_only_access: bool, so we don't have to rely on some implicit assumptions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure 52f9c66

} else {
// If we cannot deploy read-only access to the subnet, this would mean that the CUP is
// corrupted on enough nodes to stall the subnet which, in practice, should happen only
// during upgrades. In that case, all nodes stalled at the same height (the upgrade height)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all nodes stalled at the same height (the upgrade height) except the lagging behind nodes (like you mention below). If a node was very slow and doesn't have all the artifacts to reach the upgrade height, then it might never reach that height

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I think there's an assumption that between the moment the subnet halts/breaks and the moment the test driver fetches metrics, there is enough time for the node to catch-up in case it's falling behind (through P2P).

Comment on lines +1133 to +1136
let upload_node = env
.topology_snapshot()
.unassigned_nodes()
.next()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that if there are unassigned nodes then we are in failover nodes recovery? (that's how I understand of the comment above)

I'd rather add another parameter to the get_recovery_parameters function, sayfailover_mode: bool, and do here:

let upload node = if failover_mode {
     env
            .topology_snapshot()
            .unassigned_nodes()
            .next()
            .expect("To do a failover nodes recovery we must have unassigned nodes")
} else {
     download_state_node
};

Instead of relying on some implicit assumptions about the setup of the tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1167 to +1171
let upload_node = env
.topology_snapshot()
.unassigned_nodes()
.next()
.unwrap_or_else(|| download_state_node.clone());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually maybe we can even move this outside the if {} else {}, to avoid code duplciation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it depends on download_state_node so it must be declared after it. Though admin_nodes (whose definition is different based on the if-else) depends on upload_nodes :(
I could introduce a closure that takes download_state_node as an argument, but I'm afraid that would badly affect readability.

@kpop-dfinity
Copy link
Copy Markdown
Contributor

One more meta thing: I don't think refactor tag is appropriate here. I think feat makes more sense

@basvandijk
Copy link
Copy Markdown
Collaborator

This will make @basvandijk very happy

It does!

@pierugo-dfinity pierugo-dfinity changed the title refactor(recovery): [CON-1604] allow to override what state height to download feat(recovery): [CON-1604] allow to override what state height to download Apr 17, 2026
@github-actions github-actions bot added feat and removed refactor labels Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants