feat(recovery): [CON-1604] allow to override what state height to download#9707
feat(recovery): [CON-1604] allow to override what state height to download#9707pierugo-dfinity wants to merge 40 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the subnet recovery test harness and ic-recovery tooling to allow explicitly selecting which checkpoint height to download (targeting the latest CUP height), reducing flakes caused by manifest/CUP timing races during state download.
Changes:
- Add
--download-state-heightsupport to recovery flows (App subnet, NNS same-nodes, NNS failover-nodes) and thread it through download/copy-local steps. - Extend node metrics parsing to include CUP height and last computed manifest height; update node selection logic in tests to use these heights.
- Update system tests to pass an explicit
download_state_heightderived from CUP height and refactor parameter selection into a helper.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| rs/tests/testnets/nns_recovery.rs | Switches to updated node-height selection helper and struct return type. |
| rs/tests/nested/nns_recovery/common.rs | Uses CUP height to set download_state_height for NNS recovery runs. |
| rs/tests/consensus/subnet_recovery/utils.rs | Introduces NodeHeights + selects node based on cert-share/CUP heights; threads download_state_height into local CLI args. |
| rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs | Passes download_state_height into NNS failover recovery args. |
| rs/tests/consensus/subnet_recovery/common.rs | Refactors app-subnet recovery parameter selection; ensures state download node has manifest >= CUP; passes download_state_height. |
| rs/recovery/subnet_splitting/src/subnet_splitting.rs | Updates get_ic_state_includes call signature. |
| rs/recovery/src/steps.rs | Makes local state copy step accept precomputed include paths; uses MaybeRemote for latest checkpoint lookup remotely. |
| rs/recovery/src/recovery_state.rs | Updates tests to account for new args field and simplifies expectations. |
| rs/recovery/src/nns_recovery_same_nodes.rs | Adds CLI arg + interactive prompt for download_state_height; threads it into download/copy-local steps. |
| rs/recovery/src/nns_recovery_failover_nodes.rs | Adds CLI arg + interactive prompt for download_state_height; threads it into download step. |
| rs/recovery/src/lib.rs | Implements download_height support in get_ic_state_includes; adds MaybeRemote; extends NodeMetrics; adds unit test. |
| rs/recovery/src/cli.rs | Updates height-selection guidance message to mention CUP height. |
| rs/recovery/src/app_subnet_recovery.rs | Adds CLI arg + interactive prompt for download_state_height; threads it into download/copy-local steps. |
| rs/recovery/Cargo.toml | Adds assert_matches dev-dependency for new unit test assertions. |
| rs/recovery/BUILD.bazel | Adds Bazel dev-dependency for assert_matches. |
| Cargo.lock | Locks assert_matches into ic-recovery dependency set. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 17 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kpop-dfinity
left a comment
There was a problem hiding this comment.
This will make @basvandijk very happy
| // We could pick a node with highest finalization and CUP height automatically, | ||
| // but we might have a preference between nodes of same heights. |
There was a problem hiding this comment.
IIRC we don't print CUP heights in print_height_info. Should we start doing so?
There was a problem hiding this comment.
We print the debug representation of NodeMetrics which has a catch_up_package_height field with this PR :)
| pub fn get_ic_state_includes(ssh_helper: Option<&SshHelper>) -> RecoveryResult<Vec<PathBuf>> { | ||
| pub fn get_ic_state_includes( | ||
| maybe_remote: MaybeRemote<'_>, | ||
| download_height: Option<u64>, |
There was a problem hiding this comment.
I'm probably overengineering but maybe we could use an enum here?
enum CheckpointHeight {
Latest,
Specific(u64), // I don't know what a good name for this variant would be
}
Otherwise, without reading the documentation, it's not immediately clear what None means
There was a problem hiding this comment.
I don't dislike it. I was also using an Option<&SshHelper> before switching to MaybeRemote which I think helps. WDYT of 8c9b674 ?
| Self::get_maybe_latest_checkpoint_name_remotely(ssh_helper, &ic_checkpoints_path)? | ||
|
|
||
| let checkpoint_name = if let Some(height) = download_height { | ||
| let name = format!("{:016x}", height); |
There was a problem hiding this comment.
| let name = format!("{:016x}", height); | |
| let name = format!("{height:016x}"); |
| pub fn get_ic_state_includes(ssh_helper: Option<&SshHelper>) -> RecoveryResult<Vec<PathBuf>> { | ||
| pub fn get_ic_state_includes( | ||
| maybe_remote: MaybeRemote<'_>, | ||
| download_height: Option<u64>, |
There was a problem hiding this comment.
| download_height: Option<u64>, | |
| checkpoint_height_to_download: Option<u64>, |
| // No checkpoints, return an empty list of includes. This is not an error, as the | ||
| // subnet could have stalled in its first DKG interval. |
There was a problem hiding this comment.
Should we at least log a warning?
| pub cert_share: u64, | ||
| } | ||
|
|
||
| /// Select a node with highest certification share height in the given subnet snapshot |
There was a problem hiding this comment.
This not necessarily holds any more, it could happen that there is a node with a cert share higher than everyone else, yet who hasn't produced the lastest CUP.
Is that an issue?
There was a problem hiding this comment.
Yeah I have thought about that and I think you're right :( I do not think that should lead to a lot of flakiness. The way I understand it:
- Except if the test driver is very fast*, all remaining healthy nodes after halting/breaking the subnet should have the same CUP height (i.e. P2P still runs in both cases).
- Out of these, it should be fine to select the node with highest certification share height.
*it could happen that:
- A node aggregates a CUP
- Subnet halts/breaks
- The test driver scrapes all metrics
- The node gossips the CUP
In that case, only 1 (or a few) nodes would have the latest CUP and, indeed, may not have the highest certification share available. The test would flake if its certification share height is lower than the subnet's highest certification height in ValidateReplayOutput step. This is theoretically possible indeed. Let me think if we could also avoid that, but note that between halting/breaking the subnet and fetching the metrics, the test driver also asserts that the subnet is broken (1 query + update). The racing window looks quite unrealistic.
(I have updated the comment 5a8a090)
| let mut parameters = if cfg.corrupt_cup.can_determine_subnet_id() { | ||
| // If we can deploy read-only access to the subnet, then: |
There was a problem hiding this comment.
so cfg.corrupt_cup.can_determine_subnet_id() implies that we can deploy read-only access?
Can we instead add another argument to get_recovery_parameters, say can_deploy_read_only_access: bool, so we don't have to rely on some implicit assumptions?
| } else { | ||
| // If we cannot deploy read-only access to the subnet, this would mean that the CUP is | ||
| // corrupted on enough nodes to stall the subnet which, in practice, should happen only | ||
| // during upgrades. In that case, all nodes stalled at the same height (the upgrade height) |
There was a problem hiding this comment.
all nodes stalled at the same height (the upgrade height) except the lagging behind nodes (like you mention below). If a node was very slow and doesn't have all the artifacts to reach the upgrade height, then it might never reach that height
There was a problem hiding this comment.
Indeed. I think there's an assumption that between the moment the subnet halts/breaks and the moment the test driver fetches metrics, there is enough time for the node to catch-up in case it's falling behind (through P2P).
| let upload_node = env | ||
| .topology_snapshot() | ||
| .unassigned_nodes() | ||
| .next() |
There was a problem hiding this comment.
Do I understand correctly that if there are unassigned nodes then we are in failover nodes recovery? (that's how I understand of the comment above)
I'd rather add another parameter to the get_recovery_parameters function, sayfailover_mode: bool, and do here:
let upload node = if failover_mode {
env
.topology_snapshot()
.unassigned_nodes()
.next()
.expect("To do a failover nodes recovery we must have unassigned nodes")
} else {
download_state_node
};
Instead of relying on some implicit assumptions about the setup of the tests
| let upload_node = env | ||
| .topology_snapshot() | ||
| .unassigned_nodes() | ||
| .next() | ||
| .unwrap_or_else(|| download_state_node.clone()); |
There was a problem hiding this comment.
actually maybe we can even move this outside the if {} else {}, to avoid code duplciation
There was a problem hiding this comment.
Unfortunately, it depends on download_state_node so it must be declared after it. Though admin_nodes (whose definition is different based on the if-else) depends on upload_nodes :(
I could introduce a closure that takes download_state_node as an argument, but I'm afraid that would badly affect readability.
|
One more meta thing: I don't think |
It does! |
This PR aims at deflaking subnet recovery system tests that either:
download_state_nodehas not computed the manifest yet.It does so by introducing a new
download_state_heightargument toic-recovery. If provided, it will download the checkpoint at that height and throw and error if it does not exist.If not provided (recommended in most production cases), it will default to the latest checkpoint.
Subnet recovery system tests are then adapted as such: