Skip to content

feat(l1): snap sync metrics and dashboard#6468

Closed
Arkenan wants to merge 4 commits intomainfrom
snapsync-metrics
Closed

feat(l1): snap sync metrics and dashboard#6468
Arkenan wants to merge 4 commits intomainfrom
snapsync-metrics

Conversation

@Arkenan
Copy link
Copy Markdown
Collaborator

@Arkenan Arkenan commented Apr 13, 2026

No description provided.

@github-actions github-actions bot added the L1 Ethereum client label Apr 13, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 13, 2026

Lines of code report

Total lines added: 301
Total lines removed: 0
Total lines changed: 301

Detailed view
+----------------------------------------------+-------+------+
| File                                         | Lines | Diff |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/api.rs      | 60    | +6   |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/mod.rs      | 54    | +2   |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/snapsync.rs | 196   | +196 |
+----------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/network.rs      | 752   | +97  |
+----------------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. snapsync_storage_downloaded and snapsync_storage_inserted are effectively dead metrics. In snapsync.rs, both are defined as IntCounters and only expose inc_* mutators at snapsync.rs. But the only producer path in network.rs and network.rs updates storage_per_second only; it never updates the storage total series. Result: Prometheus will report snapsync_storage_downloaded / snapsync_storage_inserted as 0 for the whole run, which makes the new storage progress metrics incorrect. Suggested fix: either make those fields IntGauge and set absolute values like the other snap-sync totals, or keep them as counters and increment them by the interval delta in push_snapsync_prometheus_metrics.

Other than that, the PR looks mechanically safe: it doesn’t touch EVM execution, gas accounting, consensus validation, trie logic, or RLP paths.

I couldn’t run cargo check in this environment because rustup failed to create temp files under the sandboxed filesystem, so this review is based on static inspection only.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough context for a thorough review. Let me write it up.


PR Review: feat(l1): Snap sync metrics and dashboard (#6468)

Overall: The structure is sound and follows established patterns in the codebase. There are a few correctness bugs and some gaps in metric coverage worth addressing before this lands.


Bugs

storage_downloaded / storage_inserted are always zero

crates/blockchain/metrics/snapsync.rs lines 27–28 declare these as IntCounter, and the setter methods (inc_storage_downloaded, inc_storage_inserted) use inc_by(). However, push_snapsync_prometheus_metrics in network.rs never calls these setters for the storage cases — only set_storage_per_second is called. As a result, both metrics will always report 0 in Prometheus.

Fix options:

  • Switch to IntGauge + set() (consistent with every other progress counter in this file), and call METRICS_SNAPSYNC.set_storage_downloaded(downloaded) / METRICS_SNAPSYNC.set_storage_inserted(inserted) in the RequestingStorageRanges / InsertingStorageRanges arms.
  • Or keep IntCounter and call inc_by(delta) with the interval delta — but this would permanently diverge from the gauge-based approach used for accounts, headers, and bytecodes, and would make the metric non-resettable across node restarts.

The first option is strongly preferred for consistency.


Storage absolute counts not pushed at all

Even beyond the type issue above, RequestingStorageRanges and InsertingStorageRanges (network.rs ~lines 734–745) only push the per-second rate — they never push the actual downloaded or inserted counts. Compare with headers, accounts, and bytecodes which push both the count and the rate. If the intent is observability of sync progress, the raw counts are the more useful signal for dashboards (e.g., "X of Y storage leaves downloaded").


Correctness Concerns

HealingStorage never sets a start timestamp

HealingState calls METRICS_SNAPSYNC.set_healing_stage_start_now() when phase_elapsed < 2, but HealingStorage does not (network.rs ~lines 754–760). If the sync process transitions from HealingStateHealingStorage, the healing_stage_start_timestamp metric retains the value from when HealingState began, not when HealingStorage began. Whether this is intentional (treating healing as one combined phase) should be clarified in a comment.


cast_signed() on u64 — potential panic in debug

snapsync.rs lines ~163–186 use .cast_signed() to convert u64i64. For current Ethereum block numbers this is safe, but for any counter that could realistically exceed i64::MAX (≈ 9.2 × 10¹⁸), this panics in debug builds and wraps silently in release. Consider using as i64 with a known-safe comment, or saturate at i64::MAX. This is low priority given current scale, but worth a one-line comment.


Design / Performance

gather_metrics() allocates a fresh Registry on every scrape

snapsync.rs lines 189–229 create a new Registry, box all 22 metrics, register them, and encode them on every call to get_metrics(). This is consistent with the existing p2p.rs pattern, but it is allocation-heavy and scales linearly with metric count. Since this is called periodically by the Prometheus scraper, the impact is minor, but if this pattern is ever revisited globally, a cached registry would be worthwhile.


set_stage takes i64, not u8

snapsync.rs:162 declares pub fn set_stage(&self, stage: i64), but the call site (network.rs:697) does let step_num: u8 = step.into(); METRICS_SNAPSYNC.set_stage(step_num as i64). The IntGauge::set() API requires i64, so this is unavoidable, but the function signature could accept u8 internally and convert, keeping the call site cleaner:

pub fn set_stage(&self, stage: u8) { self.stage.set(stage as i64); }

Minor / Nits

  • The stage encoding (snapsync_stage description in snapsync.rs:55) maps stage numbers to sync phases. These numbers are derived from CurrentStepValue::into::<u8>() — worth verifying this mapping is stable and not just the current enum discriminant order, which could silently shift if variants are reordered.

  • interval_secs = PROGRESS_INTERVAL_SECS.max(1) as f64 (network.rs:704): The .max(1) guard is correct. Since PROGRESS_INTERVAL_SECS is a const u64 = 30, this is fine at compile time but also just defensive dead code — a comment that this guards against a hypothetical zero interval would help future readers.


Summary

Severity Item
Bug storage_downloaded / storage_inserted always 0; inc_* methods unused
Bug Storage absolute counts not pushed (only rate)
Correctness HealingStorage missing stage-start timestamp (or needs comment)
Low cast_signed() panic in debug for very large counters
Nit Stage number–enum mapping stability

The rate calculation logic and phase detection heuristic (phase_elapsed < 2s) are correct given the 30-second polling interval. The feature flag gating is consistent with the rest of the codebase.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@Arkenan Arkenan changed the title feat(l1): Snap sync metrics and dashboard feat(l1): snap sync metrics and dashboard Apr 13, 2026
…torage progress

Fixes three critical issues with snap sync metrics:

1. Metrics not reaching 100% on phase completion
   - Added metrics emission when phases complete, not just during periodic logging
   - Headers and accounts now accurately show 100% when finished

2. Storage metrics showing incorrect values
   - Changed storage_downloaded/inserted from IntCounter to IntGauge
   - Use set() with total values instead of inc_by() with intervals
   - Prevents accumulation errors where inserted > downloaded

3. Improved real-time accuracy
   - Stage number updates immediately on phase transitions
   - All counters show actual values matching internal state

Also enables metrics API by removing default-features=false from ethrex-metrics dependency.
Fixes two additional metrics issues:

1. Healing elapsed showing "no data"
   - Healing timestamp was only set within first 2 seconds of phase
   - Changed to set on first metrics call if currently 0
   - Added getter method to check current timestamp value
   - Applied to both HealingState and HealingStorage phases

2. Bytecodes ETA bouncing between values
   - bytecodes_total was updating as pivot block changed during sync
   - Now locks the total when bytecode phase first starts
   - Prevents target from moving, making ETA calculations stable
   - Added getter method to check if total already locked

Also removed unused IntCounter import.
ElFantasma added a commit that referenced this pull request Apr 14, 2026
Incorporate progress metrics from PR #6468 (Tomi/Esteve) into the
observability PR, with improvements:

- Add progress gauges: headers, accounts, storage, healing, bytecodes
  (downloaded/inserted/total) + stage + pivot_block
- Push from METRICS atomics via push_sync_prometheus_metrics() in
  network.rs, called each polling cycle and on phase completion
- Grafana dashboard with 7 rows: overview, peer health, headers,
  accounts, storage, healing, bytecodes — with progress gauges, rate
  panels (using Grafana rate() instead of app-computed rates), and ETA
- All metrics use default Prometheus registry (register at init)
- New peer-health row with eligible peers, pivot age, inflight requests,
  and pivot update outcomes — not present in the original PR

Supersedes #6468.
@ElFantasma
Copy link
Copy Markdown
Contributor

We've incorporated the progress metrics and Grafana dashboard from this PR into #6470 (snap sync observability), with a few adaptations:

  • Metrics use the default Prometheus registry (register once at init) instead of per-call Registry::new()
  • Throughput rates (*_per_second) are computed in Grafana via rate(metric[5m]) instead of in-app — this is more idiomatic for Prometheus and avoids maintaining rate state in Rust
  • Stage start timestamps are removed (Grafana derives timing from rate slopes)
  • Added a new Peer Health row to the dashboard with eligible peers, pivot age, inflight requests, and pivot update outcomes (diagnostics from the broader observability work)
  • All metric names use ethrex_sync_* prefix for consistency

The dashboard layout is preserved: overview → headers → accounts → storage → healing → bytecodes, with progress gauges, rate panels, and ETA.

This PR can be closed once #6470 merges. Thanks for the dashboard work!

@ElFantasma
Copy link
Copy Markdown
Contributor

Closing this PR as all it's functionality was moved to #6470.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants