Add scheduled instance snapshots with retention cleanup#139
Add scheduled instance snapshots with retention cleanup#139sjmiller609 wants to merge 30 commits intomainfrom
Conversation
✱ Stainless preview buildsThis PR will update the Edit this comment to update it. It will appear in the SDK's changelogs. ⏳ hypeman-go studio · code · diff
✅ hypeman-openapi studio · code · diff
⏳ These are partial results; builds are still running. This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push. |
There was a problem hiding this comment.
Stale comment
Automated Risk Assessment
Medium-Highrisk. Review is required, so this PR is not approved by automation.Evidence from the diff:
- Adds a new background scheduler loop in
cmd/api/main.gothat periodically executes snapshot schedule runs.- Introduces substantial new scheduling and retention logic in
lib/instances/snapshot_schedule.go, including automatic snapshot deletion paths.- Expands public API surface with new snapshot-schedule endpoints in
cmd/api/api/snapshots.goandopenapi.yaml(plus regenerated bindings inlib/oapi/oapi.go).- Changes are cross-cutting across API, runtime behavior, persistence, and cleanup semantics, increasing regression and operational blast radius.
Validation notes:
- Attempted targeted tests, but they could not run in this environment due missing embedded binary assets under
lib/vmm,lib/system, andlib/ingress.Action taken:
- Requested reviewers:
rgarcia,hiroTamada.
There was a problem hiding this comment.
Automated Risk Assessment
Medium-High risk. Review is required, so this PR is not approved by automation.
Assessment based on the code diff:
- Adds a new minute-polling background scheduler in
cmd/api/main.gothat executes asynchronous snapshot work continuously. - Introduces substantial new scheduling/retention execution logic in
lib/instances/snapshot_schedule.go, including automatic snapshot deletion paths. - Expands production API surface via new schedule endpoints in
cmd/api/api/snapshots.goand schema/binding changes inopenapi.yamlandlib/oapi/oapi.go. - Persists and loads schedule state on disk (
lib/scheduledsnapshots/storage.go,lib/paths/paths.go), increasing operational/state-management complexity.
Reviewer handling:
- Code review is required for this risk level.
- No additional reviewer requests were made because 2 reviewers are already assigned (
hiroTamada,rgarcia).
…ers, remove dead code - Extract failScheduleRun helper to deduplicate the "record error + save schedule" pattern repeated 4 times in runSnapshotScheduleForInstanceLocked - Inline countScheduledSnapshots (single-use len() wrapper) at call site - Inline validateSetSnapshotScheduleRequest (single-use wrapper) at call site - Merge snapshot_schedule_types.go contents into snapshot_schedule.go - Remove dead truncation/fallback code in BuildSnapshotName that could never execute because ValidateSetRequest already rejects invalid prefixes upstream Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove redundant nil-check-before-assign for pointer fields in snapshotScheduleToOAPI (assigning nil to a nil field is a no-op) - Use var declaration instead of make([]error, 0) for error accumulator that is only appended to (not serialized to JSON) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…le-retention # Conflicts: # lib/oapi/oapi.go
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
The scopes package (added on main) requires all routes to have scope mappings. Map the three snapshot-schedule endpoints to existing snapshot scopes: read, write, and delete. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sayan-
left a comment
There was a problem hiding this comment.
a few things to think about before this sees heavy use -- none of these need to block the PR but worth tracking:
scheduler robustness
the scheduler processes all due instances sequentially in a single goroutine with no per-snapshot timeout and no max-snapshots-per-tick limit. if a single createSnapshot hangs (e.g. VM pause never completes), it blocks the entire scheduler indefinitely. if many instances are due at once, the scheduler can fall behind the 1-minute tick -- and when it does, NextRun silently skips forward with no last_error, no metric, and no log. the user has no way to know a scheduled snapshot was missed.
consider: per-snapshot context timeout, a cap on snapshots per tick, and logging/recording when a schedule is skipped due to the scheduler falling behind.
thundering herd
SetSnapshotSchedule sets NextRunAt = now + interval with no jitter. instances created around the same time with the same interval will stay aligned forever, creating periodic load spikes. adding a small random offset on creation would spread the load.
ux nits
MinInterval = 1mis pretty aggressive for standby snapshots on running instances, which involve a pause/resume cycle. might be worth a higher floor (5m? 10m?) or at least documenting the guest-visible impact.SetSnapshotScheduleresetsNextRunAton every update, even config-only changes like adjusting retention. if i just want to changemax_countfrom 5 to 10, my next snapshot gets pushed out by a full interval. consider only resetting the timer when the interval itself changes.
|
Addressed the targeted scheduler concerns from Sayans review in the latest pass:
I left the broader scheduler hardening items as follow-up work for now: per-snapshot timeouts, per-tick caps, and missed-run/backlog visibility. |



Summary
max_count/max_age)API
GET /instances/{id}/snapshot-schedulePUT /instances/{id}/snapshot-scheduleDELETE /instances/{id}/snapshot-scheduleNotes
Testing
go test ./lib/instances -run SnapshotSchedule -count=1go test ./cmd/api/api -run Snapshot -count=1go test ./cmd/api -run TestOapiRuntimeBindStyledParameter_URLDecoding -count=1Note
Medium Risk
Adds a new background scheduler that automatically creates and deletes snapshots based on persisted schedules, which can impact storage usage and snapshot/instance lifecycle behavior if misconfigured or buggy.
Overview
Adds scheduled snapshots: new
GET/PUT/DELETE /instances/{id}/snapshot-scheduleendpoints, OpenAPI/SDK bindings, and API handlers to create/update/read/delete schedules.Implements schedule persistence and execution in
instances(newSnapshotScheduleManager): schedules are stored undersnapshot-schedules/, due schedules are polled every minute fromcmd/api/main.go, runs auto-select snapshot kind from instance state, and retention cleanup (max_count/max_age) deletes only scheduled snapshots.Includes new
scheduledsnapshotslibrary (validation, deterministic jitter cadence, marshal/unmarshal, listing) plus paths/errors/docs updates and focused tests (including preservingmax_count=0).Written by Cursor Bugbot for commit 52e78bf. This will update automatically on new commits. Configure here.