Skip to content

Add scheduled instance snapshots with retention cleanup#139

Open
sjmiller609 wants to merge 30 commits intomainfrom
codex/snapshot-schedule-retention
Open

Add scheduled instance snapshots with retention cleanup#139
sjmiller609 wants to merge 30 commits intomainfrom
codex/snapshot-schedule-retention

Conversation

@sjmiller609
Copy link
Collaborator

@sjmiller609 sjmiller609 commented Mar 9, 2026

Summary

  • add per-instance snapshot schedule configuration (get/set/delete)
  • add background scheduler that runs due schedules and creates snapshots automatically
  • add retention cleanup for scheduled snapshots (max_count / max_age)
  • update snapshot/instance README docs for scheduled snapshot behavior
  • extend OpenAPI and regenerate SDK/server bindings

API

  • GET /instances/{id}/snapshot-schedule
  • PUT /instances/{id}/snapshot-schedule
  • DELETE /instances/{id}/snapshot-schedule

Notes

  • schedule configs are stored centrally under the data dir (not in guest snapshot payloads), so schedule state is not forked/restored with VM payload snapshots.

Testing

  • go test ./lib/instances -run SnapshotSchedule -count=1
  • go test ./cmd/api/api -run Snapshot -count=1
  • go test ./cmd/api -run TestOapiRuntimeBindStyledParameter_URLDecoding -count=1

Note

Medium Risk
Adds a new background scheduler that automatically creates and deletes snapshots based on persisted schedules, which can impact storage usage and snapshot/instance lifecycle behavior if misconfigured or buggy.

Overview
Adds scheduled snapshots: new GET/PUT/DELETE /instances/{id}/snapshot-schedule endpoints, OpenAPI/SDK bindings, and API handlers to create/update/read/delete schedules.

Implements schedule persistence and execution in instances (new SnapshotScheduleManager): schedules are stored under snapshot-schedules/, due schedules are polled every minute from cmd/api/main.go, runs auto-select snapshot kind from instance state, and retention cleanup (max_count/max_age) deletes only scheduled snapshots.

Includes new scheduledsnapshots library (validation, deterministic jitter cadence, marshal/unmarshal, listing) plus paths/errors/docs updates and focused tests (including preserving max_count=0).

Written by Cursor Bugbot for commit 52e78bf. This will update automatically on new commits. Configure here.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Add scheduled instance snapshots with retention cleanup

Edit this comment to update it. It will appear in the SDK's changelogs.

hypeman-typescript studio · code · diff

generate ✅build ⏳lint ✅test ✅

hypeman-go studio · code · diff

generate ✅build ⏳lint ⏳test ⏳

go get github.com/stainless-sdks/hypeman-go@8bdcdd4ff8cff79d9d60cb3787b715fcf6f4e875
hypeman-openapi studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅

⏳ These are partial results; builds are still running.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-03-24 16:58:05 UTC

@sjmiller609 sjmiller609 marked this pull request as ready for review March 9, 2026 05:21
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Automated Risk Assessment

Medium-High risk. Review is required, so this PR is not approved by automation.

Evidence from the diff:

  • Adds a new background scheduler loop in cmd/api/main.go that periodically executes snapshot schedule runs.
  • Introduces substantial new scheduling and retention logic in lib/instances/snapshot_schedule.go, including automatic snapshot deletion paths.
  • Expands public API surface with new snapshot-schedule endpoints in cmd/api/api/snapshots.go and openapi.yaml (plus regenerated bindings in lib/oapi/oapi.go).
  • Changes are cross-cutting across API, runtime behavior, persistence, and cleanup semantics, increasing regression and operational blast radius.

Validation notes:

  • Attempted targeted tests, but they could not run in this environment due missing embedded binary assets under lib/vmm, lib/system, and lib/ingress.

Action taken:

  • Requested reviewers: rgarcia, hiroTamada.

Open in Web View Automation 

@cursor cursor bot requested review from hiroTamada and rgarcia March 9, 2026 05:30
@sjmiller609 sjmiller609 marked this pull request as draft March 9, 2026 15:57
@sjmiller609 sjmiller609 marked this pull request as ready for review March 9, 2026 16:00
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Risk Assessment

Medium-High risk. Review is required, so this PR is not approved by automation.

Assessment based on the code diff:

  • Adds a new minute-polling background scheduler in cmd/api/main.go that executes asynchronous snapshot work continuously.
  • Introduces substantial new scheduling/retention execution logic in lib/instances/snapshot_schedule.go, including automatic snapshot deletion paths.
  • Expands production API surface via new schedule endpoints in cmd/api/api/snapshots.go and schema/binding changes in openapi.yaml and lib/oapi/oapi.go.
  • Persists and loads schedule state on disk (lib/scheduledsnapshots/storage.go, lib/paths/paths.go), increasing operational/state-management complexity.

Reviewer handling:

  • Code review is required for this risk level.
  • No additional reviewer requests were made because 2 reviewers are already assigned (hiroTamada, rgarcia).

Open in Web View Automation 

@sjmiller609 sjmiller609 requested review from Sayan- and removed request for hiroTamada and rgarcia March 17, 2026 19:05
sjmiller609 and others added 3 commits March 19, 2026 17:53
…ers, remove dead code

- Extract failScheduleRun helper to deduplicate the "record error + save
  schedule" pattern repeated 4 times in runSnapshotScheduleForInstanceLocked
- Inline countScheduledSnapshots (single-use len() wrapper) at call site
- Inline validateSetSnapshotScheduleRequest (single-use wrapper) at call site
- Merge snapshot_schedule_types.go contents into snapshot_schedule.go
- Remove dead truncation/fallback code in BuildSnapshotName that could never
  execute because ValidateSetRequest already rejects invalid prefixes upstream

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove redundant nil-check-before-assign for pointer fields in
  snapshotScheduleToOAPI (assigning nil to a nil field is a no-op)
- Use var declaration instead of make([]error, 0) for error accumulator
  that is only appended to (not serialized to JSON)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…le-retention

# Conflicts:
#	lib/oapi/oapi.go
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

sjmiller609 and others added 2 commits March 19, 2026 18:19
The scopes package (added on main) requires all routes to have scope
mappings. Map the three snapshot-schedule endpoints to existing snapshot
scopes: read, write, and delete.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@Sayan- Sayan- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few things to think about before this sees heavy use -- none of these need to block the PR but worth tracking:

scheduler robustness

the scheduler processes all due instances sequentially in a single goroutine with no per-snapshot timeout and no max-snapshots-per-tick limit. if a single createSnapshot hangs (e.g. VM pause never completes), it blocks the entire scheduler indefinitely. if many instances are due at once, the scheduler can fall behind the 1-minute tick -- and when it does, NextRun silently skips forward with no last_error, no metric, and no log. the user has no way to know a scheduled snapshot was missed.

consider: per-snapshot context timeout, a cap on snapshots per tick, and logging/recording when a schedule is skipped due to the scheduler falling behind.

thundering herd

SetSnapshotSchedule sets NextRunAt = now + interval with no jitter. instances created around the same time with the same interval will stay aligned forever, creating periodic load spikes. adding a small random offset on creation would spread the load.

ux nits

  • MinInterval = 1m is pretty aggressive for standby snapshots on running instances, which involve a pause/resume cycle. might be worth a higher floor (5m? 10m?) or at least documenting the guest-visible impact.
  • SetSnapshotSchedule resets NextRunAt on every update, even config-only changes like adjusting retention. if i just want to change max_count from 5 to 10, my next snapshot gets pushed out by a full interval. consider only resetting the timer when the interval itself changes.

@sjmiller609
Copy link
Collaborator Author

Addressed the targeted scheduler concerns from Sayans review in the latest pass:

  • fixed the schedule persistence bug so a run is recorded before snapshot side effects, which prevents duplicate scheduled snapshots after a late save failure
  • added regression coverage for post-snapshot save failure and cleanup-failure behavior
  • preserved next_run_at on config-only schedule updates instead of delaying the next capture
  • added deterministic per-instance cadence jitter when establishing a new schedule cadence to reduce thundering-herd alignment
  • documented the 1-minute minimum interval more clearly, including the pause/resume impact for running-instance scheduled captures

I left the broader scheduler hardening items as follow-up work for now: per-snapshot timeouts, per-tick caps, and missed-run/backlog visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants