Skip to content

Fix Fleet API protection bugs and add two-phase secret deletion#1

Open
prashanthd409 wants to merge 5 commits intomainfrom
fleet-api-protection-pr
Open

Fix Fleet API protection bugs and add two-phase secret deletion#1
prashanthd409 wants to merge 5 commits intomainfrom
fleet-api-protection-pr

Conversation

@prashanthd409
Copy link

@prashanthd409 prashanthd409 commented Jan 30, 2026

Summary

Fixes critical bugs in the Fleet API transient issue protection and adds a two-phase deletion mechanism with a configurable grace period, preventing ArgoCD cluster secrets from being deleted when the Fleet API temporarily returns partial data.

Changes

Fixes the fallthrough bug where suspicious data was cached, adds mutex protection for concurrent map access, introduces annotation-based two-phase deletion with a 60s grace period, replaces time.Sleep with context-aware retries, reuses the Kubernetes clientset, and adds race-safe tests including an incident replay scenario.

Test plan

  • go test -v -race -count=1 ./... — all tests pass, race detector clean
  • go vet ./... — no issues
  • Incident replay test: 14→9→14 in 22s → zero deletions
  • Two-phase lifecycle: mark → wait → delete OR mark → recover
  • HTTP handler error path terminates correctly
  • Fallthrough: transient + no cache → error, not bad data
  • Concurrent Refresh + PluginResults under -race

Detailed changes

Bug fixes

  1. Fallthrough bug (fleetclient.go): When transient issue detected on final retry with expired cache, execution fell through to cache and return suspicious data. Now returns an error.
  2. Race condition (fleetclient.go): MembershipTenancyMapCache and ScopeTenancyMapCache written by Refresh() with no lock while read by HTTP handlers. Added sync.RWMutex.
  3. Missing return (main.go): After PluginResults error, HTTP handler wrote 500 but continued to write a second response.

Safety features

  1. Two-phase deletion (fleetclient.go): pruneSecrets no longer deletes immediately. Absent secrets get fleet.gke.io/absent-since annotation; only deleted after configurable grace period (default 60s). Recovered memberships get annotation removed.
  2. Context-aware retry (fleetclient.go): time.Sleep replaced with select on time.After/ctx.Done() so retries respect context cancellation.
  3. Goroutine leak fix (fleetclient.go): startReconcile now uses time.Ticker + ctx.Done() instead of infinite for/time.Sleep.

Performance

  1. Reuse Kubernetes clientset (fleetclient.go): Created once in NewFleetSync() instead of every reconciliation cycle.
  2. Parse template once (fleetclient.go): Package-level template.Must instead of per-iteration parsing.

CI & config

  1. Race detector in CI (golangci-lint.yml): Added go test -v -race -count=1 ./... step.
  2. ApplicationSet demo (applicationset-demo.yaml): Updated to applicationsSync: create-update and preserveResourcesOnDeletion: true.
  3. New env var: DELETION_GRACE_PERIOD_SECONDS (default 60).

Backward compatibility

  • Zero breaking changes to Connect Gateway URLs, secret format, HTTP API, or PluginResults
  • Old code ignores fleet.gke.io/absent-since annotation — clean rollback
  • Legitimate removals take 60s grace period before deletion (intentional)

During Redis restarts or maintenance, Fleet API occasionally returns
incomplete responses, causing ArgoCD to delete applications.

Solution adds three-layer protection:
- Detection: Identifies transient issues by pattern analysis
- Retry: 3 attempts with exponential backoff
- Cache: Falls back to last known good response

Changes:
- Add protection/ package for detection and caching
- Update fleetclient.go with retry logic
- Add configuration via environment variables
- Add comprehensive test coverage (29 tests passing)

Fixes application deletions during Redis HA failover
@prashanthd409 prashanthd409 force-pushed the fleet-api-protection-pr branch from 9725aa5 to 4e02c1b Compare January 30, 2026 22:03
Fix fallthrough bug where suspicious data was cached when transient
detection fired with an expired cache, fix data race on map caches,
add two-phase deletion with grace period to prevent secret loss from
transient Fleet API partial responses, and add race-safe tests.
@prashanthd409 prashanthd409 changed the title Fleet api protection pr Fix Fleet API protection bugs and add two-phase secret deletion Mar 16, 2026
Pin all action references to full-length commit SHAs to prevent
supply-chain attacks via mutable tags. Add dependabot.yml for
automated dependency updates on Go modules and GitHub Actions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants