upgradeDefaultProxyOnStartup races against reconciler, produces consistent ERROR logs on every operator restart

### Problem

Every time the operator pod starts, the logs reliably contain multiple ERROR lines like:

\`\`\`
ERROR unable to update AuthProxyWorkload during startup upgrade
  {"error": "Operation cannot be fulfilled on authproxyworkloads.cloudsql.cloud.google.com
   \"authproxyworkload-<name>\": the object has been modified; please apply your changes
   to the latest version and try again"}
github.com/GoogleCloudPlatform/cloud-sql-proxy-operator/internal/controller.(*upgradeDefaultProxyOnStartup).Start
    /work/internal/controller/proxy_image_upgrade.go:58
\`\`\`

In our cluster (~40 AuthProxyWorkload resources across ~25 namespaces in prod), ≥10 of these errors fire on every operator restart.

### Root cause

\`upgradeDefaultProxyOnStartup.Start\` ([proxy_image_upgrade.go:38–62](https://github.com/GoogleCloudPlatform/cloud-sql-proxy-operator/blob/main/internal/controller/proxy_image_upgrade.go#L38-L62)) does:

1. \`List\` all \`AuthProxyWorkload\` resources (snapshot at resourceVersion T).
2. For each one with a default proxy image, call \`c.c.Update(ctx, &p)\` with the (now potentially stale) object.

But the \`AuthProxyWorkload\` reconciler is **already running** by the time \`Start\` is called (both are registered as runnables on the manager, and reconcile is event-driven so it fires as soon as the cache is populated). The reconciler typically updates the resource's status/annotations during its reconcile pass.

So the timeline is:
- T0: \`Start\` calls \`c.c.List\` → snapshot at resourceVersion V₀.
- T1: reconciler fires for \`apw1\` → updates apw1 → resourceVersion now V₁.
- T2: \`Start\` iterates to \`apw1\`, calls \`c.c.Update(ctx, &p)\` with resourceVersion V₀ → apiserver rejects (409, "the object has been modified").

The conflict is inevitable for any workload that the reconciler manages to touch between \`List\` and \`Update\`.

\`NeedLeaderElection()\` returning \`true\` means this happens on the single leader pod; replicas count doesn't matter — we observed the same pattern with \`replicas: 1\` and \`replicas: 2\`.

### Functional impact

The startup-upgrade fails silently for the conflicted workloads, but the regular reconcile loop subsequently catches up — workloads with default proxy images still get the new image via normal reconciliation. So **no behavioral problem**, just persistent log noise that pollutes alerting on operator restarts.

### Reproduction

Any cluster with multiple \`AuthProxyWorkload\` resources will reproduce this on every operator restart. Restart the operator pod and check logs.

### Proposed fixes (in increasing scope)

1. **Retry-on-conflict in the startup-upgrade goroutine.** Wrap the \`c.c.Update(ctx, &p)\` call in \`retry.RetryOnConflict\` from k8s.io/client-go/util/retry. On 409, re-fetch and re-apply. Cheap fix; zero behavior change.

2. **Add a flag to disable.** \`--upgrade-on-startup=true|false\` (default \`true\` for back-compat). Lets operators opt out if image-version drift on operator restart isn't a concern for their use case.

3. **Replace startup-upgrade with annotation-based trigger.** Instead of forcing an Update on every restart, write the operator's default image version to a ConfigMap/Lease/CRD-status field. The reconciler can compare and trigger a reconcile only when the image version changes. Eliminates the race entirely; only does work when there's actually work to do.

Option (1) is the smallest patch and would resolve the noise. Option (2) gives operators a knob. Option (3) is the cleanest long-term shape but is a bigger refactor.

### Environment

- Operator version: \`v1.7.5\` (build \`3db049b\`)
- Manager runtime: \`go1.26.0 linux/amd64\`
- Kubernetes: \`v1.35.3-gke.1389000\` (GKE)
- Manifest source: \`https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy-operator/v1.7.5/cloud-sql-proxy-operator.yaml\`

Happy to send a PR for option (1) if it'd be welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgradeDefaultProxyOnStartup races against reconciler, produces consistent ERROR logs on every operator restart #764

Problem

Root cause

Functional impact

Reproduction

Proposed fixes (in increasing scope)

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

upgradeDefaultProxyOnStartup races against reconciler, produces consistent ERROR logs on every operator restart #764

Description

Problem

Root cause

Functional impact

Reproduction

Proposed fixes (in increasing scope)

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions