Skip to content

upgradeDefaultProxyOnStartup races against reconciler, produces consistent ERROR logs on every operator restart #764

@chung-x-hr

Description

@chung-x-hr

Problem

Every time the operator pod starts, the logs reliably contain multiple ERROR lines like:

```
ERROR unable to update AuthProxyWorkload during startup upgrade
{"error": "Operation cannot be fulfilled on authproxyworkloads.cloudsql.cloud.google.com
"authproxyworkload-": the object has been modified; please apply your changes
to the latest version and try again"}
github.com/GoogleCloudPlatform/cloud-sql-proxy-operator/internal/controller.(*upgradeDefaultProxyOnStartup).Start
/work/internal/controller/proxy_image_upgrade.go:58
```

In our cluster (~40 AuthProxyWorkload resources across ~25 namespaces in prod), ≥10 of these errors fire on every operator restart.

Root cause

`upgradeDefaultProxyOnStartup.Start` (proxy_image_upgrade.go:38–62) does:

  1. `List` all `AuthProxyWorkload` resources (snapshot at resourceVersion T).
  2. For each one with a default proxy image, call `c.c.Update(ctx, &p)` with the (now potentially stale) object.

But the `AuthProxyWorkload` reconciler is already running by the time `Start` is called (both are registered as runnables on the manager, and reconcile is event-driven so it fires as soon as the cache is populated). The reconciler typically updates the resource's status/annotations during its reconcile pass.

So the timeline is:

  • T0: `Start` calls `c.c.List` → snapshot at resourceVersion V₀.
  • T1: reconciler fires for `apw1` → updates apw1 → resourceVersion now V₁.
  • T2: `Start` iterates to `apw1`, calls `c.c.Update(ctx, &p)` with resourceVersion V₀ → apiserver rejects (409, "the object has been modified").

The conflict is inevitable for any workload that the reconciler manages to touch between `List` and `Update`.

`NeedLeaderElection()` returning `true` means this happens on the single leader pod; replicas count doesn't matter — we observed the same pattern with `replicas: 1` and `replicas: 2`.

Functional impact

The startup-upgrade fails silently for the conflicted workloads, but the regular reconcile loop subsequently catches up — workloads with default proxy images still get the new image via normal reconciliation. So no behavioral problem, just persistent log noise that pollutes alerting on operator restarts.

Reproduction

Any cluster with multiple `AuthProxyWorkload` resources will reproduce this on every operator restart. Restart the operator pod and check logs.

Proposed fixes (in increasing scope)

  1. Retry-on-conflict in the startup-upgrade goroutine. Wrap the `c.c.Update(ctx, &p)` call in `retry.RetryOnConflict` from k8s.io/client-go/util/retry. On 409, re-fetch and re-apply. Cheap fix; zero behavior change.

  2. Add a flag to disable. `--upgrade-on-startup=true|false` (default `true` for back-compat). Lets operators opt out if image-version drift on operator restart isn't a concern for their use case.

  3. Replace startup-upgrade with annotation-based trigger. Instead of forcing an Update on every restart, write the operator's default image version to a ConfigMap/Lease/CRD-status field. The reconciler can compare and trigger a reconcile only when the image version changes. Eliminates the race entirely; only does work when there's actually work to do.

Option (1) is the smallest patch and would resolve the noise. Option (2) gives operators a knob. Option (3) is the cleanest long-term shape but is a bigger refactor.

Environment

Happy to send a PR for option (1) if it'd be welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions