:bug: Make deployments HA-ready with configurable replica count and update upgrade-e2e test cases by jianzhangbjz · Pull Request #2371 · operator-framework/operator-controller

jianzhangbjz · 2025-12-04T03:09:32Z

Description

Make deployments HA-ready with configurable replica count

Problem

The current deployment configuration creates a deadlock during node maintenance operations:

Both operator-controller-controller-manager and catalogd-controller-manager have 1 replica (hardcoded)
Both have PodDisruptionBudgets with minAvailable: 1
When the single pod runs on a node being drained, eviction violates the PDB
This prevents node drain operations from completing

Impact: Node maintenance operations (upgrades, scaling, etc.) fail with:
Cannot evict pod as it would violate the pod's disruption budget.

Root cause: With only 1 replica and minAvailable: 1, there's no way to safely evict the pod while maintaining availability.

Solution

This PR makes the replica count configurable and sets a sensible default of 2 replicas:

Added replicas configuration to Helm values:
- options.operatorController.deployment.replicas: 2
- options.catalogd.deployment.replicas: 2
Updated Helm templates to use the configurable value instead of hardcoded 1
Regenerated all manifests with the new defaults

Benefits

✅ Enables safe node maintenance: With 2 replicas, one pod can be evicted while maintaining minAvailable: 1
✅ Zero-downtime updates: RollingUpdate with 2 replicas ensures continuous service availability
✅ Improved fault tolerance: System continues operating if one pod fails
✅ Production-ready defaults: Aligns with HA best practices for control plane components
✅ Flexible configuration: Replica count can be customized via Helm values

Compatibility

Existing deployments: When this change is applied, deployments will scale from 1 to 2 replicas:

RollingUpdate strategy ensures zero downtime during scale-up
maxSurge: 1 allows temporary 3rd pod during transition
maxUnavailable: 0 prevents service interruption

Custom deployments: Users can override the default by setting:

options:
  operatorController:
    deployment:
      replicas: <custom-value>
  catalogd:
    deployment:
      replicas: <custom-value>

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

Assisted-By: Claude Code

netlify · 2025-12-04T03:09:37Z

❌ Deploy Preview for olmv1 failed.

Name	Link
🔨 Latest commit	`f94ec84`
🔍 Latest deploy log	https://app.netlify.com/projects/olmv1/deploys/69c0feadc00f6d0007ae20ff

jianzhangbjz · 2025-12-04T03:33:40Z

/retest

jianzhangbjz · 2025-12-04T03:35:13Z

Encounter this error again.

=== RUN   TestClusterExtensionInstallRegistry/no_registry_configuration_necessary
    cluster_extension_install_test.go:57: When a cluster extension is installed from a catalog
    cluster_extension_install_test.go:58: When the extension bundle format is registry+v1
    helpers.go:263: Ensuring ClusterCatalog has Status.Condition of Progressing with a status == True and reason == Succeeded
    helpers.go:273: Checking that catalog has the expected metadata label
    helpers.go:278: Ensuring ClusterCatalog has Status.Condition of Type = Serving with status == True
    cluster_extension_install_test.go:79: It resolves the specified package with correct bundle path
    cluster_extension_install_test.go:80: By creating the ClusterExtension resource
    cluster_extension_install_test.go:83: By eventually reporting a successful resolution and bundle path
    cluster_extension_install_test.go:88: By eventually reporting progressing as True
    cluster_extension_install_test.go:89: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:94
        	            				/opt/hostedtoolcache/go/1.24.6/x64/src/runtime/asm_amd64.s:1700
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Retrying"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-Succeeded
        	            	+Retrying
    cluster_extension_install_test.go:89: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:89
        	Error:      	Condition never satisfied
        	Test:       	TestClusterExtensionInstallRegistry/no_registry_configuration_necessary
Failed to list cluster extensions: no matches for kind "ClusterExtensionRevision" in version "olm.operatorframework.io/v1"    helpers.go:321: By deleting ClusterCatalog "test-catalog-jnwbqjlw"
    helpers.go:330: By deleting ClusterExtension "clusterextension-fqz4fls8"
    helpers.go:294: By waiting for CustomResourceDefinitions of "clusterextension-fqz4fls8" to be deleted
    helpers.go:302: By waiting for ClusterRoleBindings of "clusterextension-fqz4fls8" to be deleted
    helpers.go:310: By waiting for ClusterRoles of "clusterextension-fqz4fls8" to be deleted
    helpers.go:340: By deleting ServiceAccount "clusterextension-fqz4fls8"
    helpers.go:349: By deleting Namespace "clusterextension-fqz4fls8"

tmshort · 2025-12-04T04:18:03Z

Changing the number of replicas directly impacts the expectations of our e2es. Perhaps default it to 1 or update the tests with some additional logic.

Allowing us to configure the number of replicas and other values will give us better downstream control.

codecov · 2025-12-04T09:41:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.77%. Comparing base (a307a6d) to head (f94ec84).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2371      +/-   ##
==========================================
+ Coverage   65.84%   67.77%   +1.92%     
==========================================
  Files         137      137              
  Lines        9560     9560              
==========================================
+ Hits         6295     6479     +184     
+ Misses       2795     2580     -215     
- Partials      470      501      +31

Flag	Coverage Δ
e2e	`38.14% <ø> (+26.95%)`	⬆️
experimental-e2e	`51.27% <ø> (+0.20%)`	⬆️
unit	`52.80% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jianzhangbjz · 2025-12-04T11:16:53Z

It failed at, more https://github.com/operator-framework/operator-controller/actions/runs/19926553979/job/57127694168?pr=2371

./testdata/build-test-registry.sh operator-controller-e2e docker-registry localhost/e2e-test-registry:devel
namespace/operator-controller-e2e created
certificate.cert-manager.io/operator-controller-e2e-registry created
deployment.apps/docker-registry created
service/docker-registry created
error: timed out waiting for the condition on deployments/docker-registry
make: *** [Makefile:267: image-registry] Error 1

jianzhangbjz · 2025-12-04T12:16:05Z

Hi @tmshort , I have updated all related e2e test cases. Could you help have a review? Thanks!

pedjak

our upstream e2e test do not need HA setup to run, please keep them lean as they were before this PR.

hack/test/pre-upgrade-setup.sh

pedjak · 2025-12-04T12:41:12Z

helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml

    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 1          # Allow temporary 2 pods (1 + 1) for zero-downtime updates
+      maxSurge: 1          # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates


from operational point of view this is a no-op change - please revert it.

Updating the comment makes sense, especially since the number of pods is based on deployment configuration. But I'd suggest something else.

The comment is necessary so that others can understand the reason.

pedjak · 2025-12-04T12:41:34Z

helm/olmv1/templates/deployment-olmv1-system-operator-controller-controller-manager.yml

    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 1          # Allow temporary 2 pods (1 + 1) for zero-downtime updates
+      maxSurge: 1          # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates


same as above.

pedjak · 2025-12-04T12:44:21Z

helm/olmv1/values.yaml

    enabled: true
    deployment:
      image: quay.io/operator-framework/operator-controller:devel
+      replicas: 2


I think the default should remain 1, and set podDisruptionBudget.enabled to false as default as well. >1 is required for HA setups. We should not implicitly enforce after an upgrade that OLM suddenly consumes more cluster resources (i.e. pods). Such things should not be hidden from users.

I generally agree with @pedjak here, we should be defaulting to 1 upstream, but allow the configuration to updated downstream. However, because we are allowing changes here, we will need to ensure that the e2e tests can still be run upstream and downstream.

Now thinking, maybe we want to use 1 for the default, and put 2 in experimental? That way, we can test both values.

The community user will also encounter a similar upgrade error(OCPBUGS-62517) if we disable the PDB as the default.

In my opinion, we should maintain consistency between the upstream and downstream projects. This is also one of the core principles of open source: fostering alignment, collaboration, and transparency across the entire ecosystem. When upstream and downstream diverge without clear justification, it creates unnecessary complexity, increases maintenance overhead, and makes it harder for users and contributors to understand or adopt the project.

By keeping the two sides consistent, we ensure that new features, bug fixes, and improvements flow smoothly in both directions. It also strengthens long-term sustainability, because the community can rely on a unified architecture and shared expectations. Ultimately, this consistency reinforces the value of open source itself—working together toward a common, coherent, and maintainable solution.

The community user will also encounter a similar upgrade error

As usual, it depends. IMHO, this is the downstream bug because the downstream has certain expectations about components availability during upgrade. If OLM deployment has just single pod, it is perfectly normal that the deployment becomes shortly unavailable during upgrade, and many upstream users might be ok with that, especially because OLM is a controller that performs reconciliations asynchronously (i.e. eventually). All ClusterExtensions that get created/modified during OLM upgrade are not lost and they are going to be reconciled - hence to service interruption does not happens at all.

Having said that, it is also perfectly fine that we add knobs on the upstream, so that the number of pods and pdbs can be configured/added for users that have stronger requirements, but we should not enforce them, especially if that means that suddenly running OLM requires twice as much resources.

I would still want this to be 1 for the standard setup (and tests).

We can set this to 2 in experimental.yaml to also test the multi-replica e2e without forcing the default user to have 2 replicas.

pedjak · 2025-12-04T12:44:28Z

helm/olmv1/values.yaml

    enabled: true
    deployment:
      image: quay.io/operator-framework/catalogd:devel
+      replicas: 2


same as above.

pedjak · 2025-12-04T12:45:51Z

manifests/experimental-e2e.yaml

    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 1          # Allow temporary 2 pods (1 + 1) for zero-downtime updates
+      maxSurge: 1          # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates


same comment as earlier.

manifests/experimental-e2e.yaml

pedjak · 2025-12-04T12:47:37Z

test/e2e/cluster_extension_install_test.go

-	artifactName         = "operator-controller-e2e"
-	pollDuration         = time.Minute
+	artifactName = "operator-controller-e2e"
+	// pollDuration is set to 3 minutes to account for leader election time in multi-replica deployments.


our e2e test do not need to run now in HA setup, having just single replica should be fine. If some additional HA tests are needed, they should be provided downstream.

However, the upstream e2e's do run downstream as well, and we need to ensure they can work there as well.

Sure, but if I am not mistaken, OCPBUGS-62517 reports a downstream test to fail, not an upstream one?

Yes, but it's all the same downstream (OCP) configuration. We can't change the OCP configuration just to run our upstream e2e's. Our upstream e2e's are expected to be able to run mostly unmodified on OCP.

The upstream e2e will encounter 2 replicas when run in OCP (because that's the OCP configuration), but will encounter 1 replica upstream (because that's the upstream configuration).

And because we'd also allow the user to support multiple replicas upstream, our upstream e2e should accommodate that.

our upstream e2e should accommodate that.

Agree that our e2e tests should be agnostic to underlying setup, but our upstream e2e CI test jobs should run against an OLM deployment with single replicas (as they do until now).

Based on the latest update, the standard test-e2e will use a single replica. Running test-experimental-e2e will use two replicas. Installing either standard or experimental (outside of the e2e) will use a single replica.

helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml

pedjak · 2025-12-11T12:28:52Z

test/upgrade-e2e/post_upgrade_test.go

+
+		// Now fetch pods and find the one matching the lease holder
+		var managerPods corev1.PodList
+		if err := c.List(ctx, &managerPods, client.MatchingLabelsSelector{Selector: deploymentLabelSelector}); err != nil {


no need to list pods. After finding out the holder identity, we can directly fetch the particular pod with the given name.

tmshort · 2025-12-11T15:54:33Z

Hi @joelanford , could you help point out how to solve this general failure(#2371 (comment)) if we don't increase the timeout? Just /retest? Thanks!

You need to use standard GitHub mechanisms to rerun tests: clock on the failed test, and then there is a "Re-run Jobs" button in the upper right.

But... we shouldn't really fail that often.

tmshort · 2025-12-11T18:51:27Z

This should be rebased given the basic changes to support replicas in the tests.

jianzhangbjz · 2025-12-15T07:19:31Z

You need to use standard GitHub mechanisms to rerun tests: clock on the failed test, and then there is a "Re-run Jobs" button in the upper right.

I don't find that "Re-run Jobs" button. The permission issue?

jianzhangbjz · 2025-12-15T08:15:15Z

According to Re-running workflows and jobs: People with write permissions to a repository can re-run workflows in the repository. I believe this is the reason why I cannot find the re-run button.

jianzhangbjz · 2025-12-16T01:01:53Z

/retest experimental-e2e

openshift-ci · 2025-12-16T01:01:57Z

@jianzhangbjz: No presubmit jobs available for operator-framework/operator-controller@main

Details

In response to this:

/retest experimental-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tmshort · 2025-12-18T18:49:32Z

test/helpers/helpers.go

-	// CRDs may take an extra long time to be deleted, and may run into the following error:
+	// Use 2 minute timeout for cleanup operations to ensure they complete within the test timeout.
+	// This is shorter than pollDuration (3 min) to leave buffer for the overall test suite.
+	cleanupTimeout := 2 * time.Minute


This is actually a shorter timeout. It was originally 5*pollDuration which is equal to 5 * time.Minute.
Do we need this change?

tmshort · 2025-12-18T18:52:20Z

@joelanford are you still looking for minimization of the diffs for polling timeouts?

openshift-merge-robot · 2026-02-13T09:47:57Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot

Pull request overview

This PR aims to make OLMv1 control-plane deployments more HA-friendly by introducing configurable replica counts (and an HA values overlay), regenerating manifests accordingly, and updating upgrade/e2e-related scripts and helpers to be more resilient to slower rollouts/leader elections.

Changes:

Add Helm values for configurable replicas for operator-controller and catalogd, and update Deployment templates to use them.
Introduce an HA-focused Helm values overlay (helm/high-availability.yaml) and wire it into experimental-e2e manifest generation (including pod anti-affinity).
Adjust e2e/upgrade tooling and timeouts (new pre-upgrade setup script, longer waits, updated helper timeouts, new post-upgrade test file).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
testdata/build-test-registry.sh	Increase kubectl wait timeouts for registry deploy/job completion.
test/upgrade-e2e/post_upgrade_test.go	Add post-upgrade validation helpers/tests (leader election + reconciliation verification).
test/helpers/helpers.go	Introduce more tailored polling durations; adjust cleanup timing and namespace deletion timeout.
manifests/standard.yaml	Comment-only tweak for RollingUpdate `maxSurge` description.
manifests/standard-e2e.yaml	Comment-only tweak for RollingUpdate `maxSurge` description.
manifests/experimental.yaml	Comment-only tweak for RollingUpdate `maxSurge` description.
manifests/experimental-e2e.yaml	Render experimental-e2e with 2 replicas and preferred pod anti-affinity for HA testing.
helm/olmv1/values.yaml	Add `replicas` values for operator-controller and catalogd.
helm/olmv1/templates/deployment-olmv1-system-operator-controller-controller-manager.yml	Use `.Values.options.operatorController.deployment.replicas` instead of hardcoded 1.
helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml	Use `.Values.options.catalogd.deployment.replicas` instead of hardcoded 1.
helm/high-availability.yaml	New HA values overlay (replicas=2, preferred pod anti-affinity).
hack/test/pre-upgrade-setup.sh	New script to seed ClusterCatalog/ClusterExtension + RBAC for upgrade testing.
Makefile	Include HA overlay in experimental-e2e manifests; extend wait timeouts for catalogd/cert readiness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T07:39:20Z

test/upgrade-e2e/post_upgrade_test.go

+package upgradee2e
+
+import (
+	"bufio"
+	"context"
+	"fmt"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/stretchr/testify/assert"
+	"github.com/stretchr/testify/require"
+	appsv1 "k8s.io/api/apps/v1"
+	coordinationv1 "k8s.io/api/coordination/v1"
+	corev1 "k8s.io/api/core/v1"
+	apimeta "k8s.io/apimachinery/pkg/api/meta"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/labels"
+	"k8s.io/apimachinery/pkg/types"
+	"k8s.io/apimachinery/pkg/util/wait"
+	"sigs.k8s.io/controller-runtime/pkg/client"
+
+	ocv1 "github.com/operator-framework/operator-controller/api/v1"
+	testutil "github.com/operator-framework/operator-controller/internal/shared/util/test"
+)


This new test file references several identifiers that are not defined anywhere in the test/upgrade-e2e package (e.g., c, cfg, kclientset, testClusterCatalogName, testClusterExtensionName). As-is, go test ./test/upgrade-e2e will fail to compile. Consider initializing the controller-runtime client + clientset in this package (or importing them via exported helpers), and plumb the test resource names into these tests (e.g., via env vars written by the pre-upgrade setup script).

Copilot · 2026-03-23T07:39:20Z

test/upgrade-e2e/post_upgrade_test.go

+func TestClusterCatalogUnpacking(t *testing.T) {
+	ctx := context.Background()
+
+	t.Log("Checking that the controller-manager deployment is updated")
+	managerLabelSelector := labels.Set{"app.kubernetes.io/name": "catalogd"}
+	var managerDeployment appsv1.Deployment
+	require.EventuallyWithT(t, func(ct *assert.CollectT) {
+		var managerDeployments appsv1.DeploymentList
+		err := c.List(ctx, &managerDeployments, client.MatchingLabels(managerLabelSelector), client.InNamespace("olmv1-system"))
+		require.NoError(ct, err)
+		require.Len(ct, managerDeployments.Items, 1)
+		managerDeployment = managerDeployments.Items[0]
+		require.Equal(ct, *managerDeployment.Spec.Replicas, managerDeployment.Status.UpdatedReplicas)
+		require.Equal(ct, *managerDeployment.Spec.Replicas, managerDeployment.Status.Replicas)
+		require.Equal(ct, *managerDeployment.Spec.Replicas, managerDeployment.Status.AvailableReplicas)
+		require.Equal(ct, *managerDeployment.Spec.Replicas, managerDeployment.Status.ReadyReplicas)
+	}, time.Minute, time.Second)
+
+	t.Log("Waiting for controller-manager pods to match the desired replica count")
+	require.EventuallyWithT(t, func(ct *assert.CollectT) {
+		var managerPods corev1.PodList
+		err := c.List(ctx, &managerPods, client.MatchingLabels(managerLabelSelector))
+		require.NoError(ct, err)
+		require.Len(ct, managerPods.Items, int(*managerDeployment.Spec.Replicas))
+	}, time.Minute, time.Second)
+
+	t.Log("Waiting for acquired leader election")
+	leaderCtx, leaderCancel := context.WithTimeout(ctx, pollDuration)
+	defer leaderCancel()
+
+	// When there are multiple replicas, find the leader pod
+	managerPod, err := findLeaderPod(leaderCtx, "catalogd")
+	require.NoError(t, err)
+	require.NotNil(t, managerPod)
+
+	t.Log("Reading logs to make sure that ClusterCatalog was reconciled by catalogdv1")
+	logCtx, cancel := context.WithTimeout(ctx, time.Minute)
+	defer cancel()
+	substrings := []string{
+		"reconcile ending",
+		fmt.Sprintf(`ClusterCatalog=%q`, testClusterCatalogName),
+	}
+	found, err := watchPodLogsForSubstring(logCtx, managerPod, substrings...)
+	require.NoError(t, err)
+	require.True(t, found)
+
+	catalog := &ocv1.ClusterCatalog{}
+	t.Log("Ensuring ClusterCatalog has Status.Condition of Progressing with a status == True, reason == Succeeded")
+	require.EventuallyWithT(t, func(ct *assert.CollectT) {
+		err := c.Get(ctx, types.NamespacedName{Name: testClusterCatalogName}, catalog)
+		require.NoError(ct, err)
+		cond := apimeta.FindStatusCondition(catalog.Status.Conditions, ocv1.TypeProgressing)
+		require.NotNil(ct, cond)
+		require.Equal(ct, metav1.ConditionTrue, cond.Status)
+		require.Equal(ct, ocv1.ReasonSucceeded, cond.Reason)
+	}, pollDuration, pollInterval)
+
+	t.Log("Ensuring ClusterCatalog has Status.Condition of Serving with a status == True, reason == Available")
+	require.EventuallyWithT(t, func(ct *assert.CollectT) {
+		err := c.Get(ctx, types.NamespacedName{Name: testClusterCatalogName}, catalog)
+		require.NoError(ct, err)
+		cond := apimeta.FindStatusCondition(catalog.Status.Conditions, ocv1.TypeServing)
+		require.NotNil(ct, cond)
+		require.Equal(ct, metav1.ConditionTrue, cond.Status)
+		require.Equal(ct, ocv1.ReasonAvailable, cond.Reason)
+	}, pollDuration, pollInterval)
+}
+
+func TestClusterExtensionAfterOLMUpgrade(t *testing.T) {
+	t.Log("Starting checks after OLM upgrade")
+	ctx := context.Background()
+	defer testutil.CollectTestArtifacts(t, artifactName, c, cfg)
+
+	// wait for catalogd deployment to finish
+	t.Log("Wait for catalogd deployment to be ready")
+	waitForDeployment(t, ctx, "catalogd")
+
+	// Find the catalogd leader pod
+	catalogdLeaderCtx, catalogdLeaderCancel := context.WithTimeout(ctx, pollDuration)
+	defer catalogdLeaderCancel()
+	catalogdManagerPod, err := findLeaderPod(catalogdLeaderCtx, "catalogd")
+	require.NoError(t, err)
+	require.NotNil(t, catalogdManagerPod)
+


These Test* functions are standard Go tests, but the package’s existing TestMain (in upgrade_test.go) runs the Godog suite and does not call m.Run(). That means even after fixing compilation, these tests will never execute under go test. Either integrate these checks into the Godog scenarios/steps, or update TestMain to also run m.Run() and return a combined exit code.

Copilot · 2026-03-23T07:39:20Z

test/upgrade-e2e/post_upgrade_test.go

+	t.Log("Waiting for controller-manager pods to match the desired replica count")
+	require.EventuallyWithT(t, func(ct *assert.CollectT) {
+		var managerPods corev1.PodList
+		err := c.List(ctx, &managerPods, client.MatchingLabels(managerLabelSelector))


The pod list here is not scoped to a namespace. Since the deployment check above explicitly targets olmv1-system, this c.List can accidentally count matching pods in other namespaces (or from other test runs), making the replica assertion flaky. Add client.InNamespace("olmv1-system") (or the detected install namespace) to the pod list call.

Suggested change

err := c.List(ctx, &managerPods, client.MatchingLabels(managerLabelSelector))

err := c.List(ctx, &managerPods, client.MatchingLabels(managerLabelSelector), client.InNamespace("olmv1-system"))

Copilot · 2026-03-23T07:39:20Z

test/upgrade-e2e/post_upgrade_test.go

+		var managerDeployments appsv1.DeploymentList
+		require.NoError(ct, c.List(ctx, &managerDeployments, client.MatchingLabelsSelector{Selector: deploymentLabelSelector}))
+		require.Len(ct, managerDeployments.Items, 1)
+		managerDeployment := managerDeployments.Items[0]
+
+		require.True(ct,
+			managerDeployment.Status.UpdatedReplicas == *managerDeployment.Spec.Replicas &&
+				managerDeployment.Status.Replicas == *managerDeployment.Spec.Replicas &&
+				managerDeployment.Status.AvailableReplicas == *managerDeployment.Spec.Replicas &&
+				managerDeployment.Status.ReadyReplicas == *managerDeployment.Spec.Replicas,
+		)
+
+		// Check that the deployment has the Available condition set to True
+		var availableCond *appsv1.DeploymentCondition
+		for i := range managerDeployment.Status.Conditions {
+			if managerDeployment.Status.Conditions[i].Type == appsv1.DeploymentAvailable {
+				availableCond = &managerDeployment.Status.Conditions[i]
+				break
+			}
+		}
+		require.NotNil(ct, availableCond, "Available condition not found")
+		require.Equal(ct, corev1.ConditionTrue, availableCond.Status, "Deployment Available condition is not True")
+
+		desiredNumReplicas = *managerDeployment.Spec.Replicas
+	}, time.Minute, time.Second)
+
+	t.Logf("Ensure the number of remaining pods equal the desired number of replicas (%d)", desiredNumReplicas)
+	require.EventuallyWithT(t, func(ct *assert.CollectT) {
+		var managerPods corev1.PodList
+		require.NoError(ct, c.List(ctx, &managerPods, client.MatchingLabelsSelector{Selector: deploymentLabelSelector}))
+		require.Len(ct, managerPods.Items, int(desiredNumReplicas))
+	}, time.Minute, time.Second)


waitForDeployment lists Deployments/Pods cluster-wide using only app.kubernetes.io/name. This can match multiple installs (or other components) and break the Len(..., 1) / pod count assertions. Scope both c.List calls to the intended namespace (e.g., client.InNamespace("olmv1-system") or a namespace discovered from the upgrade install).

Copilot · 2026-03-23T07:39:21Z

test/helpers/helpers.go


-	// CRDs may take an extra long time to be deleted, and may run into the following error:
+	// Use 2 minute timeout for cleanup operations to ensure they complete within the test timeout.
+	// This is shorter than pollDuration (3 min) to leave buffer for the overall test suite.


The comment says this cleanup timeout is shorter than pollDuration (3 min), but in this file pollDuration is time.Minute. If you meant the 3-minute catalog timeout, reference catalogPollDuration instead; otherwise update the comment to match the actual constant values.

Suggested change

// This is shorter than pollDuration (3 min) to leave buffer for the overall test suite.

// This is shorter than catalogPollDuration (3 min) to leave buffer for the overall test suite.

Copilot · 2026-03-23T07:39:21Z

hack/test/pre-upgrade-setup.sh

+which will later be used in upgrade testing.
+
+Usage:
+  post-upgrade-checks.sh [TEST_CATALOG_IMG] [TEST_CATALOG_NAME] [TEST_CLUSTER_EXTENSION_NAME]


The usage text references post-upgrade-checks.sh, but this script is pre-upgrade-setup.sh. This is confusing when the wrong command is printed on argument errors; update the help string to match the actual script name and arguments.

Suggested change

post-upgrade-checks.sh [TEST_CATALOG_IMG] [TEST_CATALOG_NAME] [TEST_CLUSTER_EXTENSION_NAME]

pre-upgrade-setup.sh [TEST_CATALOG_IMG] [TEST_CLUSTER_CATALOG_NAME] [TEST_CLUSTER_EXTENSION_NAME]

Copilot · 2026-03-23T07:39:21Z

hack/test/pre-upgrade-setup.sh

+kubectl wait --for=condition=Serving --timeout=5m ClusterCatalog $TEST_CLUSTER_CATALOG_NAME
+kubectl wait --for=condition=Installed --timeout=5m ClusterExtension $TEST_CLUSTER_EXTENSION_NAME


For consistency with other scripts in this repo (e.g. hack/demo/catalogd-demo-script.sh uses kubectl wait ... clustercatalog/<name>), these waits should use the canonical resource form (clustercatalog/${TEST_CLUSTER_CATALOG_NAME} and clusterextension/${TEST_CLUSTER_EXTENSION_NAME}) rather than separate KIND + NAME args. This avoids reliance on kubectl’s kind-to-resource mapping and is more robust across API group/shortname differences.

Suggested change

kubectl wait --for=condition=Serving --timeout=5m ClusterCatalog $TEST_CLUSTER_CATALOG_NAME

kubectl wait --for=condition=Installed --timeout=5m ClusterExtension $TEST_CLUSTER_EXTENSION_NAME

kubectl wait --for=condition=Serving --timeout=5m clustercatalog/${TEST_CLUSTER_CATALOG_NAME}

kubectl wait --for=condition=Installed --timeout=5m clusterextension/${TEST_CLUSTER_EXTENSION_NAME}

Copilot · 2026-03-23T07:39:21Z

test/upgrade-e2e/post_upgrade_test.go

+		var lease coordinationv1.Lease
+		if err := c.Get(ctx, types.NamespacedName{
+			Name:      leaseName,
+			Namespace: "olmv1-system",
+		}, &lease); err != nil {
+			// Lease might not exist yet, retry
+			return false, nil
+		}
+
+		if lease.Spec.HolderIdentity == nil {
+			// No leader elected yet, retry
+			return false, nil
+		}
+
+		// The HolderIdentity is in the format "pod-name_hash"
+		// Extract the pod name by splitting on "_"
+		holderIdentity := *lease.Spec.HolderIdentity
+		podName := strings.Split(holderIdentity, "_")[0]
+
+		// Directly fetch the pod by name instead of listing all pods
+		pod := &corev1.Pod{}
+		if err := c.Get(ctx, types.NamespacedName{
+			Name:      podName,
+			Namespace: "olmv1-system",
+		}, pod); err != nil {
+			// Pod might not exist yet or is being terminated, retry
+			return false, nil
+		}


findLeaderPod is hardcoded to read the Lease and Pod from namespace olmv1-system. The upgrade E2E harness already has logic to detect the OLM install namespace (see test/e2e/steps/detectOLMDeployment usage), so hardcoding here can break upgrades where the stable release installs into a different namespace. Pass the namespace into this helper (or discover it from the Lease/Deployment) rather than embedding olmv1-system.

Copilot · 2026-03-23T07:39:22Z

helm/olmv1/values.yaml

@@ -19,6 +20,7 @@ options:
    enabled: true
    deployment:
      image: quay.io/operator-framework/catalogd:devel
+      replicas: 1
      extraArguments: []


PR description says the chart will default both control-plane Deployments to 2 replicas to resolve the PDB deadlock, but the new replicas values added here default to 1. The rendered manifests/standard*.yaml and manifests/experimental.yaml also remain at 1 replica, so the deadlock scenario still exists by default. Either change these defaults to 2 (and regenerate manifests), or update the PR description/docs to clarify that HA requires an explicit values override (e.g., helm/high-availability.yaml).

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T08:55:20Z

helm/olmv1/values.yaml

    enabled: true
    deployment:
      image: quay.io/operator-framework/catalogd:devel
+      replicas: 1


Defaulting options.catalogd.deployment.replicas to 1 keeps catalogd single-replica by default. With podDisruptionBudget.minAvailable: 1, this can still prevent node drains/evictions for that pod. Consider defaulting to 2 (or making the PDB conditional on the replica count) to actually resolve the HA/PDB deadlock for default installs.

Suggested change

replicas: 1

replicas: 2

Copilot · 2026-03-23T08:55:20Z

test/helpers/helpers.go

 		list := &apiextensionsv1.CustomResourceDefinitionList{}
 		err := c.List(context.Background(), list, client.MatchingLabelsSelector{Selector: ls.AsSelector()})
 		require.NoError(ct, err)
 		require.Empty(ct, list.Items)
-	}, 5*pollDuration, pollInterval)
+	}, cleanupTimeout, pollInterval)


EnsureNoExtensionResources now uses a fixed 2-minute cleanupTimeout even for CRD deletion, but the comment immediately above notes CRDs may take extra time to delete. This is also a significant reduction from the previous longer waits and may reintroduce test flakes on slower clusters. Consider using a longer timeout for CRDs (or extendedPollDuration) and keeping shorter timeouts for the faster RBAC cleanup steps.

Copilot · 2026-03-23T08:55:21Z

test/upgrade-e2e/post_upgrade_test.go

+		require.Equal(ct, *managerDeployment.Spec.Replicas, managerDeployment.Status.UpdatedReplicas)
+		require.Equal(ct, *managerDeployment.Spec.Replicas, managerDeployment.Status.Replicas)
+		require.Equal(ct, *managerDeployment.Spec.Replicas, managerDeployment.Status.AvailableReplicas)
+		require.Equal(ct, *managerDeployment.Spec.Replicas, managerDeployment.Status.ReadyReplicas)
+	}, time.Minute, time.Second)


These rollout assertions only wait up to time.Minute, but this file defines pollDuration = 3 * time.Minute specifically because upgrades can involve pod restarts + leader election delays (~163s worst case). To reduce upgrade-e2e flakiness, consider using pollDuration (or another longer, named timeout) for these Eventually checks instead of a hard-coded 1 minute.

Copilot · 2026-03-23T08:55:21Z

test/upgrade-e2e/post_upgrade_test.go

+	require.EventuallyWithT(t, func(ct *assert.CollectT) {
+		var managerPods corev1.PodList
+		err := c.List(ctx, &managerPods, client.MatchingLabels(managerLabelSelector), client.InNamespace("olmv1-system"))
+		require.NoError(ct, err)
+		require.Len(ct, managerPods.Items, int(*managerDeployment.Spec.Replicas))
+	}, time.Minute, time.Second)


This check also uses a 1-minute Eventually timeout. Since upgrades can extend rollout times (image pulls, restarts, leader election), using the longer pollDuration (3 minutes) defined above would better match the expected worst-case behavior and avoid unnecessary flakes.

Copilot · 2026-03-23T08:55:21Z

test/upgrade-e2e/post_upgrade_test.go

+	}, time.Minute, time.Second)
+
+	t.Logf("Ensure the number of remaining pods equal the desired number of replicas (%d)", desiredNumReplicas)
+	require.EventuallyWithT(t, func(ct *assert.CollectT) {
+		var managerPods corev1.PodList
+		require.NoError(ct, c.List(ctx, &managerPods, client.MatchingLabelsSelector{Selector: deploymentLabelSelector}, client.InNamespace("olmv1-system")))
+		require.Len(ct, managerPods.Items, int(desiredNumReplicas))
+	}, time.Minute, time.Second)


waitForDeployment uses a fixed 1-minute timeout here as well. Since this helper is called after upgrades (where rollouts can exceed 60s), consider using pollDuration/extendedPollDuration (or passing a timeout into the helper) so readiness checks reflect the upgrade timing assumptions documented at the top of the file.

Suggested change

}, time.Minute, time.Second)

t.Logf("Ensure the number of remaining pods equal the desired number of replicas (%d)", desiredNumReplicas)

require.EventuallyWithT(t, func(ct *assert.CollectT) {

var managerPods corev1.PodList

require.NoError(ct, c.List(ctx, &managerPods, client.MatchingLabelsSelector{Selector: deploymentLabelSelector}, client.InNamespace("olmv1-system")))

require.Len(ct, managerPods.Items, int(desiredNumReplicas))

}, time.Minute, time.Second)

}, pollDuration, time.Second)

t.Logf("Ensure the number of remaining pods equal the desired number of replicas (%d)", desiredNumReplicas)

require.EventuallyWithT(t, func(ct *assert.CollectT) {

var managerPods corev1.PodList

require.NoError(ct, c.List(ctx, &managerPods, client.MatchingLabelsSelector{Selector: deploymentLabelSelector}, client.InNamespace("olmv1-system")))

require.Len(ct, managerPods.Items, int(desiredNumReplicas))

}, pollDuration, time.Second)

Copilot · 2026-03-23T08:55:21Z

test/upgrade-e2e/upgrade_test.go

+	// Run standard Go tests first (e.g., post_upgrade_test.go)
+	exitCode := m.Run()
+	if exitCode != 0 {
+		os.Exit(exitCode)
+	}


m.Run() will only execute other Go tests in this package if the test runner invokes the package (e.g., go test ./test/upgrade-e2e). If CI/Make targets still run go test against only upgrade_test.go as a file target, post_upgrade_test.go/suite_test.go won’t be included in the build and these checks won’t run. Please ensure the upgrade-e2e invocation runs the package so the newly added Go tests actually execute.

Copilot · 2026-03-23T08:55:22Z

helm/olmv1/values.yaml

      image: quay.io/operator-framework/operator-controller:devel
+      replicas: 1
      extraArguments: []


Defaulting options.operatorController.deployment.replicas to 1 here means the out-of-the-box installation is still single-replica. Given the PDB is configured with minAvailable: 1, this can still block voluntary evictions during node drains. If the goal is HA-ready defaults, consider defaulting this to 2 (or adjusting the PDB behavior when replicas=1).

jianzhangbjz requested a review from a team as a code owner December 4, 2025 03:09

openshift-ci bot requested review from tmshort and trgeiger December 4, 2025 03:09

jianzhangbjz mentioned this pull request Dec 4, 2025

OCPBUGS-62517: enable downstream pdb and update replicas to meet MCO drain openshift/operator-framework-operator-controller#576

Open

jianzhangbjz force-pushed the pdb2 branch 3 times, most recently from fa1b4f7 to 89b96d0 Compare December 4, 2025 07:23

jianzhangbjz changed the title 🐛 makes the replica count **configurable** and sets a sensible default of **2 replicas** 🐛 makes the replica count configurable and update upgrade-e2e test cases Dec 4, 2025

jianzhangbjz force-pushed the pdb2 branch 5 times, most recently from b03ff44 to 5403227 Compare December 4, 2025 09:13

jianzhangbjz force-pushed the pdb2 branch 4 times, most recently from da978dd to 1ba8c7e Compare December 4, 2025 10:56

jianzhangbjz force-pushed the pdb2 branch 3 times, most recently from b7d4b4d to 90b470a Compare December 4, 2025 11:59

pedjak requested changes Dec 4, 2025

View reviewed changes

openshift-ci bot assigned pedjak Dec 4, 2025

tmshort reviewed Dec 4, 2025

View reviewed changes

helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml Outdated Show resolved Hide resolved

pedjak reviewed Dec 11, 2025

View reviewed changes

tmshort mentioned this pull request Dec 12, 2025

🌱 Change the experimental-e2e to use a two-node kind cluster #2382

Merged

4 tasks

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2025

jianzhangbjz force-pushed the pdb2 branch from a79efe1 to c4145cd Compare December 15, 2025 08:13

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2025

jianzhangbjz force-pushed the pdb2 branch 4 times, most recently from 353780c to dd7ff75 Compare December 17, 2025 03:57

tmshort reviewed Dec 18, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 13, 2026

Copilot AI review requested due to automatic review settings March 23, 2026 07:34

jianzhangbjz force-pushed the pdb2 branch from dd7ff75 to a6050bf Compare March 23, 2026 07:34

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2026

Copilot started reviewing on behalf of jianzhangbjz March 23, 2026 07:35 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

jianzhangbjz force-pushed the pdb2 branch from a6050bf to b8df7e5 Compare March 23, 2026 08:31

Make deployments HA-ready with configurable replica count

f94ec84

Copilot AI review requested due to automatic review settings March 23, 2026 08:49

jianzhangbjz force-pushed the pdb2 branch from b8df7e5 to f94ec84 Compare March 23, 2026 08:49

Copilot started reviewing on behalf of jianzhangbjz March 23, 2026 08:50 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

	err := c.List(ctx, &managerPods, client.MatchingLabels(managerLabelSelector))
	err := c.List(ctx, &managerPods, client.MatchingLabels(managerLabelSelector), client.InNamespace("olmv1-system"))

	// This is shorter than pollDuration (3 min) to leave buffer for the overall test suite.
	// This is shorter than catalogPollDuration (3 min) to leave buffer for the overall test suite.

	post-upgrade-checks.sh [TEST_CATALOG_IMG] [TEST_CATALOG_NAME] [TEST_CLUSTER_EXTENSION_NAME]
	pre-upgrade-setup.sh [TEST_CATALOG_IMG] [TEST_CLUSTER_CATALOG_NAME] [TEST_CLUSTER_EXTENSION_NAME]

		kubectl wait --for=condition=Serving --timeout=5m ClusterCatalog $TEST_CLUSTER_CATALOG_NAME
		kubectl wait --for=condition=Installed --timeout=5m ClusterExtension $TEST_CLUSTER_EXTENSION_NAME

Conversation

jianzhangbjz commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Make deployments HA-ready with configurable replica count

Problem

Solution

Benefits

Compatibility

Reviewer Checklist

Uh oh!

netlify bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Deploy Preview for olmv1 failed.

Uh oh!

jianzhangbjz commented Dec 4, 2025

Uh oh!

jianzhangbjz commented Dec 4, 2025

Uh oh!

tmshort commented Dec 4, 2025

Uh oh!

codecov bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jianzhangbjz commented Dec 4, 2025

Uh oh!

jianzhangbjz commented Dec 4, 2025

Uh oh!

pedjak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pedjak Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jianzhangbjz Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pedjak Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmshort Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmshort Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

jianzhangbjz commented Dec 4, 2025 •

edited

Loading

netlify bot commented Dec 4, 2025 •

edited

Loading

codecov bot commented Dec 4, 2025 •

edited

Loading

pedjak Dec 4, 2025 •

edited

Loading

jianzhangbjz Dec 5, 2025 •

edited

Loading

pedjak Dec 5, 2025 •

edited

Loading

tmshort Dec 5, 2025 •

edited

Loading

tmshort Dec 8, 2025 •

edited

Loading

pedjak Dec 11, 2025 •

edited

Loading

tmshort commented Dec 11, 2025 •

edited

Loading