Skip to content

Latest commit

 

History

History
265 lines (211 loc) · 9.14 KB

File metadata and controls

265 lines (211 loc) · 9.14 KB

Che Happy-Path Test

Script: .ci/oci-devworkspace-happy-path.sh Purpose: Integration test validating DevWorkspace Operator with Eclipse Che deployment

Overview

This script deploys and validates the full DevWorkspace Operator + Eclipse Che stack on OpenShift, ensuring the happy-path user workflow succeeds. It's used in the v14-che-happy-path Prow CI test.

Features

Retry Logic

  • Max retries: 2 (3 total attempts)
  • Exponential backoff: 60s base delay with 0-15s jitter
  • Cleanup: Deletes failed Che deployment before retry

Health Checks

  • OLM: Verifies catalog-operator and olm-operator are available before Che deployment (2-minute timeout each)
  • DWO: Waits for deployment condition=available (5-minute timeout)
  • Che: Waits for CheCluster condition=Available (10-minute timeout)
  • Pods: Verifies all Che pods are ready

Artifact Collection

On each failure, collects:

  • OLM diagnostics (Subscription, InstallPlan, CSV, CatalogSource)
  • CatalogSource pod logs
  • Che operator logs (last 1000 lines)
  • CheCluster CR status (full YAML)
  • All pod logs from Che namespace
  • Kubernetes events
  • chectl server logs

Error Handling

  • Graceful error handling with stage-specific messages
  • Progress indicators: "Attempt 1/2", "Retrying in 71s..."
  • No crash on failures

Configuration

Environment variables (all optional):

Variable Default Description
CHE_NAMESPACE eclipse-che Namespace for Che deployment
MAX_RETRIES 2 Maximum retry attempts
BASE_DELAY 60 Base delay in seconds for exponential backoff
MAX_JITTER 15 Maximum jitter in seconds
ARTIFACT_DIR /tmp/dwo-e2e-artifacts Directory for diagnostic artifacts
DEVWORKSPACE_OPERATOR (required) DWO image to deploy

Usage

In Prow CI

The script is called automatically by the v14-che-happy-path Prow job. Prow sets DEVWORKSPACE_OPERATOR based on the context:

For PR checks (testing PR code):

export DEVWORKSPACE_OPERATOR="quay.io/devfile/devworkspace-controller:pr-${PR_NUMBER}-${COMMIT_SHA}"
./.ci/oci-devworkspace-happy-path.sh

For periodic/nightly runs (testing main branch):

export DEVWORKSPACE_OPERATOR="quay.io/devfile/devworkspace-controller:next"
./.ci/oci-devworkspace-happy-path.sh

Local Testing

export DEVWORKSPACE_OPERATOR="quay.io/youruser/devworkspace-controller:your-tag"
export ARTIFACT_DIR="/tmp/my-test-artifacts"
./.ci/oci-devworkspace-happy-path.sh

Test Flow

  1. Deploy DWO

    • Runs make install
    • Waits for controller deployment to be available
    • Collects artifacts if deployment fails
  2. Deploy Che (with retry)

    • Runs chectl server:deploy with extended timeouts (24h)
    • Waits for CheCluster condition=Available
    • Verifies all pods are ready
    • Collects artifacts on failure
    • Cleans up and retries if needed
  3. Run Happy-Path Test

    • Downloads test script from Eclipse Che repository
    • Executes Che happy-path workflow
    • Collects artifacts on failure

Exit Codes

  • 0: Success - All stages completed
  • 1: Failure - Check $ARTIFACT_DIR for diagnostics

Timeouts

Component Timeout Purpose
DWO deployment 5 minutes Pod becomes available
CheCluster Available 10 minutes Che fully deployed
Che pods ready 5 minutes All pods running
chectl pod wait/ready 24 hours Generous for slow environments

Common Failures

OLM Infrastructure Not Ready

Symptoms: "ERROR: OLM infrastructure is not healthy, cannot proceed with Che deployment" Check: $ARTIFACT_DIR/olm-diagnostics-olm-check.yaml Common causes:

  • OLM operators not running (catalog-operator, olm-operator)
  • Cluster provisioning issues during bootstrap
  • Resource constraints preventing OLM operator scheduling Resolution: This indicates a fundamental cluster infrastructure issue. Check cluster health and OLM operator logs before retrying.

DWO Deployment Fails

Symptoms: "ERROR: DWO controller is not ready" Check: $ARTIFACT_DIR/devworkspace-controller-info/ Common causes: Image pull errors, resource constraints, webhook conflicts

Che Deployment Timeout

Symptoms: "ERROR: CheCluster did not become available within 10 minutes" Check: $ARTIFACT_DIR/che-operator-logs-attempt-*.log, $ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml Common causes:

  • OLM subscription timeout (check olm-diagnostics for subscription state)
  • Database connection issues
  • Image pull failures
  • Operator reconciliation errors

Pod CrashLoopBackOff

Symptoms: "ERROR: chectl server:deploy failed" Check: $ARTIFACT_DIR/eclipse-che-info/ for pod logs Common causes: Configuration errors, resource limits, TLS certificate issues

OLM Subscription Stuck

Symptoms: Subscription timeout after 120 seconds with no resources created Check: $ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml, $ARTIFACT_DIR/catalogsource-logs-attempt-*.log Common causes:

  • CatalogSource pod not pulling/running
  • InstallPlan not created (subscription cannot resolve dependencies)
  • Cluster resource exhaustion preventing operator pod scheduling Resolution: Check OLM operator logs and CatalogSource pod status. See "Advanced Troubleshooting" section for monitoring and alternative deployment options.

Artifact Locations

After a failed test run:

$ARTIFACT_DIR/
├── devworkspace-controller-info/
│   ├── <pod-name>-<container>.log
│   └── events.log
├── eclipse-che-info/
│   ├── <pod-name>-<container>.log
│   └── events.log
├── che-operator-logs-attempt-1.log
├── che-operator-logs-attempt-2.log
├── checluster-status-attempt-1.yaml
├── checluster-status-attempt-2.yaml
├── olm-diagnostics-attempt-1.yaml
├── olm-diagnostics-attempt-2.yaml
├── catalogsource-logs-attempt-1.log
├── catalogsource-logs-attempt-2.log
├── chectl-logs-attempt-1/
└── chectl-logs-attempt-2/

Dependencies

  • kubectl - Kubernetes CLI
  • oc - OpenShift CLI (for log collection)
  • chectl - Eclipse Che CLI (v7.114.0+)
  • jq - JSON processor (for chectl)

Advanced Troubleshooting

OLM Infrastructure Issues

If you experience persistent OLM subscription timeouts (see olm-diagnostics-*.yaml artifacts):

Option 1: OLM Health Check (Implemented)

The script now verifies OLM infrastructure health before deploying Che:

  • Checks catalog-operator is available
  • Checks olm-operator is available
  • Verifies openshift-marketplace is accessible

If OLM is unhealthy, the test fails fast with diagnostic artifacts instead of waiting through timeouts.

Option 2: Monitor Subscription Progress (Advanced)

For debugging stuck subscriptions, you can add active monitoring to detect zero-progress scenarios earlier:

# Example: Monitor subscription state every 10 seconds
while [ $elapsed -lt 300 ]; do
  state=$(kubectl get subscription eclipse-che -n eclipse-che \
    -o jsonpath='{.status.state}' 2>/dev/null)
  echo "[$elapsed/300s] Subscription state: ${state:-unknown}"
  if [ "$state" = "AtLatestKnown" ]; then
    break
  fi
  sleep 10
  elapsed=$((elapsed + 10))
done

This helps identify whether subscriptions are progressing slowly vs. completely stuck.

Option 3: Skip OLM Installation (Alternative Approach)

For CI environments with persistent OLM issues, consider deploying Che operator directly instead of via OLM:

chectl server:deploy \
  --installer=operator \  # Uses direct YAML deployment
  -p openshift \
  --batch \
  --telemetry=off \
  --skip-devworkspace-operator \
  --chenamespace="$CHE_NAMESPACE"

Trade-offs:

  • ✅ Bypasses OLM infrastructure entirely
  • ✅ More reliable in resource-constrained CI environments
  • ❌ Doesn't test OLM integration path (used by production OperatorHub)
  • ❌ May miss OLM-specific issues

When to use: Temporary workaround for CI infrastructure issues while OLM problems are being resolved.

Subscription Timeout Issues

If OLM subscriptions consistently timeout (visible in olm-diagnostics-*.yaml):

  1. Check OLM operator logs:

    kubectl logs -n openshift-operator-lifecycle-manager \
      deployment/catalog-operator --tail=100
    kubectl logs -n openshift-operator-lifecycle-manager \
      deployment/olm-operator --tail=100
  2. Verify CatalogSource pod is running:

    kubectl get pods -n openshift-marketplace \
      -l olm.catalogSource=eclipse-che
    kubectl logs -n openshift-marketplace \
      -l olm.catalogSource=eclipse-che
  3. Check InstallPlan creation:

    kubectl get installplan -n eclipse-che -o yaml
    • If no InstallPlan exists, OLM couldn't resolve the subscription
    • If InstallPlan exists but isn't complete, check its status conditions

Related Documentation