Skip to content

Commit 3134799

Browse files
committed
Enhancement: Add drift detection and automatic reconciliation
Proposal for drift detection feature.
1 parent eeb37af commit 3134799

1 file changed

Lines changed: 179 additions & 0 deletions

File tree

enhancements/drift-detection.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Enhancement: Drift Detection and Automatic Reconciliation
2+
3+
| Field | Value |
4+
|-------|-------|
5+
| **Status** | implementable |
6+
| **Author(s)** | @eshulman |
7+
| **Created** | 2026-02-03 |
8+
| **Last Updated** | 2026-02-03 |
9+
| **Tracking Issue** | TBD |
10+
11+
## Summary
12+
13+
This enhancement introduces drift detection and automatic reconciliation for ORC managed resources. The feature enables ORC to periodically check OpenStack resources for changes made outside of ORC (via CLI, dashboard, or other tools) and automatically restore them to match the desired state defined in the Kubernetes specification.
14+
15+
Additionally, managed resources that are deleted externally from OpenStack will be automatically recreated by ORC, ensuring the declared state is maintained.
16+
17+
## Motivation
18+
19+
In production environments, OpenStack resources may be modified outside of ORC through various means:
20+
21+
- Direct OpenStack CLI/SDK operations
22+
- OpenStack Horizon dashboard
23+
- Other automation tools or controllers
24+
- Manual emergency interventions
25+
- Third-party integrations
26+
27+
Without drift detection, these changes go unnoticed until they cause issues, leading to configuration drift between the declared Kubernetes state and the actual OpenStack state. This undermines the declarative model that ORC provides.
28+
29+
Similar Kubernetes controllers for cloud resources have implemented drift detection:
30+
31+
- **AWS Controllers for Kubernetes (ACK)**: Uses a 10-hour default resync period for drift recovery
32+
- **Azure Service Operator (ASO)**: Uses a 1-hour default resync period with configurable intervals
33+
34+
## Goals
35+
36+
- **Ensure state consistency**: Managed resources in OpenStack should match the desired state declared in Kubernetes
37+
- **Detect external modifications**: Identify when OpenStack resources are modified outside of ORC
38+
- **Automatic correction**: Restore drifted resources to their desired state without manual intervention
39+
- **Resource recreation**: Recreate managed resources that are deleted externally from OpenStack
40+
- **Configurable frequency**: Allow operators to tune the resync interval based on their requirements
41+
- **Hierarchical configuration**: Support configuration at ORC-wide, resource-type, and per-resource levels
42+
- **Minimal API impact**: Avoid excessive OpenStack API calls that could trigger rate limiting
43+
44+
## Non-Goals
45+
46+
- **Real-time drift detection**: Event-driven detection of changes (would require OpenStack webhooks or very short polling intervals)
47+
- **Drift reporting without correction**: Alerting on drift without taking corrective action (future enhancement)
48+
- **Selective field reconciliation**: Allowing some fields to drift while correcting others
49+
- **Conflict resolution with merge semantics**: Merging external changes with desired state
50+
- **Drift detection for unmanaged resources**: Unmanaged resources are explicitly not modified by ORC
51+
52+
## Proposal
53+
54+
### Periodic Resync Mechanism
55+
56+
The drift detection mechanism works by periodically triggering a full reconciliation of managed resources:
57+
58+
1. **Trigger**: After a resource reaches a stable state (Progressing=False), ORC schedules a resync after `resyncPeriod` duration
59+
2. **Fetch**: On resync, ORC fetches the current state of the OpenStack resource
60+
3. **Compare**: The current state is compared against the desired state in the Kubernetes spec
61+
4. **Update**: If drift is detected, ORC updates the OpenStack resource to match the desired state
62+
5. **Reschedule**: After successful reconciliation, the next resync is scheduled
63+
64+
### API Changes
65+
66+
The `ManagedOptions` struct is extended with a `resyncPeriod` field:
67+
68+
```yaml
69+
apiVersion: openstack.k-orc.cloud/v1alpha1
70+
kind: Network
71+
metadata:
72+
name: critical-network
73+
spec:
74+
cloudCredentialsRef:
75+
secretName: openstack-clouds
76+
cloudName: openstack
77+
managementPolicy: managed
78+
managedOptions:
79+
resyncPeriod: 1h # Check for drift every hour
80+
resource:
81+
description: Critical application network
82+
```
83+
84+
**Default value**: 10 hours (`10h`) - chosen to balance drift detection responsiveness with API load, aligning with ACK's default.
85+
86+
**Disable resync**: Set `resyncPeriod: 0` to disable periodic drift detection for a resource.
87+
88+
### Configuration Hierarchy
89+
90+
Drift detection supports a three-level configuration hierarchy, with more specific configurations taking precedence:
91+
92+
| Level | Scope | Configuration Location | Precedence |
93+
|-------|-------|----------------------|------------|
94+
| ORC-wide | All resources across all types | Controller deployment configuration | Lowest |
95+
| Resource-type | All resources of a specific type (e.g., all Networks) | CRD-level configuration or controller flags | Medium |
96+
| Per-resource | Individual resource instance | `spec.managedOptions.resyncPeriod` on the CR | Highest |
97+
98+
**Resolution order**: Per-resource → Resource-type → ORC-wide → Default (10h)
99+
100+
### Resource Recreation on External Deletion
101+
102+
When a managed resource is deleted from OpenStack but the ORC object still exists:
103+
104+
1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id`
105+
2. If not found and the resource was originally created by ORC (not imported), ORC recreates it
106+
3. The new resource ID is stored in `status.id`
107+
108+
For **imported resources** that are deleted externally, this is a terminal error because the resource was not created by ORC and recreating it would not restore the original resource.
109+
110+
### Field Coverage
111+
112+
Drift detection covers all **mutable fields** that ORC actuators implement update operations for. Before this feature is considered stable, all actuator implementations must be audited to ensure they cover all mutable fields.
113+
114+
## Risks and Edge Cases
115+
116+
### Split-Brain Scenarios
117+
118+
**Risk**: Multiple controllers or systems may be managing the same OpenStack resources, leading to conflicts where changes are repeatedly overwritten.
119+
120+
**Mitigation**:
121+
- Implement retry with exponential backoff when update conflicts are detected
122+
- Document that ORC should be the sole manager of resources it creates
123+
- Report conflicts in resource conditions for observability
124+
125+
### API Rate Limiting
126+
127+
**Risk**: Frequent resync across many resources could trigger OpenStack API rate limiting.
128+
129+
**Mitigation**:
130+
- Conservative 10-hour default resync period
131+
- Add random jitter to resync times to avoid thundering herd
132+
- Allow operators to disable or lengthen resync for stable resources
133+
134+
### Controller Resource Consumption
135+
136+
**Risk**: Frequent reconciliation increases CPU and memory usage on the ORC controller.
137+
138+
**Mitigation**:
139+
- Implement hash-based comparison: compute a hash of the OpenStack resource state and store it in `status.observedStateHash`. Only proceed with update operations if the hash differs from the previous reconciliation.
140+
- Conservative default limits reconciliation frequency
141+
142+
### Conflicts with External Systems
143+
144+
**Risk**: If resources are intentionally managed by external systems (e.g., autoscalers, other controllers), drift correction can cause unexpected behavior.
145+
146+
**Mitigation**:
147+
- Allow `resyncPeriod: 0` to disable drift detection
148+
- Use `managementPolicy: unmanaged` for externally managed resources
149+
- Document the implications clearly in the user guide
150+
151+
### Upgrade/Downgrade Considerations
152+
153+
**Risk**: Users upgrading to a version with drift detection may experience unexpected reconciliations.
154+
155+
**Mitigation**: The 10-hour default is conservative enough that most users won't notice immediate impact. Document the new behavior in release notes.
156+
157+
## Alternatives Considered
158+
159+
### Event-Driven Drift Detection
160+
161+
Use OpenStack notifications (Oslo messaging) to detect changes in real-time.
162+
163+
**Rejected because**: Requires OpenStack notification infrastructure, complex to implement, not all deployments have notifications enabled.
164+
165+
### Drift Detection Without Correction
166+
167+
Detect and report drift without automatically correcting it.
168+
169+
**Rejected because**: Adds operational burden requiring human intervention. Could be added as a separate management policy option in the future.
170+
171+
### Watch-Based Detection
172+
173+
Implement a watcher that periodically lists all resources from OpenStack and compares.
174+
175+
**Rejected because**: List operations can be expensive, harder to implement with proper filtering, and per-resource reconciliation integrates naturally with controller-runtime.
176+
177+
## Implementation History
178+
179+
- 2026-02-03: Enhancement proposed

0 commit comments

Comments
 (0)