Skip to content

[ci_gen_kustomize_values] Co-locate provisionserver with metal3 to prevent DHCP failures#3738

Open
mnietoji wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
mnietoji:dhcp_provisioning_with_fix
Open

[ci_gen_kustomize_values] Co-locate provisionserver with metal3 to prevent DHCP failures#3738
mnietoji wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
mnietoji:dhcp_provisioning_with_fix

Conversation

@mnietoji
Copy link
Copy Markdown
Contributor

@mnietoji mnietoji commented Mar 4, 2026

…ures

When metal3-dnsmasq pod restarts during a node's DHCP lease renewal on the provisioning network (172.23.0.0/24), NetworkManager fails to renew and sets ipv4.method=disabled. NMState operator then preserves this disabled state, causing permanent loss of provisioning network connectivity on that node.

The issue occurs when OpenStackProvisionServer and metal3 pods run on different nodes. If metal3 restarts while a node is attempting DHCP renewal, the temporary unavailability of metal3-dnsmasq causes the renewal to fail.

Solution:
Automatically detect the node running metal3 pod (via k8s-app=metal3 label) and configure provisionServerNodeSelector in baremetalSetTemplate to schedule OpenStackProvisionServer on the same node. This ensures provisioning network connectivity is maintained because metal3-static-ip-manager maintains a static IP (172.23.0.3) on the metal3 node regardless of dnsmasq restarts.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@softwarefactory-project-zuul
Copy link
Copy Markdown

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,d660efa12350eb88ab3c89b1d91a04abcbc82293

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from d660efa to 369ae18 Compare March 4, 2026 11:42
@softwarefactory-project-zuul
Copy link
Copy Markdown

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,369ae185bc2b7d5a266e63c93224f86f1d2723cd

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 369ae18 to 3fa51c9 Compare March 4, 2026 11:49
@softwarefactory-project-zuul
Copy link
Copy Markdown

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,3fa51c9d28a6f3c53f0c99dbbdef1baf476724d5

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3217375613864e0d83f7f88f394dcfaa

openstack-k8s-operators-content-provider FAILURE in 7m 16s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal-minor-update SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ cifmw-pod-zuul-files SUCCESS in 4m 25s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 21s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 49s
✔️ cifmw-pod-pre-commit SUCCESS in 8m 44s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 54s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 24s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 14s

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from d0cf92f to 6b9c8b0 Compare March 4, 2026 14:44
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 6b9c8b0 to 1339a1d Compare March 4, 2026 17:48
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 1339a1d to cf58db9 Compare March 5, 2026 10:55
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch 3 times, most recently from e29b915 to 08a2b2b Compare March 10, 2026 15:14
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0b04bcb1f4f54d518d017da862888f74

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 03m 30s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 22m 08s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 25m 42s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 49m 34s
✔️ cifmw-pod-zuul-files SUCCESS in 5m 28s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 35s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 49s
cifmw-pod-pre-commit TIMED_OUT in 31m 04s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 29s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 33s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 07s

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

This PR is stale because it has been for over 15 days with no activity.
Remove stale label or comment or this will be closed in 7 days.

Copy link
Copy Markdown
Contributor

@michburk michburk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested? If so, could you link a jira ticket with any relevant downstream links hidden behind comments or descriptions that are marked as 'Red Hat Employee'?

Thanks!

Comment on lines +48 to +50
{% for key, value in _original_baremetal_template.items() %}
{{ key }}: {{ value }}
{% endfor %}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use something like | to_nice_yaml for this instead of manually deconstructing and reconstructing the yaml key: values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, improved the code

ansible.builtin.include_role:
name: run_hook

- name: Detect metal3 pod node for baremetal nodeset provisioning
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to put these tasks somewhere other than execute_step.yml? Would these tasks be better-suited to living in some dedicated hook, rather than being part of this generic execute_step.yml file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right. moved the code to a different place

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 3c8b076 to dc09396 Compare April 13, 2026 09:17
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from dc09396 to 309d835 Compare April 14, 2026 15:09
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 309d835 to 073a7c2 Compare April 14, 2026 15:11
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 073a7c2 to 6ed9e2a Compare April 14, 2026 21:16
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch 7 times, most recently from 7720fe4 to 2ffc089 Compare April 14, 2026 21:42
@mnietoji mnietoji changed the title [multiple] Co-locate provisionserver with metal3 to prevent DHCP fail… [ci_gen_kustomize_values] Co-locate provisionserver with metal3 to prevent DHCP failures Apr 14, 2026
…event DHCP failures

When metal3-dnsmasq pod restarts during a node's DHCP lease renewal on
the provisioning network (172.23.0.0/24), NetworkManager fails to renew
and sets ipv4.method=disabled. NMState operator then preserves this
disabled state, causing permanent loss of provisioning network
connectivity on that node.

The issue occurs when OpenStackProvisionServer and metal3 pods run on
different nodes. If metal3 restarts while a node is attempting DHCP
renewal, the temporary unavailability of metal3-dnsmasq causes the
renewal to fail.

Solution:
Automatically detect the node running metal3 pod (via k8s-app=metal3
label) and configure provisionServerNodeSelector in baremetalSetTemplate
to schedule OpenStackProvisionServer on the same node. This ensures
provisioning network connectivity is maintained because
metal3-static-ip-manager maintains a static IP (172.23.0.3) on the
metal3 node regardless of dnsmasq restarts.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Miguel Angel Nieto Jimenez <mnietoji@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants