Skip to content

allow the status-section to be updated on a crm_shadow commit with an explicit switch#4106

Open
wenningerk wants to merge 2 commits into
ClusterLabs:mainfrom
wenningerk:modify_fence_scsi_devices
Open

allow the status-section to be updated on a crm_shadow commit with an explicit switch#4106
wenningerk wants to merge 2 commits into
ClusterLabs:mainfrom
wenningerk:modify_fence_scsi_devices

Conversation

@wenningerk

@wenningerk wenningerk commented May 7, 2026

Copy link
Copy Markdown
Contributor

Investigating why with Pacemaker 2.1.7 the high-level-tooling approach to prevent a resource restart as a result of a parameter change in the config by in parallel changing the digests in the status section wouldn't work anymore ...
Was barking at the wrong tree investigating the commit that changes how digests are calculated.
In parallel there was as well a commit in Pacemaker 2.1.7 ( ec65417) that removes the special handling of replacing the full cib and now triggers an election + join refresh whenever an “unsafe” client modifies the status section, overwriting the manually pushed digest before the scheduler can see it.
Added a ‘--commit-status’ option to crm_shadow that prevents this behavior (client will claim to be crm_shadow_status and this is added to the safe clients list).

@wenningerk wenningerk force-pushed the modify_fence_scsi_devices branch from 62b194c to a09fa0c Compare May 20, 2026 14:37
@wenningerk wenningerk changed the title [WIP] don't trigger restart of fence_scsi when devices are modified and start digest is adapted accordingly allow the status-section to be updated on a crm_shadow commit with an explicit switch May 20, 2026

@nrwahl2 nrwahl2 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few comments. Two of them are requests to improve maintainability. The third comment questions this approach more broadly.

I'll try to give this issue some further thought tonight as well.

Comment thread cts/cts-cli.in
"crm_shadow --diff",
expected_rc=ExitStatus.ERROR),
], cib_gen=partial(copy_existing_cib, f"{cts_cli_data}/crm_mon.xml")),
TestGroup([

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these tests before the fix commit, with expected_rc set to an error? Then update the expected_rc and the output in the fix commit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test first - Ok. Just wasn't sure we were pushing for that. Will do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this (at least technically) isn't a change that fixes a behavior but rather a feature added I chose to add a test prior to the commit that checks if the - in this case - undesired behavior (status not comited) is there. There is probably not much sense in checking that the new feature is not there by checking if trying to use it gives us a failure.
Together with the feature-commit I'm then checking if the new feature can be used to achieve the desired behavior. Was hesitant if I should have left the test checking the behavior without the new switch.

"crm_attribute",
"crm_node",
"crm_resource",
"crm_shadow_status",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the commit message to reference RHEL-70283, and to explain why we're doing this (since future contributors may not be able to access this RHEL Jira issue).

We've encountered numerous issues when commit messages don't explain the reasoning for the changes in the commit. We often end up, much later, removing or changing behavior that we depend on. At best, we spend a long time trying to figure out why things are the way they are, and we're often not confident in the conclusions we reach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it doesn't fix the issue right away I can add that being able to alter the status section is a pre-requisite for RHEL-70283 and that this commit gives us back this possibility.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course I would have loved to get some pcs feedback regarding the approach.
Which is why I put it here for now as something generically useful. But the reference to what triggered the implementation is probably still a good idea.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bit of data that should help understand why the feature was added including a reference to the Jira Issue.

Comment thread tools/crm_shadow.c
return;
}

if (options.update_status) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this would fix the issue. Based on RHEL-70283, it appears that the issue is that adding devices to a fence_scsi or fence_mpath resource using pcs stonith update causes resource restarts. However, pcs doesn't use crm_shadow at all. So updating the crm_shadow CLI tool doesn't seem like it would affect the behavior of pcs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't fix the issue right away. There is already RHELHA-1011 to take care of this from the pcs side. Just after your commit there was no low-level-tool left you could use to update the status section. Both for getting an atomic update and for the interface crm_shadow seems like the reasonable answer.

As since ec65417 there is no tool
available anymore to modify the status section and crm_shadow would
be the way to do that in an atomic way adding the switch --update-status.
In favor to reverting to behavior before the commit above this
approach doesn't introduce potentially unintentional behavioral
changes.
This was triggered by pcs feature that allows altering fence_scsi
devices without resource restart altering the digests in already
recorded actions in the status section. (Jira Issue RHEL-70283)
@wenningerk wenningerk force-pushed the modify_fence_scsi_devices branch from a09fa0c to 73e0e5c Compare June 1, 2026 12:49
@mirecheck

Copy link
Copy Markdown

Would it be possible to implement similar option for the cibadmin tool? So that
the resource refresh does not occur after updating the status section.

Pcs uses diff-based push:

  1. crm_diff --original <old_cib_file> --new <new_cib_file> --no-version
  2. cibadmin --patch --verbose --xml-pipe (with diff XML on stdin)

It would be more easier for pcs to just add another option to the cibadmin call
than integrating crm_shadow.

@wenningerk

Copy link
Copy Markdown
Contributor Author

Would it be possible to implement similar option for the cibadmin tool? So that the resource refresh does not occur after updating the status section.

Pcs uses diff-based push:

  1. crm_diff --original <old_cib_file> --new <new_cib_file> --no-version
  2. cibadmin --patch --verbose --xml-pipe (with diff XML on stdin)

It would be more easier for pcs to just add another option to the cibadmin call than integrating crm_shadow.

Basically there were 2 reasons to go for crm_shadow:

  • it is an atomic update (which might be the case for your diff as well if applied with a single cibadmin call - have to check)
  • adding the switch to crm_shadow kind of gives you the possibility to use any tooling on the status-section (of a shadow-cib then) with a minimal change to the code-base in pacemaker

From an architectural pov it would be definitely cleaner to have a little bit more code for this very - anyway hacky -
feature in the high-level-tooling instead of bloating the pacemaker code-base.
If using cibadmin reduces the risk of a race with a concurrent status-update that gets reverted with a full status-push it might be worth thinking over once again though - given of course the cibadmin --patch thing works in an atomic manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants