Skip to content

fix: harden reassociation barriers for fast-math#1276

Open
DiamonDinoia wants to merge 1 commit intoxtensor-stack:masterfrom
DiamonDinoia:fix/nearbyint-fastmath
Open

fix: harden reassociation barriers for fast-math#1276
DiamonDinoia wants to merge 1 commit intoxtensor-stack:masterfrom
DiamonDinoia:fix/nearbyint-fastmath

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

There is a but in nearbyint -fassociative-math breaks it as it does not define __FAST_MATH__.
Also the barrier used was causing a stack spill.

I centralized a barrier function that we can use everywhere in the code and used it in the places I know it helps.

Now we can just use reassociation_barrier to avoid compiler reordering of instructions.

Let me know if you like the internal API and if you need changes to it. I find that this is the solution that minimizes ifdef boilerplate and allows to dispatch to all archs. (With c++17 this will be simpler).

Cheers,
Marco

@DiamonDinoia DiamonDinoia force-pushed the fix/nearbyint-fastmath branch 5 times, most recently from f9a9992 to 2f9a431 Compare March 23, 2026 19:55
@DiamonDinoia DiamonDinoia changed the title fix: harden reassociation barriers for fast-math nearbyint fix: harden reassociation barriers for fast-math Mar 23, 2026
@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

@serge-sans-paille this is also ready for review.

@DiamonDinoia DiamonDinoia force-pushed the fix/nearbyint-fastmath branch 5 times, most recently from 3685ff0 to c4dec73 Compare March 27, 2026 19:57
@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

I think I can simplify this a bit more. A is not needed anymore if I we go this route.

@DiamonDinoia DiamonDinoia force-pushed the fix/nearbyint-fastmath branch from c4dec73 to 6d6bc61 Compare March 28, 2026 22:41
Use arch-specific register constraints to prevent -ffast-math from
reassociating arithmetic without forcing a register spill to the stack.

Each platform's base arch header provides a reassociation_barrier
overload using the tightest register constraint for that target:
  - x86 (sse2.hpp):   "+x" — XMM/YMM/ZMM
  - ARM (neon.hpp):    "+w" — NEON vector
  - ARM (sve.hpp):     "+w" — SVE Z-register
  - PPC (vsx.hpp):     "+wa" — VS register
  - fallback (common): "r"(&x) + "memory" clobber

The x86 overload uses template<T,A> to catch all x86 arches (sse2, avx,
avx512f and descendants) via overload resolution against the common
fallback's requires_arch<common>.

Also adds a mandatory const char* reason parameter to document why each
barrier exists at each call site, and removes the now-unused
memory_barrier_tag.
@DiamonDinoia DiamonDinoia force-pushed the fix/nearbyint-fastmath branch from 6d6bc61 to d571f2f Compare March 28, 2026 22:42
template <class T>
XSIMD_INLINE void reassociation_barrier(T& x, const char*) noexcept
{
#if XSIMD_WITH_INLINE_ASM
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we make this empty if we're not under fast-math?

@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

DiamonDinoia commented Mar 29, 2026 via email

@serge-sans-paille
Copy link
Copy Markdown
Contributor

Well, I don't think we can say that

__asm__ volatile("" : : "r"(&x) : "memory");

is a no-op. Should we allow for user-defined XSIMD_REASSOCIATIVE which would be forced by __FAST_MATH__ but could also be set by the user?
I don't want fast-math to impact non-fast math code generation.

@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

DiamonDinoia commented Mar 29, 2026

Let me think about this for a second. MSVC does not allow inline asm. I should update the comment there. There might be better alternatives for scalars, WASM, and RISC-V that don't impact performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants