fix: harden reassociation barriers for fast-math#1276
Open
DiamonDinoia wants to merge 1 commit intoxtensor-stack:masterfrom
Open
fix: harden reassociation barriers for fast-math#1276DiamonDinoia wants to merge 1 commit intoxtensor-stack:masterfrom
DiamonDinoia wants to merge 1 commit intoxtensor-stack:masterfrom
Conversation
f9a9992 to
2f9a431
Compare
Contributor
Author
|
@serge-sans-paille this is also ready for review. |
serge-sans-paille
requested changes
Mar 27, 2026
3685ff0 to
c4dec73
Compare
Contributor
Author
|
I think I can simplify this a bit more. A is not needed anymore if I we go this route. |
c4dec73 to
6d6bc61
Compare
Use arch-specific register constraints to prevent -ffast-math from reassociating arithmetic without forcing a register spill to the stack. Each platform's base arch header provides a reassociation_barrier overload using the tightest register constraint for that target: - x86 (sse2.hpp): "+x" — XMM/YMM/ZMM - ARM (neon.hpp): "+w" — NEON vector - ARM (sve.hpp): "+w" — SVE Z-register - PPC (vsx.hpp): "+wa" — VS register - fallback (common): "r"(&x) + "memory" clobber The x86 overload uses template<T,A> to catch all x86 arches (sse2, avx, avx512f and descendants) via overload resolution against the common fallback's requires_arch<common>. Also adds a mandatory const char* reason parameter to document why each barrier exists at each call site, and removes the now-unused memory_barrier_tag.
6d6bc61 to
d571f2f
Compare
| template <class T> | ||
| XSIMD_INLINE void reassociation_barrier(T& x, const char*) noexcept | ||
| { | ||
| #if XSIMD_WITH_INLINE_ASM |
Contributor
There was a problem hiding this comment.
Shouldn't we make this empty if we're not under fast-math?
Contributor
Author
|
Well, if we are not under fast math this should essentially be a no op.
Also, fast math does not detect if only associative math is enabled . Which
also breaks it. Now it becames checking all flags of the compiler that can
reorder floating points.
…On Sunday, March 29, 2026, serge-sans-paille ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In include/xsimd/arch/common/xsimd_common_details.hpp
<https://urldefense.com/v3/__https://github.com/xtensor-stack/xsimd/pull/1276?email_source=notifications&email_token=ACGKNQMMCG3APXMEDVTJKC34TDCDPA5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTIMBSGY2DINRZGMZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2K64DSL5ZGK5TJMV3V6Y3MNFRWW*discussion_r3005807424__;Iw!!DSb-azq1wVFtOg!WiXH_0OrBz6JCIv7x-DiGYBuUBKXVDKcuEENC8ic8K5UQ3HRuKK4aVV6FfZKWDQfRzIQGM5ZEcg7sPaZKJwzfq1tKY2y44AK$>
:
> + // exists at each call site; it is unused at runtime.
+ //
+ // Two overloads:
+ // reassociation_barrier(reg, reason) – raw register
+ // reassociation_barrier(batch, reason) – extracts .data
+ //
+ // Uses the tightest register-class constraint for the target so
+ // the value stays in its native SIMD register (no spill):
+ // x86 (SSE/AVX/AVX-512) : "+x" – XMM / YMM / ZMM
+ // ARM (NEON / SVE) : "+w" – vector / SVE Z-reg
+ // PPC (VSX) : "+wa" – VS register
+ // other / MSVC : address + memory clobber (fallback)
+ template <class T>
+ XSIMD_INLINE void reassociation_barrier(T& x, const char*) noexcept
+ {
+#if XSIMD_WITH_INLINE_ASM
Shouldn't we make this empty if we're not under fast-math?
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/xtensor-stack/xsimd/pull/1276?email_source=notifications&email_token=ACGKNQPGDROZDUYFFC3B7XL4TDCDPA5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTIMBSGY2DINRZGMZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2L24DSL5ZGK5TJMV3V63TPORUWM2LDMF2GS33OONPWG3DJMNVQ*pullrequestreview-4026446933__;Iw!!DSb-azq1wVFtOg!WiXH_0OrBz6JCIv7x-DiGYBuUBKXVDKcuEENC8ic8K5UQ3HRuKK4aVV6FfZKWDQfRzIQGM5ZEcg7sPaZKJwzfq1tKYc70ZNS$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACGKNQN7OHPQ2TPQJDWPYTL4TDCDPAVCNFSM6AAAAACWYMMMSGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DAMRWGQ2DMOJTGM__;!!DSb-azq1wVFtOg!WiXH_0OrBz6JCIv7x-DiGYBuUBKXVDKcuEENC8ic8K5UQ3HRuKK4aVV6FfZKWDQfRzIQGM5ZEcg7sPaZKJwzfq1tKdMsdfC6$>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Contributor
|
Well, I don't think we can say that is a no-op. Should we allow for user-defined |
Contributor
Author
|
Let me think about this for a second. MSVC does not allow inline asm. I should update the comment there. There might be better alternatives for scalars, WASM, and RISC-V that don't impact performance. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There is a but in nearbyint
-fassociative-mathbreaks it as it does not define__FAST_MATH__.Also the barrier used was causing a stack spill.
I centralized a barrier function that we can use everywhere in the code and used it in the places I know it helps.
Now we can just use
reassociation_barrierto avoid compiler reordering of instructions.Let me know if you like the internal API and if you need changes to it. I find that this is the solution that minimizes ifdef boilerplate and allows to dispatch to all archs. (With c++17 this will be simpler).
Cheers,
Marco