[Quantization] MSE-calibrate every per-expert weight in fused-experts MoE#1421
[Quantization] MSE-calibrate every per-expert weight in fused-experts MoE#1421cjluo-nv wants to merge 2 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds NVFP4-static grouped weight-quantizer support and synchronization, bootstraps missing per-weight amax, ensures per-expert fused-expert calibration coverage, adjusts fused-expert export amax slicing for static block quant, and sanitizes HF generation config before save. ChangesNVFP4-static grouped quantizer support with MoE improvements
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
360b53e to
8e21516
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1138-1160: The function _sanitize_generation_config_for_save
currently mutates model.generation_config.do_sample in-place and never restores
it; change the flow so the original value is preserved and restored after export
(e.g., capture original_do_sample = getattr(gc, "do_sample", None) before
setting gc.do_sample = True and restore gc.do_sample = original_do_sample after
the save operation), and apply the same pattern to the other affected block
around the code referenced at 1262-1270; locate usages of
_sanitize_generation_config_for_save (and the other block) surrounding the
save_pretrained/export call and ensure restoration occurs even on exceptions
(use try/finally or a context manager).
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 121-162: The bootstrap loop in
_bootstrap_uncalibrated_weight_quantizers must run weight reads inside
enable_weight_access_and_writeback() so FSDP/HF-TP/offload sharded modules
perform proper local access/writeback instead of triggering an access failure
swallowed by the blanket except; wrap the per-module calibration work (the call
to module.iter_weights_for_calibration() and the q(weight) calibration call
inside the loop) with with enable_weight_access_and_writeback(module):, remove
or narrow the broad try/except that currently hides access errors so genuine
access failures surface, and keep the rest of the logic (q.disable_quant(),
q.enable_calib(), q(weight), q.load_calib_amax(), q.enable_quant(),
q.disable_calib(), q._calibrator.reset()) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5bd29857-aece-4f2e-9df6-23870190b9ec
📒 Files selected for processing (6)
modelopt/torch/export/moe_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/utils/core_utils.py
8e21516 to
adee8b5
Compare
Signed-off-by: Chenjie Luo <[email protected]>
adee8b5 to
12e3c24
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
modelopt/torch/export/unified_export_hf.py (1)
1137-1148:⚠️ Potential issue | 🟠 Major | ⚡ Quick winScope generation-config mutation and patch cleanup to avoid state leakage.
_sanitize_generation_config_for_savemutatesmodel.generation_config.do_samplepermanently, and it currently runs outside thetry/finallythat unpatches transformers internals. If sanitize raises,_unpatch_revert_weight_conversionis skipped; if it succeeds,do_samplestill leaks into later calls on the same model object.Proposed fix
-def _sanitize_generation_config_for_save(model: torch.nn.Module) -> None: +def _sanitize_generation_config_for_save(model: torch.nn.Module) -> Callable[[], None]: @@ - gc = getattr(model, "generation_config", None) - if gc is None: - return + gc = getattr(model, "generation_config", None) + if gc is None: + return lambda: None if getattr(gc, "top_k", None) is not None or getattr(gc, "top_p", None) is not None: - gc.do_sample = True + original_do_sample = getattr(gc, "do_sample", None) + gc.do_sample = True + return lambda: setattr(gc, "do_sample", original_do_sample) + return lambda: None @@ - _sanitize_generation_config_for_save(model) - - try: + restore_generation_config = lambda: None + try: + restore_generation_config = _sanitize_generation_config_for_save(model) model.save_pretrained( export_dir, state_dict={**post_state_dict, **(extra_state_dict or {})}, save_modelopt_state=save_modelopt_state, max_shard_size=max_shard_size, ) finally: + restore_generation_config() _unpatch_revert_weight_conversion(_patches)Also applies to: 1242-1254
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/unified_export_hf.py` around lines 1137 - 1148, The helper _sanitize_generation_config_for_save currently mutates model.generation_config.do_sample permanently and runs outside the unpatch/cleanup flow, causing state leakage and skipped cleanup on exceptions; change it to only temporarily set do_sample: inside the same try/finally that calls _unpatch_revert_weight_conversion, capture the original value (orig = getattr(gc, "do_sample", None)), set gc.do_sample = True only when sampling attrs exist, and always restore gc.do_sample = orig in the finally block so the model's generation_config is unchanged after save. Apply the identical temporary-mutation+restore pattern to the other occurrence referenced around lines 1242-1254.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1137-1148: The helper _sanitize_generation_config_for_save
currently mutates model.generation_config.do_sample permanently and runs outside
the unpatch/cleanup flow, causing state leakage and skipped cleanup on
exceptions; change it to only temporarily set do_sample: inside the same
try/finally that calls _unpatch_revert_weight_conversion, capture the original
value (orig = getattr(gc, "do_sample", None)), set gc.do_sample = True only when
sampling attrs exist, and always restore gc.do_sample = orig in the finally
block so the model's generation_config is unchanged after save. Apply the
identical temporary-mutation+restore pattern to the other occurrence referenced
around lines 1242-1254.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a398d472-2294-43dc-80ef-57d6c78c285a
📒 Files selected for processing (6)
modelopt/torch/export/moe_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/utils/core_utils.py
✅ Files skipped from review due to trivial changes (1)
- modelopt/torch/quantization/plugins/huggingface.py
🚧 Files skipped from review as they are similar to previous changes (4)
- modelopt/torch/quantization/nn/modules/tensor_quantizer.py
- modelopt/torch/quantization/utils/core_utils.py
- modelopt/torch/export/moe_utils.py
- modelopt/torch/quantization/model_calib.py
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Follow-up bug fix to #1407 for dead-expert MSE calibration in fused-experts MoE. The design is not a new subsystem — the new sync_grouped_weight_global_amax composes on the existing preprocess_linear_fusion from quant_utils.py, and iter_weights_for_calibration extends an already-existing base-class hook. The fix targets a real correctness bug (gate/up weight_scale_2 divergence on experts that received no calibration tokens), backed by end-to-end Qwen3.5-122B numbers (0/12288 vs 1/12288 mismatches).
Reasons I did not approve:
-
No unit tests for non-trivial new behavior. The PR body explicitly opts out ("Did you write any new necessary tests?: ❌").
tests/unit/torch/quantization/plugins/test_fused_experts.pyalready has good infrastructure (_SyntheticFusedExperts,TestFusedExpertsCalibration) that could exercise:_QuantFusedExperts.iter_weights_for_calibrationyieldingnum_experts * 2pairs._bootstrap_uncalibrated_weight_quantizerspopulating_amaxon never-routed experts (simulate by only routing to a subset inforward_loop).- The new
_export_fused_expertsper-block amax reshape path for NVFP4 (the existingtest_uncalibrated_expert_gate_up_share_amaxcovers the per-tensor fallback but not the newamax.numel() % fused_total == 0reshape branch). sync_grouped_weight_global_amaxunifyingglobal_amaxacross a Q/K/V sibling group.
-
_sanitize_generation_config_for_savesilently mutates user state. It flipsdo_sample=False → Trueon the model's livegeneration_configwhenevertop_k/top_pis set, and this mutation is persisted to the exportedgeneration_config.json. In practice this is a semantic no-op for greedy decoding, but the change is invisible to the caller. Consider (a) emitting a warning naming the fields being rewritten, or (b) doing the mutation on a copy that's scoped to thesave_pretrainedcall only. -
_GROUPED_WEIGHT_QUANTIZER_PATTERNSis architecture-name heuristic. The hardcoded tuple covers Llama/Qwen/Mistral/Mixtral but will silently miss any model using different attribute names (wqkv, fusedqkv_proj, DeepSeek naming variants, etc.) — grouped unification would just not run and export would fall back to per-module amax. Worth either documenting this as a known limitation in the docstring or logging when a model has NVFP4-static quantizers but produces zero groups.
Minor: sync_grouped_weight_global_amax is added to __all__ as public API, but given the hardcoded sibling-name heuristic it's really an internal helper; consider dropping it from __all__ or renaming to _sync_....
No licensing changes. Size is fine (+195/-71 across 6 files). Happy for a human reviewer with MoE export context to make the final call on the testing gap and the generation_config mutation policy.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1421 +/- ##
==========================================
- Coverage 77.30% 72.76% -4.54%
==========================================
Files 478 478
Lines 51404 52274 +870
==========================================
- Hits 39737 38038 -1699
- Misses 11667 14236 +2569
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
12e3c24 to
0869a90
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_fused_experts.py (1)
636-729: ⚡ Quick winExercise the public MSE path in this dead-expert regression.
This proves
_bootstrap_uncalibrated_weight_quantizers()works, but it won't fail ifmse_calibrate()stops calling the helper or calls it in the wrong order. Since the bug is specifically in the MSE calibration flow, I'd add one end-to-end assertion throughalgorithm="mse"ormse_calibrate()here as well.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 636 - 729, Add an end-to-end assertion that runs the MSE calibration path (either by calling mtq.quantize with quant_cfg["algorithm"]="mse" or invoking mse_calibrate(...) after the existing max-path partial_forward) and then assert that dead experts' weight quantizers (experts.gate_up_proj_weight_quantizers[idx] and experts.down_proj_weight_quantizers[idx]) have populated non-zero _amax and that their _amax matches per-row max(|weight|) as done for the max path; this ensures mse_calibrate still calls the same bootstrap helper (_bootstrap_uncalibrated_weight_quantizers) or otherwise populates the quantizers correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 167-185: The warning in _sync_grouped_weight_global_amax currently
fires whenever any NVFP4-static TensorQuantizer exists but no groups matched;
change this to first check whether the model contains any potential sibling
candidates before warning: scan model.modules() for modules whose names or
attributes match the grouping heuristics (i.e. the same criteria used by
_GROUPED_WEIGHT_QUANTIZER_PATTERNS or the sibling attribute tests used when
building groups) and only if such candidate siblings exist and n_groups == 0 and
has_nvfp4_static is true, emit the warnings.warn; update the logic around
n_groups, has_nvfp4_static, and the invocation of
_GROUPED_WEIGHT_QUANTIZER_PATTERNS in _sync_grouped_weight_global_amax to
reflect this guarded check.
---
Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 636-729: Add an end-to-end assertion that runs the MSE calibration
path (either by calling mtq.quantize with quant_cfg["algorithm"]="mse" or
invoking mse_calibrate(...) after the existing max-path partial_forward) and
then assert that dead experts' weight quantizers
(experts.gate_up_proj_weight_quantizers[idx] and
experts.down_proj_weight_quantizers[idx]) have populated non-zero _amax and that
their _amax matches per-row max(|weight|) as done for the max path; this ensures
mse_calibrate still calls the same bootstrap helper
(_bootstrap_uncalibrated_weight_quantizers) or otherwise populates the
quantizers correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 082a3f81-ae36-4445-a4a3-d01fad1bea23
📒 Files selected for processing (6)
modelopt/torch/export/moe_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/plugins/huggingface.pytests/unit/torch/quantization/plugins/test_fused_experts.py
🚧 Files skipped from review as they are similar to previous changes (4)
- modelopt/torch/quantization/nn/modules/tensor_quantizer.py
- modelopt/torch/quantization/plugins/huggingface.py
- modelopt/torch/export/unified_export_hf.py
- modelopt/torch/export/moe_utils.py
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of #1421 (dead-expert MSE calibration for fused-experts MoE).
Previous critical comments — addressed:
- ✅
_bootstrap_uncalibrated_weight_quantizersnow runs insideenable_weight_access_and_writeback(module, model, name_to_module)and the blanket try/except was removed (addresses CodeRabbit FSDP-safety concern). - ✅ Unit tests added in
tests/unit/torch/quantization/plugins/test_fused_experts.py:test_bootstrap_populates_dead_expert_quantizersexercises the dead-expert path end-to-end including themax(|weight|)invariant, andtest_per_block_amax_reshape_for_fused_exportguards the new NVFP4 per-block amax reshape branch in_export_fused_experts. Addresses meenchen's main nudge reason. - ✅ The new helpers are underscore-prefixed and not exported in
__all__(addresses the "public API for hardcoded heuristic" concern). - ✅
_GROUPED_WEIGHT_QUANTIZER_PATTERNSlimitation is documented in the_sync_grouped_weight_global_amaxdocstring and a warning is emitted when matching fails despite NVFP4-static quantizers existing.
Still unresolved (minor, non-blocking individually):
- ❌
_sanitize_generation_config_for_savestill mutatesmodel.generation_config.do_samplein-place with no restore. Ifsave_pretrainedraises,_unpatch_revert_weight_conversionstill runs (it's infinally) butdo_sampleis never reverted — callers reusing the model object silently getdo_sample=True. A try/finally-scoped restore (captured-original-value pattern) was suggested twice by CodeRabbit and not applied. Consider wrapping in the existingtry/finallyaroundmodel.save_pretrained. - ❌
_sync_grouped_weight_global_amaxwarning is still gated only onhas_nvfp4_static, so models with only standalone NVFP4 linears (no Q/K/V or gate/up siblings at all) will emit a misleading warning every timemse_calibrate/local_hessian_calibrateruns. A guard on "candidate siblings exist" would make it actionable. - ❌ No MSE end-to-end test — the bootstrap path is only exercised via the direct helper, not via
mse_calibrate(..., algorithm="mse"). If a future refactor stops calling the helper frommse_calibrate, the test won't catch it.
Design check: not a new subsystem. _sync_grouped_weight_global_amax composes on preprocess_linear_fusion from quant_utils.py; iter_weights_for_calibration extends an existing base-class hook; new helpers are small and localized.
No licensing changes. Size OK (+423/-71, 7 files). Core correctness fix looks right and is backed by end-to-end Qwen3.5-122B/35B validation. Flagging for human sign-off on the three minor items above — all three are stylistic/defensive rather than correctness bugs, but were called out explicitly by the previous reviewers and not acted on.
|
|
||
| @torch.no_grad() | ||
| def _bootstrap_uncalibrated_weight_quantizers(model: nn.Module) -> int: | ||
| """Populate ``_amax`` from weights for quantizers the forward pass didn't reach. |
There was a problem hiding this comment.
@realAsma After PTQ, if there are uncalibrated amaxes I thought we intentionally want to throw an error/warning and tell user to increase number of samples/seq length. this changes the behavior to automatically fill in the amaxes without any logging/warning which seems problematic.
this docstring is also misleading -- it sounds like you are loading amax from the weights but should say that you are recalibrating weights with missing _amax
| if n_groups == 0: | ||
| # Surface architectures whose Q/K/V or gate/up siblings don't match the | ||
| # pattern list — without this, sibling-sync is a silent no-op. | ||
| has_nvfp4_static = any( |
There was a problem hiding this comment.
why not reuse the is_nvfp4_static method from modelopt/torch/quantization/nn/modules/tensor_quantizer.py ?
| ) | ||
|
|
||
| # Identify weight quantizers by checking if they have corresponding weight parameters | ||
| # Step 3: calibrate weight quantizers via iter_weights_for_calibration. |
There was a problem hiding this comment.
this comment is too long and specific to fused experts. remove mention of fused experts to make it more generic and in general reduce length of AI comments
| weight_quantizer.enable_quant() | ||
| weight_quantizer.disable_calib() | ||
|
|
||
| # Synchronize ALL CUDA devices before resetting to ensure all async operations complete |
There was a problem hiding this comment.
why is this comment removed?
There was a problem hiding this comment.
The old comment described the previous weight_attr_names lookup; with the refactor to parent_module.iter_weights_for_calibration() the call site is self-describing. Trimmed the replacement block to a single line per your other comment.
0869a90 to
9203072
Compare
… MoE
Two-part fix for transformers 5.x fused-experts containers (Qwen3-MoE /
Qwen3.5-MoE / Mixtral / DeepSeek / Kimi-K2.x ...) where weight quantizers
live in `nn.ModuleList`s (`gate_up_proj_weight_quantizers`,
`down_proj_weight_quantizers`):
1. Add `_QuantFusedExperts.iter_weights_for_calibration` that yields
per-expert (weight_slice, quantizer) pairs for both projections. The base
impl uses singular `*_weight_quantizer` and silently skips fused-experts
modules, so weight-only calibration paths never reach per-expert
quantizers.
2. Refactor `mse_calibrate`:
- Add `_bootstrap_uncalibrated_weight_quantizers` after `max_calibrate`
to populate `_amax` on quantizers the forward pass didn't reach (dead
MoE experts that received no calibration tokens). Runs the existing
calibrator on the weight slice surfaced by
`iter_weights_for_calibration`.
- Replace the singular-only `weight_attr_names` discovery + `getattr`-by-
name walk with an `iter_weights_for_calibration` walk done inside each
parent module's `enable_weight_access_and_writeback` context, so MSE
processes every per-expert quantizer (active and dead) and remains
FSDP-safe.
Without this, the export-time fallback in `_export_fused_experts` derived
separate gate/up amaxes from each half of the fused weight, breaking the
gate==up `weight_scale_2` invariant on dead experts. End-to-end check on
Qwen3.5-122B-A10B with `nvfp4_experts_only_mse-fp8_cast_kv`:
- Before: 1/12288 (layer 38 expert 69) gate \!= up; 0 weights MSE-calibrated
- After: 0/12288 mismatches; 24576 weights MSE-calibrated; ~4.2 min
Signed-off-by: Chenjie Luo <[email protected]>
9203072 to
ebce8b2
Compare
|
|
||
|
|
||
| @torch.no_grad() | ||
| def _sync_grouped_weight_global_amax(model: nn.Module) -> int: |
There was a problem hiding this comment.
This function is also useful for other algorithms like AWQ and GPTQ. We need to sync the amax for fused layers before some algorithm begins. cc @sychen52 on the design of fused modules.
| ("q_proj", "k_proj", "v_proj"), | ||
| ("gate_proj", "up_proj"), # Llama/Qwen/Mistral | ||
| ("w1", "w3"), # Mixtral | ||
| ) |
There was a problem hiding this comment.
We have existing fusion metadata in
modelopt/torch/export/quant_utils.preprocess_linear_fusion, could we avoid adding new hard-coded tuples for fusion?
| # Step 1: max calibrate, bootstrap dead-expert weight quantizers, | ||
| # unify grouped NVFP4 global_amax so MSE sees a consistent FP8 grid. | ||
| max_calibrate(model, forward_loop, distributed_sync) | ||
| _bootstrap_uncalibrated_weight_quantizers(model) |
There was a problem hiding this comment.
I wonder why the dead-expert weight quantizers does not break the MAX calibration export path with NVFP4 dynamic quantizer. i.e., does max_calibrate need this fix as well?
Does it make more sense to run _bootstrap_uncalibrated_weight_quantizers in max_calibrate. (after weight_only_quantize/forward,
before promote_nvfp4_static_quantizers)? Therefore other recipes (AWQ, GPTQ, local-hessian) gets the fix as well.
| n_groups += 1 | ||
|
|
||
| if n_groups == 0 and any(_is_calibrated_nvfp4_static(m) for m in model.modules()): | ||
| warnings.warn( |
There was a problem hiding this comment.
If I'm understanding correctly, this "no group matched" warning will fire on every model that
doesn't happen to use those three tuples — including _QuantFusedExperts itself, who gate vs. up halves are not in any group. Maybe include _QuantFusedExperts in the grouping pass?
What does this PR do?
Type of change: Bug fix
Two-part fix for transformers 5.x fused-experts containers (Qwen3-MoE / Qwen3.5-MoE / Mixtral / DeepSeek / Kimi-K2.x ...) where weight quantizers live in
nn.ModuleLists (gate_up_proj_weight_quantizers,down_proj_weight_quantizers):Per-expert weight iteration for calibration. Add
_QuantFusedExperts.iter_weights_for_calibrationthat yields per-expert(weight_slice, quantizer)pairs for both projections. The base impl uses singular*_weight_quantizerand silently skips fused-experts modules, so weight-only calibration paths never reached per-expert quantizers.mse_calibraterefactor._bootstrap_uncalibrated_weight_quantizersaftermax_calibrateto populate_amaxon quantizers the forward pass didn't reach (dead MoE experts that received no calibration tokens). Runs the existing calibrator on the weight slice surfaced byiter_weights_for_calibration.weight_attr_namesdiscovery +getattr-by-name walk with aniter_weights_for_calibrationwalk done inside each parent module'senable_weight_access_and_writebackcontext, so MSE processes every per-expert quantizer (active and dead) and remains FSDP-safe.Without this, the export-time fallback in
_export_fused_expertsderived separate gate/up amaxes from each half of the fused weight, breaking thegate==upweight_scale_2invariant on dead experts.Also includes:
_sanitize_generation_config_for_saveinunified_export_hf— coercesdo_sample=Truewhen an upstreamgeneration_config.jsonhastop_k/top_pset, so newer transformers' strict validate doesn't blocksave_pretrained.moe_utils.py,tensor_quantizer.py, andcore_utils.pyto support the per-expert iteration and bootstrap path.Usage
Testing
Original validation — Qwen3.5-122B-A10B with
nvfp4_experts_only_mse-fp8_cast_kv:gate \!= up; 0 weights MSE-calibrated.End-to-end pipeline validation — Qwen3.5-35B-A3B (40 layers × 256 experts × 2 projections = 20,480 per-expert weight quantizers), TRT-LLM 1.3.0rc13 + transformers 5.6 docker, single B200:
_amax_amaxmtq.quantizetimen=20480 exact=20480 diff=0 max_rel=0). With 8/256 experts routed per token and 4 calib samples, almost all experts are "dead" in Path A. Bootstrap fills them frommax(|weight|), MSE searches deterministically from there → identical to Path B which bootstraps everything.generation_config.jsonhasdo_sample: true(upstream hadtop_k=20+top_p=0.95which would have failed strict validate)."Born in north-east France, Soyer trained as a"→" tailor. Demonstrating his craft at a young age, at 20 he moved to Paris at the requests of the noble people of Picardy."(coherent grammar; factually wrong as expected with 4-sample calib, but no NaN/Inf in logits, no scale-mismatch crash). 92 GB GPU memory used.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
Follow-up to PR #1407 (MSE+FP8-cast-KV recipes). The recipe YAML files landed there; this PR fixes the calibration codepath so the MSE recipes actually exercise per-expert weight quantizers in fused-experts MoE containers.
Summary by CodeRabbit
Bug Fixes
New Features
Tests