[Quantization] MSE-calibrate every per-expert weight in fused-experts MoE by cjluo-nv · Pull Request #1421 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-08T23:35:03Z

What does this PR do?

Type of change: Bug fix

Two-part fix for transformers 5.x fused-experts containers (Qwen3-MoE / Qwen3.5-MoE / Mixtral / DeepSeek / Kimi-K2.x ...) where weight quantizers live in nn.ModuleLists (gate_up_proj_weight_quantizers, down_proj_weight_quantizers):

Per-expert weight iteration for calibration. Add _QuantFusedExperts.iter_weights_for_calibration that yields per-expert (weight_slice, quantizer) pairs for both projections. The base impl uses singular *_weight_quantizer and silently skips fused-experts modules, so weight-only calibration paths never reached per-expert quantizers.
mse_calibrate refactor.
- Add _bootstrap_uncalibrated_weight_quantizers after max_calibrate to populate _amax on quantizers the forward pass didn't reach (dead MoE experts that received no calibration tokens). Runs the existing calibrator on the weight slice surfaced by iter_weights_for_calibration.
- Replace the singular-only weight_attr_names discovery + getattr-by-name walk with an iter_weights_for_calibration walk done inside each parent module's enable_weight_access_and_writeback context, so MSE processes every per-expert quantizer (active and dead) and remains FSDP-safe.

Without this, the export-time fallback in _export_fused_experts derived separate gate/up amaxes from each half of the fused weight, breaking the gate==up weight_scale_2 invariant on dead experts.

Also includes:

_sanitize_generation_config_for_save in unified_export_hf — coerces do_sample=True when an upstream generation_config.json has top_k/top_p set, so newer transformers' strict validate doesn't block save_pretrained.
Small companion plumbing in moe_utils.py, tensor_quantizer.py, and core_utils.py to support the per-expert iteration and bootstrap path.

Usage

import modelopt.torch.quantization as mtq
from modelopt.recipe import load_config

# Recipe `nvfp4_experts_only_mse-kv_fp8_cast` (already on main) now correctly
# MSE-calibrates every per-expert weight quantizer in fused-experts MoE models.
cfg = load_config("general/ptq/nvfp4_experts_only_mse-kv_fp8_cast")
mtq.quantize(model, cfg, forward_loop=calibration_forward_loop)

Testing

Original validation — Qwen3.5-122B-A10B with nvfp4_experts_only_mse-fp8_cast_kv:

Before: 1/12288 (layer 38 expert 69) gate \!= up; 0 weights MSE-calibrated.
After: 0/12288 mismatches; 24576 weights MSE-calibrated; ~4.2 min.

End-to-end pipeline validation — Qwen3.5-35B-A3B (40 layers × 256 experts × 2 projections = 20,480 per-expert weight quantizers), TRT-LLM 1.3.0rc13 + transformers 5.6 docker, single B200:

	Path A (4-sample calib, deliberately undercalibrated)	Path B (zero forward-pass tokens)
Per-expert weight quantizers calibrated	20,480 / 20,480	20,480 / 20,480
Missing `_amax`	0	0
All-zero `_amax`	0	0
`mtq.quantize` time	25–34 s	23 s

Cross-path diff: every per-expert weight amax matches bit-for-bit between the two paths (n=20480 exact=20480 diff=0 max_rel=0). With 8/256 experts routed per token and 4 calib samples, almost all experts are "dead" in Path A. Bootstrap fills them from max(|weight|), MSE searches deterministically from there → identical to Path B which bootstraps everything.
Export to HF NVFP4 checkpoint succeeded (~95 s, 22 GB checkpoint). Resulting generation_config.json has do_sample: true (upstream had top_k=20 + top_p=0.95 which would have failed strict validate).
TRT-LLM inference loaded the checkpoint and generated text: "Born in north-east France, Soyer trained as a" → " tailor. Demonstrating his craft at a young age, at 20 he moved to Paris at the requests of the noble people of Picardy." (coherent grammar; factually wrong as expected with 4-sample calib, but no NaN/Inf in logits, no scale-mismatch crash). 92 GB GPU memory used.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ❌
Did you update Changelog?: N/A
Did you get Claude approval on this PR?: ❌

Additional Information

Follow-up to PR #1407 (MSE+FP8-cast-KV recipes). The recipe YAML files landed there; this PR fixes the calibration codepath so the MSE recipes actually exercise per-expert weight quantizers in fused-experts MoE containers.

Summary by CodeRabbit

Bug Fixes
- Fixed generation configuration validation for HuggingFace model exports.
- Improved handling of quantization shape mismatches during expert weight export.
New Features
- Enhanced calibration process with automatic population of missing expert quantizers.
- Added grouped quantizer synchronization for improved multi-expert quantization.
Tests
- Added regression tests for fused expert export and calibration correctness.

coderabbitai · 2026-05-08T23:35:17Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds NVFP4-static grouped weight-quantizer support and synchronization, bootstraps missing per-weight amax, ensures per-expert fused-expert calibration coverage, adjusts fused-expert export amax slicing for static block quant, and sanitizes HF generation config before save.

Changes

NVFP4-static grouped quantizer support with MoE improvements

Layer / File(s)	Summary
Type foundation `modelopt/torch/quantization/nn/modules/tensor_quantizer.py`	Added `is_nvfp4_static` property to detect NVFP4 static block quantization format without requiring calibrated amax values.
Utility simplification `modelopt/torch/quantization/utils/core_utils.py`	Simplified `promote_nvfp4_static_quantizers()` to use `is_nvfp4_static` property instead of recomputing eligibility from quantizer internals.
Grouped quantizer patterns & bootstrap `modelopt/torch/quantization/model_calib.py`	Introduced grouped-weight quantizer patterns, `_collect_grouped_linears`, `_bootstrap_uncalibrated_weight_quantizers`, and `_sync_grouped_weight_global_amax` to gather sibling modules and populate/unify global_amax.
MSE calibrate reorder & refactor `modelopt/torch/quantization/model_calib.py`	Reorders `mse_calibrate` to run `max_calibrate` → bootstrap missing amax → sync grouped global_amax, and refactors weight calibration to iterate `QuantModule.iter_weights_for_calibration()`.
Hessian calibration integration `modelopt/torch/quantization/model_calib.py`	Calls `_sync_grouped_weight_global_amax()` after `max_calibrate` in `local_hessian_calibrate` and uses `is_nvfp4_static` with a guard to skip re-promotion.
Fused expert weight iteration `modelopt/torch/quantization/plugins/huggingface.py`	Added `iter_weights_for_calibration()` override to `_QuantFusedExperts` yielding per-expert weight quantizer pairs for calibration coverage.
Export amax slicing for fused experts `modelopt/torch/export/moe_utils.py`	Extended `_export_fused_experts` to reshape flat per-block `_amax` vectors when needed, slice fused regions correctly, delete existing `_amax` buffer, and re-register the sliced amax.
Export integration for HuggingFace checkpoints `modelopt/torch/export/unified_export_hf.py`	Added `_sanitize_generation_config_for_save()` helper and call before `model.save_pretrained()` to set `do_sample=True` when `top_k`/`top_p` are present.
Tests `tests/unit/torch/quantization/plugins/test_fused_experts.py`	Added regression tests for per-block `_amax` reshape during fused export and for bootstrapping dead expert quantizers.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[Quantization] MSE-calibrate every per-expert weight in fused-experts MoE' directly describes the main objective: ensuring MSE calibration reaches all per-expert weights in fused MoE containers, which is the core fix addressed throughout the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security violations found. No torch.load(weights_only=False), numpy.load(allow_pickle=True), hardcoded trust_remote_code=True, eval/exec, nosec comments, or unsafe dependencies present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-3

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-08T23:39:13Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1421/
Built to branch `gh-pages` at 2026-05-11 22:24 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1138-1160: The function _sanitize_generation_config_for_save
currently mutates model.generation_config.do_sample in-place and never restores
it; change the flow so the original value is preserved and restored after export
(e.g., capture original_do_sample = getattr(gc, "do_sample", None) before
setting gc.do_sample = True and restore gc.do_sample = original_do_sample after
the save operation), and apply the same pattern to the other affected block
around the code referenced at 1262-1270; locate usages of
_sanitize_generation_config_for_save (and the other block) surrounding the
save_pretrained/export call and ensure restoration occurs even on exceptions
(use try/finally or a context manager).

In `@modelopt/torch/quantization/model_calib.py`:
- Around line 121-162: The bootstrap loop in
_bootstrap_uncalibrated_weight_quantizers must run weight reads inside
enable_weight_access_and_writeback() so FSDP/HF-TP/offload sharded modules
perform proper local access/writeback instead of triggering an access failure
swallowed by the blanket except; wrap the per-module calibration work (the call
to module.iter_weights_for_calibration() and the q(weight) calibration call
inside the loop) with with enable_weight_access_and_writeback(module):, remove
or narrow the broad try/except that currently hides access errors so genuine
access failures surface, and keep the rest of the logic (q.disable_quant(),
q.enable_calib(), q(weight), q.load_calib_amax(), q.enable_quant(),
q.disable_calib(), q._calibrator.reset()) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5bd29857-aece-4f2e-9df6-23870190b9ec

📥 Commits

Reviewing files that changed from the base of the PR and between 1d796f9 and 360b53e.

📒 Files selected for processing (6)

modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py
modelopt/torch/quantization/plugins/huggingface.py
modelopt/torch/quantization/utils/core_utils.py

Signed-off-by: Chenjie Luo <[email protected]>

coderabbitai

♻️ Duplicate comments (1)

modelopt/torch/export/unified_export_hf.py (1)

1137-1148: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scope generation-config mutation and patch cleanup to avoid state leakage.

_sanitize_generation_config_for_save mutates model.generation_config.do_sample permanently, and it currently runs outside the try/finally that unpatches transformers internals. If sanitize raises, _unpatch_revert_weight_conversion is skipped; if it succeeds, do_sample still leaks into later calls on the same model object.

Proposed fix

-def _sanitize_generation_config_for_save(model: torch.nn.Module) -> None:
+def _sanitize_generation_config_for_save(model: torch.nn.Module) -> Callable[[], None]:
@@
-    gc = getattr(model, "generation_config", None)
-    if gc is None:
-        return
+    gc = getattr(model, "generation_config", None)
+    if gc is None:
+        return lambda: None
     if getattr(gc, "top_k", None) is not None or getattr(gc, "top_p", None) is not None:
-        gc.do_sample = True
+        original_do_sample = getattr(gc, "do_sample", None)
+        gc.do_sample = True
+        return lambda: setattr(gc, "do_sample", original_do_sample)
+    return lambda: None
@@
-        _sanitize_generation_config_for_save(model)
-
-        try:
+        restore_generation_config = lambda: None
+        try:
+            restore_generation_config = _sanitize_generation_config_for_save(model)
             model.save_pretrained(
                 export_dir,
                 state_dict={**post_state_dict, **(extra_state_dict or {})},
                 save_modelopt_state=save_modelopt_state,
                 max_shard_size=max_shard_size,
             )
         finally:
+            restore_generation_config()
             _unpatch_revert_weight_conversion(_patches)

Also applies to: 1242-1254

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/unified_export_hf.py` around lines 1137 - 1148, The
helper _sanitize_generation_config_for_save currently mutates
model.generation_config.do_sample permanently and runs outside the
unpatch/cleanup flow, causing state leakage and skipped cleanup on exceptions;
change it to only temporarily set do_sample: inside the same try/finally that
calls _unpatch_revert_weight_conversion, capture the original value (orig =
getattr(gc, "do_sample", None)), set gc.do_sample = True only when sampling
attrs exist, and always restore gc.do_sample = orig in the finally block so the
model's generation_config is unchanged after save. Apply the identical
temporary-mutation+restore pattern to the other occurrence referenced around
lines 1242-1254.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1137-1148: The helper _sanitize_generation_config_for_save
currently mutates model.generation_config.do_sample permanently and runs outside
the unpatch/cleanup flow, causing state leakage and skipped cleanup on
exceptions; change it to only temporarily set do_sample: inside the same
try/finally that calls _unpatch_revert_weight_conversion, capture the original
value (orig = getattr(gc, "do_sample", None)), set gc.do_sample = True only when
sampling attrs exist, and always restore gc.do_sample = orig in the finally
block so the model's generation_config is unchanged after save. Apply the
identical temporary-mutation+restore pattern to the other occurrence referenced
around lines 1242-1254.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a398d472-2294-43dc-80ef-57d6c78c285a

📥 Commits

Reviewing files that changed from the base of the PR and between adee8b5 and 12e3c24.

📒 Files selected for processing (6)

modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py
modelopt/torch/quantization/plugins/huggingface.py
modelopt/torch/quantization/utils/core_utils.py

✅ Files skipped from review due to trivial changes (1)

modelopt/torch/quantization/plugins/huggingface.py

🚧 Files skipped from review as they are similar to previous changes (4)

modelopt/torch/quantization/nn/modules/tensor_quantizer.py
modelopt/torch/quantization/utils/core_utils.py
modelopt/torch/export/moe_utils.py
modelopt/torch/quantization/model_calib.py

meenchen

Bot review — DM the bot to share feedback.

Follow-up bug fix to #1407 for dead-expert MSE calibration in fused-experts MoE. The design is not a new subsystem — the new sync_grouped_weight_global_amax composes on the existing preprocess_linear_fusion from quant_utils.py, and iter_weights_for_calibration extends an already-existing base-class hook. The fix targets a real correctness bug (gate/up weight_scale_2 divergence on experts that received no calibration tokens), backed by end-to-end Qwen3.5-122B numbers (0/12288 vs 1/12288 mismatches).

Reasons I did not approve:

No unit tests for non-trivial new behavior. The PR body explicitly opts out ("Did you write any new necessary tests?: ❌"). tests/unit/torch/quantization/plugins/test_fused_experts.py already has good infrastructure (_SyntheticFusedExperts, TestFusedExpertsCalibration) that could exercise:
- _QuantFusedExperts.iter_weights_for_calibration yielding num_experts * 2 pairs.
- _bootstrap_uncalibrated_weight_quantizers populating _amax on never-routed experts (simulate by only routing to a subset in forward_loop).
- The new _export_fused_experts per-block amax reshape path for NVFP4 (the existing test_uncalibrated_expert_gate_up_share_amax covers the per-tensor fallback but not the new amax.numel() % fused_total == 0 reshape branch).
- sync_grouped_weight_global_amax unifying global_amax across a Q/K/V sibling group.
_sanitize_generation_config_for_save silently mutates user state. It flips do_sample=False → True on the model's live generation_config whenever top_k/top_p is set, and this mutation is persisted to the exported generation_config.json. In practice this is a semantic no-op for greedy decoding, but the change is invisible to the caller. Consider (a) emitting a warning naming the fields being rewritten, or (b) doing the mutation on a copy that's scoped to the save_pretrained call only.
_GROUPED_WEIGHT_QUANTIZER_PATTERNS is architecture-name heuristic. The hardcoded tuple covers Llama/Qwen/Mistral/Mixtral but will silently miss any model using different attribute names (wqkv, fused qkv_proj, DeepSeek naming variants, etc.) — grouped unification would just not run and export would fall back to per-module amax. Worth either documenting this as a known limitation in the docstring or logging when a model has NVFP4-static quantizers but produces zero groups.

Minor: sync_grouped_weight_global_amax is added to __all__ as public API, but given the hardcoded sibling-name heuristic it's really an internal helper; consider dropping it from __all__ or renaming to _sync_....

No licensing changes. Size is fine (+195/-71 across 6 files). Happy for a human reviewer with MoE export context to make the final call on the testing gap and the generation_config mutation policy.

codecov · 2026-05-09T00:07:08Z

Codecov Report

❌ Patch coverage is 93.91304% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.76%. Comparing base (1d796f9) to head (ebce8b2).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	94.44%	5 Missing ⚠️
modelopt/torch/export/unified_export_hf.py	85.71%	1 Missing ⚠️
modelopt/torch/quantization/plugins/huggingface.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1421      +/-   ##
==========================================
- Coverage   77.30%   72.76%   -4.54%     
==========================================
  Files         478      478              
  Lines       51404    52274     +870     
==========================================
- Hits        39737    38038    -1699     
- Misses      11667    14236    +2569

Flag	Coverage Δ
examples	`41.73% <15.65%> (-0.23%)`	⬇️
gpu	`51.24% <73.04%> (-9.19%)`	⬇️
regression	`15.21% <9.56%> (+0.07%)`	⬆️
unit	`52.79% <77.39%> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/unit/torch/quantization/plugins/test_fused_experts.py (1)
636-729: ⚡ Quick win

Exercise the public MSE path in this dead-expert regression.

This proves _bootstrap_uncalibrated_weight_quantizers() works, but it won't fail if mse_calibrate() stops calling the helper or calls it in the wrong order. Since the bug is specifically in the MSE calibration flow, I'd add one end-to-end assertion through algorithm="mse" or mse_calibrate() here as well.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 636
- 729, Add an end-to-end assertion that runs the MSE calibration path (either by
calling mtq.quantize with quant_cfg["algorithm"]="mse" or invoking
mse_calibrate(...) after the existing max-path partial_forward) and then assert
that dead experts' weight quantizers
(experts.gate_up_proj_weight_quantizers[idx] and
experts.down_proj_weight_quantizers[idx]) have populated non-zero _amax and that
their _amax matches per-row max(|weight|) as done for the max path; this ensures
mse_calibrate still calls the same bootstrap helper
(_bootstrap_uncalibrated_weight_quantizers) or otherwise populates the
quantizers correctly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 167-185: The warning in _sync_grouped_weight_global_amax currently
fires whenever any NVFP4-static TensorQuantizer exists but no groups matched;
change this to first check whether the model contains any potential sibling
candidates before warning: scan model.modules() for modules whose names or
attributes match the grouping heuristics (i.e. the same criteria used by
_GROUPED_WEIGHT_QUANTIZER_PATTERNS or the sibling attribute tests used when
building groups) and only if such candidate siblings exist and n_groups == 0 and
has_nvfp4_static is true, emit the warnings.warn; update the logic around
n_groups, has_nvfp4_static, and the invocation of
_GROUPED_WEIGHT_QUANTIZER_PATTERNS in _sync_grouped_weight_global_amax to
reflect this guarded check.

---

Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 636-729: Add an end-to-end assertion that runs the MSE calibration
path (either by calling mtq.quantize with quant_cfg["algorithm"]="mse" or
invoking mse_calibrate(...) after the existing max-path partial_forward) and
then assert that dead experts' weight quantizers
(experts.gate_up_proj_weight_quantizers[idx] and
experts.down_proj_weight_quantizers[idx]) have populated non-zero _amax and that
their _amax matches per-row max(|weight|) as done for the max path; this ensures
mse_calibrate still calls the same bootstrap helper
(_bootstrap_uncalibrated_weight_quantizers) or otherwise populates the
quantizers correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 082a3f81-ae36-4445-a4a3-d01fad1bea23

📥 Commits

Reviewing files that changed from the base of the PR and between 12e3c24 and 0869a90.

📒 Files selected for processing (6)

modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py
modelopt/torch/quantization/plugins/huggingface.py
tests/unit/torch/quantization/plugins/test_fused_experts.py

🚧 Files skipped from review as they are similar to previous changes (4)

modelopt/torch/quantization/nn/modules/tensor_quantizer.py
modelopt/torch/quantization/plugins/huggingface.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/export/moe_utils.py

meenchen

Bot review — DM the bot to share feedback.

Re-review of #1421 (dead-expert MSE calibration for fused-experts MoE).

Previous critical comments — addressed:

✅ _bootstrap_uncalibrated_weight_quantizers now runs inside enable_weight_access_and_writeback(module, model, name_to_module) and the blanket try/except was removed (addresses CodeRabbit FSDP-safety concern).
✅ Unit tests added in tests/unit/torch/quantization/plugins/test_fused_experts.py: test_bootstrap_populates_dead_expert_quantizers exercises the dead-expert path end-to-end including the max(|weight|) invariant, and test_per_block_amax_reshape_for_fused_export guards the new NVFP4 per-block amax reshape branch in _export_fused_experts. Addresses meenchen's main nudge reason.
✅ The new helpers are underscore-prefixed and not exported in __all__ (addresses the "public API for hardcoded heuristic" concern).
✅ _GROUPED_WEIGHT_QUANTIZER_PATTERNS limitation is documented in the _sync_grouped_weight_global_amax docstring and a warning is emitted when matching fails despite NVFP4-static quantizers existing.

Still unresolved (minor, non-blocking individually):

❌ _sanitize_generation_config_for_save still mutates model.generation_config.do_sample in-place with no restore. If save_pretrained raises, _unpatch_revert_weight_conversion still runs (it's in finally) but do_sample is never reverted — callers reusing the model object silently get do_sample=True. A try/finally-scoped restore (captured-original-value pattern) was suggested twice by CodeRabbit and not applied. Consider wrapping in the existing try/finally around model.save_pretrained.
❌ _sync_grouped_weight_global_amax warning is still gated only on has_nvfp4_static, so models with only standalone NVFP4 linears (no Q/K/V or gate/up siblings at all) will emit a misleading warning every time mse_calibrate/local_hessian_calibrate runs. A guard on "candidate siblings exist" would make it actionable.
❌ No MSE end-to-end test — the bootstrap path is only exercised via the direct helper, not via mse_calibrate(..., algorithm="mse"). If a future refactor stops calling the helper from mse_calibrate, the test won't catch it.

Design check: not a new subsystem. _sync_grouped_weight_global_amax composes on preprocess_linear_fusion from quant_utils.py; iter_weights_for_calibration extends an existing base-class hook; new helpers are small and localized.

No licensing changes. Size OK (+423/-71, 7 files). Core correctness fix looks right and is backed by end-to-end Qwen3.5-122B/35B validation. Flagging for human sign-off on the three minor items above — all three are stylistic/defensive rather than correctness bugs, but were called out explicitly by the previous reviewers and not acted on.

jenchen13 · 2026-05-11T18:56:06Z

+
+@torch.no_grad()
+def _bootstrap_uncalibrated_weight_quantizers(model: nn.Module) -> int:
+    """Populate ``_amax`` from weights for quantizers the forward pass didn't reach.


@realAsma After PTQ, if there are uncalibrated amaxes I thought we intentionally want to throw an error/warning and tell user to increase number of samples/seq length. this changes the behavior to automatically fill in the amaxes without any logging/warning which seems problematic.

this docstring is also misleading -- it sounds like you are loading amax from the weights but should say that you are recalibrating weights with missing _amax

jenchen13 · 2026-05-11T18:59:08Z

+    if n_groups == 0:
+        # Surface architectures whose Q/K/V or gate/up siblings don't match the
+        # pattern list — without this, sibling-sync is a silent no-op.
+        has_nvfp4_static = any(


why not reuse the is_nvfp4_static method from modelopt/torch/quantization/nn/modules/tensor_quantizer.py ?

jenchen13 · 2026-05-11T19:00:11Z

                )

-    # Identify weight quantizers by checking if they have corresponding weight parameters
+    # Step 3: calibrate weight quantizers via iter_weights_for_calibration.


this comment is too long and specific to fused experts. remove mention of fused experts to make it more generic and in general reduce length of AI comments

jenchen13 · 2026-05-11T19:04:32Z

+                weight_quantizer.enable_quant()
+                weight_quantizer.disable_calib()

-        # Synchronize ALL CUDA devices before resetting to ensure all async operations complete


why is this comment removed?

The old comment described the previous weight_attr_names lookup; with the refactor to parent_module.iter_weights_for_calibration() the call site is self-describing. Trimmed the replacement block to a single line per your other comment.

… MoE Two-part fix for transformers 5.x fused-experts containers (Qwen3-MoE / Qwen3.5-MoE / Mixtral / DeepSeek / Kimi-K2.x ...) where weight quantizers live in `nn.ModuleList`s (`gate_up_proj_weight_quantizers`, `down_proj_weight_quantizers`): 1. Add `_QuantFusedExperts.iter_weights_for_calibration` that yields per-expert (weight_slice, quantizer) pairs for both projections. The base impl uses singular `*_weight_quantizer` and silently skips fused-experts modules, so weight-only calibration paths never reach per-expert quantizers. 2. Refactor `mse_calibrate`: - Add `_bootstrap_uncalibrated_weight_quantizers` after `max_calibrate` to populate `_amax` on quantizers the forward pass didn't reach (dead MoE experts that received no calibration tokens). Runs the existing calibrator on the weight slice surfaced by `iter_weights_for_calibration`. - Replace the singular-only `weight_attr_names` discovery + `getattr`-by- name walk with an `iter_weights_for_calibration` walk done inside each parent module's `enable_weight_access_and_writeback` context, so MSE processes every per-expert quantizer (active and dead) and remains FSDP-safe. Without this, the export-time fallback in `_export_fused_experts` derived separate gate/up amaxes from each half of the fused weight, breaking the gate==up `weight_scale_2` invariant on dead experts. End-to-end check on Qwen3.5-122B-A10B with `nvfp4_experts_only_mse-fp8_cast_kv`: - Before: 1/12288 (layer 38 expert 69) gate \!= up; 0 weights MSE-calibrated - After: 0/12288 mismatches; 24576 weights MSE-calibrated; ~4.2 min Signed-off-by: Chenjie Luo <[email protected]>

meenchen

LGTM

meenchen · 2026-05-11T22:35:46Z

+
+
+@torch.no_grad()
+def _sync_grouped_weight_global_amax(model: nn.Module) -> int:


This function is also useful for other algorithms like AWQ and GPTQ. We need to sync the amax for fused layers before some algorithm begins. cc @sychen52 on the design of fused modules.

Fridah-nv · 2026-05-11T23:20:31Z

+    ("q_proj", "k_proj", "v_proj"),
+    ("gate_proj", "up_proj"),  # Llama/Qwen/Mistral
+    ("w1", "w3"),  # Mixtral
+)


We have existing fusion metadata in
modelopt/torch/export/quant_utils.preprocess_linear_fusion, could we avoid adding new hard-coded tuples for fusion?

Fridah-nv · 2026-05-11T23:25:29Z

+    # Step 1: max calibrate, bootstrap dead-expert weight quantizers,
+    # unify grouped NVFP4 global_amax so MSE sees a consistent FP8 grid.
    max_calibrate(model, forward_loop, distributed_sync)
+    _bootstrap_uncalibrated_weight_quantizers(model)


I wonder why the dead-expert weight quantizers does not break the MAX calibration export path with NVFP4 dynamic quantizer. i.e., does max_calibrate need this fix as well?
Does it make more sense to run _bootstrap_uncalibrated_weight_quantizers in max_calibrate. (after weight_only_quantize/forward,
before promote_nvfp4_static_quantizers)? Therefore other recipes (AWQ, GPTQ, local-hessian) gets the fix as well.

Fridah-nv · 2026-05-11T23:31:21Z

+        n_groups += 1
+
+    if n_groups == 0 and any(_is_calibrated_nvfp4_static(m) for m in model.modules()):
+        warnings.warn(


If I'm understanding correctly, this "no group matched" warning will fire on every model that
doesn't happen to use those three tuples — including _QuantFusedExperts itself, who gate vs. up halves are not in any group. Maybe include _QuantFusedExperts in the grouping pass?

cjluo-nv requested review from a team as code owners May 8, 2026 23:35

cjluo-nv requested a review from meenchen May 8, 2026 23:35

cjluo-nv force-pushed the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-3 branch from 360b53e to 8e21516 Compare May 8, 2026 23:41

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread modelopt/torch/export/unified_export_hf.py

Comment thread modelopt/torch/quantization/model_calib.py

cjluo-nv force-pushed the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-3 branch from 8e21516 to adee8b5 Compare May 8, 2026 23:49

Fix bugs for MSE

eaa953b

Signed-off-by: Chenjie Luo <[email protected]>

cjluo-nv force-pushed the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-3 branch from adee8b5 to 12e3c24 Compare May 8, 2026 23:52

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

meenchen reviewed May 9, 2026

View reviewed changes

cjluo-nv force-pushed the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-3 branch from 12e3c24 to 0869a90 Compare May 9, 2026 16:01

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/model_calib.py Outdated

meenchen reviewed May 9, 2026

View reviewed changes

jenchen13 self-requested a review May 11, 2026 18:41

jenchen13 reviewed May 11, 2026

View reviewed changes

cjluo-nv force-pushed the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-3 branch from 0869a90 to 9203072 Compare May 11, 2026 22:13

cjluo-nv force-pushed the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-3 branch from 9203072 to ebce8b2 Compare May 11, 2026 22:20

meenchen approved these changes May 11, 2026

View reviewed changes

Fridah-nv reviewed May 11, 2026

View reviewed changes



		@torch.no_grad()
		def _sync_grouped_weight_global_amax(model: nn.Module) -> int:

Conversation

cjluo-nv commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-11 22:24 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cjluo-nv commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-11 22:24 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented May 9, 2026 •

edited

Loading