Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe#1363
Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe#1363jenchen13 wants to merge 14 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds FP8 scale sweep stride control to calibration workflows, introduces three mixed-precision NVFP4 quantization recipes for Nemotron-3-Super-120B with different calibration methods (MSE, MSE with stride-4 sweep, max-based), refactors MoE calibration completeness checks to recursively traverse SequentialQuantizer leaves, and overhauls HuggingFace export to collect and apply per-layer quantization metadata. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 29-32: The metadata claims attention o_proj is FP8 per-tensor but
quant_cfg lacks any override for attention.o_proj; either update the description
or add the missing quantizer mapping. Fix by adding an explicit quant_cfg
override for the attention o_proj parameter name (e.g., attention.o_proj /
attention.output_projection / whatever key is used in your model mapping) to use
the FP8 per-tensor quantizer used elsewhere, or remove the attention o_proj
mention from the metadata so it matches the existing quant_cfg; ensure you
reference the exact layer key used in quant_cfg to keep mapping consistent with
the model.
- Around line 94-115: The entries using broad quantizer_name patterns
('*mixer.fc1_latent_proj*weight_quantizer',
'*mixer.fc1_latent_proj*input_quantizer',
'*mixer.fc2_latent_proj*weight_quantizer',
'*mixer.fc2_latent_proj*input_quantizer') are enabling FP8 for every latent
projection instead of only layers 1, 3, and 5; update these quantizer_name
values to target only the specific layer instances (e.g. include the layer
index/identifier for layers 1, 3, and 5 in the wildcard or use a regex/explicit
list) so only those mixer.fc1_latent_proj and mixer.fc2_latent_proj quantizers
are set to num_bits: e4m3, and leave all other latent projection quantizers at
BF16 (or remove the generic entries).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 60ed9c0b-efef-4967-b321-7270a8853455
📒 Files selected for processing (1)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1363 +/- ##
===========================================
- Coverage 76.93% 64.10% -12.83%
===========================================
Files 471 477 +6
Lines 50404 55095 +4691
===========================================
- Hits 38776 35320 -3456
- Misses 11628 19775 +8147
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/model_calib.py (1)
389-412:⚠️ Potential issue | 🟡 MinorCustom FP8-sweep backends still ignore the new stride setting.
When a registered
backend_factoryis used, this branch still calls the old 3-argument factory signature, sofp8_scale_sweep_strideonly takes effect on the built-inNVFP4MSECalibratorpath below. That makes the new config silently no-op for registry-backed sweep calibrators. Please extend the factory contract or reject non-default stride explicitly in the backend path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 389 - 412, The registered backend factory path (lookup via _FP8_SWEEP_CALIBRATOR_REGISTRY and assigned to backend_factory) currently calls the factory with the old 3-argument signature and thus ignores fp8_scale_sweep_stride; update the branch that sets module._calibrator via backend_factory to either (A) call the factory with the new argument (pass fp8_scale_sweep_stride) and update the factory contract accordingly, or (B) explicitly detect a non-default fp8_scale_sweep_stride and raise/rollback with a clear error so users know registry-backed calibrators do not support stride; ensure the call still passes initial_amax, module._calibrator._axis, partial(_mse_quant_func, quantizer=module) and include fp8_scale_sweep_stride when choosing option A, mirroring how NVFP4MSECalibrator is constructed.
♻️ Duplicate comments (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)
41-42:⚠️ Potential issue | 🟡 MinorThe metadata description still overstates what this recipe quantizes.
These lines say
attention o_proj,fc1_latent_proj, andfc2_latent_projare FP8 per-tensor, but there are no matching overrides inquant_cfg, and the header comments say the latent MoE projections stay BF16. Please update the description so it matches the actual recipe.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around lines 41 - 42, The metadata description in the YAML (the description field) incorrectly claims that "attention o_proj", "fc1_latent_proj", and "fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching overrides in quant_cfg and header comments state latent MoE projections remain BF16/FP16; update the description text to accurately reflect the recipe by removing or changing the FP8 claims (e.g., state that latent MoE projections and those specific projections remain BF16/FP16 and only list the layers that the quant_cfg actually overrides as FP8 per-tensor).modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)
34-35:⚠️ Potential issue | 🟡 MinorThe metadata description does not match the actual quantizer mapping.
These lines still claim
attention o_proj,fc1_latent_proj, andfc2_latent_projare FP8 per-tensor, but this recipe never enables those quantizers, and the header comments above say latent MoE stays BF16. Please align the description with thequant_cfgthat follows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml` around lines 34 - 35, The YAML description string claims that "attention o_proj, fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in this recipe does not enable quantizers for "attention.o_proj", "fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays BF16/FP16); fix this by either updating the description to state those projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify quant_cfg to actually enable FP8 per-tensor quantizers for the keys "attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment matches the mapping.
🧹 Nitpick comments (1)
modelopt/torch/quantization/calib/mse.py (1)
202-206: Add a regression test for strided FP8 candidate generation.The existing coverage only exercises the default 126-candidate path. This branch adds two new behaviors—subsampling and forced inclusion of the last candidate—so it should have a focused test for
fp8_scale_sweep_stride > 1to lock down both the reduced candidate count and preservation of the max scale.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/calib/mse.py` around lines 202 - 206, The strided FP8 candidate-generation branch guarded by fp8_scale_sweep_stride > 1 (in the block that subsamples fp8_values into candidates and appends the last value) is untested; add a focused regression test (e.g., test_fp8_scale_sweep_stride_preserves_last_candidate) that sets fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and asserts that the resulting candidates length is reduced according to the stride and that the final element equals the original fp8_values[-1] (verifying forced inclusion of the max scale); ensure the test covers both subsampling and the append behavior so the branch is locked down.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 16-28: Remove the unresolved Git merge conflict markers (<<<<<<<,
=======, >>>>>>>) and restore a single coherent comment block describing the
quant config; keep the more detailed version that lists both HF and
Megatron-Core names (the lines mentioning mixer.experts.<N>.{up,down}_proj,
mlp.experts.local_experts.<N>.linear_fc{1,2},
mixer.shared_experts.{up,down}_proj, and mlp.shared_experts.linear_fc{1,2}) or
merge its additional details into the shorter variant so the YAML comment is
valid and free of conflict markers.
---
Outside diff comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 389-412: The registered backend factory path (lookup via
_FP8_SWEEP_CALIBRATOR_REGISTRY and assigned to backend_factory) currently calls
the factory with the old 3-argument signature and thus ignores
fp8_scale_sweep_stride; update the branch that sets module._calibrator via
backend_factory to either (A) call the factory with the new argument (pass
fp8_scale_sweep_stride) and update the factory contract accordingly, or (B)
explicitly detect a non-default fp8_scale_sweep_stride and raise/rollback with a
clear error so users know registry-backed calibrators do not support stride;
ensure the call still passes initial_amax, module._calibrator._axis,
partial(_mse_quant_func, quantizer=module) and include fp8_scale_sweep_stride
when choosing option A, mirroring how NVFP4MSECalibrator is constructed.
---
Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The YAML description string claims that "attention o_proj,
fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in
this recipe does not enable quantizers for "attention.o_proj",
"fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays
BF16/FP16); fix this by either updating the description to state those
projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify
quant_cfg to actually enable FP8 per-tensor quantizers for the keys
"attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment
matches the mapping.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 41-42: The metadata description in the YAML (the description
field) incorrectly claims that "attention o_proj", "fc1_latent_proj", and
"fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching
overrides in quant_cfg and header comments state latent MoE projections remain
BF16/FP16; update the description text to accurately reflect the recipe by
removing or changing the FP8 claims (e.g., state that latent MoE projections and
those specific projections remain BF16/FP16 and only list the layers that the
quant_cfg actually overrides as FP8 per-tensor).
---
Nitpick comments:
In `@modelopt/torch/quantization/calib/mse.py`:
- Around line 202-206: The strided FP8 candidate-generation branch guarded by
fp8_scale_sweep_stride > 1 (in the block that subsamples fp8_values into
candidates and appends the last value) is untested; add a focused regression
test (e.g., test_fp8_scale_sweep_stride_preserves_last_candidate) that sets
fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and
asserts that the resulting candidates length is reduced according to the stride
and that the final element equals the original fp8_values[-1] (verifying forced
inclusion of the max scale); ensure the test covers both subsampling and the
append behavior so the branch is locked down.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0d72ce13-180e-46ea-b4d8-8c6c140d22a7
📒 Files selected for processing (5)
modelopt/torch/quantization/calib/mse.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
There was a problem hiding this comment.
♻️ Duplicate comments (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)
35-36:⚠️ Potential issue | 🟡 MinorMetadata precision mapping is inconsistent with the actual recipe.
Line 35 says latent MoE
fc1_latent_proj/fc2_latent_projare FP8 per-tensor, but this file has no latent-MoE quantizer overrides and the header (Line 27) says they stay BF16. Please align the description withquant_cfgto avoid confusion.Proposed patch
metadata: recipe_type: ptq - description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj - FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with FP8 scale sweep. + description: >- + Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); + shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE, + lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32 + (optionally FP16 in vLLM). Weight-MSE calibration with FP8 scale sweep.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around lines 35 - 36, The description's claim that latent MoE "fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the quant_cfg (and header) which leave those layers as BF16; either update the description string to state that fc1_latent_proj and fc2_latent_proj remain BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8 per-tensor (matching the rest of the FP8 settings); ensure the description text and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are aligned.modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)
34-35:⚠️ Potential issue | 🟡 MinorDescription/comments conflict on latent MoE and SSM/mamba precision.
Lines 34-35 state latent MoE is FP8 per-tensor, but the recipe does not enable latent-MoE quantizers. Also, Lines 134-135 conflict with earlier comments on SSM/mamba precision. Please make these statements internally consistent.
Proposed patch
metadata: recipe_type: ptq - description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj - FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with stride-4 FP8 scale sweep. + description: >- + Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); + shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE, + lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32 + (optionally FP16 in vLLM). Weight-MSE calibration with stride-4 FP8 scale sweep. @@ - # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head. - # SSM state / mamba conv1d stay FP16. + # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head, mamba conv1d. + # SSM state stays FP32 (can be set to FP16 in vLLM).Also applies to: 134-135
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml` around lines 34 - 35, The description text claims "latent MoE is FP8 per-tensor" and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by making the YAML fields match the prose: either add the latent-MoE quantizer key (e.g., include latent_moe in the quantizers list or set enable_latent_moe_quantizers: true) and ensure its precision is set to FP8 per-tensor, or change the description to remove the FP8 latent-MoE claim; likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba precision fields to uniformly state BF16/FP16 (or change the description to reflect the actual configured precisions) so the "description" string and the quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision, mamba_precision, lm_head_precision / mtp_precision) are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description text claims "latent MoE is FP8 per-tensor"
and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable
latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by
making the YAML fields match the prose: either add the latent-MoE quantizer key
(e.g., include latent_moe in the quantizers list or set
enable_latent_moe_quantizers: true) and ensure its precision is set to FP8
per-tensor, or change the description to remove the FP8 latent-MoE claim;
likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba
precision fields to uniformly state BF16/FP16 (or change the description to
reflect the actual configured precisions) so the "description" string and the
quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision,
mamba_precision, lm_head_precision / mtp_precision) are consistent.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description's claim that latent MoE
"fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the
quant_cfg (and header) which leave those layers as BF16; either update the
description string to state that fc1_latent_proj and fc2_latent_proj remain
BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg
for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8
per-tensor (matching the rest of the FP8 settings); ensure the description text
and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are
aligned.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 595b4531-3ced-4894-a5cc-c28f161c8f3e
📒 Files selected for processing (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
| # Megatron-Core names: mlp.shared_experts.linear_fc{1,2} | ||
| # - Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor | ||
| # - KV cache: FP8 | ||
| # - Attention linears ({q,k,v}_proj): BF16 (not quantized) |
There was a problem hiding this comment.
Can we double check attention out linear? IIRC, attention o_proj should be FP8.
There was a problem hiding this comment.
responded in slack, only 2/9 attention layers had o_proj FP8 in final Super NVFP4 ckpt, but we can always add it later to test if accuracy degradation is minimal
|
/claude review |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
@claude review |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/export/unified_export_megatron.py (1)
1213-1229:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRecord packed-expert quant metadata against the module prefix, not the weight key.
In both pack-remap helpers,
prefixis the tensor key written intostate_dict(for example...weight). Recording that verbatim produceslayer_config_dictentries like...weight.quantization, andprocess_layer_quant_config()will then emitquantized_layersnames ending in.weightinstead of the HF module prefix. Serving-side layer matching will miss those packed experts.Suggested fix
- self._record_layer_quant_config(prefix, qformat, block_size) + module_prefix = prefix.rsplit(".", 1)[0] + "." + self._record_layer_quant_config(module_prefix, qformat, block_size)Apply the same normalization in both
_pack_name_remapping()and_pack_name_remapping_gpt_oss().Also applies to: 1280-1298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_megatron.py` around lines 1213 - 1229, The code records quant metadata under the full tensor key (e.g., "...weight") causing layer names to end with ".weight"; update both _pack_name_remapping and _pack_name_remapping_gpt_oss to normalize the prefix before calling self._record_layer_quant_config by stripping the tensor suffix (e.g., remove trailing ".weight" or the last dot component) so the module-level HF prefix is recorded instead of the weight key; apply the same normalization logic where self._record_layer_quant_config(prefix, qformat, block_size) is invoked in both functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yaml`:
- Around line 16-36: The recipe metadata/comments misstate Latent MOE behavior:
fc1_latent_proj/fc2_latent_proj are described as FP8 per-tensor but the
quant_cfg (and lines noting "stay BF16/FP16") never enable those quantizers;
either update the human-readable description to say Latent MOE projections
remain BF16 (or FP16 per existing comment) to match quant_cfg, or enable FP8
per-tensor quantizers for fc1_latent_proj and fc2_latent_proj in the quant_cfg
so the comment matches behavior—make the change by editing the
metadata/description block and/or the quant_cfg entries that reference
fc1_latent_proj and fc2_latent_proj accordingly.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description erroneously claims the Latent MoE is "FP8
per-tensor" while the recipe does not enable quantizer patterns for
fc1_latent_proj or fc2_latent_proj; update the metadata description string (the
YAML "description" field) to remove or correct the FP8 statement for Latent MoE
(or explicitly state that fc1_latent_proj/fc2_latent_proj remain BF16/FP16) so
it matches the actual quantizer configuration for fc1_latent_proj and
fc2_latent_proj.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 35-36: The description claims latent MoE is FP8 and that static
scales are "chosen by MSE", but the YAML sets no quantizers for
fc1_latent_proj/fc2_latent_proj and uses method: max; update the human-readable
description to match the actual config by either (A) enabling the latent
projection quantizers (fc1_latent_proj/fc2_latent_proj) to make latent MoE FP8,
or (B) explicitly state latent projections remain unquantized (not FP8) and
change the phrase "static scales are chosen by MSE" to reflect the configured
method (e.g., "static scales computed by max" or "method: max"); adjust the
description lines referencing "FP8 per-tensor; ... Latent MOE
fc1_latent_proj/fc2_latent_proj" and the sentence about static scale selection
accordingly so the text and the fields (fc1_latent_proj, fc2_latent_proj, and
method: max) are consistent.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description string misreports the latent-MoE policy:
update the description (the YAML description field) to reflect that
fc1_latent_proj and fc2_latent_proj are left unquantized (per the quant_cfg)
rather than being "FP8 per-tensor"; edit the text around the FP8/KV/latent-MoE
sentence to explicitly state that fc1_latent_proj/fc2_latent_proj remain
unquantized (or their actual precision) and keep FP8 per-tensor/KV wording only
for the tensors that truly use FP8.
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 815-818: The current check only treats qformat is None as
excluded, but QUANTIZATION_NONE should be treated the same so those layers are
recorded as excluded (and not dropped from the exported quant config); update
the conditional in unified_export_megatron.py where qformat is obtained from
_get_quantization_format(module) to also consider QUANTIZATION_NONE (e.g., if
qformat is None or qformat == QUANTIZATION_NONE) before calling
_record_excluded_module(prefix), and ensure any references to
_record_layer_quant_config still see these explicitly disabled quantizers as
excluded rather than omitted.
---
Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 1213-1229: The code records quant metadata under the full tensor
key (e.g., "...weight") causing layer names to end with ".weight"; update both
_pack_name_remapping and _pack_name_remapping_gpt_oss to normalize the prefix
before calling self._record_layer_quant_config by stripping the tensor suffix
(e.g., remove trailing ".weight" or the last dot component) so the module-level
HF prefix is recorded instead of the weight key; apply the same normalization
logic where self._record_layer_quant_config(prefix, qformat, block_size) is
invoked in both functions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3adef0ee-1161-4068-a796-fc857c2455fd
📒 Files selected for processing (11)
modelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/calib/mse.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yamltests/gpu_megatron/torch/export/test_unified_export_megatron.py
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
modelopt/torch/quantization/qtensor/nvfp4_tensor.py (1)
120-122: 💤 Low value
_get_static_global_amaxis called twice for the static branch.
_is_static_quantizerinternally calls_get_static_global_amaxand discards the result; line 122 then calls it again. Consider hoisting to a single variable:♻️ Proposed refactor
- if cls._is_static_quantizer(weight_quantizer): - # Static path: use pre-computed per-block amax values from quantizer - global_amax = cls._get_static_global_amax(weight_quantizer).float() + global_amax = cls._get_static_global_amax(weight_quantizer) + if global_amax is not None: + # Static path: use pre-computed per-block amax values from quantizer + global_amax = global_amax.float()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py` around lines 120 - 122, Hoist the call to _get_static_global_amax so it's executed only once: call amax = cls._get_static_global_amax(weight_quantizer) (and .float() as needed) and reuse that value instead of calling _get_static_global_amax again; update _is_static_quantizer to accept an optional precomputed amax parameter or refactor _is_static_quantizer to avoid calling _get_static_global_amax internally so the static branch uses the hoisted amax (refer to _is_static_quantizer, _get_static_global_amax and weight_quantizer).tests/gpu_megatron/torch/export/test_unified_export_megatron.py (1)
151-164: ⚡ Quick winAdd one end-to-end mixed-precision export case.
The new export path here is the per-layer
layer_config_dict/process_layer_quant_configflow, but this matrix still only exercises uniform configs (NVFP4_DEFAULT_CFG/FP8_DEFAULT_CFG). A single mixed recipe case would catch regressions inquantized_layers, excludes, and theconfig.json↔hf_quant_config.jsonparity you just tightened.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py` around lines 151 - 164, Add a new parametrized test case to the existing pytest.mark.parametrize matrix (the tuple of ("model_type", "arch", "extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer mixed-precision export path: supply a quant_config that triggers the layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json parity; update the matrix alongside existing entries (referencing the parameters model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a mixed per-layer config for either nemotron or llama.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 45-79: The two routed-expert weight quantizer entries
('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer')
currently use type: dynamic; change them to use the static-weight calibration
path by replacing type: dynamic with type: static (or remove the dynamic setting
and ensure static is explicitly set) so routed-expert weights remain static in
this max-calib variant while leaving the corresponding input_quantizer entries
unchanged.
---
Nitpick comments:
In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 120-122: Hoist the call to _get_static_global_amax so it's
executed only once: call amax = cls._get_static_global_amax(weight_quantizer)
(and .float() as needed) and reuse that value instead of calling
_get_static_global_amax again; update _is_static_quantizer to accept an optional
precomputed amax parameter or refactor _is_static_quantizer to avoid calling
_get_static_global_amax internally so the static branch uses the hoisted amax
(refer to _is_static_quantizer, _get_static_global_amax and weight_quantizer).
In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py`:
- Around line 151-164: Add a new parametrized test case to the existing
pytest.mark.parametrize matrix (the tuple of ("model_type", "arch",
"extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer
mixed-precision export path: supply a quant_config that triggers the
layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config
rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises
quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json
parity; update the matrix alongside existing entries (referencing the parameters
model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a
mixed per-layer config for either nemotron or llama.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 94f3528f-2ee4-4757-b8de-3a323300ca74
📒 Files selected for processing (10)
modelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/calib/mse.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yamltests/gpu_megatron/torch/export/test_unified_export_megatron.py
| # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale. | ||
| # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj. | ||
| - quantizer_name: '*mixer.experts.*weight_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| - quantizer_name: '*mixer.experts.*input_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | ||
| - quantizer_name: '*mlp.experts*weight_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| - quantizer_name: '*mlp.experts*input_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 |
There was a problem hiding this comment.
Keep routed-expert weights static in the max-calib variant.
These two *...weight_quantizer entries are type: dynamic, so this recipe does more than swap method: mse for method: max—it changes routed-expert weights to dynamic NVFP4 and bypasses the static-weight calibration path entirely. That makes this variant a different quantization recipe, not a max-calibrated version of super-nvfp4.yaml.
Suggested fix
- quantizer_name: '*mixer.experts.*weight_quantizer'
enable: true
cfg:
block_sizes:
- type: dynamic
+ type: static
scale_bits: e4m3
num_bits: e2m1
@@
- quantizer_name: '*mlp.experts*weight_quantizer'
enable: true
cfg:
block_sizes:
- type: dynamic
+ type: static
scale_bits: e4m3
num_bits: e2m1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale. | |
| # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj. | |
| - quantizer_name: '*mixer.experts.*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mixer.experts.*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | |
| - quantizer_name: '*mlp.experts*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mlp.experts*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale. | |
| # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj. | |
| - quantizer_name: '*mixer.experts.*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: static | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mixer.experts.*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | |
| - quantizer_name: '*mlp.experts*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: static | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mlp.experts*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`
around lines 45 - 79, The two routed-expert weight quantizer entries
('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer')
currently use type: dynamic; change them to use the static-weight calibration
path by replacing type: dynamic with type: static (or remove the dynamic setting
and ensure static is explicitly set) so routed-expert weights remain static in
this max-calib variant while leaving the corresponding input_quantizer entries
unchanged.
| if hasattr(self, "kv_cache_dtype"): | ||
| self._hf_quant_config["quantization"]["kv_cache_quant_algo"] = self.kv_cache_dtype | ||
| # Use one serving-facing config for both hf_quant_config.json and config.json. | ||
| self._hf_quant_config = convert_hf_quant_config_format(raw_hf_quant_config) |
There was a problem hiding this comment.
do we change the format of hf_quant_config.json by this change?
There was a problem hiding this comment.
yes to align it with the quant config inside config.json. previously hf_quant_config was still using the older json format whereas the quant config inside config.json was being converted to a newer format
| return global_amax is not None | ||
|
|
||
| @classmethod | ||
| def _get_static_global_amax(cls, weight_quantizer): |
There was a problem hiding this comment.
Thanks for catching the bug! I'd like to understand why in mcore path we use _global_amax instead of global_amax, is there any general guidance for when to use private attribute name and when to use public ones? I think we have similar issue with amax vs _amax
There was a problem hiding this comment.
during checkpoint resume in MCore it initializes the quantizers as generic TensorQuantizer and those have _global_amax parameter (instead of NVFP4QTensor which has global_amax)
There was a problem hiding this comment.
@Fridah-nv I have changed MCore static quantizer restore to restore as NVFP4QTensor in mtq.plugins.custom
Fridah-nv
left a comment
There was a problem hiding this comment.
Changes in MSE flow LGTM
|
/claude review |
1a10eeb to
8fc569d
Compare
| axis=module._calibrator._axis, | ||
| global_amax=module.global_amax, | ||
| quant_func=partial(_mse_quant_func, quantizer=module), | ||
| fp8_scale_sweep_stride=fp8_scale_sweep_stride, |
There was a problem hiding this comment.
[CRITICAL Algorithm] NVFP4MSECalibrator.__init__ does not accept a fp8_scale_sweep_stride parameter (see calib/mse.py:177-184). Passing it here will raise TypeError: __init__() got an unexpected keyword argument 'fp8_scale_sweep_stride' at runtime whenever fp8_scale_sweep=True and the quantizer is an NVFP4 static quantizer.
Either:
- Add
fp8_scale_sweep_stridetoNVFP4MSECalibrator.__init__and use it in_generate_candidatesto subsample the candidate list (e.g.,fp8_values[::stride]), or - Remove this kwarg from the call site until the calibrator supports it.
| if hasattr(self, "kv_cache_dtype"): | ||
| self._hf_quant_config["quantization"]["kv_cache_quant_algo"] = self.kv_cache_dtype | ||
| # Use one serving-facing config for both hf_quant_config.json and config.json. | ||
| self._hf_quant_config = convert_hf_quant_config_format(raw_hf_quant_config) |
There was a problem hiding this comment.
[IMPORTANT Compatibility] This changes the schema of hf_quant_config.json from the raw ModelOpt format ({"producer": {...}, "quantization": {...}}) to the compressed-tensors-compatible format ({"config_groups": {...}, "ignore": [...], "quant_algo": ...}).
The HF export path (unified_export_hf.py:1190-1194) still writes the raw format to hf_quant_config.json and the converted format to config.json. This PR makes the MCore export path diverge by writing converted format in both places.
Any downstream tools (TRT-LLM build scripts, vLLM loaders) that read hf_quant_config.json from MCore-exported checkpoints and expect the "quantization" top-level key will break. If this is intentional (which the test updates suggest), please mention the schema change in the PR description and/or changelog so consumers know to update.
| if _has_state(self.weight_quantizer, "_amax") and not _has_complete_static_nvfp4_weight_state( | ||
| self.weight_quantizer, self.weight | ||
| ): |
There was a problem hiding this comment.
[IMPORTANT Compatibility] The condition _has_state(wq, "_amax") and not _has_complete_static_nvfp4_weight_state(wq, self.weight) means: if _amax exists but the quantizer is NOT a complete static NVFP4, recalibrate. Previously it always recalibrated when _amax existed.
However, _has_complete_static_nvfp4_weight_state checks global_amax = getattr(quantizer, "global_amax", None) which relies on the global_amax property of NVFP4StaticQuantizer (which in turn reads _global_amax buffer). If the quantizer class hasn't been upgraded to NVFP4StaticQuantizer yet at this point during restore (because modelopt_post_restore is called before the megatron plugin's set_extra_state → from_tensor_quantizer), then global_amax will be None (plain TensorQuantizer doesn't have this property).
In that case, the function returns False, falling through to recalibrate — which is the safe/correct fallback. So this works correctly, but the "preserve MSE scales" path only activates when the quantizer has already been upgraded. Is that the intended sequence? If the HF restore path always upgrades first (via the _check_unsupported_states / from_tensor_quantizer flow), this is fine. Just flagging for awareness.
| for leaf_quantizer in _iter_leaf_quantizers(quantizer): | ||
| if _is_dynamic_block_quantizer(leaf_quantizer): | ||
| continue | ||
| leaf_quantizer.sync_amax_across_distributed_group(parallel_state.data_parallel_group) | ||
| leaf_quantizer.sync_amax_across_distributed_group( | ||
| parallel_state.expert_model_parallel_group |
There was a problem hiding this comment.
[SUGGESTION] The old code guarded sync_amax_across_distributed_group with if getattr(quantizer, "_amax", None) is not None. The new code removes this guard and relies on the internal check inside sync_amax_across_distributed_group.
While functionally correct (the method has if ... getattr(self, "_amax", None) is not None internally), this introduces unnecessary all_reduce call-site overhead for dynamic block quantizers that were already skipped by _is_dynamic_block_quantizer. However, non-dynamic quantizers that happen to not have _amax (e.g., disabled quantizers yielded by _iter_leaf_quantizers) will now call into sync_amax_across_distributed_group only to no-op internally. Minor perf consideration — no functional bug.
| self._record_layer_quant_config(q_proj_prefix, qformat, block_size) | ||
| self._record_layer_quant_config(k_proj_prefix, qformat, block_size) | ||
| self._record_layer_quant_config(v_proj_prefix, qformat, block_size) | ||
| if qformat in (None, QUANTIZATION_NONE): | ||
| # MCore stores Q/K/V as one fused linear_qkv module, but HF exports them | ||
| # as separate q_proj/k_proj/v_proj modules. Record the HF names so | ||
| # runtime quant configs do not miss excluded fused-QKV projections. | ||
| fused_prefix = prefix.removesuffix(".") | ||
| self.exclude_modules = [m for m in self.exclude_modules if m != fused_prefix] | ||
| self._record_excluded_module(q_proj_prefix) | ||
| self._record_excluded_module(k_proj_prefix) | ||
| self._record_excluded_module(v_proj_prefix) |
There was a problem hiding this comment.
[SUGGESTION] When qformat in (None, QUANTIZATION_NONE), _record_excluded_module is called with q_proj_prefix, k_proj_prefix, v_proj_prefix which all end with ".". The method correctly strips the trailing dot via removesuffix(".").
However, the list-comprehension self.exclude_modules = [m for m in self.exclude_modules if m != fused_prefix] removes the original fused module entry. If _qkv_slicing is called multiple times for the same prefix (e.g., from different pipeline stages in distributed), the second call would find fused_prefix already absent — which is fine. But the deduplication inside _record_excluded_module (the if layer_name not in self.exclude_modules check) prevents the individual proj names from being double-added too. Good.
One minor observation: if prefix is "" (empty), then fused_prefix = "".removesuffix(".") is also "", and the list-comp would try to remove "" from exclude_modules. You've guarded _record_excluded_module against empty strings, but the fused_prefix removal isn't similarly guarded. Consider adding if fused_prefix: before the list-comp to be consistent.
| start_multiplier: float = 0.25, | ||
| stop_multiplier: float = 4.0, | ||
| fp8_scale_sweep: bool = False, | ||
| fp8_scale_sweep_stride: int = 1, |
There was a problem hiding this comment.
[IMPORTANT Compatibility] fp8_scale_sweep_stride is added to mse_calibrate() but not to MseCalibConfig (see config.py:1287). Since wrapped_calib_func calls func(model, forward_loop=forward_loop, **config.model_dump()), this parameter can never flow through the standard ModelOpt mode/config pipeline (e.g., when using apply_mode with a recipe).
For recipes like super-nvfp4.yaml to actually use fp8_scale_sweep_stride, it needs to be added to MseCalibConfig (or a new config field) and parsed from the recipe YAML algorithm section. Without this, the parameter is only usable via direct mse_calibrate() calls, which limits the recipe-driven workflow.
There was a problem hiding this comment.
Code Review Summary
Findings: CRITICAL: 1, IMPORTANT: 3, SUGGESTION: 2
Critical
fp8_scale_sweep_stridepassed toNVFP4MSECalibratorwhich doesn't accept it (model_calib.py:412). This will raise aTypeErrorat runtime wheneverfp8_scale_sweep=Trueand the quantizer matches the NVFP4 static path. The constructor atcalib/mse.py:177-184has no such parameter.
Important
-
hf_quant_config.jsonschema change (unified_export_megatron.py:354). MCore exports now write the converted (compressed-tensors) format tohf_quant_config.json, breaking downstream tools that expect the raw{"producer": {...}, "quantization": {...}}format. The HF export path still writes the old format to this file. -
fp8_scale_sweep_stridenot inMseCalibConfig(model_calib.py:335). The parameter can only be used via directmse_calibrate()calls, not through the standard recipe/mode pipeline, limiting its utility for the new YAML recipes. -
Static NVFP4 preserve-amax path timing (
plugins/custom.py:164). The "preserve MSE scales" logic depends on the quantizer already being upgraded toNVFP4StaticQuantizer. If upgrade hasn't happened yet (depends on restore ordering), it safely falls through to recalibration — but this means MSE scales may not be preserved in all restore paths.
Suggestions
- Guard fused_prefix removal against empty string (
unified_export_megatron.py:1074). - Minor perf: removed
_amaxguard (model_calib.py:196-201). Functionally correct but calls into sync method for quantizers without_amax.
Overall Assessment
Risk: Medium-High. The critical TypeError bug (#1) will crash any user path that invokes mse_calibrate with fp8_scale_sweep=True on a model with NVFP4 static quantizers — which is the exact use case of the new super-nvfp4.yaml recipe. The hf_quant_config.json schema change (#2) is a compatibility concern for consumers of MCore-exported checkpoints.
The core logic changes (dynamic block quantizer detection fix, _global_amax fallback for restored static quantizers, per-layer quant config collection for mixed-precision export, QKV exclude expansion) are well-reasoned and the new unit tests cover the key behaviors.
Signed-off-by: Jennifer Chen <[email protected]>
Signed-off-by: Jennifer Chen <[email protected]>
Signed-off-by: Jennifer Chen <[email protected]>
Static MSE NVFP4 export had two latent bugs that surfaced as garbage outputs during vLLM serving: 1. nvfp4_tensor.get_weights_scaling_factor_from_quantizer (static branch) clamped per-block FP8 scales below at 2^-9 (smallest e4m3fn subnormal), so tiny-amax blocks landed on FP8 byte 0x01 instead of underflowing naturally to 0x00. The dynamic branch never had this clamp; removing it makes the static path bit-equivalent to dynamic when _amax matches the block max. 2. quant_utils._ensure_weight_quantizer_calibrated treated a present-and- non-None _amax buffer as already calibrated, but _reset_pytorch_state_from_metadata registers buffers via torch.empty(...) during MCore restore. If the saved distcp shard didn't fill the buffer (e.g., a routed expert that never received calibration tokens, or a buffer first registered on the meta device by MaxCalibrator's meta-device fast path), the empty memory persisted as values like -1e37, producing valid-but-bogus FP8 scale bytes during static export that silently NaN'd in vLLM's NVFP4 MoE compute. Add a post-calibration validation pass in mse_calibrate that recomputes _amax / _global_amax from the actual weight tensor when they're missing, on the meta device, non-finite, or negative -- equivalent to the max-calibration fallback used at export time, applied earlier so the saved checkpoint is clean for both MCore and vLLM consumers. Per-quantizer warnings are emitted so the failure is visible in PTQ logs. Add CPU pytest coverage in test_nvfp4_static_export_cpu.py for the no-NaN/Inf, bound, static-vs-dynamic equivalence, and amax-zero corner cases. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ithm
Mirrors MAMBA_MOE_NVFP4_AGGRESSIVE_CFG verbatim except algorithm becomes
{"method": "mse", "fp8_scale_sweep": False}. Lets the same quantizer
coverage (weight + input on all routed/MoE matmuls, with the standard
disabled lists) be exercised under MSE calibration via QUANT_CONFIG=
MAMBA_MOE_NVFP4_AGGRESSIVE_MSE_CFG without needing a separate recipe yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Previous version mirrored MAMBA_MOE_NVFP4_AGGRESSIVE_CFG verbatim, which uses _nvfp4_cfg (block_sizes type='dynamic'). MSE search only operates on quantizers promoted to NVFP4StaticQuantizer (is_static_block_quant), so applying algorithm='mse' to dynamic weights was a no-op — the resulting checkpoint was effectively max-calibrated dynamic NVFP4. Confirmed in practice: 6528 max_calibrate forward calls during MCore restore on the 160895 ckpt, no NVFP4StaticQuantizer instances created. Match NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG's weight cfg (block_sizes type='static', scale_bits=(4,3)) so MSE actually engages. Inputs stay dynamic. Keeps the AGGRESSIVE_CFG quantizer coverage (mamba MoE disable lists) so this is the direct MSE counterpart to AGGRESSIVE_CFG for the Ultra V3 / Nemotron-H mixer topology. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Per coworker guidance, MSE-calibrated NVFP4 export should bound the FP8 per-block scale to the fp8_e4m3fn representable finite range so tiny-amax blocks don't underflow to byte 0x00 and produce zero output even when the weights are nontrivial. Apply ``.clamp_(min=2**-9, max=448.0)`` on the static path and ``.clamp_(min=2**-9)`` on the dynamic path (the upper clamp landed in main as 097293b / Fridah's fix mirrors the lower clamp on both paths to keep them bit-equivalent). Update test_static_matches_dynamic_when_amax_is_block_max accordingly: both paths now produce identical FP8 bytes for every block including underflow blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adds repair_sharded_modelopt_state(model, checkpoint_name, prefix, metadata) to the megatron-core dist-checkpointing plugin. After the main phase-2 load (dist_checkpointing.load + model.load_state_dict in Megatron's load_modelopt_checkpoint), this helper detects weight/input/output quantizer _amax / _global_amax buffers that contain NaN / Inf / negative content (torch.empty(...) leakage from a phase-2 read that silently dropped the key) and re-fills them by re-issuing dist_checkpointing.load directly with a synthetic ShardedTensor request whose descriptor matches what the saved .metadata records. This works around an MCore distcp resharding issue observed on PP=1 EP=16 save resharded to PP=12 EP=1 export: each PP rank's first MoE local layer expert _amax / _global_amax buffers stay at uninitialized memory after phase-2. Direct dist_checkpointing.load on the same keys returns the saved MSE values (verified empirically) -- so the upstream phase-2 path is wrong, not the on-disk data. The helper bypasses whatever descriptor mismatch is upstream and copies the saved values into the runtime buffers, preserving the saved (e.g. MSE-calibrated) state instead of falling back to max-calibration. Treats zero as a valid amax (legitimate dead-block amax for NVFP4 static). No-op if no invalid buffers are detected. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
repair_sharded_modelopt_state v1 walked the dict keys returned by
model.sharded_state_dict() to filter quantizer buffers, but those keys
are the *runtime* names (e.g. ...local_experts.{local_idx}.linear_fc1
.weight_quantizer._amax). The saved sharded checkpoint keys (and
.metadata, and what dist_checkpointing.load matches against) use the
*sharded* names with the SequentialMLP rewrite already applied:
...experts.{global_idx}.linear_fc1.weight_quantizer._amax. The rewrite
is performed by replace_prefix_for_sharding which only modifies
ShardedTensor.key (Megatron-LM/megatron/core/dist_checkpointing/utils.py:
line 184-207), not the containing dict's keys.
v1 therefore looked up the runtime-style key in the saved metadata, got
None, marked the buffer as not in saved metadata, and skipped it. The
repair fired on zero buffers. Verified empirically against export job
162125: post-phase-2 dump still shows the same 354 NaN amax pattern as
the pre-repair behavior, with max-cal'd post-repair amax (different per
PP rank) coming from _ensure_weight_quantizer_calibrated, not from the
new repair function.
Switch the filter and metadata lookup to v.key (the rewritten
sharded_state_dict key) so we match what is actually saved.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Job 162145's export log shows my probe (Option C) ran successfully but
the repair (Option B v2) emitted no output at all. The post-restore
quantizer summary still shows the same 354 NaN amax pattern as the
pre-repair behavior, suggesting the repair function never fired or
returned 0 candidates.
Add three unconditional prints to disambiguate:
- On entry: confirm the function is even being called.
- At each early-return: identify which guard tripped.
- After candidate scan: show how many ShardedTensor entries were
walked, how many had _amax/_global_amax suffixes, and how many
failed _quantizer_buffer_value_is_invalid.
If we see the entry print but `invalid=0`, the validation predicate is
wrong. If we don't see the entry print at all, the import or caller
gating is at fault.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…yer expert 0 fc1 Diagnostic export 162306 confirmed that repair_sharded_modelopt_state enters and walks the full runtime sharded_state_dict (rank 0: 12338 amax_keys among 16490 sharded tensors), but reports invalid=0 even though the post-restore quantizer summary shows 30 NaN amax buffers on that rank. Two possibilities: (a) v.data is not the same memory as the model's runtime buffer (some view / clone happened along the way), or (b) _quantizer_buffer_value_is_invalid is mis-detecting bf16 NaN. Add a sampling print: for the first 4 keys ending in .mlp.experts.experts.0.linear_fc1.weight_quantizer.<suffix> on each rank, dump tensor dtype/shape and counts of nan/inf/neg/zero plus finite min/max. The post-restore dump shows these specific keys with amax=[nan, nan]; if the diagnostic shows nan=0 for the same keys, v.data is not the live buffer and we need a different access path. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
repair_sharded_modelopt_state and _quantizer_buffer_value_is_invalid
were added in commits b25d487d0 / 9b35905f1 / 1f483ef5c / a348b2f98
on the assumption that phase-2 dist_checkpointing.load was silently
dropping per-rank-first-MoE-local-layer routed-expert _amax /
_global_amax buffers (leaving them at torch.empty(...) garbage).
The diagnostic export 162306 + 162320 disproved that hypothesis:
the function entered, walked the full 12338-amax-key sharded_state_dict
on every rank, and reported invalid=0 every time. The original "354
NaN amax modules" finding turned out to be the model-builder pre-load
print (model_builder.py:355, before any checkpoint load) misread as
a post-load print. The post-load print shows 0 NaN modules.
Direct empirical confirmation:
- export 162527 (PTQ -> phase-2 -> export) had 0 _ensure_weight
_quantizer_calibrated warnings on stderr.
- validate_export_per_block_amax.py against the same export: 0 /
110_100_480 weight_scale bytes mismatched vs prediction from
saved per-block _amax + _global_amax.
- validate_repair_export_mse.py: 0/168 weight_scale_2 mismatches.
- vLLM smoke (162530) returned coherent text + HTTP 200 logprobs.
- MCore generate (162531) returned coherent text.
The repair function never had real work to do. Drop the dead code so
load_modelopt_checkpoint stays clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
After rebasing onto upstream main (commit 1d796f9 'Fused Triton kernel for NVFP4 FP8 scale sweep search'), TritonNVFP4MSECalibrator is the default for mse_calibrate(..., fp8_scale_sweep=True). Per the upstream PR, this is ~42x faster than the Python reference on B300 with identical global weight-MSE. Flip MAMBA_MOE_NVFP4_AGGRESSIVE_MSE_CFG's algorithm option from fp8_scale_sweep=False to fp8_scale_sweep=True so PTQ runs through the full 126-candidate FP8 e4m3 scale sweep when generating per-block amax for routed-expert weight quantizers. MODELOPT_NVFP4_TRITON_SWEEP=0 falls back to the Python reference for debugging or numerics comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Jennifer Chen <[email protected]>
6016020 to
caf7274
Compare
What does this PR do?
Type of change: New recipe + Bug Fixes
hf_quant_config.jsonExport bug fixes
MCore and MSE fixes
_global_amaxfield inNVFP4QTensorstatic quantizer detection -> fixes bug during MCore export for MSENVFP4QTensor(not TensorQuantizer which can call max calibrate. we want to skip max calibrate for static quantizer during restore)block_sizesis dict-backed.Super recipe
Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json:
rest: not quantized
Usage
# Add a code snippet demonstrating how to use thisTesting
TODO test in HF and MCore PTQ on Nemotron model
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Improvements
Tests