Skip to content

Commit 5920ef5

Browse files
committed
Add post-quantization checkpoint validation to PTQ skill
Generalize the MoE-specific validation into a comprehensive checkpoint validation reference that works for all models and recipes: - New references/checkpoint-validation.md with: - Expected quantization patterns per recipe (nvfp4, nvfp4_mlp_only, nvfp4_experts_only, fp8, int4_awq, etc.) - Validation script checking every linear layer is either quantized (has scale params) or explicitly excluded - Common pattern gap table (Gemma4 experts, custom MoE, VLM projector) - Fix guidance for both cases (missing pattern vs missing exclude) - SKILL.md Step 5: slimmed to short pointer to the reference - unsupported-models.md: updated debugging tip to reference the script Learned from: Gemma4-26B MoE experts silently skipped by nvfp4_mlp_only config patterns, causing vLLM deployment shape mismatch. Signed-off-by: Zhiyu Cheng <[email protected]>
1 parent e8775a6 commit 5920ef5

3 files changed

Lines changed: 92 additions & 0 deletions

File tree

.claude/skills/ptq/SKILL.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,10 @@ ls -lh <output_path>/
113113

114114
Report the path and size to the user.
115115

116+
### Post-quantization validation
117+
118+
Validate the exported checkpoint's quantization pattern matches the recipe. Quantization config patterns can silently miss layers if the model uses non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns) — this only surfaces later as deployment failures. Read `references/checkpoint-validation.md` for the validation script, expected patterns per recipe, and common pattern gaps.
119+
116120
## Key API Rules
117121

118122
- `mtq.register()` classes **must** define `_setup()` and call it from `__init__`
@@ -137,6 +141,7 @@ Report the path and size to the user.
137141
| `references/launcher-guide.md` | Step 4B only (launcher path) |
138142
| `tools/launcher/CLAUDE.md` | Step 4B only, if you need more launcher detail |
139143
| `references/unsupported-models.md` | Step 4C only (unlisted model) |
144+
| `references/checkpoint-validation.md` | Step 5: validate quantization pattern matches recipe |
140145
| `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
141146
| `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
142147
| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Post-Quantization Checkpoint Validation
2+
3+
Verify the exported checkpoint's quantization pattern matches the recipe used. Quantization config patterns may silently miss layers if the model uses non-standard naming — this only surfaces later as deployment failures when the serving framework tries to load unquantized weights as quantized.
4+
5+
## Expected quantization patterns by recipe
6+
7+
| Recipe (`--qformat`) | What should be quantized | What should be excluded |
8+
|----------------------|-------------------------|------------------------|
9+
| `nvfp4` | All linear layers | lm_head, routers, norms, embeddings |
10+
| `nvfp4_mlp_only` | MLP layers (including MoE experts) | Attention layers, lm_head, routers |
11+
| `nvfp4_experts_only` | MoE expert layers only | Dense MLP, attention, lm_head, routers |
12+
| `nvfp4_omlp_only` | MLP + o_proj layers | Other attention layers, lm_head, routers |
13+
| `fp8` | All linear layers | lm_head, norms, embeddings |
14+
| `int4_awq` | All linear layers | lm_head, norms, embeddings |
15+
16+
## Validation script
17+
18+
Run against the exported checkpoint to check every linear layer is either quantized (has scale params) or explicitly excluded:
19+
20+
```bash
21+
python3 -c "
22+
import json, fnmatch
23+
24+
output = '<output_path>'
25+
idx = json.load(open(f'{output}/model.safetensors.index.json'))
26+
cfg = json.load(open(f'{output}/hf_quant_config.json'))
27+
excludes = cfg['quantization']['exclude_modules']
28+
29+
all_keys = set(idx['weight_map'].keys())
30+
# Identify linear weight params (skip norms, embeddings, scalars, scales)
31+
skip_suffixes = ('_scale', '_scale_2', 'layernorm', 'layer_norm', 'norm.weight', 'embed', 'scalar')
32+
linear_weights = sorted(k for k in all_keys
33+
if k.endswith('.weight') and not any(s in k.lower() for s in skip_suffixes))
34+
35+
# Check which have quantization scales
36+
quantized, excluded, unexpected = [], [], []
37+
for w in linear_weights:
38+
base = w.rsplit('.weight', 1)[0]
39+
has_scales = any(f'{base}.{s}' in all_keys for s in ['weight_scale', 'input_scale'])
40+
is_excluded = any(fnmatch.fnmatch(w, p) or fnmatch.fnmatch(base, p) for p in excludes)
41+
42+
if has_scales:
43+
quantized.append(w)
44+
elif is_excluded:
45+
excluded.append(w)
46+
else:
47+
unexpected.append(w)
48+
49+
print(f'Quantized layers: {len(quantized)}')
50+
print(f'Excluded layers (in exclude_modules): {len(excluded)}')
51+
if unexpected:
52+
print(f'\nWARNING: {len(unexpected)} layers have NO scales and are NOT in exclude list:')
53+
# Group by module type for readability
54+
groups = {}
55+
for w in unexpected:
56+
parts = w.split('.')
57+
module_type = next((p for p in parts if p in
58+
('self_attn', 'mlp', 'experts', 'router', 'lm_head', 'embed_tokens', 'vision_tower')), 'other')
59+
groups.setdefault(module_type, []).append(w)
60+
for mtype, weights in sorted(groups.items()):
61+
print(f' {mtype}: {len(weights)} weights (e.g., {weights[0]})')
62+
print()
63+
print('These layers were silently skipped during quantization.')
64+
print('Likely cause: quantization config patterns did not match these module names.')
65+
print('This WILL cause deployment failures (framework loads them as quantized but they are BF16).')
66+
print('Fix: add missing patterns to the config, or add to exclude_modules if intentionally unquantized.')
67+
else:
68+
print('\nAll layers are either quantized or explicitly excluded. Checkpoint is consistent.')
69+
"
70+
```
71+
72+
## Common pattern gaps
73+
74+
Layers silently skipped because the quantization config patterns don't match the model's naming:
75+
76+
| Model | Module path | Missed by pattern | Fix |
77+
|-------|-------------|-------------------|-----|
78+
| Gemma4 MoE | `layers.N.experts.*` | `*mlp*`, `*block_sparse_moe*` | Add `*.experts.*` (PR #1219) |
79+
| Custom MoE | `layers.N.moe_block.experts.*` | `*mlp*` | Add matching pattern |
80+
| VLM projector | `multi_modal_projector.*` || Usually excluded; verify |
81+
82+
## What to do when warnings appear
83+
84+
- **Layers should have been quantized** (e.g., MoE experts with `nvfp4_mlp_only`): the quantization config patterns missed them. Fix by adding the missing pattern to the config and re-running PTQ. Check if ModelOpt already has a plugin for the model in `modelopt/torch/quantization/plugins/huggingface.py`.
85+
86+
- **Layers are intentionally unquantized** (e.g., attention layers with `nvfp4_mlp_only`): they should be in the `exclude_modules` list but the export didn't add them. Add them manually to both `hf_quant_config.json` and `config.json` `quantization_config.ignore` in the checkpoint to prevent deployment failures.

.claude/skills/ptq/references/unsupported-models.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -347,4 +347,5 @@ tokenizer.save_pretrained(output_path)
347347
- **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled
348348
- **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors
349349
- **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary
350+
- **Validate quantization pattern after export**: Run the validation script from SKILL.md Step 5 on the exported checkpoint. It checks every linear layer is either quantized (has scale params) or explicitly excluded. Layers that are neither were silently skipped — common for models with non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns). This causes deployment failures when the framework tries to load BF16 weights as quantized
350351
- **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down

0 commit comments

Comments
 (0)