|
| 1 | +# Post-Quantization Checkpoint Validation |
| 2 | + |
| 3 | +Verify the exported checkpoint's quantization pattern matches the recipe used. Quantization config patterns may silently miss layers if the model uses non-standard naming — this only surfaces later as deployment failures when the serving framework tries to load unquantized weights as quantized. |
| 4 | + |
| 5 | +## Expected quantization patterns by recipe |
| 6 | + |
| 7 | +| Recipe (`--qformat`) | What should be quantized | What should be excluded | |
| 8 | +|----------------------|-------------------------|------------------------| |
| 9 | +| `nvfp4` | All linear layers | lm_head, routers, norms, embeddings | |
| 10 | +| `nvfp4_mlp_only` | MLP layers (including MoE experts) | Attention layers, lm_head, routers | |
| 11 | +| `nvfp4_experts_only` | MoE expert layers only | Dense MLP, attention, lm_head, routers | |
| 12 | +| `nvfp4_omlp_only` | MLP + o_proj layers | Other attention layers, lm_head, routers | |
| 13 | +| `fp8` | All linear layers | lm_head, norms, embeddings | |
| 14 | +| `int4_awq` | All linear layers | lm_head, norms, embeddings | |
| 15 | + |
| 16 | +## Validation script |
| 17 | + |
| 18 | +Run against the exported checkpoint to check every linear layer is either quantized (has scale params) or explicitly excluded: |
| 19 | + |
| 20 | +```bash |
| 21 | +python3 -c " |
| 22 | +import json, fnmatch |
| 23 | +
|
| 24 | +output = '<output_path>' |
| 25 | +idx = json.load(open(f'{output}/model.safetensors.index.json')) |
| 26 | +cfg = json.load(open(f'{output}/hf_quant_config.json')) |
| 27 | +excludes = cfg['quantization']['exclude_modules'] |
| 28 | +
|
| 29 | +all_keys = set(idx['weight_map'].keys()) |
| 30 | +# Identify linear weight params (skip norms, embeddings, scalars, scales) |
| 31 | +skip_suffixes = ('_scale', '_scale_2', 'layernorm', 'layer_norm', 'norm.weight', 'embed', 'scalar') |
| 32 | +linear_weights = sorted(k for k in all_keys |
| 33 | + if k.endswith('.weight') and not any(s in k.lower() for s in skip_suffixes)) |
| 34 | +
|
| 35 | +# Check which have quantization scales |
| 36 | +quantized, excluded, unexpected = [], [], [] |
| 37 | +for w in linear_weights: |
| 38 | + base = w.rsplit('.weight', 1)[0] |
| 39 | + has_scales = any(f'{base}.{s}' in all_keys for s in ['weight_scale', 'input_scale']) |
| 40 | + is_excluded = any(fnmatch.fnmatch(w, p) or fnmatch.fnmatch(base, p) for p in excludes) |
| 41 | +
|
| 42 | + if has_scales: |
| 43 | + quantized.append(w) |
| 44 | + elif is_excluded: |
| 45 | + excluded.append(w) |
| 46 | + else: |
| 47 | + unexpected.append(w) |
| 48 | +
|
| 49 | +print(f'Quantized layers: {len(quantized)}') |
| 50 | +print(f'Excluded layers (in exclude_modules): {len(excluded)}') |
| 51 | +if unexpected: |
| 52 | + print(f'\nWARNING: {len(unexpected)} layers have NO scales and are NOT in exclude list:') |
| 53 | + # Group by module type for readability |
| 54 | + groups = {} |
| 55 | + for w in unexpected: |
| 56 | + parts = w.split('.') |
| 57 | + module_type = next((p for p in parts if p in |
| 58 | + ('self_attn', 'mlp', 'experts', 'router', 'lm_head', 'embed_tokens', 'vision_tower')), 'other') |
| 59 | + groups.setdefault(module_type, []).append(w) |
| 60 | + for mtype, weights in sorted(groups.items()): |
| 61 | + print(f' {mtype}: {len(weights)} weights (e.g., {weights[0]})') |
| 62 | + print() |
| 63 | + print('These layers were silently skipped during quantization.') |
| 64 | + print('Likely cause: quantization config patterns did not match these module names.') |
| 65 | + print('This WILL cause deployment failures (framework loads them as quantized but they are BF16).') |
| 66 | + print('Fix: add missing patterns to the config, or add to exclude_modules if intentionally unquantized.') |
| 67 | +else: |
| 68 | + print('\nAll layers are either quantized or explicitly excluded. Checkpoint is consistent.') |
| 69 | +" |
| 70 | +``` |
| 71 | + |
| 72 | +## Common pattern gaps |
| 73 | + |
| 74 | +Layers silently skipped because the quantization config patterns don't match the model's naming: |
| 75 | + |
| 76 | +| Model | Module path | Missed by pattern | Fix | |
| 77 | +|-------|-------------|-------------------|-----| |
| 78 | +| Gemma4 MoE | `layers.N.experts.*` | `*mlp*`, `*block_sparse_moe*` | Add `*.experts.*` (PR #1219) | |
| 79 | +| Custom MoE | `layers.N.moe_block.experts.*` | `*mlp*` | Add matching pattern | |
| 80 | +| VLM projector | `multi_modal_projector.*` | — | Usually excluded; verify | |
| 81 | + |
| 82 | +## What to do when warnings appear |
| 83 | + |
| 84 | +- **Layers should have been quantized** (e.g., MoE experts with `nvfp4_mlp_only`): the quantization config patterns missed them. Fix by adding the missing pattern to the config and re-running PTQ. Check if ModelOpt already has a plugin for the model in `modelopt/torch/quantization/plugins/huggingface.py`. |
| 85 | + |
| 86 | +- **Layers are intentionally unquantized** (e.g., attention layers with `nvfp4_mlp_only`): they should be in the `exclude_modules` list but the export didn't add them. Add them manually to both `hf_quant_config.json` and `config.json` `quantization_config.ignore` in the checkpoint to prevent deployment failures. |
0 commit comments