Safetensor metadata mismatch fix in Mcore export#1422
Safetensor metadata mismatch fix in Mcore export#1422jinhangchoi wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: Jinhang Choi <jinhangc@nvidia.com>
📝 WalkthroughWalkthroughThe PR reorders operations within the per-layer safetensors export function to move the file write before metadata JSON generation. This ensures the metadata is computed from the same ChangesSafetensors Write Order
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/export/plugins/mcore_custom.py (1)
308-320:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFreeze
layer_state_dictonce to fully eliminate metadata/file drift.This reorder helps, but you still read a live mutable dict twice. If
layer_state_dictchanges aftersave_file(...)and before the metadata loop,.jsoncan diverge from the written.safetensors.Proposed hardening
for layer_index, layer_state_dict in layer_state_dicts.items(): filename = name_template.format(layer_index, total_layers) meta_filename = filename + ".json" ckpt_filename = filename + ".safetensors" + # Freeze key->tensor mapping used by both outputs. + frozen_layer_state_dict = dict(layer_state_dict) + # Write safetensors first, then build the per-layer meta JSON from the same dict. # Order matters: any late mutations to layer_state_dict (e.g. MTP tensors added after # the dict was first constructed) must be captured by both files. Writing safetensors # first ensures the JSON is always consistent with what is physically on disk. - save_file(layer_state_dict, save_directory + "/" + ckpt_filename, metadata={"format": "pt"}) + save_file( + frozen_layer_state_dict, + save_directory + "/" + ckpt_filename, + metadata={"format": "pt"}, + ) weight_map = {} layer_total_size = 0 - for key, val in layer_state_dict.items(): + for key, val in frozen_layer_state_dict.items(): tensor_size = val.numel() * val.element_size() layer_total_size += tensor_size weight_map[key] = ckpt_filename🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/plugins/mcore_custom.py` around lines 308 - 320, layer_state_dict is mutated after being written which can cause metadata/file drift; snapshot it and use that immutable copy for both the safetensors write and the metadata loop. Specifically, create a frozen copy of layer_state_dict (e.g., snapshot = dict(layer_state_dict)) and pass snapshot to save_file(...) and iterate snapshot.items() when building weight_map/layer_total_size so save_file, weight_map, and layer_total_size are computed from the exact same data; reference save_file, layer_state_dict, weight_map, ckpt_filename in your change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@modelopt/torch/export/plugins/mcore_custom.py`:
- Around line 308-320: layer_state_dict is mutated after being written which can
cause metadata/file drift; snapshot it and use that immutable copy for both the
safetensors write and the metadata loop. Specifically, create a frozen copy of
layer_state_dict (e.g., snapshot = dict(layer_state_dict)) and pass snapshot to
save_file(...) and iterate snapshot.items() when building
weight_map/layer_total_size so save_file, weight_map, and layer_total_size are
computed from the exact same data; reference save_file, layer_state_dict,
weight_map, ckpt_filename in your change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8c7dee09-3fac-42b5-92ed-6c95d8e50462
📒 Files selected for processing (1)
modelopt/torch/export/plugins/mcore_custom.py
What does this PR do?
Type of change: Bug fix
While Mcore export is being processed, a test reveals the following scenario:
To avoid this mismatch, we should force writing safetensor first, followed by per-layer JSON metadata.
Usage
# Add a code snippet demonstrating how to use thisTesting
tested in MCore PTQ on Nemotron model
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
Summary by CodeRabbit