Add MoE load balancing loss to distillation by JamesDeng42 · Pull Request #3679 · AI-Hypercomputer/maxtext

JamesDeng42 · 2026-04-16T00:05:46Z

Description

This PR introduces support for Mixture of Experts (MoE) load balancing loss during the distillation workflow.

Key Changes

NNX Intermediate Extraction (maxtext_utils.py & qwen3.py):
- Replaced legacy Linen self.sow(...) calls with native nnx.Intermediate(load_balance_loss) inside
Distillation Strategy Updates (distillation_utils.py & train_distill.py):
- Upgraded DistillationForwardOutput to carry the collected moe_lb_loss.
- Updated CombinedDistillationStrategy to actively add the moe_lb_loss to the total_loss so the optimizer
  minimizes it.
- Surfaced "distill/moe_lb_loss" to the metrics dictionary for TensorBoard logging and visibility.
Model Mutability (models.py):
- Automatically appended "intermediates" to the mutable_collections list during the Transformer's forward pass
  whenever load_balance_loss_weight > 0.0 to ensure NNX variables can successfully write to the state.

Tests

Added "distill/moe_lb_loss" to the expected metrics keys in the test suite to prevent regressions in train_distill_test.py.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-16T00:11:33Z

Codecov Report

❌ Patch coverage is 8.57143% with 32 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/utils/maxtext_utils.py	0.00%	20 Missing and 1 partial ⚠️
.../trainers/post_train/distillation/train_distill.py	20.00%	3 Missing and 1 partial ⚠️
...ners/post_train/distillation/distillation_utils.py	40.00%	2 Missing and 1 partial ⚠️
src/maxtext/models/models.py	0.00%	1 Missing and 1 partial ⚠️
src/maxtext/models/qwen3.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

JamesDeng42 force-pushed the yujiedeng/load-balance-loss branch from 7eaad4e to 69341c6 Compare April 16, 2026 00:23

Add MoE load balancing loss to distillation

414e2f0

JamesDeng42 force-pushed the yujiedeng/load-balance-loss branch from 69341c6 to 414e2f0 Compare April 16, 2026 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MoE load balancing loss to distillation#3679

Add MoE load balancing loss to distillation#3679
JamesDeng42 wants to merge 1 commit intomainfrom
yujiedeng/load-balance-loss

JamesDeng42 commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JamesDeng42 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JamesDeng42 commented Apr 16, 2026 •

edited

Loading

codecov bot commented Apr 16, 2026 •

edited

Loading