Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
6c038f9
Add modelopt/torch/_compress CODEOWNERS
kevalmorabia97 Oct 27, 2025
230cee1
Merge branch 'main' into feature/compress
kevalmorabia97 Oct 27, 2025
54c5f0f
Remove llm_ptq example tests from CICD
kevalmorabia97 Oct 27, 2025
9eeee25
E2E test for the experimental compress algorithm based on https://arx…
danielkorzekwa Oct 28, 2025
ad1d18e
Merge branch 'main' into feature/compress
kevalmorabia97 Oct 28, 2025
cef3655
Add convert_llama3_config_to_decilm_config + unit test (#465)
danielkorzekwa Oct 29, 2025
002b8b5
Implement nas.convert() api for the compress algorithm (#482)
danielkorzekwa Oct 31, 2025
1c12fd8
modelopt nas search() implementation for the compress algorithm (#490)
danielkorzekwa Nov 3, 2025
f7d547f
Add decilm modelling code (#505)
danielkorzekwa Nov 12, 2025
50a580c
Compress tutorial (PoC) (#492)
danielkorzekwa Nov 12, 2025
b121945
Add llama converter (no dependency on internal Nvidia code) - part 1/…
danielkorzekwa Nov 13, 2025
866e400
llama converter is self-contained now (no dependency on internal nvid…
danielkorzekwa Nov 14, 2025
0868f1c
Add integration test for attention pruning (#562)
danielkorzekwa Nov 14, 2025
69726cc
Merge branch 'main' into feature/compress
kevalmorabia97 Nov 15, 2025
07ca24d
Merge branch 'main' into feature/compress
kevalmorabia97 Nov 15, 2025
1dde209
Add score_pruning_activations (step 2/6) (#563)
danielkorzekwa Nov 18, 2025
2e559e7
Update README.md
kevalmorabia97 Nov 18, 2025
f10be0d
Add activation hooks used for pruning (#576)
danielkorzekwa Nov 20, 2025
194b532
Add sewing kit and utilities used for pruning scoring - pruning scori…
danielkorzekwa Nov 24, 2025
8c9cdd4
Add L2NormHook and use it in megatron.py (#599)
danielkorzekwa Nov 26, 2025
1f72466
Add pruning checkpoints for the compress algorithm (#607)
danielkorzekwa Nov 27, 2025
97fe7f0
Add build replacement library to the compress algorithm. (#616)
danielkorzekwa Dec 1, 2025
954103e
Add subblock stats to the compress algorithm (#623)
danielkorzekwa Dec 1, 2025
dcc425f
Add 1-block scoring to the compress algorithm (#625)
danielkorzekwa Dec 2, 2025
56d95de
Add checkpoint save/load to ForwardHook + add IterativeChannelContrib…
danielkorzekwa Dec 2, 2025
74aae83
Add MIP step to the compress algorithm (#627)
danielkorzekwa Dec 4, 2025
a1f63bc
Merge branch 'main' into feature/compress
kevalmorabia97 Dec 8, 2025
a99f503
Remove unused mip functions + fix multi-gpu test (#660)
kevalmorabia97 Dec 8, 2025
67489f4
Fix a bug in IterativeChannelContributionHook + tools for activation …
danielkorzekwa Dec 11, 2025
1d8bd20
Remove runtime.py and directly use torch dist utils + remove unused f…
kevalmorabia97 Dec 11, 2025
f7a0cb0
Use shared activation hooks component in the puzzle algorithm (#687)
danielkorzekwa Dec 17, 2025
db866d9
Clean up Puzzle Compress Tutorial (#711)
LianaMikael Dec 22, 2025
2e813bf
Two bug fixes: mix checkpointing and dtype (#718)
danielkorzekwa Dec 22, 2025
83ac3b1
Merge remote-tracking branch 'origin/main' into feature/compress
kevalmorabia97 Jan 13, 2026
0eecfc6
Fix test assertions for 2-gpu (#772)
kevalmorabia97 Jan 13, 2026
43b3cfa
Rename compress to puzzletron (#776)
kevalmorabia97 Jan 14, 2026
4c30bd5
Add NeMo Conversion Scripts to Puzzletron (#784)
LianaMikael Jan 15, 2026
96bb0ba
Merge branch 'main' into feature/compress
kevalmorabia97 Mar 3, 2026
8c84fee
[CI] Update to only run puzzletron tests
kevalmorabia97 Mar 3, 2026
5812777
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Mar 3, 2026
5f77c81
Pin torchprofile==0.0.4 to fix CI
kevalmorabia97 Mar 10, 2026
82df595
Add anymodel-core to feature/puzzletron (#974)
danielkorzekwa Mar 11, 2026
4dc9932
Draft: anymodel activation scoring (#989)
danielkorzekwa Mar 12, 2026
d358eb3
Draft: Merge anymodel pruning (#990)
danielkorzekwa Mar 12, 2026
8e827f3
Draft: Merging anymodel:build_library_and_stats (#993)
danielkorzekwa Mar 12, 2026
eb4b210
Draft: merge any model calc one block scores (#994)
danielkorzekwa Mar 12, 2026
8fe318d
Draft: merge any_model: mip_and_realize_models (#995)
danielkorzekwa Mar 13, 2026
2fbdf0e
Update uv.lock for nspect puzzletron scanning
kevalmorabia97 Mar 13, 2026
1b42f0b
Dkorzekwa/any model other models (#1007)
danielkorzekwa Mar 17, 2026
67999eb
Dkorzekwa/anymodel gptoss (#1020)
danielkorzekwa Mar 17, 2026
660dc17
Merge any_model tutorial (#1035)
danielkorzekwa Mar 19, 2026
01cba6a
Merge mbridge distillation for any_model (#1036)
danielkorzekwa Mar 20, 2026
2b6572c
MR branch for the remaining difference between dkorzekwa/any_model an…
danielkorzekwa Mar 20, 2026
110316a
Dkorzekwa/decilm hf code cleanup (#1071)
danielkorzekwa Mar 23, 2026
4190275
Dkorzekwa/decilm hf code cleanup 2 (#1073)
danielkorzekwa Mar 23, 2026
0708ca2
Dkorzekwa/anymodel subblock stats (#1085)
danielkorzekwa Mar 24, 2026
e018ca0
Add bypass distillation (blockwise local KD) to puzzletron pipeline
Separius Mar 24, 2026
2b99327
Address review comments for bypass distillation MR
Separius Apr 2, 2026
351b44e
improve bypass' tutorial
Separius Apr 2, 2026
346408b
Clean up main.py and puzzletron_nas_plugin.py
Separius Apr 2, 2026
53f2a33
Refactor train() in training_loop.py: extract helper functions
Separius Apr 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
modelopt/torch/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
modelopt/torch/speculative @NVIDIA/modelopt-torch-speculative-codeowners
Expand Down
7 changes: 4 additions & 3 deletions .github/workflows/_example_tests_runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,15 @@ jobs:
apt-get update && apt-get install -y git-lfs
git lfs install --system

pip install ".${{ inputs.pip_install_extras }}"
# use `python -m pip` instead of `pip` to avoid conflicts with system pip for nemo containers
python -m pip install ".${{ inputs.pip_install_extras }}"

if [[ "${{ inputs.example }}" == *"diffusers"* ]]; then
echo "Uninstalling apex for diffusers: T5 Int8 (PixArt) + Apex is not supported as per https://github.com/huggingface/transformers/issues/21391"
pip uninstall -y apex || true
python -m pip uninstall -y apex || true
fi

find examples/${{ inputs.example }} -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
find examples/${{ inputs.example }} -name "requirements.txt" | while read req_file; do python -m pip install -r "$req_file" || exit 1; done
- name: Run tests
run: |
echo "Running tests for: ${{ inputs.example }}"
Expand Down
89 changes: 7 additions & 82 deletions .github/workflows/example_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,108 +56,33 @@ jobs:
match_pattern: "^DCO$|^linux$" # Wait for DCO and Unit tests / linux to pass
delay: 300s

##### PyTorch Example Tests (speculative_decoding requires 26.01 image) #####
torch-pr:
##### NeMo Example Tests #####
nemo-pr:
needs: [check-file-changes, wait-checks]
if: startsWith(github.ref, 'refs/heads/pull-request/') && needs.check-file-changes.outputs.any_changed == 'true'
strategy: &torch_strategy
fail-fast: false
matrix:
example: [llm_distill, llm_qat, llm_sparsity]
include:
- example: speculative_decoding
docker_image: "26.01"
uses: ./.github/workflows/_example_tests_runner.yml
secrets: inherit
with:
docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.01' }}-py3"
example: ${{ matrix.example }}
timeout_minutes: 30
pip_install_extras: "[hf,dev-test]"
runner: linux-amd64-gpu-h100-latest-1

torch-non-pr:
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
strategy: *torch_strategy
uses: ./.github/workflows/_example_tests_runner.yml
secrets: inherit
with:
docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.01' }}-py3"
example: ${{ matrix.example }}
timeout_minutes: 30
pip_install_extras: "[hf,dev-test]"
runner: linux-amd64-gpu-rtxpro6000-latest-2

##### TensorRT-LLM Example Tests #####
trtllm-pr:
needs: [check-file-changes, wait-checks]
if: startsWith(github.ref, 'refs/heads/pull-request/') && needs.check-file-changes.outputs.any_changed == 'true'
strategy:
fail-fast: false
matrix:
example: [llm_ptq, vlm_ptq]
uses: ./.github/workflows/_example_tests_runner.yml
secrets: inherit
with:
docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5"
example: ${{ matrix.example }}
pip_install_extras: "[hf,dev-test]"
runner: linux-amd64-gpu-rtxpro6000-latest-1

trtllm-non-pr:
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
strategy:
fail-fast: false
matrix:
example: [llm_autodeploy, llm_eval, llm_ptq, vlm_ptq]
uses: ./.github/workflows/_example_tests_runner.yml
secrets: inherit
with:
docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5"
example: ${{ matrix.example }}
pip_install_extras: "[hf,dev-test]"
runner: linux-amd64-gpu-rtxpro6000-latest-2

##### ONNX/TensorRT Example Tests #####
onnx-pr:
needs: [check-file-changes, wait-checks]
if: startsWith(github.ref, 'refs/heads/pull-request/') && needs.check-file-changes.outputs.any_changed == 'true'
strategy: &onnx_strategy
fail-fast: false
matrix:
example: [diffusers, torch_onnx]
uses: ./.github/workflows/_example_tests_runner.yml
secrets: inherit
with:
docker_image: "nvcr.io/nvidia/tensorrt:26.01-py3"
example: ${{ matrix.example }}
pip_install_extras: "[all,dev-test]"
runner: linux-amd64-gpu-l4-latest-1

onnx-non-pr:
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
strategy: *onnx_strategy
example: [puzzletron]
uses: ./.github/workflows/_example_tests_runner.yml
secrets: inherit
with:
docker_image: "nvcr.io/nvidia/tensorrt:26.01-py3"
docker_image: "nvcr.io/nvidia/nemo:26.02"
example: ${{ matrix.example }}
pip_install_extras: "[all,dev-test]"
pip_install_extras: "[hf,puzzletron,dev-test]"
runner: linux-amd64-gpu-rtxpro6000-latest-2

##### Required Check for PR #####
example-pr-required-check:
# Run even if example tests are skipped
if: ${{ startsWith(github.ref, 'refs/heads/pull-request/') && always() }}
needs: [check-file-changes, torch-pr, trtllm-pr, onnx-pr]
needs: [check-file-changes, nemo-pr]
runs-on: ubuntu-latest
steps:
- name: Required GPU tests did not succeed
if: |
needs.check-file-changes.result != 'success' ||
(needs.check-file-changes.outputs.any_changed == 'true' && (
needs.torch-pr.result != 'success' ||
needs.trtllm-pr.result != 'success' ||
needs.onnx-pr.result != 'success'
needs.nemo-pr.result != 'success'
))
run: exit 1
20 changes: 11 additions & 9 deletions .github/workflows/gpu_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,16 +62,16 @@ jobs:
fail-fast: false
matrix:
include:
- example: gpu
timeout: 45
container_image: pytorch:26.01-py3
- example: gpu-megatron
timeout: 45
container_image: pytorch:26.01-py3
- example: gpu-trtllm
- example: gpu-puzzletron
timeout: 30
container_image: tensorrt-llm/release:1.3.0rc5
runs-on: linux-amd64-gpu-rtxpro6000-latest-1
container_image: pytorch:26.01-py3
# - example: gpu-megatron
# timeout: 45
# container_image: pytorch:26.01-py3
# - example: gpu-trtllm
# timeout: 30
# container_image: tensorrt-llm/release:1.3.0rc5
runs-on: linux-amd64-gpu-rtxpro6000-latest-2
timeout-minutes: ${{ matrix.timeout }}
container: &gpu_container
image: nvcr.io/nvidia/${{ matrix.container_image }}
Expand All @@ -85,6 +85,8 @@ jobs:
- name: Setup environment variables
run: |
echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/include:/usr/lib/x86_64-linux-gnu" >> $GITHUB_ENV
- name: Install dependencies for mip
run: apt-get update && apt-get install -y libffi-dev
- name: Run gpu tests
run: pip install tox-current-env && tox -e cuda13-${{ matrix.example }} --current-env
gpu-tests-non-pr:
Expand Down
19 changes: 17 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,20 @@ repos:
hooks:
- id: ruff-check
args: [--fix, --exit-non-zero-on-fix]
exclude: ^examples/specdec_bench/specdec_bench/datasets/speed\.py$
# See: commit hooks modifies block_config.py leading to test_puzzletron.py failing (#25) · Issues · omniml / modelopt · GitLab
exclude: >
(?x)^(
^examples/specdec_bench/specdec_bench/datasets/speed\.py$|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config\.py|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py
)$
- id: ruff-format
exclude: ^examples/specdec_bench/specdec_bench/datasets/speed\.py$
exclude: >
(?x)^(
^examples/specdec_bench/specdec_bench/datasets/speed\.py$|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config\.py|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py
)$

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.17.1
Expand Down Expand Up @@ -84,6 +95,7 @@ repos:
modelopt/torch/speculative/eagle/utils.py|
modelopt/torch/speculative/plugins/transformers.py|
modelopt/torch/utils/plugins/megatron_mmlu.py|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py|
examples/chained_optimizations/bert_prune_distill_quantize.py|
examples/deepseek/quantize_to_nvfp4.py|
examples/deepseek/ptq.py|
Expand All @@ -96,10 +108,13 @@ repos:
examples/llm_eval/modeling.py|
examples/llm_qat/main.py|
examples/llm_sparsity/weight_sparsity/finetune.py|
examples/puzzletron/evaluation/lm_eval_anymodel.py|
examples/specdec_bench/specdec_bench/models/specbench_medusa.py|
examples/speculative_decoding/main.py|
examples/speculative_decoding/medusa_utils.py|
examples/speculative_decoding/server_generate.py|
examples/puzzletron/evaluation/lm_eval_anymodel.py|
modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_pruned_to_mxfp4.py|
experimental/dms/models/qwen3/configuration_qwen3_dms.py|
experimental/dms/models/qwen3/modeling_qwen3_dms.py|
)$
Expand Down
1 change: 1 addition & 0 deletions examples/pruning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Pruning can involve removal (prune) of Linear and Conv layers; and Transformer a
This section focuses on applying Model Optimizer's state-of-the-art complementary pruning modes to enable you to search for the best subnet architecture from your provided base model:

1. [Minitron](https://arxiv.org/pdf/2408.11796): A pruning method developed by NVIDIA Research for pruning GPT (and later extended to Mamba, MoE, and Hybrid Transformer Mamba) models in NVIDIA Megatron-LM (M-LM) or Megatron-Bridge (M-Bridge) framework. It uses the activation magnitudes to prune the embedding hidden size; mlp ffn hidden size; transformer attention heads; mamba heads and head dimension; MoE number of experts, ffn hidden size, and shared expert intermediate size; and number of layers of the model.
1. [Puzzletron](../puzzletron/README.md): An advanced pruning method by NVIDIA using Mixed Integer Programming (MIP) based NAS search algorithm.
1. FastNAS: A pruning method recommended for Computer Vision models. Given a pretrained model, FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
1. GradNAS: A light-weight pruning method recommended for language models like Hugging Face BERT, GPT-J. It uses the gradient information to prune the model's linear layers and attention heads to meet the given constraints.

Expand Down
159 changes: 159 additions & 0 deletions examples/puzzletron/BYPASS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Bypass Distillation (Blockwise Local Distillation)

Bypass distillation (also called **Blockwise Local Distillation / BLD**) is an optional pipeline
stage that trains alternative transformer block configurations using per-block knowledge
distillation from the teacher model. It significantly improves the quality of aggressively
compressed models by producing better "puzzle pieces" for the MIP solver.

## When to use bypass

Bypass is most beneficial whenever the pruned block structure deviates significantly from the
teacher — either because the weight-initialisation heuristic is too coarse, or because one
sub-block must compensate for something the other no longer provides. Specifically, use bypass
when:

- **KV head reduction (any amount)**: the `AverageKV` initialisation is a naive starting point
that averages existing KV heads together. The resulting weights are a poor local minimum and
bypass distillation is needed to repair the quality loss. This applies even to moderate
reductions (e.g., 8 → 4 heads).
- **Attention removed (`no_op: true`)**: removing an entire attention block leaves the co-located
FFN doing all the work for that block. Bypass trains the FFN to compensate for the missing
attention and recover the representational capacity.
- **FFN removed (`no_op: true`)**: similarly, when an FFN block is removed, bypass trains the
remaining attention to compensate.
- **Extreme FFN / MoE compression**: when the target `intermediate_size` is reduced by more than
~3/4 of the teacher width, or the number of MoE experts is reduced by half or more, simple
weight truncation / expert selection leaves the block far from a good solution and bypass
significantly improves quality. For example, on Llama-3.1-8B (`intermediate_size=14336`),
bypass is strongly recommended for sizes ≤ 3584.

## Time cost

Bypass distillation is a full training loop. Plan for several hours per configuration when using
~1B training tokens on H100 GPUs. Total time scales with
`len(bypass.configs) × training_tokens`. This is comparable to lightweight fine-tuning.

## Sequential execution

Each entry in `bypass.configs` trains **sequentially** (one config at a time). There is no
parallelism across configurations. Distribute jobs across different runs if time is a
constraint.

## Enabling bypass

In your concrete model YAML, uncomment the bypass line:

```yaml
defaults:
- Llama-3_1-8B
- bypass: defaults # remove the comment to enable bypass distillation
- _self_
```

A shared `bypass/defaults.yaml` is located at
[`configs/bypass/defaults.yaml`](configs/bypass/defaults.yaml). It is used by all models.
Adjust `training.training_tokens` (default is 10K tokens for sanity-check runs; set to `1e+9`
for production runs) and the `auto_configs` or `configs` settings to match your compression
targets.

## Decoupled vs. coupled BLD

**Decoupled BLD** trains only one sub-block type at a time while keeping the other frozen:

| `keys_to_learn` | What is trained |
|---|---|
| `subblock_ffn` | FFN weights only (attention frozen) |
| `subblock_attention` | Attention weights only (FFN frozen) |
| `subblock_mamba` | Mamba SSM weights (hybrid models, e.g. NemotronH) |
| `entire_block` | Full transformer block (coupled BLD) |

**Coupled BLD** (`keys_to_learn: entire_block`) trains the whole block end-to-end and captures
interactions between attention and FFN. The main cost is combinatorial: if you have N FFN sizes
and M attention sizes in your replacement library, coupled BLD requires N × M training runs
instead of N + M for decoupled. Decoupled BLD is therefore the default and usually sufficient.

## Training multiple configurations

Use `bypass.configs` to train multiple block configurations sequentially:

```yaml
bypass:
training:
training_tokens: 1e+9 # ~1B tokens per config
configs:
- model_config_overrides:
ffn:
- intermediate_size: 1792 # aggressive — bypass strongly recommended
attention:
- num_key_value_heads: null
keys_to_learn: subblock_ffn
- model_config_overrides:
ffn:
- intermediate_size: 3584
attention:
- num_key_value_heads: null
keys_to_learn: subblock_ffn
```

> **Note:** Always include `num_key_value_heads: null` under `attention:` even when not
> changing KV heads. Omitting it when `no_op: true` is set on another field can cause
> a config parsing issue.

Trained checkpoints are automatically symlinked into `$PUZZLE_DIR/ckpts/` where the replacement
library builder picks them up in the next pipeline stage.

## Auto-generating configs from the pruning search space

Instead of listing each config manually, use `bypass.auto_configs` to generate configs
automatically from the pruning search space. The default (`auto_configs.attn: true`) trains
one attention-only bypass per KV-head reduction specified in `pruning.n_heads_in_group_list`:

```yaml
bypass:
auto_configs:
attn: true # one subblock_attention config per pruned kv-head count
ffn: false # set true: one subblock_ffn config per size in pruning.intermediate_size_list
blk: false # set true: cartesian product (FFN size × kv-head count), entire_block BLD
training:
training_tokens: 1e+9
```

Teacher-size subblocks are automatically excluded (no redundant training). For `blk`, all
combinations where **both** FFN and attention are at teacher values are skipped.

All three flags can be combined. Order of generated configs: FFN → attn → blk.

## Attention no-op + FFN-only bypass

A common aggressive compression pattern removes entire attention blocks (`no_op: true`) and
trains only the FFN in those blocks. Example config:

```yaml
configs:
- model_config_overrides:
ffn:
- intermediate_size: 12288
attention:
- num_key_value_heads: null
no_op: true
keys_to_learn: subblock_ffn
```

When attention is removed, only the FFN parameters are trained. The bypass code automatically
skips attention-related weights (including model-specific ones such as Qwen3's `q_norm`/`k_norm`)
during student weight initialisation.

## Weights & Biases logging

Enable W&B to track per-block distillation loss and validation metrics:

```yaml
bypass:
wandb_log: true
wandb:
project: my-puzzletron-project
entity: my-org
```

W&B logs iteration number, token count, learning rate, and per-block loss at each log interval.
If `wandb` is not installed, logging is silently disabled.
Loading