Skip to content

Fix vLLM >= 0.17 compatibility: migrate to native WeightTransferConfig API#3556

Closed
vmoens wants to merge 2 commits intogh/vmoens/240/basefrom
gh/vmoens/240/head
Closed

Fix vLLM >= 0.17 compatibility: migrate to native WeightTransferConfig API#3556
vmoens wants to merge 2 commits intogh/vmoens/240/basefrom
gh/vmoens/240/head

Conversation

@vmoens
Copy link
Copy Markdown
Collaborator

@vmoens vmoens commented Mar 21, 2026

Stack from ghstack (oldest at bottom):


  • Replace manual stateless_init_process_group + collective_rpc("update_weight")
    with vLLM's native WeightTransferConfig/NCCLWeightTransferEngine API
  • Fix VLLM_USE_V1 env var removal (V1 always on in 0.17+)
  • Fix NCCL weight sync deadlock by dispatching worker RPCs before trainer joins
  • Fix LoRA weight extraction (merge_and_unload before state_dict)
  • Fix weight transfer KeyError by using HF model directly (not TransformersWrapper)
  • Fix prompt_logprobs length mismatch in _RequestOutput_tc for V1 engine
  • Auto-propagate WANDB_API_KEY, HF_TOKEN, HF_HOME to Ray workers

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

[ghstack-poisoned]
vmoens added a commit that referenced this pull request Mar 21, 2026
…g API

- Replace manual stateless_init_process_group + collective_rpc("update_weight")
  with vLLM's native WeightTransferConfig/NCCLWeightTransferEngine API
- Fix VLLM_USE_V1 env var removal (V1 always on in 0.17+)
- Fix NCCL weight sync deadlock by dispatching worker RPCs before trainer joins
- Fix LoRA weight extraction (merge_and_unload before state_dict)
- Fix weight transfer KeyError by using HF model directly (not TransformersWrapper)
- Fix prompt_logprobs length mismatch in _RequestOutput_tc for V1 engine
- Auto-propagate WANDB_API_KEY, HF_TOKEN, HF_HOME to Ray workers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ghstack-source-id: 1a2d958
Pull-Request: #3556
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 21, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3556

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 3 Unrelated Failures

As of commit d0fb2a9 with merge base 4e2e787 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions
Copy link
Copy Markdown
Contributor

⚠️ PR Title Label Error

PR title must start with a label prefix in brackets (e.g., [BugFix]).

Current title: Fix vLLM >= 0.17 compatibility: migrate to native WeightTransferConfig API

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

@github-actions github-actions bot added llm/ LLM-related PR, triggers LLM CI tests sota-implementations/ Modules WeightUpdate labels Mar 21, 2026
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

⚠️ PR Title Label Error

PR title must start with a label prefix in brackets (e.g., [BugFix]).

Current title: Fix vLLM >= 0.17 compatibility: migrate to native WeightTransferConfig API

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 21, 2026

$\color{#D29922}\textsf{\Large&amp;#x26A0;\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 172. Improved: $\large\color{#35bf28}11$. Worsened: $\large\color{#d91a1a}6$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 81.8339μs 80.5935μs 12.4080 KOps/s 12.4851 KOps/s $\color{#d91a1a}-0.62\%$
test_tensor_to_bytestream_speed[torch.save] 0.1397ms 0.1386ms 7.2140 KOps/s 7.1840 KOps/s $\color{#35bf28}+0.42\%$
test_tensor_to_bytestream_speed[untyped_storage] 0.1108s 0.1105s 9.0513 Ops/s 9.0348 Ops/s $\color{#35bf28}+0.18\%$
test_tensor_to_bytestream_speed[numpy] 2.5482μs 2.5435μs 393.1643 KOps/s 395.2466 KOps/s $\color{#d91a1a}-0.53\%$
test_tensor_to_bytestream_speed[safetensors] 36.9283μs 36.6334μs 27.2975 KOps/s 26.0877 KOps/s $\color{#35bf28}+4.64\%$
test_simple 0.7826s 0.7817s 1.2793 Ops/s 1.2371 Ops/s $\color{#35bf28}+3.41\%$
test_transformed 1.3755s 1.3741s 0.7278 Ops/s 0.7120 Ops/s $\color{#35bf28}+2.22\%$
test_serial 2.4327s 2.3196s 0.4311 Ops/s 0.4200 Ops/s $\color{#35bf28}+2.64\%$
test_parallel 1.9245s 1.8229s 0.5486 Ops/s 0.5519 Ops/s $\color{#d91a1a}-0.60\%$
test_step_mdp_speed[True-True-True-True-True] 0.1843ms 42.3945μs 23.5880 KOps/s 23.9642 KOps/s $\color{#d91a1a}-1.57\%$
test_step_mdp_speed[True-True-True-True-False] 57.1610μs 23.0989μs 43.2921 KOps/s 43.3240 KOps/s $\color{#d91a1a}-0.07\%$
test_step_mdp_speed[True-True-True-False-True] 73.4420μs 23.5847μs 42.4003 KOps/s 43.9052 KOps/s $\color{#d91a1a}-3.43\%$
test_step_mdp_speed[True-True-True-False-False] 42.5800μs 12.7772μs 78.2643 KOps/s 77.3807 KOps/s $\color{#35bf28}+1.14\%$
test_step_mdp_speed[True-True-False-True-True] 83.2520μs 44.2352μs 22.6064 KOps/s 22.4330 KOps/s $\color{#35bf28}+0.77\%$
test_step_mdp_speed[True-True-False-True-False] 58.6510μs 25.5230μs 39.1804 KOps/s 39.5892 KOps/s $\color{#d91a1a}-1.03\%$
test_step_mdp_speed[True-True-False-False-True] 59.8010μs 25.8073μs 38.7487 KOps/s 38.6982 KOps/s $\color{#35bf28}+0.13\%$
test_step_mdp_speed[True-True-False-False-False] 53.7410μs 15.4572μs 64.6949 KOps/s 65.1556 KOps/s $\color{#d91a1a}-0.71\%$
test_step_mdp_speed[True-False-True-True-True] 0.1179ms 46.8276μs 21.3549 KOps/s 20.9731 KOps/s $\color{#35bf28}+1.82\%$
test_step_mdp_speed[True-False-True-True-False] 67.7110μs 28.1462μs 35.5288 KOps/s 35.6843 KOps/s $\color{#d91a1a}-0.44\%$
test_step_mdp_speed[True-False-True-False-True] 54.9710μs 26.1861μs 38.1882 KOps/s 37.5558 KOps/s $\color{#35bf28}+1.68\%$
test_step_mdp_speed[True-False-True-False-False] 40.6210μs 15.2478μs 65.5832 KOps/s 64.8506 KOps/s $\color{#35bf28}+1.13\%$
test_step_mdp_speed[True-False-False-True-True] 84.5620μs 49.1386μs 20.3506 KOps/s 20.0432 KOps/s $\color{#35bf28}+1.53\%$
test_step_mdp_speed[True-False-False-True-False] 68.8620μs 30.0952μs 33.2279 KOps/s 32.5721 KOps/s $\color{#35bf28}+2.01\%$
test_step_mdp_speed[True-False-False-False-True] 60.6010μs 28.3195μs 35.3114 KOps/s 35.1444 KOps/s $\color{#35bf28}+0.48\%$
test_step_mdp_speed[True-False-False-False-False] 58.6310μs 17.6406μs 56.6873 KOps/s 55.4042 KOps/s $\color{#35bf28}+2.32\%$
test_step_mdp_speed[False-True-True-True-True] 87.0910μs 46.6662μs 21.4288 KOps/s 20.8881 KOps/s $\color{#35bf28}+2.59\%$
test_step_mdp_speed[False-True-True-True-False] 57.2010μs 27.5472μs 36.3014 KOps/s 35.8694 KOps/s $\color{#35bf28}+1.20\%$
test_step_mdp_speed[False-True-True-False-True] 2.4869ms 29.9273μs 33.4143 KOps/s 33.5495 KOps/s $\color{#d91a1a}-0.40\%$
test_step_mdp_speed[False-True-True-False-False] 45.9200μs 17.0627μs 58.6072 KOps/s 59.3039 KOps/s $\color{#d91a1a}-1.17\%$
test_step_mdp_speed[False-True-False-True-True] 86.2610μs 48.8095μs 20.4878 KOps/s 20.1449 KOps/s $\color{#35bf28}+1.70\%$
test_step_mdp_speed[False-True-False-True-False] 66.1710μs 30.1817μs 33.1327 KOps/s 32.5591 KOps/s $\color{#35bf28}+1.76\%$
test_step_mdp_speed[False-True-False-False-True] 63.0810μs 32.0491μs 31.2021 KOps/s 31.1537 KOps/s $\color{#35bf28}+0.16\%$
test_step_mdp_speed[False-True-False-False-False] 50.4610μs 19.3231μs 51.7515 KOps/s 51.9095 KOps/s $\color{#d91a1a}-0.30\%$
test_step_mdp_speed[False-False-True-True-True] 90.5710μs 51.3817μs 19.4622 KOps/s 19.2463 KOps/s $\color{#35bf28}+1.12\%$
test_step_mdp_speed[False-False-True-True-False] 71.8610μs 32.8089μs 30.4796 KOps/s 30.2437 KOps/s $\color{#35bf28}+0.78\%$
test_step_mdp_speed[False-False-True-False-True] 68.0510μs 31.6025μs 31.6431 KOps/s 31.2318 KOps/s $\color{#35bf28}+1.32\%$
test_step_mdp_speed[False-False-True-False-False] 61.5710μs 18.9732μs 52.7059 KOps/s 52.0746 KOps/s $\color{#35bf28}+1.21\%$
test_step_mdp_speed[False-False-False-True-True] 90.1110μs 54.0174μs 18.5125 KOps/s 18.2681 KOps/s $\color{#35bf28}+1.34\%$
test_step_mdp_speed[False-False-False-True-False] 95.5120μs 34.6283μs 28.8781 KOps/s 28.1220 KOps/s $\color{#35bf28}+2.69\%$
test_step_mdp_speed[False-False-False-False-True] 53.8610μs 33.5677μs 29.7905 KOps/s 29.2696 KOps/s $\color{#35bf28}+1.78\%$
test_step_mdp_speed[False-False-False-False-False] 44.7010μs 21.8044μs 45.8623 KOps/s 46.0767 KOps/s $\color{#d91a1a}-0.47\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.8500s 0.7414s 1.3489 Ops/s 1.3469 Ops/s $\color{#35bf28}+0.14\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7058s 0.6065s 1.6488 Ops/s 1.6337 Ops/s $\color{#35bf28}+0.92\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.7328s 1.6455s 0.6077 Ops/s 0.6024 Ops/s $\color{#35bf28}+0.88\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.5090s 1.4277s 0.7004 Ops/s 0.6997 Ops/s $\color{#35bf28}+0.10\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 1.9827s 1.8982s 0.5268 Ops/s 0.5270 Ops/s $\color{#d91a1a}-0.05\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.7699s 1.6773s 0.5962 Ops/s 0.5974 Ops/s $\color{#d91a1a}-0.19\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.7064s 4.5849s 0.2181 Ops/s 0.2164 Ops/s $\color{#35bf28}+0.79\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.5414s 4.4057s 0.2270 Ops/s 0.2244 Ops/s $\color{#35bf28}+1.13\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 1.9688s 1.8767s 0.5329 Ops/s 0.5356 Ops/s $\color{#d91a1a}-0.51\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.6743s 1.6050s 0.6230 Ops/s 0.6331 Ops/s $\color{#d91a1a}-1.58\%$
test_values[generalized_advantage_estimate-True-True] 21.3523ms 20.6778ms 48.3610 Ops/s 47.1374 Ops/s $\color{#35bf28}+2.60\%$
test_values[vec_generalized_advantage_estimate-True-True] 0.1392s 3.7061ms 269.8230 Ops/s 283.9737 Ops/s $\color{#d91a1a}-4.98\%$
test_values[td0_return_estimate-False-False] 0.1025ms 82.2431μs 12.1591 KOps/s 12.1810 KOps/s $\color{#d91a1a}-0.18\%$
test_values[td1_return_estimate-False-False] 48.4830ms 48.0827ms 20.7975 Ops/s 20.3764 Ops/s $\color{#35bf28}+2.07\%$
test_values[vec_td1_return_estimate-False-False] 1.3482ms 1.0898ms 917.5997 Ops/s 913.3728 Ops/s $\color{#35bf28}+0.46\%$
test_values[td_lambda_return_estimate-True-False] 79.5740ms 78.8964ms 12.6749 Ops/s 12.4395 Ops/s $\color{#35bf28}+1.89\%$
test_values[vec_td_lambda_return_estimate-True-False] 1.2950ms 1.0869ms 920.0286 Ops/s 913.7384 Ops/s $\color{#35bf28}+0.69\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 20.9997ms 20.7025ms 48.3034 Ops/s 48.1498 Ops/s $\color{#35bf28}+0.32\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.0118ms 0.7566ms 1.3218 KOps/s 1.3112 KOps/s $\color{#35bf28}+0.81\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.8047ms 0.6810ms 1.4685 KOps/s 1.4741 KOps/s $\color{#d91a1a}-0.38\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.5264ms 1.4890ms 671.5696 Ops/s 672.4670 Ops/s $\color{#d91a1a}-0.13\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.7346ms 0.6953ms 1.4381 KOps/s 1.4065 KOps/s $\color{#35bf28}+2.25\%$
test_dqn_speed[False-None] 1.6993ms 1.6032ms 623.7463 Ops/s 627.4638 Ops/s $\color{#d91a1a}-0.59\%$
test_dqn_speed[False-backward] 2.3183ms 2.2495ms 444.5479 Ops/s 443.5572 Ops/s $\color{#35bf28}+0.22\%$
test_dqn_speed[True-None] 0.6703ms 0.5761ms 1.7359 KOps/s 1.6993 KOps/s $\color{#35bf28}+2.15\%$
test_dqn_speed[True-backward] 1.2649ms 1.2177ms 821.2259 Ops/s 819.4070 Ops/s $\color{#35bf28}+0.22\%$
test_dqn_speed[reduce-overhead-None] 0.7399ms 0.6089ms 1.6424 KOps/s 1.6228 KOps/s $\color{#35bf28}+1.21\%$
test_ddpg_speed[False-None] 3.4689ms 3.0429ms 328.6364 Ops/s 329.8233 Ops/s $\color{#d91a1a}-0.36\%$
test_ddpg_speed[False-backward] 4.9782ms 4.5039ms 222.0294 Ops/s 223.6785 Ops/s $\color{#d91a1a}-0.74\%$
test_ddpg_speed[True-None] 1.4222ms 1.3370ms 747.9668 Ops/s 742.9018 Ops/s $\color{#35bf28}+0.68\%$
test_ddpg_speed[True-backward] 2.3753ms 2.3250ms 430.1056 Ops/s 400.9642 Ops/s $\textbf{\color{#35bf28}+7.27\%}$
test_ddpg_speed[reduce-overhead-None] 1.4560ms 1.3657ms 732.2456 Ops/s 724.7545 Ops/s $\color{#35bf28}+1.03\%$
test_sac_speed[False-None] 8.9579ms 8.5364ms 117.1457 Ops/s 117.4731 Ops/s $\color{#d91a1a}-0.28\%$
test_sac_speed[False-backward] 12.1395ms 11.6228ms 86.0376 Ops/s 84.5686 Ops/s $\color{#35bf28}+1.74\%$
test_sac_speed[True-None] 2.3281ms 1.8458ms 541.7841 Ops/s 528.2851 Ops/s $\color{#35bf28}+2.56\%$
test_sac_speed[True-backward] 3.5299ms 3.4248ms 291.9855 Ops/s 273.0737 Ops/s $\textbf{\color{#35bf28}+6.93\%}$
test_sac_speed[reduce-overhead-None] 16.9107ms 10.1890ms 98.1450 Ops/s 98.0442 Ops/s $\color{#35bf28}+0.10\%$
test_redq_deprec_speed[False-None] 10.6393ms 9.5352ms 104.8747 Ops/s 104.4890 Ops/s $\color{#35bf28}+0.37\%$
test_redq_deprec_speed[False-backward] 13.2051ms 12.6858ms 78.8285 Ops/s 76.7313 Ops/s $\color{#35bf28}+2.73\%$
test_redq_deprec_speed[True-None] 2.6555ms 2.5374ms 394.1020 Ops/s 383.9178 Ops/s $\color{#35bf28}+2.65\%$
test_redq_deprec_speed[True-backward] 4.0549ms 3.9871ms 250.8060 Ops/s 233.8570 Ops/s $\textbf{\color{#35bf28}+7.25\%}$
test_redq_deprec_speed[reduce-overhead-None] 14.8212ms 9.6522ms 103.6029 Ops/s 103.6334 Ops/s $\color{#d91a1a}-0.03\%$
test_td3_speed[False-None] 8.5183ms 8.3927ms 119.1512 Ops/s 117.7044 Ops/s $\color{#35bf28}+1.23\%$
test_td3_speed[False-backward] 11.3962ms 10.8319ms 92.3196 Ops/s 91.7576 Ops/s $\color{#35bf28}+0.61\%$
test_td3_speed[True-None] 1.6499ms 1.6171ms 618.3918 Ops/s 582.9998 Ops/s $\textbf{\color{#35bf28}+6.07\%}$
test_td3_speed[True-backward] 3.0798ms 2.9478ms 339.2308 Ops/s 317.8524 Ops/s $\textbf{\color{#35bf28}+6.73\%}$
test_td3_speed[reduce-overhead-None] 98.2653ms 25.7436ms 38.8446 Ops/s 38.6384 Ops/s $\color{#35bf28}+0.53\%$
test_cql_speed[False-None] 18.4699ms 17.8094ms 56.1501 Ops/s 56.3456 Ops/s $\color{#d91a1a}-0.35\%$
test_cql_speed[False-backward] 23.6310ms 23.1680ms 43.1629 Ops/s 42.4847 Ops/s $\color{#35bf28}+1.60\%$
test_cql_speed[True-None] 3.5518ms 3.4346ms 291.1539 Ops/s 303.6496 Ops/s $\color{#d91a1a}-4.12\%$
test_cql_speed[True-backward] 5.9011ms 5.4961ms 181.9480 Ops/s 177.9555 Ops/s $\color{#35bf28}+2.24\%$
test_cql_speed[reduce-overhead-None] 17.8105ms 11.8815ms 84.1642 Ops/s 83.8894 Ops/s $\color{#35bf28}+0.33\%$
test_a2c_speed[False-None] 3.6040ms 3.4720ms 288.0186 Ops/s 295.7474 Ops/s $\color{#d91a1a}-2.61\%$
test_a2c_speed[False-backward] 6.9608ms 6.4364ms 155.3663 Ops/s 150.9679 Ops/s $\color{#35bf28}+2.91\%$
test_a2c_speed[True-None] 1.4550ms 1.3596ms 735.5352 Ops/s 725.7821 Ops/s $\color{#35bf28}+1.34\%$
test_a2c_speed[True-backward] 3.2742ms 3.1327ms 319.2105 Ops/s 330.2158 Ops/s $\color{#d91a1a}-3.33\%$
test_a2c_speed[reduce-overhead-None] 1.1751ms 1.0420ms 959.6825 Ops/s 955.1133 Ops/s $\color{#35bf28}+0.48\%$
test_ppo_speed[False-None] 4.3216ms 4.1711ms 239.7454 Ops/s 247.6136 Ops/s $\color{#d91a1a}-3.18\%$
test_ppo_speed[False-backward] 7.8966ms 7.5438ms 132.5596 Ops/s 136.9900 Ops/s $\color{#d91a1a}-3.23\%$
test_ppo_speed[True-None] 1.7350ms 1.5060ms 664.0306 Ops/s 664.9134 Ops/s $\color{#d91a1a}-0.13\%$
test_ppo_speed[True-backward] 3.3857ms 3.3203ms 301.1769 Ops/s 295.8596 Ops/s $\color{#35bf28}+1.80\%$
test_ppo_speed[reduce-overhead-None] 1.2418ms 1.0951ms 913.1178 Ops/s 892.5180 Ops/s $\color{#35bf28}+2.31\%$
test_reinforce_speed[False-None] 2.7625ms 2.4393ms 409.9515 Ops/s 398.9212 Ops/s $\color{#35bf28}+2.77\%$
test_reinforce_speed[False-backward] 3.9984ms 3.5778ms 279.5026 Ops/s 287.1583 Ops/s $\color{#d91a1a}-2.67\%$
test_reinforce_speed[True-None] 1.4860ms 1.3659ms 732.0985 Ops/s 723.7706 Ops/s $\color{#35bf28}+1.15\%$
test_reinforce_speed[True-backward] 3.5772ms 3.1502ms 317.4410 Ops/s 332.1191 Ops/s $\color{#d91a1a}-4.42\%$
test_reinforce_speed[reduce-overhead-None] 16.0974ms 9.0548ms 110.4382 Ops/s 111.9383 Ops/s $\color{#d91a1a}-1.34\%$
test_iql_speed[False-None] 11.0086ms 10.0238ms 99.7627 Ops/s 102.3746 Ops/s $\color{#d91a1a}-2.55\%$
test_iql_speed[False-backward] 14.5950ms 13.9208ms 71.8348 Ops/s 73.6738 Ops/s $\color{#d91a1a}-2.50\%$
test_iql_speed[True-None] 2.4328ms 2.2587ms 442.7365 Ops/s 441.1823 Ops/s $\color{#35bf28}+0.35\%$
test_iql_speed[True-backward] 4.9970ms 4.9035ms 203.9364 Ops/s 208.9675 Ops/s $\color{#d91a1a}-2.41\%$
test_iql_speed[reduce-overhead-None] 17.0540ms 10.1268ms 98.7475 Ops/s 100.1302 Ops/s $\color{#d91a1a}-1.38\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.2192ms 5.9772ms 167.3015 Ops/s 166.8570 Ops/s $\color{#35bf28}+0.27\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.9283ms 0.3372ms 2.9659 KOps/s 2.9510 KOps/s $\color{#35bf28}+0.50\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5373ms 0.2738ms 3.6517 KOps/s 2.9664 KOps/s $\textbf{\color{#35bf28}+23.10\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 5.9926ms 5.7596ms 173.6217 Ops/s 171.8015 Ops/s $\color{#35bf28}+1.06\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2.2314ms 0.3269ms 3.0590 KOps/s 3.3435 KOps/s $\textbf{\color{#d91a1a}-8.51\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6055ms 0.2815ms 3.5522 KOps/s 3.7199 KOps/s $\color{#d91a1a}-4.51\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.6088ms 1.2805ms 780.9574 Ops/s 786.7949 Ops/s $\color{#d91a1a}-0.74\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.4157ms 1.2049ms 829.9324 Ops/s 833.6848 Ops/s $\color{#d91a1a}-0.45\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.1337ms 5.9747ms 167.3726 Ops/s 168.3295 Ops/s $\color{#d91a1a}-0.57\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.9468ms 0.4416ms 2.2646 KOps/s 1.9300 KOps/s $\textbf{\color{#35bf28}+17.34\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8606ms 0.4223ms 2.3678 KOps/s 2.0288 KOps/s $\textbf{\color{#35bf28}+16.71\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 5.9516ms 5.7820ms 172.9500 Ops/s 172.5008 Ops/s $\color{#35bf28}+0.26\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.6606ms 0.2871ms 3.4828 KOps/s 3.3460 KOps/s $\color{#35bf28}+4.09\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5967ms 0.3680ms 2.7174 KOps/s 3.1687 KOps/s $\textbf{\color{#d91a1a}-14.24\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 5.8965ms 5.6971ms 175.5283 Ops/s 173.9889 Ops/s $\color{#35bf28}+0.88\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.2080ms 0.3515ms 2.8447 KOps/s 2.8574 KOps/s $\color{#d91a1a}-0.45\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5438ms 0.3020ms 3.3110 KOps/s 3.3390 KOps/s $\color{#d91a1a}-0.84\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.0128ms 5.8365ms 171.3345 Ops/s 165.7385 Ops/s $\color{#35bf28}+3.38\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.6737ms 0.5021ms 1.9918 KOps/s 1.8615 KOps/s $\textbf{\color{#35bf28}+7.00\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7089ms 0.4856ms 2.0595 KOps/s 1.9266 KOps/s $\textbf{\color{#35bf28}+6.90\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.9615s 24.2078ms 41.3090 Ops/s 195.4822 Ops/s $\textbf{\color{#d91a1a}-78.87\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 9.7995ms 1.9977ms 500.5750 Ops/s 502.5160 Ops/s $\color{#d91a1a}-0.39\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 9.9517ms 1.3139ms 761.1102 Ops/s 1.0190 KOps/s $\textbf{\color{#d91a1a}-25.31\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 6.9212ms 5.0648ms 197.4416 Ops/s 198.1174 Ops/s $\color{#d91a1a}-0.34\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 3.9904ms 1.8347ms 545.0508 Ops/s 539.7199 Ops/s $\color{#35bf28}+0.99\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 1.3794ms 1.0162ms 984.0611 Ops/s 1.0392 KOps/s $\textbf{\color{#d91a1a}-5.31\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 7.8695ms 5.2874ms 189.1295 Ops/s 186.0342 Ops/s $\color{#35bf28}+1.66\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 0.6609s 15.3503ms 65.1452 Ops/s 494.3705 Ops/s $\textbf{\color{#d91a1a}-86.82\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 2.5871ms 1.1935ms 837.8456 Ops/s 851.0172 Ops/s $\color{#d91a1a}-1.55\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 39.6831ms 37.9813ms 26.3288 Ops/s 25.8422 Ops/s $\color{#35bf28}+1.88\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 19.4244ms 18.0408ms 55.4298 Ops/s 54.8734 Ops/s $\color{#35bf28}+1.01\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 43.2301ms 39.2733ms 25.4626 Ops/s 24.9082 Ops/s $\color{#35bf28}+2.23\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 20.4316ms 18.4916ms 54.0787 Ops/s 53.8905 Ops/s $\color{#35bf28}+0.35\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 42.7877ms 40.9320ms 24.4307 Ops/s 23.9275 Ops/s $\color{#35bf28}+2.10\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 21.9271ms 20.3261ms 49.1978 Ops/s 49.2270 Ops/s $\color{#d91a1a}-0.06\%$
test_storage_write_lazystack[50-img_shape0-small] 0.8561ms 0.2184ms 4.5792 KOps/s 4.4237 KOps/s $\color{#35bf28}+3.51\%$
test_storage_write_lazystack[100-img_shape1-atari] 1.6601ms 1.4093ms 709.5776 Ops/s 709.2011 Ops/s $\color{#35bf28}+0.05\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.7703ms 2.3771ms 420.6815 Ops/s 436.3603 Ops/s $\color{#d91a1a}-3.59\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.1347ms 2.9365ms 340.5367 Ops/s 336.6963 Ops/s $\color{#35bf28}+1.14\%$
test_storage_write_contiguous[50-img_shape0-small] 0.4919ms 0.1632ms 6.1288 KOps/s 5.9565 KOps/s $\color{#35bf28}+2.89\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3813ms 0.2190ms 4.5670 KOps/s 4.3415 KOps/s $\textbf{\color{#35bf28}+5.20\%}$
test_storage_write_contiguous[100-img_shape2-large_img] 2.0097ms 1.7843ms 560.4390 Ops/s 550.1413 Ops/s $\color{#35bf28}+1.87\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.5858ms 1.3794ms 724.9783 Ops/s 744.5115 Ops/s $\color{#d91a1a}-2.62\%$
test_collector_stack_then_write[50-img_shape0-small] 1.4658ms 1.1493ms 870.1236 Ops/s 875.7571 Ops/s $\color{#d91a1a}-0.64\%$
test_collector_stack_then_write[100-img_shape1-atari] 3.7252ms 3.6101ms 276.9975 Ops/s 278.9396 Ops/s $\color{#d91a1a}-0.70\%$
test_collector_stack_then_write[100-img_shape2-large_img] 6.0544ms 5.8395ms 171.2478 Ops/s 169.2655 Ops/s $\color{#35bf28}+1.17\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.5864ms 7.3828ms 135.4497 Ops/s 134.6698 Ops/s $\color{#35bf28}+0.58\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4142ms 0.2772ms 3.6073 KOps/s 3.6240 KOps/s $\color{#d91a1a}-0.46\%$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.7055ms 1.5289ms 654.0714 Ops/s 643.6705 Ops/s $\color{#35bf28}+1.62\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.7230ms 2.4977ms 400.3659 Ops/s 399.4142 Ops/s $\color{#35bf28}+0.24\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.3438ms 3.1376ms 318.7156 Ops/s 316.0961 Ops/s $\color{#35bf28}+0.83\%$
test_collector_without_rb[100-img_shape0-atari] 33.0763ms 32.5758ms 30.6976 Ops/s 30.3886 Ops/s $\color{#35bf28}+1.02\%$
test_collector_without_rb[200-img_shape1-large_batch] 64.3704ms 64.1237ms 15.5949 Ops/s 15.4631 Ops/s $\color{#35bf28}+0.85\%$
test_collector_with_rb[100-img_shape0-atari] 38.0902ms 37.4591ms 26.6957 Ops/s 26.5407 Ops/s $\color{#35bf28}+0.58\%$
test_collector_with_rb[200-img_shape1-large_batch] 74.6245ms 73.7637ms 13.5568 Ops/s 13.5454 Ops/s $\color{#35bf28}+0.08\%$
test_collector_without_rb_cuda[100-img_shape0-atari] 55.4755ms 55.0507ms 18.1651 Ops/s 17.6601 Ops/s $\color{#35bf28}+2.86\%$
test_collector_without_rb_cuda[200-img_shape1-large_batch] 0.1099s 0.1096s 9.1279 Ops/s 8.9213 Ops/s $\color{#35bf28}+2.32\%$
test_collector_with_rb_cuda[100-img_shape0-atari] 57.4542ms 57.1912ms 17.4852 Ops/s 17.3711 Ops/s $\color{#35bf28}+0.66\%$
test_collector_with_rb_cuda[200-img_shape1-large_batch] 0.1141s 0.1133s 8.8226 Ops/s 8.6883 Ops/s $\color{#35bf28}+1.55\%$

vmoens added a commit that referenced this pull request Mar 23, 2026
…g API

- Replace manual stateless_init_process_group + collective_rpc("update_weight")
  with vLLM's native WeightTransferConfig/NCCLWeightTransferEngine API
- Fix VLLM_USE_V1 env var removal (V1 always on in 0.17+)
- Fix NCCL weight sync deadlock by dispatching worker RPCs before trainer joins
- Fix LoRA weight extraction (merge_and_unload before state_dict)
- Fix weight transfer KeyError by using HF model directly (not TransformersWrapper)
- Fix prompt_logprobs length mismatch in _RequestOutput_tc for V1 engine
- Auto-propagate WANDB_API_KEY, HF_TOKEN, HF_HOME to Ray workers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ghstack-source-id: 1a2d958
Pull-Request: #3556
[ghstack-poisoned]
vmoens added a commit that referenced this pull request Mar 23, 2026
…g API

- Replace manual stateless_init_process_group + collective_rpc("update_weight")
  with vLLM's native WeightTransferConfig/NCCLWeightTransferEngine API
- Fix VLLM_USE_V1 env var removal (V1 always on in 0.17+)
- Fix NCCL weight sync deadlock by dispatching worker RPCs before trainer joins
- Fix LoRA weight extraction (merge_and_unload before state_dict)
- Fix weight transfer KeyError by using HF model directly (not TransformersWrapper)
- Fix prompt_logprobs length mismatch in _RequestOutput_tc for V1 engine
- Auto-propagate WANDB_API_KEY, HF_TOKEN, HF_HOME to Ray workers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ghstack-source-id: f6fb817
Pull-Request: #3556
@github-actions
Copy link
Copy Markdown
Contributor

⚠️ PR Title Label Error

PR title must start with a label prefix in brackets (e.g., [BugFix]).

Current title: Fix vLLM >= 0.17 compatibility: migrate to native WeightTransferConfig API

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

vmoens added a commit that referenced this pull request Mar 24, 2026
…g API

- Replace manual stateless_init_process_group + collective_rpc("update_weight")
  with vLLM's native WeightTransferConfig/NCCLWeightTransferEngine API
- Fix VLLM_USE_V1 env var removal (V1 always on in 0.17+)
- Fix NCCL weight sync deadlock by dispatching worker RPCs before trainer joins
- Fix LoRA weight extraction (merge_and_unload before state_dict)
- Fix weight transfer KeyError by using HF model directly (not TransformersWrapper)
- Fix prompt_logprobs length mismatch in _RequestOutput_tc for V1 engine
- Auto-propagate WANDB_API_KEY, HF_TOKEN, HF_HOME to Ray workers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ghstack-source-id: f6fb817
Pull-Request: #3556
@vmoens vmoens closed this Mar 26, 2026
@vmoens vmoens deleted the gh/vmoens/240/head branch March 26, 2026 16:53
vmoens added a commit that referenced this pull request Mar 26, 2026
…g API

- Replace manual stateless_init_process_group + collective_rpc("update_weight")
  with vLLM's native WeightTransferConfig/NCCLWeightTransferEngine API
- Fix VLLM_USE_V1 env var removal (V1 always on in 0.17+)
- Fix NCCL weight sync deadlock by dispatching worker RPCs before trainer joins
- Fix LoRA weight extraction (merge_and_unload before state_dict)
- Fix weight transfer KeyError by using HF model directly (not TransformersWrapper)
- Fix prompt_logprobs length mismatch in _RequestOutput_tc for V1 engine
- Auto-propagate WANDB_API_KEY, HF_TOKEN, HF_HOME to Ray workers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ghstack-source-id: f6fb817
Pull-Request: #3556

ghstack-source-id: 60cb4b4
Pull Request resolved: #3574
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. llm/ LLM-related PR, triggers LLM CI tests Modules sota-implementations/ WeightUpdate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant