Information
Tasks
Reproduction
Any official scripts.
Expected behavior
During distributed training with multiple GPUs, all ranks produce identical retain loss values while forget loss values differ as expected. This indicates that all ranks are sampling the same retain data samples, which breaks the randomness assumption in the unlearning process.
The issue is located in src/data/unlearn.py in the ForgetRetainDataset.__getitem__() method:
def __getitem__(self, idx):
item = {}
if self.anchor == "forget":
item["forget"] = self.forget[idx] # Sequential access - different across ranks ✓
if self.retain:
retain_idx = torch.randint(0, len(self.retain), (1,)).item() # ❌ Problem here
item["retain"] = self.retain[retain_idx]
The problem: torch.randint() uses the global PyTorch random number generator, which is synchronized across all ranks. This causes all ranks to generate the same sequence of random indices, leading to identical retain sample selection.
Information
Tasks
Reproduction
Any official scripts.
Expected behavior
During distributed training with multiple GPUs, all ranks produce identical retain loss values while forget loss values differ as expected. This indicates that all ranks are sampling the same retain data samples, which breaks the randomness assumption in the unlearning process.
The issue is located in
src/data/unlearn.pyin theForgetRetainDataset.__getitem__()method:The problem:
torch.randint()uses the global PyTorch random number generator, which is synchronized across all ranks. This causes all ranks to generate the same sequence of random indices, leading to identical retain sample selection.