Skip to content

Commit baf5a17

Browse files
committed
wishlist: Allow MOVE_MOUNT_BENEATH on the rootfs
Signed-off-by: Christian Brauner <[email protected]>
1 parent 4945c51 commit baf5a17

1 file changed

Lines changed: 40 additions & 0 deletions

File tree

README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,46 @@ work directly on the real rootfs. Removes the need for traditional
133133
`switch_root` workarounds. In the future this also allows us to create
134134
completely empty mount namespaces without risking to leak anything.
135135
136+
### Allow `MOVE_MOUNT_BENEATH` on the rootfs
137+
138+
Allow `MOVE_MOUNT_BENEATH` to target the caller's rootfs, enabling
139+
root-switching without `pivot_root(2)`. The traditional approach to
140+
switching the rootfs involves `pivot_root(2)` or a `chroot_fs_refs()`-based
141+
mechanism that atomically updates `fs->root` for all tasks sharing the
142+
same `fs_struct`. This has consequences for `fork()`, `unshare(CLONE_FS)`,
143+
and `setns()`.
144+
145+
Instead, decompose root-switching into individually atomic, locally-scoped
146+
steps:
147+
148+
```c
149+
fd_tree = open_tree(-EBADF, "/newroot",
150+
OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
151+
fchdir(fd_tree);
152+
move_mount(fd_tree, "", AT_FDCWD, "/",
153+
MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
154+
chroot(".");
155+
umount2(".", MNT_DETACH);
156+
```
157+
158+
Since each step only modifies the caller's own state, the
159+
`fork()`/`unshare()`/`setns()` races are eliminated by design.
160+
161+
To make this work, `MNT_LOCKED` is transferred from the top mount to the
162+
mount beneath. The new mount takes over the job of protecting the parent
163+
mount from being revealed. This also makes it possible to safely modify
164+
an inherited mount table after `unshare(CLONE_NEWUSER | CLONE_NEWNS)`:
165+
166+
```sh
167+
mount --beneath -t tmpfs tmpfs /proc
168+
umount -l /proc
169+
```
170+
171+
**Use-Case:** Containers created with `unshare(CLONE_NEWUSER | CLONE_NEWNS)`
172+
can reshuffle an inherited mount table safely. `MOVE_MOUNT_BENEATH` on the
173+
rootfs makes it possible to switch out the rootfs without the costly
174+
`pivot_root(2)` and without cross-namespace vulnerabilities.
175+
136176
### Query mount information via file descriptor with `statmount()`
137177

138178
Extend `struct mnt_id_req` to accept a file descriptor and introduce

0 commit comments

Comments
 (0)