@@ -133,6 +133,46 @@ work directly on the real rootfs. Removes the need for traditional
133133`switch_root` workarounds. In the future this also allows us to create
134134completely empty mount namespaces without risking to leak anything.
135135
136+ ### Allow `MOVE_MOUNT_BENEATH` on the rootfs
137+
138+ Allow `MOVE_MOUNT_BENEATH` to target the caller's rootfs, enabling
139+ root-switching without `pivot_root(2)`. The traditional approach to
140+ switching the rootfs involves `pivot_root(2)` or a `chroot_fs_refs()`-based
141+ mechanism that atomically updates `fs->root` for all tasks sharing the
142+ same `fs_struct`. This has consequences for `fork()`, `unshare(CLONE_FS)`,
143+ and `setns()`.
144+
145+ Instead, decompose root-switching into individually atomic, locally-scoped
146+ steps:
147+
148+ ```c
149+ fd_tree = open_tree(-EBADF, "/newroot",
150+ OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
151+ fchdir(fd_tree);
152+ move_mount(fd_tree, "", AT_FDCWD, "/",
153+ MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
154+ chroot(".");
155+ umount2(".", MNT_DETACH);
156+ ```
157+
158+ Since each step only modifies the caller's own state, the
159+ ` fork() ` /` unshare() ` /` setns() ` races are eliminated by design.
160+
161+ To make this work, ` MNT_LOCKED ` is transferred from the top mount to the
162+ mount beneath. The new mount takes over the job of protecting the parent
163+ mount from being revealed. This also makes it possible to safely modify
164+ an inherited mount table after ` unshare(CLONE_NEWUSER | CLONE_NEWNS) ` :
165+
166+ ``` sh
167+ mount --beneath -t tmpfs tmpfs /proc
168+ umount -l /proc
169+ ```
170+
171+ ** Use-Case:** Containers created with ` unshare(CLONE_NEWUSER | CLONE_NEWNS) `
172+ can reshuffle an inherited mount table safely. ` MOVE_MOUNT_BENEATH ` on the
173+ rootfs makes it possible to switch out the rootfs without the costly
174+ ` pivot_root(2) ` and without cross-namespace vulnerabilities.
175+
136176### Query mount information via file descriptor with ` statmount() `
137177
138178Extend ` struct mnt_id_req ` to accept a file descriptor and introduce
0 commit comments