feat(data-branch): Branch Protect Snapshot to guard LCA history from GC by gouhongshen · Pull Request #24313 · matrixorigin/matrixone

gouhongshen · 2026-05-08T10:50:25Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Introduces Branch Protect Snapshot, a system-managed kind='branch' row
in mo_catalog.mo_snapshots that pins parent-side history for the exact
duration a branch subtree is alive.

Motivation: after a flush + global checkpoint + disk-cleaner GC cycle, the
LCA probe in pkg/frontend/data_branch_hashdiff.go (which time-travels to
clone_ts(child) on the parent) can silently downgrade an UPDATE into an
INSERT once GC reclaims the pre-branch object of the probed row. The
reproduction is captured in test/distributed/cases/git4data/branch/diff/diff_9.sql
Case 3.

The new snapshot feeds the branch timestamp into the TAE GC retention engine
(pkg/vm/engine/tae/logtail/snapshot.go), so the backing objects stay on
disk as long as any branch descendant needs them.

Lifecycle

Phase	Trigger	Helper
Create	`DATA BRANCH CREATE TABLE/DATABASE` → `updateBranchMetaTable`	`createBranchProtectSnapshot` (atomic with the CLONE DDL and `mo_branch_metadata` insert)
Reclaim (frontend)	`DATA BRANCH DELETE TABLE/DATABASE`	`reclaimBranchSnapshotsWithBH`
Reclaim (ddl.go)	plain `DROP TABLE` / `DROP DATABASE` cascade	`(*Compile).reclaimBranchProtectSnapshots` using a `runSQL` closure
Shared core	subtree-alive predicate over `mo_branch_metadata`	`databranchutils.ReclaimBranchSnapshotsCore`

An edge (parent, child, clone_ts) is reclaimed iff the entire subtree
rooted at child is deleted — sibling and grand-descendant branches keep
the snapshot alive as long as any of them references it.

User surface

SHOW SNAPSHOTS filters out kind = 'branch' rows.
DROP SNAPSHOT __mo_branch_<tid> is rejected with "managed by data branch".
Branch snapshots do not consume the user quota.
Cross-account DATA BRANCH CREATE ... TO ACCOUNT <b> anchors the snapshot
on the parent's account, so GC retention applies where the parent's
objects actually live; reclaim runs as sys to reach across accounts.

No schema migration

Reuses the existing mo_snapshots.kind column (defaults to 'user'). No
backfill for pre-existing branches — their LCA history has already been GC'd
and a snapshot at the original clone_ts would not resurrect it. Document
advises DROP and recreate for affected branches.

Test coverage

Unit tests (9/9 pass)

pkg/frontend/data_branch_snapshot_test.go — sname format, DAG builder,
subtreeAllDeleted on linear / branching DAGs, reclaim core drop-list,
ancestor walk, dangling metadata handling, doDropSnapshot branch-kind
rejection, buildShowSnapShots filter.

Engine tests (7/7 pass)

pkg/vm/engine/test/branch_protect_snapshot_test.go — create, reclaim on
data branch delete, reclaim on plain DROP TABLE, cascaded subtree rule,
cross-account create, cross-account drop-source-first, create-failed-
rolls-back.

BVT (10 cases, 194 queries, 100% pass in 3 consecutive runs)

test/distributed/cases/git4data/branch/protect/protect_1..10.{sql,result}:

Case	Coverage
1	Creation + visibility + SHOW filter + kind label
2	Reclaim on `DATA BRANCH DELETE TABLE`
3	Reclaim on plain `DROP TABLE`
4	Subtree retention (linear chain)
5	Fan-out independence (three siblings)
6	SHOW SNAPSHOTS filter coexists with user snapshots
7	Same-account account-scoping
8	`DATA BRANCH CREATE/DELETE DATABASE` batch create+reclaim
9	Plain `DROP DATABASE` cascade reclaim
10	Full cross-account `TO ACCOUNT <b>` round-trip

GC → diff regression (3/3 pass, 59 queries)

test/distributed/cases/git4data/branch/diff/diff_9.sql tightened with a
new assertion that an update on a pre-branch PK (the exact shape of the
bug) is still classified as t1 UPDATE after GC — this is the end-to-end
verification that the feature actually fixes the reported bug.

Special notes for your reviewer:

cloneReceipt grows three fields (srcTableID, dstTableID,
srcAccountName) so the snapshot insert can reuse the IDs that
updateBranchMetaTable already resolves.
(*Compile).reclaimBranchProtectSnapshots is the single hook point
added in pkg/sql/compile/ddl.go; it runs synchronously after the
UPDATE mo_branch_metadata SET table_deleted = true statement.
Full design doc: docs/design/data_branch_protect_snapshot.md.

qodo-code-review · 2026-05-08T10:50:30Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

CLAassistant · 2026-05-08T10:50:37Z

All committers have signed the CLA.

gouhongshen · 2026-05-11T06:12:59Z

Thanks for the thorough review @aunjgr. All four blocking findings addressed in 05ffa754d8.

Fixes

1. Concurrent-sibling race (race) — both loadDAG closures now read mo_branch_metadata FOR UPDATE, serialising the two drop txns. Once txn-1 commits, txn-2 re-reads the flipped state and correctly reclaims the common ancestor. Low-frequency path so the single-table lock is fine.

2. Quota (quota) — checkFeatureLimit's SNAPSHOT counter gained AND kind != 'branch'. Branch rows no longer charged against user quota (§7.3).

3. SHOW RECOVERY WINDOW leak (recovery) — snapSearchSQL in show_recovery_window.go filters AND kind != 'branch'. Matches the SHOW SNAPSHOTS filter.

4. Snapshot-name-lookup bypass (lookup) — centralised reject in getSnapshotByName: snapshotRecord gained a kind field (optional read, pre-kind mocks still work); getSnapshotByName errors on kind='branch'. That guards doRestoreSnapshot, doResolveSnapshotWithSnapshotName ({snapshot=...} hint), and the rest. The two sibling SQL-level lookups (GetSnapshotInfoByName, object_list resolve) also got AND kind != 'branch' as defence-in-depth.

Nit

Added WARN log in ReclaimBranchSnapshotsCore on cycle detection so catalog corruption is observable instead of producing a silently-truncated drop list.

Test updates

New TestReclaimCore_CycleGuard (already landed in 24ee591) stays green.
snapshot_ddl_test.go + pitr_test.go: mocks updated for the now-optional kind column.
pitr_test.go Test_restoreViews: swapped a mis-shaped newMrsForPitrRecord helper for newMrsForSnapshotRecord — the old helper had 13 columns but only provided 8-entry rows, a latent test bug that only surfaced once kind-column read landed.
test/types_mock.go: added the missing MockRecorder.GetColumnCount entry that gomock needs.

go vet + full pkg/frontend suite green.

…on on unrelated drops CI regression from PR matrixorigin#24313 commit 05ffa75: the blocking-issue-#1 fix added `FOR UPDATE` to every reclaimBranchProtectSnapshots call, which runs synchronously after every plain DROP TABLE. That meant even drops on tables that have nothing to do with data branch took a pessimistic lock on the whole mo_branch_metadata table. Under the docker-compose PESSIMISTIC CI job, the `restore account` flow drops/recreates many tables back-to-back. The serialised lock created enough contention that an in-flight view restore observed its base table as `no such table` (restore view race vs concurrent drop cleanup), causing `restore account sys{snapshot=sp02}` to fail in sys_restore_view_to_sys_account.sql (324 total, 4 failed). Fix: short-circuit the reclaim path. Before taking the `FOR UPDATE` lock, run a cheap SELECT probe asking whether any of the dead tids actually participate in mo_branch_metadata (as child or as parent referenced by a child). 99%+ of drops hit the fast path and skip the full reclaim scan entirely. The frontend path (dataBranchDeleteTable/Database) is not affected because those entry points are branch-specific by construction — every deadTID is guaranteed to be in mo_branch_metadata, so a probe would be wasted work. Correctness unchanged: when the probe confirms branch involvement, the original FOR UPDATE-backed reclaim still runs and still serialises concurrent sibling drops, so blocking issue #1 remains fixed. Co-authored-by: Copilot <[email protected]>

gouhongshen · 2026-05-11T08:16:29Z

Root cause of the PESSIMISTIC docker-compose CI failure: the blocking-issue-#1 fix in 05ffa754d8 (added FOR UPDATE on mo_branch_metadata in reclaimBranchProtectSnapshots) runs on every plain DROP TABLE, including drops of tables that have nothing to do with data branch.

In the PESSIMISTIC CI job, restore account sys{snapshot="sp02"} drops and recreates many tables back-to-back. The new unconditional FOR UPDATE lock contended with the restore's view/table re-creation, and an in-flight view restore saw its base table as no such table test.aff01 — causing 4 failures in snapshot/sys_restore_view_to_sys_account.sql.

Fix in c390c13c52: short-circuit the reclaim with a cheap SELECT 1 ... WHERE table_id IN (...) OR p_table_id IN (...) LIMIT 1 probe. 99%+ of drops skip the reclaim entirely. Only real branch drops take the FOR UPDATE lock, so blocking issue #1 remains fixed.

Verified locally:

go build ./... clean
Engine tests TestBranchProtectSnapshot* (7/7) pass
Frontend protect unit tests (10/10) pass

Pushed; CI re-running.

aunjgr

Re-reviewed commits 24ee591b, 05ffa754, c390c13c. All four of my previously flagged blocking findings are fully resolved:

1. Concurrent-sibling race — fixed with SELECT ... FOR UPDATE on mo_branch_metadata inside the same reclaim txn, in both the compile path (pkg/sql/compile/ddl.go:2192-2195 via runSqlWithResult sharing c.proc.GetTxnOperator()) and the frontend path (pkg/frontend/data_branch_snapshot.go:887-890 in loadBranchDAGWithBH). The second sibling's table_deleted=true flip blocks on the first txn's row locks; after commit it sees fresh state and reclaims. Traced A → {B,C} and root → A → {B,C} cascades — no leak. Idempotent DELETEs make duplicate passes safe.

The fast-path probe at ddl.go:2163-2182 (no FOR UPDATE) is safe: it returns early only when no row matches any dead-tid, meaning the table isn't in any branch lineage — no snapshot exists to reclaim.

2. Snapshot quota — pkg/frontend/feature_limit.go:109-111 now includes and kind != 'branch'. Column is NOT NULL DEFAULT 'user' with a tenant-upgrade backfill for legacy rows, so no NULL-semantics hole.

3. SHOW RECOVERY WINDOW — pkg/frontend/show_recovery_window.go:192 filters and kind != 'branch'.

4. Central choke point for name lookups — getSnapshotByName rejects kind='branch' post-fetch (covers RESTORE, DROP, get_ddl, and every caller routed through it). GetSnapshotInfoByName and ResolveSnapshotWithSnapshotNameWithoutSession also add the filter. doDropSnapshot has a belt-and-suspenders pre-check via getSnapshotKindByName.

No new correctness bugs introduced. Minor note: dataBranchDeleteTable / dataBranchDeleteDatabase invoke reclaimBranchSnapshotsWithBH after the compile path has already reclaimed, making a second idempotent pass. Wasted work, not a bug — worth a follow-up cleanup if tracked.

LGTM.

XuPeng-SH

LGTM. Well-structured feature with strong test coverage (unit, engine, BVT) and a clear design doc.

Reviewed:

Lifecycle correctness: Snapshot create is atomic with CLONE + mo_branch_metadata insert (same BackgroundExec txn). Reclaim fires on both frontend (DATA BRANCH DELETE) and compile-layer (DROP TABLE) paths via the shared ReclaimBranchSnapshotsCore pipeline.
Subtree-all-deleted algorithm: Cycle-safe via visited set, O(N) amortised via memo. Dangling metadata (unknown nodes) treated as deleted — correct for the catalog corruption case.
Concurrency: FOR UPDATE on mo_branch_metadata serialises sibling reclaim paths. Fast-path probe (select 1 ... limit 1 without FOR UPDATE) avoids the lock for the 99% of DROP TABLE ops that don't involve branches. The benign races between probe and FOR UPDATE scan are both safe (redundant no-op or already-handled).
Security surface: BuildBranchSnapshotDeleteSQL only accepts __mo_branch_<uint64> snames — no injection risk. The INSERT in createBranchProtectSnapshot interpolates parser-validated identifiers and catalog-fetched account names — acceptable given upstream guarantees.
User surface: SHOW SNAPSHOTS filters kind != 'branch'; DROP SNAPSHOT on a branch-managed row is explicitly rejected with a clear error; doResolveSnapshotWithSnapshotName also rejects branch-kind rows from restore/select paths.
No schema migration needed: Reuses existing mo_snapshots.kind column (defaults to 'user'). Clean.

CI all green. 4 prior approvals. Merge state is BEHIND but MERGEABLE — just needs a merge of main (no conflicts).

mergify · 2026-05-12T05:24:26Z

Merge Queue Status

✅ Entered queue — 2026-05-12 05:24 UTC · Rule: main
✅ Checks started · in-place
🚫 Left the queue — 2026-05-12 05:50 UTC · at b10bad2c8a28f98e3cc412ddac7b542289e8fd2d

This pull request spent 25 minutes 42 seconds in the queue, with no time running CI.

Waiting for

All conditions

Reason

The merge conditions cannot be satisfied due to failing checks

Hint

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio queue comment.

mergify · 2026-05-12T08:53:47Z

Merge Queue Status

✅ Entered queue — 2026-05-12 08:53 UTC · Rule: main
✅ Checks skipped · PR is already up-to-date
✅ Merged — 2026-05-12 08:54 UTC · at b10bad2c8a28f98e3cc412ddac7b542289e8fd2d · squash

This pull request spent 47 seconds in the queue, including 4 seconds running CI.

Required conditions to merge

gouhongshen requested review from XuPeng-SH, aunjgr, fengttt, heni02 and ouyuanning as code owners May 8, 2026 10:50

mergify Bot added the kind/feature label May 8, 2026

gouhongshen temporarily deployed to ci May 8, 2026 10:56 — with GitHub Actions Inactive

gouhongshen had a problem deploying to ci May 8, 2026 10:56 — with GitHub Actions Failure

gouhongshen temporarily deployed to ci May 8, 2026 10:56 — with GitHub Actions Inactive

gouhongshen temporarily deployed to ci May 8, 2026 10:57 — with GitHub Actions Inactive

gouhongshen had a problem deploying to ci May 8, 2026 10:57 — with GitHub Actions Failure

matrix-meow added the size/XXL Denotes a PR that changes 2000+ lines label May 8, 2026

This comment was marked as outdated.

Sign in to view

gouhongshen temporarily deployed to ci May 8, 2026 12:58 — with GitHub Actions Inactive

gouhongshen temporarily deployed to ci May 11, 2026 06:12 — with GitHub Actions Inactive

gouhongshen had a problem deploying to ci May 11, 2026 06:12 — with GitHub Actions Failure

gouhongshen requested a review from aunjgr May 11, 2026 06:18

Merge branch 'main' into feat/branch-protect-snapshot

0e9303e

gouhongshen had a problem deploying to ci May 11, 2026 07:13 — with GitHub Actions Failure

gouhongshen temporarily deployed to ci May 11, 2026 07:13 — with GitHub Actions Inactive

Merge branch 'main' into feat/branch-protect-snapshot

97f8788

aunjgr approved these changes May 11, 2026

View reviewed changes

gouhongshen mentioned this pull request May 11, 2026

to 3.0: feat(data-branch): backport Branch Protect Snapshot to 3.0-dev #24347

Merged

7 tasks

fengttt approved these changes May 11, 2026

View reviewed changes

XuPeng-SH approved these changes May 12, 2026

View reviewed changes

Merge branch 'main' into feat/branch-protect-snapshot

b10bad2

gouhongshen mentioned this pull request May 12, 2026

frontend: session Close temp-table cleanup panics with "cannot create context from nil parent" when session is torn down after a long-running statement #24351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-branch): Branch Protect Snapshot to guard LCA history from GC#24313

feat(data-branch): Branch Protect Snapshot to guard LCA history from GC#24313
mergify[bot] merged 9 commits into
matrixorigin:mainfrom
gouhongshen:feat/branch-protect-snapshot

gouhongshen commented May 8, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented May 8, 2026

Uh oh!

CLAassistant commented May 8, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

gouhongshen commented May 11, 2026

Uh oh!

gouhongshen commented May 11, 2026

Uh oh!

aunjgr left a comment

Uh oh!

XuPeng-SH left a comment

Uh oh!

mergify Bot commented May 12, 2026 •

edited

Loading

Uh oh!

mergify Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

gouhongshen commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Lifecycle

User surface

No schema migration

Test coverage

Unit tests (9/9 pass)

Engine tests (7/7 pass)

BVT (10 cases, 194 queries, 100% pass in 3 consecutive runs)

GC → diff regression (3/3 pass, 59 queries)

Special notes for your reviewer:

Uh oh!

qodo-code-review Bot commented May 8, 2026

Uh oh!

CLAassistant commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

gouhongshen commented May 11, 2026

Fixes

Nit

Test updates

Uh oh!

gouhongshen commented May 11, 2026

Uh oh!

aunjgr left a comment

Choose a reason for hiding this comment

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Queue Status

Reason

Hint

Uh oh!

mergify Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Queue Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

gouhongshen commented May 8, 2026 •

edited

Loading

CLAassistant commented May 8, 2026 •

edited

Loading

mergify Bot commented May 12, 2026 •

edited

Loading

mergify Bot commented May 12, 2026 •

edited

Loading