|
| 1 | +# Lightning primer for QA |
| 2 | + |
| 3 | +Background for [channel monitor desync repro](./repro-channel-monitor-desync.md) and any work that touches Lightning storage, migration, or startup. |
| 4 | + |
| 5 | +## What a Lightning channel is (operationally) |
| 6 | + |
| 7 | +Two parties lock funds in a 2-of-2 on-chain output. **Off-chain**, they exchange **commitment transactions** that encode “who gets what if we publish now.” Each new off-chain state is a **commitment update**. LDK tracks progress with an internal **`update_id`** (a monotonic counter per channel). |
| 8 | + |
| 9 | +- **ChannelManager** — current view of all channels, balances, and pending HTLCs. |
| 10 | +- **ChannelMonitor** — per-channel state used to watch the chain, enforce penalties, and react to force-closes. It must stay **consistent** with what the ChannelManager believes. |
| 11 | + |
| 12 | +The **chain::Watch** contract (simplified): durable storage must reflect **latest** ChannelMonitor data **before** the app continues as if that state is live. If an old monitor is paired with an advanced manager, LDK reports **`DangerousValue`** and refuses to start — that protects funds. |
| 13 | + |
| 14 | +## HTLCs |
| 15 | + |
| 16 | +**HTLC** means **Hash Time-Locked Contract**. It is a conditional payment: pay the peer if they reveal a preimage by a deadline; otherwise revert. HTLCs live **inside** commitment updates. Each hop of a multi-hop payment adds HTLCs; resolving them advances commitment state again. |
| 17 | + |
| 18 | +Testing “payments” matters because each payment usually causes **multiple** commitment updates, not a 1:1 mapping to “one payment = one update_id step.” |
| 19 | + |
| 20 | +## “Gap” in the test matrix (e.g. 21 / 30 payments) |
| 21 | + |
| 22 | +The doc’s payment counts are a **proxy for many `update_id` advances**, not a magic number from BOLT math. |
| 23 | + |
| 24 | +- **Small** mismatch between an old backup and the live node may be **healed** via peer reconnection and commitment replay. |
| 25 | +- **Large** mismatch, or injecting a **stale monitor** on top of an **advanced** manager, triggers **stale ChannelMonitor** errors and a refused start until recovery. |
| 26 | + |
| 27 | +## What went wrong in the ChannelMonitor desync bug |
| 28 | + |
| 29 | +1. **ChannelManager** on device was **ahead** (normal usage after RN migration). |
| 30 | +2. **Old ChannelMonitor** data (e.g. from RN remote backup) was applied without matching the current manager. |
| 31 | +3. On load: monitor `update_id` ≪ manager → **stale monitor** → **`DangerousValue`** → node will not run. |
| 32 | + |
| 33 | +The **fix path** uses **`accept_stale_channel_monitors`** so ldk-node can align state and **self-heal** (commitment round-trips, chain sync). That is why recovery logs show retries, healing, and sometimes **over a minute** before balances and payments look normal — especially with **many blocks** to sync (e.g. T5) or **local LND** setups vs Blocktank-only flows. |
| 34 | + |
| 35 | +## What to test when Lightning / LDK storage changes |
| 36 | + |
| 37 | +| Area | Why | |
| 38 | +|------|-----| |
| 39 | +| **Cold start** | Any path that reads/writes ChannelManager, monitors, or VSS must not pair **new** manager with **old** monitor. | |
| 40 | +| **Backup / restore** | Restoring must be **consistent snapshots**; partial or older monitor alone is high risk. | |
| 41 | +| **Migration** | RN → native or schema changes: avoid overwriting live data with **stale** remote copies. | |
| 42 | +| **Recovery** | After `DangerousValue` / `accept_stale`: peers reconnect, chain sync completes, **inbound and outbound** payments work, **second launch** does not repeat recovery forever. | |
| 43 | +| **Infra noise** | On regtest, **stale RGS** / gossip can cause transient **“route not found”** — distinguish from persistence bugs (see logs for `DangerousValue` vs routing errors). | |
| 44 | + |
| 45 | +## Risks of incorrect “fixes” |
| 46 | + |
| 47 | +- Skipping or weakening persistence checks can lead to **wrong** enforcement keys or **missed** on-chain reactions. |
| 48 | +- Blindly merging backups can recreate the **stale monitor** class of bug. |
| 49 | +- Recovery paths should always be validated with **real sends/receives** and **restart**, not only “app opens.” |
| 50 | + |
| 51 | +## Glossary |
| 52 | + |
| 53 | +| Term | Meaning | |
| 54 | +|------|--------| |
| 55 | +| **Commitment update** | New off-chain state (balances + HTLC set). | |
| 56 | +| **`update_id`** | LDK’s persisted notion of how far the ChannelMonitor has advanced vs the ChannelManager for that channel. | |
| 57 | +| **HTLC** | **Hash Time-Locked Contract** — conditional payment inside a commitment (hash lock + time lock). | |
| 58 | +| **ChannelMonitor** | Per-channel persisted state for chain watching and dispute handling. | |
| 59 | +| **DangerousValue** | LDK/ldk-node refusing to load because continuing would violate safety assumptions (e.g. stale monitor). | |
| 60 | +| **accept_stale_channel_monitors** | Explicit recovery mode to load despite mismatch, then heal via protocol + sync (use only in controlled recovery). | |
| 61 | + |
| 62 | +## See also |
| 63 | + |
| 64 | +- [repro-channel-monitor-desync.md](./repro-channel-monitor-desync.md) — repro steps, matrix, recovery timing notes |
0 commit comments