Skip to content

Commit b116c10

Browse files
committed
update docs
1 parent 0cef6c1 commit b116c10

2 files changed

Lines changed: 72 additions & 1 deletion

File tree

docs/lightning-primer-for-qa.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Lightning primer for QA
2+
3+
Background for [channel monitor desync repro](./repro-channel-monitor-desync.md) and any work that touches Lightning storage, migration, or startup.
4+
5+
## What a Lightning channel is (operationally)
6+
7+
Two parties lock funds in a 2-of-2 on-chain output. **Off-chain**, they exchange **commitment transactions** that encode “who gets what if we publish now.” Each new off-chain state is a **commitment update**. LDK tracks progress with an internal **`update_id`** (a monotonic counter per channel).
8+
9+
- **ChannelManager** — current view of all channels, balances, and pending HTLCs.
10+
- **ChannelMonitor** — per-channel state used to watch the chain, enforce penalties, and react to force-closes. It must stay **consistent** with what the ChannelManager believes.
11+
12+
The **chain::Watch** contract (simplified): durable storage must reflect **latest** ChannelMonitor data **before** the app continues as if that state is live. If an old monitor is paired with an advanced manager, LDK reports **`DangerousValue`** and refuses to start — that protects funds.
13+
14+
## HTLCs
15+
16+
**HTLC** means **Hash Time-Locked Contract**. It is a conditional payment: pay the peer if they reveal a preimage by a deadline; otherwise revert. HTLCs live **inside** commitment updates. Each hop of a multi-hop payment adds HTLCs; resolving them advances commitment state again.
17+
18+
Testing “payments” matters because each payment usually causes **multiple** commitment updates, not a 1:1 mapping to “one payment = one update_id step.”
19+
20+
## “Gap” in the test matrix (e.g. 21 / 30 payments)
21+
22+
The doc’s payment counts are a **proxy for many `update_id` advances**, not a magic number from BOLT math.
23+
24+
- **Small** mismatch between an old backup and the live node may be **healed** via peer reconnection and commitment replay.
25+
- **Large** mismatch, or injecting a **stale monitor** on top of an **advanced** manager, triggers **stale ChannelMonitor** errors and a refused start until recovery.
26+
27+
## What went wrong in the ChannelMonitor desync bug
28+
29+
1. **ChannelManager** on device was **ahead** (normal usage after RN migration).
30+
2. **Old ChannelMonitor** data (e.g. from RN remote backup) was applied without matching the current manager.
31+
3. On load: monitor `update_id` ≪ manager → **stale monitor****`DangerousValue`** → node will not run.
32+
33+
The **fix path** uses **`accept_stale_channel_monitors`** so ldk-node can align state and **self-heal** (commitment round-trips, chain sync). That is why recovery logs show retries, healing, and sometimes **over a minute** before balances and payments look normal — especially with **many blocks** to sync (e.g. T5) or **local LND** setups vs Blocktank-only flows.
34+
35+
## What to test when Lightning / LDK storage changes
36+
37+
| Area | Why |
38+
|------|-----|
39+
| **Cold start** | Any path that reads/writes ChannelManager, monitors, or VSS must not pair **new** manager with **old** monitor. |
40+
| **Backup / restore** | Restoring must be **consistent snapshots**; partial or older monitor alone is high risk. |
41+
| **Migration** | RN → native or schema changes: avoid overwriting live data with **stale** remote copies. |
42+
| **Recovery** | After `DangerousValue` / `accept_stale`: peers reconnect, chain sync completes, **inbound and outbound** payments work, **second launch** does not repeat recovery forever. |
43+
| **Infra noise** | On regtest, **stale RGS** / gossip can cause transient **“route not found”** — distinguish from persistence bugs (see logs for `DangerousValue` vs routing errors). |
44+
45+
## Risks of incorrect “fixes”
46+
47+
- Skipping or weakening persistence checks can lead to **wrong** enforcement keys or **missed** on-chain reactions.
48+
- Blindly merging backups can recreate the **stale monitor** class of bug.
49+
- Recovery paths should always be validated with **real sends/receives** and **restart**, not only “app opens.”
50+
51+
## Glossary
52+
53+
| Term | Meaning |
54+
|------|--------|
55+
| **Commitment update** | New off-chain state (balances + HTLC set). |
56+
| **`update_id`** | LDK’s persisted notion of how far the ChannelMonitor has advanced vs the ChannelManager for that channel. |
57+
| **HTLC** | **Hash Time-Locked Contract** — conditional payment inside a commitment (hash lock + time lock). |
58+
| **ChannelMonitor** | Per-channel persisted state for chain watching and dispute handling. |
59+
| **DangerousValue** | LDK/ldk-node refusing to load because continuing would violate safety assumptions (e.g. stale monitor). |
60+
| **accept_stale_channel_monitors** | Explicit recovery mode to load despite mismatch, then heal via protocol + sync (use only in controlled recovery). |
61+
62+
## See also
63+
64+
- [repro-channel-monitor-desync.md](./repro-channel-monitor-desync.md) — repro steps, matrix, recovery timing notes

docs/repro-channel-monitor-desync.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# ChannelMonitor Desync: Repro, Recovery & Test Plan
22

3+
See also: [Lightning primer for QA](./lightning-primer-for-qa.md) (monitors, HTLCs, gaps, risks).
4+
35
Related issues:
46
- [#847 (bitkit-android)](https://github.com/synonymdev/bitkit-android/issues/847)
57
- iOS support ticket (user logs from 2026-03-18)
@@ -178,6 +180,10 @@ Fix branches:
178180

179181
Whether healed channels should be closed after recovery is under discussion. For testing: verify wallet is operational after recovery regardless of channel closure outcome. On-chain balance should be intact even if healed channels are subsequently closed.
180182

183+
### Timing (T2 in-place upgrade)
184+
185+
After installing v2.1.2 over a broken v2.1.0 wallet, Lightning can take **over one minute** to become ready: first load hits `DangerousValue`, then `accept_stale_channel_monitors` retry, chain sync, and monitor healing. This is **expected on both iOS and Android** — not a hang. Wait for balances/channels before running payments or automation.
186+
181187
---
182188

183189
## Test Plan
@@ -192,7 +198,7 @@ Matrix of upgrade/recovery scenarios to validate v2.1.2. Each scenario should be
192198
| B2 | Restore broken v2.1.0 wallet into v2.1.2 (clean install + restore) | ✅ Recovered |
193199
| B3 | Update broken v2.1.0 wallet to v2.1.2 (in-place upgrade) | ✅ Recovered |
194200
| B4 | v2.0.6 (wallet with gap) → v2.1.2 (skip v2.1.0) | ✅ No issues |
195-
| B5 | v2.0.6 (wallet with gap) → v2.1.1 → v2.1.2 |Recovered |
201+
| B5 | v2.0.6 (wallet with gap) → v2.1.1 → v2.1.2 |No issues |
196202
| B6 | v2.1.0 healthy wallet (no gap) → v2.1.2 (regression check) | ✅ No issues |
197203
| B7 | v2.1.0 broken wallet + 600 blocks mined → v2.1.2 (stale chain state) | ✅ Recovered |
198204

@@ -229,6 +235,7 @@ Matrix of upgrade/recovery scenarios to validate v2.1.2. Each scenario should be
229235

230236
| File | Purpose |
231237
|------|---------|
238+
| `docs/lightning-primer-for-qa.md` | Background: ChannelManager vs ChannelMonitor, HTLCs, gaps, test focus |
232239
| `test/specs/receive-ln-payments.e2e.ts` | Automated spec to receive N Lightning payments |
233240
| `wdio.no-install.conf.ts` | WDIO config that attaches to existing app (no reinstall) |
234241
| `docker/bitcoin-cli` | Local docker CLI with `openchannel`, `payinvoice`, `mine`, `send` commands |

0 commit comments

Comments
 (0)