Skip to content

pampa: migrate FileId scheme from sequential to hash-based (bd-ky14a)#235

Open
cscheid wants to merge 1 commit into
mainfrom
feature/bd-ky14a-pampa-hash-fileids
Open

pampa: migrate FileId scheme from sequential to hash-based (bd-ky14a)#235
cscheid wants to merge 1 commit into
mainfrom
feature/bd-ky14a-pampa-hash-fileids

Conversation

@cscheid
Copy link
Copy Markdown
Member

@cscheid cscheid commented May 22, 2026

Summary

  • Pampa's ASTContext now derives FileIds via quarto_yaml::file_id_for_filename(name) (hashing the filename string) instead of using sequential FileId(0).
  • Eliminates the cross-parser FileId asymmetry the bd-1pwy8 work had to wrap around with (FileId, &Path) candidate pairs.
  • Drops the quarto_ast_reconcile::remap_file_ids(...) workarounds in include_expansion.rs and engine_execution.rs, and the parallel-SourceContext lockstep in include-expansion.

Why review carefully

There's a parallel source-location workstream in flight; this PR should be reviewed against it before merging. Specifically check:

  • Does the parallel work also touch ASTContext / SourceContext shape (struct fields)?
  • Does it introduce new FileId(0) literals that would need the same migration?
  • Does anything in the parallel work depend on FileId(0) as a stable serialization key (e.g. JSON wire format)?

The JSON wire format does change in a user-visible way: FileIds in sourceInfoPool.d go from small integers (0, 1, ...) to 64-bit hashes. The 62 pampa/snapshots/json/*.snap updates in this PR all reflect this. The hub-client SPA + q2-preview consumers were checked and don't depend on FileId values being small.

Contract tests (TDD anchor)

Five tests that fail on `main` with the expected `FileId(0)` vs `FileId(hash)` mismatch and turn green after the migration:

  1. `pampa::pandoc::ast_context::tests::bd_ky14a_with_filename_uses_quarto_yaml_file_id`
  2. `pampa::pandoc::ast_context::tests::bd_ky14a_source_context_indexed_by_hash_file_id`
  3. `quarto_core::bd_ky14a_file_id_contract::bd_ky14a_pampa_qmd_read_uses_quarto_yaml_file_id`
  4. `quarto_core::bd_ky14a_file_id_contract::bd_ky14a_sub_document_file_id_is_hash_based`
  5. `quarto_core::bd_ky14a_file_id_contract::bd_ky14a_fresh_source_context_renders_pampa_source_info`

Snapshot changes

62 `pampa/snapshots/json/*.snap` files updated. Diffs are exclusively `"d": 0` → `"d": ` substitutions in `sourceInfoPool` entries with `"t":0` (the Original-SourceInfo tag). Pool index values (other small integers in non-`d` positions) are unchanged. No structural/semantic changes.

Test plan

  • `cargo nextest run --workspace` — 9418 passed, 0 failed (5 net new contract tests)
  • `cargo xtask verify` — full pass (Rust + WASM + hub-client + tests)
  • Reviewer-side check that the JSON FileId format change is acceptable for parallel source-location work
  • Reviewer-side coordination on `ASTContext` / `SourceContext` struct shape if the parallel work also touches them

Plan document

`claude-notes/plans/2026-05-22-pampa-hash-fileids.md` — full motivation, design rationale, and work-items checklist (now all checked).

🤖 Generated with Claude Code

Pampa's ASTContext now derives FileIds via
`quarto_yaml::file_id_for_filename(name)` (hashing the filename
string) instead of using sequential `FileId(0)` for the document's
primary file. This makes pampa's `SourceInfo`s globally
interchangeable with `SourceInfo`s produced by quarto_yaml (and any
other parser using the same recipe), eliminating the
out-of-band-binding requirement that bd-1pwy8 had to work around
with `(FileId, &Path)` candidate pairs.

## What changed

### Core: ASTContext gains a primary_file_id field

`crates/pampa/src/pandoc/ast_context.rs`
- New private field `primary_file_id: FileId`, cached at
  construction so `current_file_id()` doesn't re-hash on every
  call.
- `new()` and `anonymous()` now delegate to `with_filename(name)`
  with the `<unknown>` / `<anonymous>` placeholder names.
- `with_filename` computes the FileId via
  `quarto_yaml::file_id_for_filename` and registers the
  SourceContext entry via `add_file_with_id`.
- New `from_parts(filenames, source_context, primary_file_id)`
  constructor for deserialization paths (the JSON reader uses
  this).

### Production sites updated to use hash-based FileIds

- `crates/pampa/src/readers/qmd.rs` — registers the input under
  `current_file_id()` (= hash) instead of the sequential 0.
- `crates/pampa/src/readers/json.rs` — when deserializing
  ASTContext, computes each file's hash FileId and uses the new
  `SourceContext::add_file_with_id_and_info` helper to preserve
  the line-break index.
- `crates/quarto-core/src/stage/stages/parse_document.rs` — the
  SourceContext built in the parse stage now uses
  `add_file_with_id(file_id_for_filename(source_name), …)` so
  it matches what pampa puts on the AST's SourceInfos.
- `crates/quarto-core/src/pipeline.rs` (q2-preview path) — same
  treatment.
- `crates/perf-harness/src/bin/parse_qmd_to_ast.rs` — replaces
  struct-literal ASTContext construction with `from_parts`.
- `crates/wasm-quarto-hub-client/src/lib.rs` — same.
- `crates/quarto-lsp-core/src/document.rs`
  (`Document::create_source_context*`) — registers the document
  under the hash FileId so analyze_document's SourceInfos resolve.
- `crates/quarto-core/src/transforms/attribution_render.rs` — the
  v1 "skip non-primary file" gate previously compared `file_id != 0`;
  now it compares against the AST-derived primary file_id.

### Simplifications enabled by the new scheme

- `crates/quarto-core/src/stage/stages/include_expansion.rs` —
  dropped the `quarto_ast_reconcile::remap_file_ids(FileId(0) →
  new_sequential_id)` workaround and the parallel-SourceContext
  lockstep with the `debug_assert_eq!` desync check. Included
  sub-documents now land at `FileId(hash(filename))` natively;
  registration in the parent's contexts is idempotent on FileId.
- `crates/quarto-core/src/stage/stages/engine_execution.rs` —
  same: dropped the `remap_file_ids(id.0 + 1)` shift; the
  intermediate `.rmarkdown` already has its hash FileId.

### Source-map helper

- `crates/quarto-source-map/src/context.rs` — new
  `add_file_with_id_and_info` method, combining the existing
  `add_file_with_id` (specific FileId, can panic if duplicate)
  with `add_file_with_info` (pre-computed FileInformation).
  Needed by the JSON deserializer's roundtrip path.

## Contract tests (the TDD anchor)

Five new tests across the workspace. Each one fails on `main` with
the expected `FileId(0)` vs `FileId(hash)` mismatch and turns
green after the migration:

1. `pampa::pandoc::ast_context::tests::bd_ky14a_with_filename_uses_quarto_yaml_file_id`
2. `pampa::pandoc::ast_context::tests::bd_ky14a_source_context_indexed_by_hash_file_id`
3. `quarto_core::bd_ky14a_file_id_contract::bd_ky14a_pampa_qmd_read_uses_quarto_yaml_file_id`
4. `quarto_core::bd_ky14a_file_id_contract::bd_ky14a_sub_document_file_id_is_hash_based`
5. `quarto_core::bd_ky14a_file_id_contract::bd_ky14a_fresh_source_context_renders_pampa_source_info`

## Pre-existing tests touched

- `pampa::tests::json_location_test::test_json_location_enabled`
  — assertion `location["f"] == 0` updated to `file_id_for_filename`.
- `pampa::tests::qmd_writer_source_info::round_trip_code_block_offset_accuracy`
  — test fixture's `SourceContext` now uses `add_file_with_id`
  with the hash FileId, matching the SourceInfos that come out
  of the parser.
- `quarto-core::stage::stages::engine_execution::tests::test_engine_execution_remaps_new_blocks_to_intermediate`
  — expected FileIds updated to the hashes of the two filenames
  involved.
- `quarto-core::stage::stages::include_expansion::tests::included_blocks_have_correct_file_id`
  — same.
- `pampa::pandoc::ast_context::tests::test_primary_file_id` /
  `test_current_file_id` — assert against the hash, not literal
  `FileId(0)`.

## Snapshot test changes

**62 snapshot files updated** in `crates/pampa/snapshots/json/`.
The diffs are exclusively `"d": 0` → `"d": <hash>` substitutions
in the `sourceInfoPool` entries for `Original` SourceInfos (entries
with `"t":0`). Pool *index* values (small integers in non-`d`
positions) are unchanged. No semantic changes — just the FileIds
that pampa now emits.

Affected snapshot files (sample, see git diff for the full set):

- `001.snap`, `002.snap`, `003.snap`
- `anchor-shorthand-{01..06}-*.snap` (6 files)
- `header-autoid-*.snap` (3 files)
- `horizontal-rules*.snap` (2 files)
- `html-comment-{01..48}-*.snap` (48 files)
- `math-with-attr.snap`, `table-alignment.snap`,
  `table-caption-attr.snap`, `yaml-tags.snap`

## Verification

- `cargo nextest run --workspace` → 9418 passed, 0 failed.
  (Net new: 5 contract tests; otherwise the count matches the
  bd-1pwy8 baseline.)
- `cargo xtask verify` (full pass — Rust + WASM + hub-client +
  tests) → green.

## Workflow

Per the plan, this PR should be **reviewed by the parallel
source-location workstream** before merging. Push as a feature
branch; do **not** merge to main locally.

Plan: claude-notes/plans/2026-05-22-pampa-hash-fileids.md

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant