Skip to content

explorer: search as a real global filter (#234 Step 4)#251

Merged
rdhyee merged 15 commits into
isamplesorg:mainfrom
rdhyee:feat/search-global-filter-a1
Jun 1, 2026
Merged

explorer: search as a real global filter (#234 Step 4)#251
rdhyee merged 15 commits into
isamplesorg:mainfrom
rdhyee:feat/search-global-filter-a1

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented May 31, 2026

A1: search as a real global filter (#234 Step 4)

Makes a committed free-text search a global filter across every explorer surface, instead of an optional side-panel lookup. On search, buildSearchFilter() materializes a non-temp DuckDB table search_pids (one ILIKE scan over sample_facets_v2), and every surface then constrains via a cheap semi-join AND pid IN (SELECT pid FROM search_pids):

  • Samples table filters to the search ("N of M in this map view").
  • Globe enters point mode and renders only the matching dots (clusters can't be text-filtered).
  • Facet legend counts and stats scope to search ∩ viewport.

Fixes the incoherence reported in #247 (table claimed unrelated viewport samples "match the current filters" during a search; interim honesty fix shipped as #250, this completes it).

What's verified

  • Globe coherencebucchero → 2,693 pids, point mode, h3 cleared, samplePointsLen ≤ total. Confirmed headed + headless and on a live deploy (real HTTP/2 + 206 ranges, prod data).
  • Search perf — collapsed an A1 double-scan (pid-set build + side-panel results both scanned the 63 MB facets); the side panel now reads the materialized search_pids, so it's one facets scan per search, matching pre-A1. CI smoke gate (pottery) passes.
  • Filter coherence — facet legend counts now use the same padded viewport (VIEWPORT_PAD_FACTOR) as the table/heatmap/point-loader/stat, so legend == "N match" (was reading low: ~166 vs ~481 at a wide view).
  • Production-clean — all A1 debug instrumentation (a1dbg/__a1log/__a1globe + on-page panel) is gated behind ?debug=a1; default load has a clean global namespace and no overhead.

Included commits

data_base dev-override fix · debug gating + dev-probe removal · double-scan collapse · facet-padding coherence fix · dev verify infra (dev_server.py, tests/playwright/a1-verify.mjs + probes). The dev/test infra is optional — happy to drop dev_server.py / the probes if you'd prefer them out of upstream.

Known / deferred (not blockers, flagged for review)

  • Heatmap isn't search-aware yet (renderHeatmap omits searchFilterSQL), so the "filtered density" layer stays unfiltered under a committed search. Follow-up.
  • Selection revalidation on search change (clear a selection that's no longer in the filtered set).
  • Cold-search latency — A1 moves the un-indexed full-text ILIKE scan to the front of the common flow; the proper fix is the BM25 substrate (Explorer FTS Track 1b: Honesty fix for query-spec / live mismatch #168–172). The "Building search filter…" affordance masks it for now.

Relates to

#234 (umbrella), #247/#250, #248 (concept-URI search — a second producer of the same search_pids set), #249 (the "refactor explorer.qmd first?" question — this PR is a data point for it).

Staging

Deployed and verified on the rdhyee fork's GitHub Pages (same data/infra as isamples.org). Suggest squash-merge — the branch carries some WIP commits whose messages predate the fixes.

🤖 Generated with Claude Code

rdhyee and others added 15 commits May 29, 2026 16:12
…table surface

Strategy B: materialize search_pids (one ILIKE scan over facets_url) on a
committed search, then constrain surfaces with a cheap pid semi-join.

This increment (table surface, verified):
- buildSearchFilter/clearSearchFilter: non-temp search_pids table (DISTINCT,
  NOT NULL), token-versioned _next→swap, captures match total. Published on
  window.__searchFilter {active,term,token,total} + window.searchFilterSQL().
- doSearch builds the filter (shows "Building search filter…") then refreshes
  the table; clears it on empty/short submit.
- loadCount/loadPage semi-join on search_pids; summaryText → "N of M
  \"term\" matches in this map view" (replaces isamplesorg#250 interim copy).
- Dev probe cell (a1PersistenceProbe) — REMOVE before PR.

Verified on local build: bucchero → table shows only OpenContext Poggio
Civitate matches (2,693), no GEOME mollusks; non-temp table persists across
db.query() calls. Probe (isamplesorg#249 data): no coord-less matches, no dup pids,
broad-term max ~82k.

TODO (still in PR #1, NOT YET DONE): points loader, facet counts + cube
gating, stats, and C3 auto-point-mode so the globe isn't left unfiltered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e), C3 — globe path BUGGY

Adds to the working table surface:
- searchIsActive()/searchFilterSQL() cell-local helpers in the viewer cell.
- loadViewportSamples: semi-join on search_pids.
- updateCrossFilteredCounts: semi-join on both paths; gate off the cube
  fast-path AND the global baseline early-return when a search is active.
- applySearchFilterChange(): C3 orchestrator — force point mode on search,
  revert to altitude-appropriate mode on clear; refresh table+facets.
- camera-changed handler: latch point mode while a search is active.
- doSearch calls applySearchFilterChange after build / on clear.

KNOWN BUG (needs debugging): the GLOBE points render the UNFILTERED viewport
count (e.g. "5000 of 1,591,051") even though search is active and the table
correctly shows 2,693. C3 does not enter point mode at high altitude on boot
either (globe stays unfiltered clusters). Likely an async race between the
boot point-load / mode entry and the post-build applySearchFilterChange
(filter built ~40-90s into boot, after the camera has already settled). The
table surface (loadCount/loadPage) IS correctly filtered. Probe cell still
present (remove before PR).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch-token staleness) + [A1dbg] logging; globe still not entering point mode — next: Codex rec #4 one-reconciler refactor

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rver + deterministic A1 observability

- R2_BASE honors ?data_base= / localStorage ISAMPLES_DATA_BASE (default prod), so the explorer can read a local parquet mirror instead of 40-90s remote range-fetches.
- dev_server.py: range-capable (206) static server; stock python http.server returns 200 and breaks DuckDB-WASM partial reads.
- window.__a1log/__a1state + a1dbg() + on-page panel (?debug=a1) replace flaky console capture; window.__a1globe() exposes mode/point state for a Playwright harness.
- Converted [A1dbg] console.logs to a1dbg events at build/mode/point-load/discard points.

NOTE: cold cost is init-dominated (DuckDB-WASM+Cesium+OJS ~40s) — mirror helps the DATA phase only; the real lever is load-once + in-page iteration. Mirror range verified (curl -r => 206) but a full end-to-end speedup run hung in init (shakedown tomorrow; check 0-byte current/wide.parquet).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…be coherence)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Working handoff docs for the search-as-global-filter (A1, isamplesorg#234 Step 4) work — branch state, the globe logjam + Codex's reconciler spec, the fast verify-loop, the performance model, and Eric isamplesorg#248 / isamplesorg#249. Strip before the A1 PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckDB-WASM

The ?data_base=/data dev override produced root-relative parquet URLs
(/data/foo.parquet). DuckDB-WASM's httpfs reads those as a virtual-FS glob
("No files found that match the pattern") instead of fetching over HTTP, so
the local-mirror verify loop hung in init with zero /data fetches — the
"shakedown" symptom. Resolve a root-relative data_base against
location.origin so the ergonomic ?data_base=/data form works; the prod
default and absolute (http://...) overrides pass through unchanged.

Verify-loop infra:
- dev_server.py: pin HTTP/1.1 (DuckDB's range reader expects keep-alive;
  curl-verified 206 + multi-request keep-alive). Local full-GET-vs-206 is
  DuckDB-WASM heuristic and moot over localhost; validate ranges on deploy.
- tests/playwright/shakedown-206.mjs: headless boot+search probe (no popup).
  Confirms cold boot ~2.3s to live, bucchero search builds 2,693 pids ~9s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnoses whether a committed search renders sample points and whether the
result depends on camera altitude. Boots at a given alt/lat/lng, fires the
bucchero search, waits for the async point load to settle, and dumps the
__a1log event sequence + final __a1globe() state.

Finding: with a proper wait, the globe is A1-coherent at BOTH whole-globe
(9000 km → renders all 2693 pids; computeViewRectangle saturates, not null)
and zoomed-in (80 km → 2670 in-view) altitudes. The earlier "0 sample
points" was a measure-too-early artifact, not a bug. Suggests the C3 fixes
(4e79830) work in a foreground/headless context and the summary's "globe
won't enter point mode" was likely a backgrounded-tab rAF-freeze artifact.
Pending headed a1-verify.mjs verdict to rule out an animation-only race.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Default stays headed (real flyTo — what A1 is verified against). A headed
window that opens UNFOCUSED becomes a background tab → Chrome freezes its
rAF render loop → the page hangs mid-init (the same backgrounded-tab freeze
that corrupted the original logjam observations). HEADLESS=1 sidesteps that:
headless pages are always "active". Use it for CI / repeated runs; keep
headed for a real-animation spot check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… dev probe

Pre-deploy cleanup (the summary's "don't ship" list):
- Remove the a1PersistenceProbe OJS cell — a one-time dev check that
  console-logged on every load and threw a Catalog Error (the design point
  it verified, non-temp tables persisting across DuckDBClient connections,
  is proven and load-bearing in production now).
- Gate the whole A1 observability block (a1dbg / __a1log / __a1state /
  __a1globe + on-page panel) behind ?debug=a1. Production users now get a
  clean global namespace and zero overhead; the Playwright harness opts in
  via ?debug=a1. All a1dbg?.() call sites already use optional chaining, so
  they are no-ops when the block doesn't run.

Verified: ?debug=a1 → a1-verify.mjs still ✅ COHERENT (2693 pts); no flag →
__a1globe/a1dbg/__a1log undefined, no panel, no probe console output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rch_pids

doSearch scanned the 63 MB facets parquet TWICE per committed search: once in
buildSearchFilter (pid-set) and again for the side-panel results SELECT (+ a
third for the real-count COUNT when the 50-cap hit). On CI's smoke gate, the
broad "pottery" search blew the 90s budget (first A1 deploy failed there).

Fix: buildSearchFilter now materializes the side-panel columns (label, source,
place_name) and the relevance score IN THE SAME scan that builds the pid-set,
so the results SELECT and the COUNT read the small in-memory search_pids table
(aliased `s`) instead of re-scanning facets. One facets scan per search now,
matching pre-A1. sourceFilterSQL('s.source') + the bare-pid facetFilterSQL
compose unchanged; search_pids stays pid-keyed (dropped the weaker 5-col
DISTINCT — pid is unique, so the build is naturally one row per pid).

Verified locally (fast mirror): pottery 15.8s → 12.7s (build 6.9s + surface
updates); a1-verify still ✅ COHERENT; production-clean without ?debug=a1.
Note: the remaining time-to-results is buildSearchFilter + applySearchFilter
(globe/facet updates); if CI's smoke still exceeds budget, render the side
panel before applySearchFilterChange next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atch")

updateCrossFilteredCounts computed facet-legend counts over the EXACT viewport
(pad 0), while the samples-table COUNT, the point-mode loader, the "samples in
view" stat, and the heatmap all pad by VIEWPORT_PAD_FACTOR (0.3). Matching
samples in the 30% margin were counted by the table but not the legend, so the
legend read low: off-by-one at a Cyprus deep-zoom (13 vs 14), and ~166 vs ~481
for material=rock at a wide Red-Sea view (RY, live rdhyee deploy). Aligns the
last "in view" surface to the padded contract (isamplesorg#234 coherence).

Applies the parked facet_count_padding.patch (one line + the coherence
regression test) on the A1 branch, since the mismatch is live on the A1 deploy
and isamplesorg#234 is exactly "make filter semantics coherent across surfaces."

Verified at the reported view: facet Rock 167 → 496, now == table 496;
a1-verify still ✅ COHERENT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two issues from Codex's PR isamplesorg#251 review:

1. search_pids staging race — buildSearchFilter used a fixed `search_pids_next`
   name. Two overlapping searches could interleave so a later search swapped an
   earlier search's rows into `search_pids` under its own term (the token checks
   guard the publish, not the shared staging object). Use a token-scoped staging
   table `search_pids_next_${token}`, dropped in finally. Also stop DROPping the
   live `search_pids` on clear (an in-flight reader would throw) — flip
   active=false and leave it unreferenced until the next search replaces it.
   Verified: bucchero→soil back-to-back now publishes soil/2969 (its own count),
   not bucchero's under soil's term.

2. heatmap search-blind — renderHeatmap omitted searchFilterSQL and
   heatmapFilterHash omitted the search token, so the "filtered density" overlay
   stayed unfiltered under a committed search. Append window.searchFilterSQL('pid')
   to the heatmap aggregation and add the search token to the hash so it
   recomputes/re-keys on search commit/clear (isamplesorg#234 cross-surface coherence).

a1-verify still ✅ COHERENT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le rows)

Round-2 of Codex's PR isamplesorg#251 review: the previous no-drop clear left the prior
search's rows in search_pids, and doSearch's side-panel SELECT reads
`FROM search_pids` directly (does NOT gate on __searchFilter.active) — so a
build failure could render the previous term's rows under the new term.

Chose Codex's empty-table alternative over the early-return built-guard: an
early return before the side-panel try would skip the isamplesorg#167 telemetry `finally`,
whereas CREATE OR REPLACE TABLE search_pids (...empty...) keeps both the
in-flight semi-join readers and the direct side-panel reader safely seeing zero
rows, and a build failure flows through the existing results.length===0 →
return-in-try → finally path with telemetry intact. Only clearSearchFilter
changes; no doSearch control-flow restructure.

Verified: a1-verify ✅ COHERENT; bucchero→clear→soil publishes soil/2969 (own
count, no stale rows); clearSearchFilter is only called on empty-submit +
build-failure, not per search.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses Codex's remaining nit (PR isamplesorg#251, non-blocking): since
clearSearchFilter() now leaves search_pids EMPTY, a genuine build failure and a
true empty result set both reach the side-panel's results.length===0 branch. A
`searchFilterBuildFailed` flag (set in the build catch) makes the panel say
"Search error: couldn't build the filter…" on a real failure while still
flowing through the isamplesorg#167 telemetry finally — instead of the misleading
"No results for {term}".

a1-verify still ✅ COHERENT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented May 31, 2026

Multi-AI review cycle → dual approval

Ran an iterative Codex (gpt-5.4) review/revise loop on this PR alongside Claude's own review. Codex found genuine issues across three rounds; all fixed and empirically re-verified (the a1-verify.mjs coherence harness stayed green throughout, and overlap/clear cycles were checked directly):

Round Finding Fix Commit
1 Staging-table race — a fixed search_pids_next name let two overlapping searches cross-contaminate (a later build could swap an earlier build's pids into search_pids under its own term; token checks guarded the publish, not the shared staging object) token-scoped search_pids_next_${token}, dropped in finally a576dea
1 Heatmap search-blindrenderHeatmap omitted searchFilterSQL and heatmapFilterHash omitted the search token, so the density overlay stayed unfiltered under a committed search append the semi-join + add the active-search token to the hash a576dea
2 Stale direct-reader — leaving search_pids in place on clear let the side-panel SELECT FROM search_pids render the previous term's rows on a build failure replace with an empty same-shape table on clear (chosen over an early-return guard so the #167 telemetry finally still runs) 8a9a1d3
3 (nit) a genuine build failure read as "No results" searchFilterBuildFailed flag → "Search error" 0a91361

Verified:

  • a1-verify.mjs: ✅ A1 COHERENT (headed + headless + live rdhyee deploy).
  • Overlap: buccherosoil back-to-back publishes soil/2969 (its own count), not bucchero's under soil's term.
  • Clear cycle: bucchero→clear→soil clean, no stale rows.

Verdicts: Claude — approve; Codex — APPROVE (no remaining requested changes). Deployed and verified on the rdhyee fork's GitHub Pages (same data/infra as isamples.org).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant