diff --git a/COLLECTION_FACET_CODEX_PROMPT.md b/COLLECTION_FACET_CODEX_PROMPT.md new file mode 100644 index 00000000..43fc9581 --- /dev/null +++ b/COLLECTION_FACET_CODEX_PROMPT.md @@ -0,0 +1,113 @@ +# Codex prompt — Option A: first-class `collection` facet in the iSamples explorer + +> Paste the block below into Codex (run from `~/C/src/iSamples/isamplesorg.github.io`, +> which has `AGENTS.md`; the repo root has `.codex/config.toml` with Playwright MCP). +> Tracks issue isamplesorg/isamplesorg.github.io#243. Plan-first with a sign-off gate. + +--- + +``` +GOAL +Add a first-class "collection" dimension to the iSamples interactive explorer +(explorer.qmd) so users can filter samples to a named collection — e.g. the +OpenContext project "PKAP Survey Area" — and layer the existing material / +context / object_type facets on top. Full background, data analysis, and the +two-phase plan are in issue #243. + +DO THIS IN TWO STAGES. Stage 1: produce a written implementation plan and STOP +for my sign-off. Stage 2 (only after I approve): implement. + +=== KEY DESIGN FACTS (already verified — do not re-derive) === +- A "collection" is the `label` of a SamplingSite entity. It is NOT on the + MaterialSampleRecord rows; it is reached by traversal: + MaterialSampleRecord.p__produced_by[1] -> SamplingEvent + SamplingEvent.p__sampling_site[1] -> SamplingSite.label + (All within the wide parquet; `otype` column distinguishes entity types.) +- Cardinality: ~60,268 distinct SamplingSite labels; only ~1.63M of 6.35M + samples have a site (sparse facet, mostly OpenContext). PKAP = 15,446 samples. +- Doing this traversal LIVE in DuckDB-WASM per interaction is NOT viable (it is + the array-join pattern profiled as the in-browser bottleneck). MUST precompute. +- Data is served from https://data.isamples.org/ (Cloudflare Worker -> R2). + NEVER reference raw pub-*.r2.dev URLs. + +=== HOW FACETS WORK TODAY (anchors in explorer.qmd) === +- Parquet URL constants: R2_BASE (:683), wide_url=/current/wide.parquet (:690), + facets_url=…sample_facets_v2.parquet (:692), facet_summaries_url (:693), + cross_filter_url (:695), vocab_labels_url (:698), lite_url (:687), + h3_res{4,6,8}_url (:684-686). +- The facet filter predicate (:942): + AND pid IN (SELECT DISTINCT pid FROM read_parquet('${facets_url}') + WHERE ) + i.e. per-sample facet values live in sample_facets_v2.parquet, keyed by pid. +- Facet checkbox lists + counts are rendered by renderFilter(...) (:~1792) from + facet_summaries (value -> count); cross-filtered counts use facet_cross_filter. +- material/context/object_type values are vocabulary URIs labeled via + vocab_labels.parquet. NOTE: a collection's "value" is a SamplingSite identity + (site_id) labeled from the NEW collections dimension below — NOT a vocab URI. +- URL/state contract is normative in EXPLORER_STATE.md. The four query params + today are search, sources, material, context, object_type (+ search_scope). + A new `collection` param must follow the SAME lifecycle as `material`: + applyQueryToFacetFilters (hydrate), handleFacetFilterChange -> + writeQueryState() (write-back), cross-filter count recompute, param removed + when empty. Honor the Quarto `?q=` collision note (use `collection`, not `q`). +- Cluster-mode honesty: H3 summary parquets only carry dominant_source, so + material/context/object_type filters do NOT affect zoomed-out clusters (the + #facetNote). A `collection` facet inherits this unless collection is also + added to the H3 summaries — call this out; do not silently break the note. + +=== STEP 0 (do first, report findings) === +Locate the build pipeline that PRODUCES the supplementary parquets +(sample_facets_v2, samples_map_lite, h3_summary_res{4,6,8}, facet_summaries, +facet_cross_filter) and uploads them to R2. They are NOT in this repo's +scripts/. Search the sibling repos and data dirs: + ~/C/src/iSamples/{isamples-python,pqg,isamplesorg.github.io-duckdb-spike} + ~/Data/iSample/ (esp. pqg_refining/) +and any notebooks. Also read workers/data-isamples-org/README.md for the R2 +serving/versioning layer. Report exactly how each file is built and uploaded, +or state that a build path must be created from scratch. + +=== STAGE 1 DELIVERABLE: a written plan covering === +1. Build: a new script (e.g. scripts/build_collections.py) that, from + /current/wide.parquet, computes per-sample (pid -> site_id, site_label) via + the traversal, and emits: + a) collections.parquet — dimension, ~60K rows: + site_id, label, source, n_samples, centroid_lat, centroid_lng, + bbox(min/max lat/lng). Powers the "search the long tail" half of the UX + and the Featured-Collections presets (collections.qmd). + b) an added `site_id` (+ maybe site_label) column on sample_facets_v2 + (regenerate as v3 if v2's builder is unavailable; keep pid as the key so + the :942 predicate extends with one more AND condition). + c) collection rows in facet_summaries (site_id -> count) so the checkbox + list + counts render via the existing machinery. Decide whether to add + collection to facet_cross_filter now or defer (note the consequence). + Define a stable site_id (hash of label, or the SamplingSite pid). Specify + versioned filenames + the /current alias, consistent with existing files. +2. Explorer wiring (explorer.qmd), mirroring `material` exactly: + - new collection facet container + a `?collection=` URL param on the + EXPLORER_STATE.md lifecycle. + - DUAL UX (my decision): top-N collections (>= a sample-count threshold) as + checkboxes reusing renderFilter; PLUS a type-to-search input over + collections.parquet for the long tail (60K). Specify how a search-selected + collection becomes an active filter value alongside the checkboxes. + - extend the :942 predicate (or facets subquery) with the collection + condition; ensure cross-filter counts and #facetNote stay correct. +3. data.qmd + collections.qmd updates: document collections.parquet; once the + facet exists, upgrade the Featured-Collections preset links from + geographic-only to a real &collection= filter. +4. Test plan: extend tests/ (pytest + Playwright). At minimum a Playwright check + that ?collection= yields the PKAP sample set and that layering + ?material=… narrows it; reproducible DuckDB snippets for the counts. +5. Risks / migration: snapshot-version coupling (site_id stability across + rebuilds), the sparse-facet UX for non-collection sources, cluster-mode + honesty, and file-size deltas. + +=== CONSTRAINTS === +- Read AGENTS.md, ../CLAUDE.md, EXPLORER_STATE.md before planning. +- explorer.qmd is ~3,500 lines of working OJS/JS — make INCREMENTAL, additive + changes mirroring existing facet code; do not refactor working paths. +- Quarto OJS gotcha: cells use `name = value`, NOT top-level const/let/var. +- Static site, no hot reload: note where `quarto preview` + browser refresh is + needed to verify. +- Verify against https://data.isamples.org/ only; never raw pub-*.r2.dev. +- STOP after the Stage 1 plan and wait for my approval before writing code. +``` diff --git a/EXPLORER_STATE.md b/EXPLORER_STATE.md index 253ee32e..a3cfc1f8 100644 --- a/EXPLORER_STATE.md +++ b/EXPLORER_STATE.md @@ -34,6 +34,7 @@ citations. | `material` | DOM `#materialFilterBody` checkboxes | omitted (= no filter) | CSV of full URIs | `applyQueryToFacetFilters()` at end of `facetFilters` (`:1061`) | `writeQueryState()` from `handleFacetFilterChange` (`:1642`) | none — checkbox `value` already constrained by render | empty checked set ⇒ param removed (`:459`) | | `context` | DOM `#contextFilterBody` checkboxes | omitted | CSV of full URIs | same as `material` | same as `material` | none | same | | `object_type` | DOM `#objectTypeFilterBody` checkboxes | omitted | CSV of full URIs | same as `material` | same as `material` | none | same | +| `collection` | DOM `#collectionFilterBody` checkboxes | omitted (= no filter) | CSV of `collection_id`s (16-hex) | `applyQueryToFacetFilters()` (after the `facetFilters` cell renders top-N ∪ URL ids) | `writeQueryState()` from `handleFacetFilterChange` | none | #243. Values are collection ids from `collections.parquet`, NOT vocab URIs. Filters via a 2nd subquery in `facetFilterSQL()` against `sample_collections.parquet`. NOT cross-filtered (no cross_filter cache); counts shown are the collection's static total. The `#collectionSearch` box adds long-tail rows beyond the top-N checkboxes | | ~~`view`~~ | _removed in mockup-v1 (#200)_ | — | — | — | — | — | The Globe/Table toggle is gone — the samples table is now permanent below the globe. `writeQueryState()` does `params.delete('view')` to canonicalize legacy bookmarks. See §6 "Mockup-v1 addendum" | | `search_scope` | local closure `_searchScope` in `zoomWatcher` | omitted (= `world`) | `area` only; absent ⇒ world | `_searchScope` hydrated at top of `zoomWatcher` from `params.get('search_scope')` | `persistSearchScope()` from `doSearch()` and button clicks | exact match `'area'` | sidebar `#sampleSearchSidebar` Enter always submits `world`, never `area` — see §6 mockup-v1 addendum | | `page` | inner closure `let page = 0` in `tableView` | not in URL | — | — | resets to 0 on `refreshTable()`; ±1 on prev/next | clamped to `[0, totalPages-1]` | **#163 item 6** — table page is intentionally not URL state today; if/when added, must coexist with the cross-filter contract below | diff --git a/_quarto.yml b/_quarto.yml index 685a07f9..140713d9 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -14,6 +14,8 @@ website: text: Home - href: explorer.qmd text: Interactive Explorer + - href: collections.qmd + text: Collections - text: How to Use menu: - text: Overview diff --git a/collections.qmd b/collections.qmd new file mode 100644 index 00000000..e2fa2797 --- /dev/null +++ b/collections.qmd @@ -0,0 +1,62 @@ +--- +title: "Featured Collections" +subtitle: "Jump straight to well-known sample collections on the interactive globe" +toc: true +categories: [explore, collections] +--- + +::: {.callout-note} +**Identity-based collection filtering** (issue +[#243](https://github.com/isamplesorg/isamplesorg.github.io/issues/243)). Each +link applies the explorer's `collection` facet (`&collection=`) so you see +*exactly* that collection's samples — not just whatever is near a location — and +flies the globe to the collection's centroid. From there, layer on the Material, +Sampled Feature, or Specimen Type facets to narrow further. +::: + +## How to use these + +1. Click **Open in Explorer** — the `collection` facet filters to exactly that + collection's samples and the globe flies to its centroid in point mode. +2. **Layer on facets**: open the *Material*, *Sampled Feature*, or *Specimen + Type* panels and check values to narrow within the collection. +3. **Find any collection** — in the explorer, open the **Collection** panel and + type in its search box; the top ~100 collections also appear as checkboxes. +4. **Share what you see** — the URL captures the full view (`collection` + + other facets + camera), so you can bookmark or send any state you reach. + +## Featured collections + +These are the largest OpenContext project areas in the current snapshot +(`202604`), by sample count. + +| Collection | Source | Samples | | +|---|---|---:|---| +| **PKAP — Pyla-Koutsopetria Survey Area** (Cyprus) | OpenContext | 15,446 | [Open in Explorer](explorer.html?collection=dd74c71982da0e21#v=1&lat=34.9836&lng=33.7071&alt=40000&mode=point) | +| Çatalhöyük (Turkey) | OpenContext | 145,884 | [Open in Explorer](explorer.html?collection=20365f0e3b27dc8e#v=1&lat=37.6682&lng=32.8272&alt=40000&mode=point) | +| Petra Great Temple (Jordan) | OpenContext | 108,846 | [Open in Explorer](explorer.html?collection=1ef8673aa89023c1#v=1&lat=30.3287&lng=35.4421&alt=40000&mode=point) | +| Polis Chrysochous (Cyprus) | OpenContext | 52,252 | [Open in Explorer](explorer.html?collection=756f324a7d902068#v=1&lat=35.0349&lng=32.4218&alt=40000&mode=point) | +| Kenan Tepe (Turkey) | OpenContext | 42,294 | [Open in Explorer](explorer.html?collection=732469b20b632815#v=1&lat=37.8307&lng=40.8137&alt=40000&mode=point) | +| Poggio Civitate (Italy) | OpenContext | 41,679 | [Open in Explorer](explorer.html?collection=a5e653d3b3704b95#v=1&lat=43.1529&lng=11.4016&alt=40000&mode=point) | +| Ilıpınar (Turkey) | OpenContext | 36,947 | [Open in Explorer](explorer.html?collection=2308de8c25a27090#v=1&lat=40.4683&lng=29.3091&alt=40000&mode=point) | +| Čḯxwicən (Washington, USA) | OpenContext | 29,793 | [Open in Explorer](explorer.html?collection=84eb590024898ba9#v=1&lat=48.1315&lng=-123.4628&alt=40000&mode=point) | +| Heit el-Ghurab / Giza (Egypt) | OpenContext | 28,940 | [Open in Explorer](explorer.html?collection=cb1775e663696ce6#v=1&lat=29.9711&lng=31.1413&alt=40000&mode=point) | +| Domuztepe (Turkey) | OpenContext | 22,394 | [Open in Explorer](explorer.html?collection=d452bbb04ea0d100#v=1&lat=37.3226&lng=37.0349&alt=40000&mode=point) | +| Forcello Bagnolo San Vito (Italy) | OpenContext | 18,573 | [Open in Explorer](explorer.html?collection=c59e2c8620cde574#v=1&lat=45.0897&lng=10.8754&alt=40000&mode=point) | +| Chogha Mish (Iran) | OpenContext | 16,827 | [Open in Explorer](explorer.html?collection=49e189be61689b3d#v=1&lat=32.2240&lng=48.5559&alt=40000&mode=point) | + +## What a preset URL is made of + +``` +explorer.html + ?collection=dd74c71982da0e21 # the collection facet (PKAP Survey Area) + #v=1 # hash schema version + &lat=34.9836&lng=33.7071 # camera target (collection centroid) + &alt=40000 # 40 km altitude → point mode + &mode=point # force individual sample dots +``` + +The `collection` value is a stable id (a hash of source + collection name) from +`collections.parquet`. To build your own view, apply any combination of facets +and camera in the explorer, then copy the browser's URL — every part of the +state is encoded there. diff --git a/data.qmd b/data.qmd index be0c0751..f1ceb6b3 100644 --- a/data.qmd +++ b/data.qmd @@ -57,6 +57,7 @@ cite `https://data.isamples.org/`. | Aggregate map clusters by zoom | [`h3_summary_res{4,6,8}.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res4.parquet) | ≤ 2.4 MB each | | Filter by material / context / object-type | [`sample_facets_v2.parquet`](https://data.isamples.org/isamples_202601_sample_facets_v2.parquet) | 63 MB | | Walk relationships (graph queries) | [`isamples_202512_narrow.parquet`](https://data.isamples.org/isamples_202512_narrow.parquet) | 820 MB | +| Browse / filter by collection (e.g. an OpenContext project) | [`collections.parquet`](https://data.isamples.org/isamples_202604_collections.parquet) + [`sample_collections.parquet`](https://data.isamples.org/isamples_202604_sample_collections.parquet) | 3 MB + 13 MB | | Translate vocabulary URIs to human-readable labels | [`vocab_labels.parquet`](https://data.isamples.org/vocab_labels.parquet) | 58 KB | ## 3. Copy-pasteable DuckDB snippets diff --git a/explorer.qmd b/explorer.qmd index c0b5c9e4..60b7b88e 100644 --- a/explorer.qmd +++ b/explorer.qmd @@ -640,6 +640,16 @@ Specimen Type Loading... +
+
+Collection +
+ +
@@ -696,6 +706,18 @@ cross_filter_url = `${R2_BASE}/isamples_202601_facet_cross_filter.parquet` // SKOS prefLabels for Material / Sampled Feature / Specimen Type URIs. // ~60 KB lookup; falls back to URI tail if a URI isn't covered. vocab_labels_url = `${R2_BASE}/vocab_labels.parquet` +// Collection facet (#243). Additive files built by scripts/build_collections.py +// from the wide parquet's Sample→Event→Site traversal — they touch none of the +// existing facet files. `collections` is the dimension (collection_id, label, +// source, n_samples, centroid_lat/lng, bbox); `sample_collections` maps +// pid → collection_id. A "collection" is a SamplingSite *label* (e.g. the +// OpenContext project "PKAP Survey Area"), keyed by a stable hash of +// (source, label). +collections_url = `${R2_BASE}/isamples_202604_collections.parquet` +sample_collections_url = `${R2_BASE}/isamples_202604_sample_collections.parquet` +// How many top collections (by sample count) render as checkboxes; the long +// tail (~60K) is reachable via the search box. +COLLECTION_FACET_TOPN = 100 // Canonical palette — see issue #113. Path-relative so this works under // both isamples.org (custom domain at root) and project-pages fork @@ -805,6 +827,10 @@ function applyQueryToFacetFilters() { setCheckedValues('materialFilterBody', csvParamValues(params, 'material')); setCheckedValues('contextFilterBody', csvParamValues(params, 'context')); setCheckedValues('objectTypeFilterBody', csvParamValues(params, 'object_type')); + // Collection checkboxes are rendered as the union of top-N and the URL's + // collection ids (see facetFilters cell), so the values below already have + // matching rows by the time this runs. + setCheckedValues('collectionFilterBody', csvParamValues(params, 'collection')); } @@ -823,6 +849,7 @@ function writeQueryState() { ['material', 'materialFilterBody'], ['context', 'contextFilterBody'], ['object_type', 'objectTypeFilterBody'], + ['collection', 'collectionFilterBody'], ].forEach(([key, containerId]) => { const values = getCheckedValues(containerId); if (values.length > 0) params.set(key, values.join(',')); @@ -881,7 +908,8 @@ function getCheckedValues(containerId) { function hasFacetFilters() { return getCheckedValues('materialFilterBody').length > 0 || getCheckedValues('contextFilterBody').length > 0 - || getCheckedValues('objectTypeFilterBody').length > 0; + || getCheckedValues('objectTypeFilterBody').length > 0 + || getCheckedValues('collectionFilterBody').length > 0; } // Single source of truth for #facetNote visibility. The note ("filter @@ -938,8 +966,20 @@ function facetFilterSQL() { const list = ot.map(s => `'${escSql(s)}'`).join(','); conds.push(`object_type IN (${list})`); } - if (conds.length === 0) return ''; - return ` AND pid IN (SELECT DISTINCT pid FROM read_parquet('${facets_url}') WHERE ${conds.join(' AND ')})`; + let sql = ''; + if (conds.length > 0) { + sql += ` AND pid IN (SELECT DISTINCT pid FROM read_parquet('${facets_url}') WHERE ${conds.join(' AND ')})`; + } + // Collection facet (#243) lives in its own membership file, so it appends a + // second independent subquery rather than a column in `facets_url`. Multiple + // checked collections are OR'd (IN list); they AND with the material/etc. + // predicate above. + const coll = getCheckedValues('collectionFilterBody'); + if (coll.length > 0) { + const list = coll.map(s => `'${escSql(s)}'`).join(','); + sql += ` AND pid IN (SELECT pid FROM read_parquet('${sample_collections_url}') WHERE collection_id IN (${list}))`; + } + return sql; } // Shared viewport-padding factor. The samples table (PR #219), the @@ -1792,6 +1832,102 @@ facetFilters = { renderFilter('materialFilterBody', 'material', grouped.material); renderFilter('contextFilterBody', 'context', grouped.context); renderFilter('objectTypeFilterBody', 'object_type', grouped.object_type); + + // --- Collection facet (#243): top-N checkboxes + search-the-tail --- + // Reads from the additive collections.parquet dimension. Counts here are + // the collection's total sample count (static); unlike material/context/ + // object_type they are NOT cross-filtered (no cross_filter cache for + // collections yet) — the dots and table still respect the filter via + // facetFilterSQL(). data-facet="collection" keeps applyFacetCounts() from + // touching these rows. + try { + const collBody = document.getElementById('collectionFilterBody'); + const collSearch = document.getElementById('collectionSearch'); + const collResults = document.getElementById('collectionSearchResults'); + const urlCollIds = csvParamValues(new URLSearchParams(location.search), 'collection') || []; + + const collRowHtml = (id, label, count) => + ``; + + // Top-N by sample count, plus any ids named in the URL so deep links + // restore long-tail selections that aren't in the top-N. + const topRows = await db.query(` + SELECT collection_id, label, n_samples + FROM read_parquet('${collections_url}') + ORDER BY n_samples DESC + LIMIT ${COLLECTION_FACET_TOPN} + `); + const seen = new Set(topRows.map(r => r.collection_id)); + let extraRows = []; + const missing = urlCollIds.filter(id => !seen.has(id)); + if (missing.length) { + const list = missing.map(s => `'${escSql(s)}'`).join(','); + extraRows = await db.query(` + SELECT collection_id, label, n_samples + FROM read_parquet('${collections_url}') + WHERE collection_id IN (${list}) + `); + } + const allRows = extraRows.concat(topRows); + if (collBody) { + collBody.innerHTML = allRows.length + ? allRows.map(r => collRowHtml(r.collection_id, r.label, r.n_samples)).join('') + : 'No collections'; + } + + // Search box → query the full dimension by label; clicking a result + // injects a checked row into the body (if absent) and fires the same + // change event the checkboxes do, so the existing handler reruns. + if (collSearch && collResults && collBody) { + let collSearchTimer = null; + const runCollSearch = async () => { + const term = collSearch.value.trim(); + if (term.length < 2) { collResults.style.display = 'none'; collResults.innerHTML = ''; return; } + const esc = escapeIlikePattern(term); + const rows = await db.query(` + SELECT collection_id, label, n_samples + FROM read_parquet('${collections_url}') + WHERE label ILIKE '%${esc}%' ESCAPE '\\' + ORDER BY n_samples DESC + LIMIT 25 + `); + collResults.innerHTML = rows.length + ? rows.map(r => + `
${escText(r.label)} (${Number(r.n_samples).toLocaleString()})
` + ).join('') + : 'No matches'; + collResults.style.display = 'block'; + }; + collSearch.addEventListener('input', () => { + clearTimeout(collSearchTimer); + collSearchTimer = setTimeout(runCollSearch, 250); + }); + collResults.addEventListener('click', (ev) => { + const row = ev.target.closest('.collection-search-row'); + if (!row) return; + const id = row.getAttribute('data-id'); + const selector = `input[type="checkbox"][value="${id}"]`; + let cb = collBody.querySelector(selector); + if (!cb) { + collBody.insertAdjacentHTML('afterbegin', + collRowHtml(id, row.getAttribute('data-label'), row.getAttribute('data-count'))); + cb = collBody.querySelector(selector); + } + if (cb && !cb.checked) { + cb.checked = true; + collBody.dispatchEvent(new Event('change', { bubbles: true })); + } + collSearch.value = ''; + collResults.style.display = 'none'; + collResults.innerHTML = ''; + }); + } + } catch (err) { + console.warn('collection facet setup failed:', err); + const collBody = document.getElementById('collectionFilterBody'); + if (collBody) collBody.innerHTML = 'Collections unavailable'; + } + applyFacetCounts('source', null); applyQueryToFacetFilters(); @@ -2960,6 +3096,7 @@ zoomWatcher = { material: getCheckedValues('materialFilterBody').slice().sort(), context: getCheckedValues('contextFilterBody').slice().sort(), object_type: getCheckedValues('objectTypeFilterBody').slice().sort(), + collection: getCheckedValues('collectionFilterBody').slice().sort(), }); } @@ -3427,6 +3564,7 @@ zoomWatcher = { document.getElementById('materialFilterBody').addEventListener('change', handleFacetFilterChange); document.getElementById('contextFilterBody').addEventListener('change', handleFacetFilterChange); document.getElementById('objectTypeFilterBody').addEventListener('change', handleFacetFilterChange); + document.getElementById('collectionFilterBody')?.addEventListener('change', handleFacetFilterChange); // --- Camera change handler --- let timer = null; diff --git a/scripts/build_collections.py b/scripts/build_collections.py new file mode 100644 index 00000000..bea7ef70 --- /dev/null +++ b/scripts/build_collections.py @@ -0,0 +1,201 @@ +#!/usr/bin/env python3 +""" +Build the supplementary parquet files that power the explorer's *collection* +facet (issue #243). + +A "collection" is the human-readable **label** of a SamplingSite (e.g. the +OpenContext project "PKAP Survey Area"). That identity does NOT live on the +MaterialSampleRecord rows the explorer renders; it is reached by traversal +through the wide parquet's relationship arrays: + + MaterialSampleRecord.p__produced_by[1] -> SamplingEvent.row_id + SamplingEvent.p__sampling_site[1] -> SamplingSite.row_id + SamplingSite.label -> the collection name + +Many SamplingSite rows share one label (e.g. ~1,336 rows are "PKAP Survey +Area"), so a collection aggregates over all of them. We therefore key a +collection on a stable hash of (source, label), NOT on a site pid. + +Doing this traversal live in DuckDB-WASM per facet interaction is the +documented in-browser bottleneck, so we precompute here. Two ADDITIVE outputs +(they touch none of the existing facet files): + + 1. collections.parquet -- dimension, one row per collection: + collection_id, label, source, n_samples, + centroid_lat, centroid_lng, min_lat, max_lat, min_lng, max_lng + Powers the top-N checkbox list, the long-tail search box, and the + Featured-Collections preset camera targets. + + 2. sample_collections.parquet -- membership, one row per sample that has a + collection: pid, collection_id + The explorer filters with: + AND pid IN (SELECT pid FROM read_parquet('') + WHERE collection_id IN (...)) + exactly parallel to the existing facet predicate at explorer.qmd:942. + +Usage: + python build_collections.py \ + --wide https://data.isamples.org/current/wide.parquet \ + --out-dir /tmp/collections_build \ + --snapshot 202604 + +Verify against the live data without writing files: + python build_collections.py --dry-run +""" +from __future__ import annotations + +import argparse +import os +import sys +import time + +import duckdb + +DEFAULT_WIDE = "https://data.isamples.org/current/wide.parquet" + + +def build(wide_url: str, out_dir: str, snapshot: str, dry_run: bool) -> dict: + con = duckdb.connect() + con.sql("INSTALL httpfs; LOAD httpfs;") + + t0 = time.time() + # Pull only the columns the traversal needs, for the three entity types. + con.sql( + f""" + CREATE TEMP TABLE w AS + SELECT row_id, pid, otype, n AS source, label, latitude, longitude, + p__produced_by, p__sampling_site + FROM read_parquet('{wide_url}') + WHERE otype IN ('MaterialSampleRecord','SamplingEvent','SamplingSite') + """ + ) + print(f"[1/4] loaded traversal columns in {time.time() - t0:.1f}s") + + # Lookup tables for the two hops. + con.sql( + "CREATE TEMP TABLE site AS " + "SELECT row_id AS site_rid, label AS site_label " + "FROM w WHERE otype='SamplingSite' AND label IS NOT NULL" + ) + # Unnest the sampling_site array so an event with multiple sites maps to + # all of them (not just the first). + con.sql( + "CREATE TEMP TABLE evt AS " + "SELECT row_id AS evt_rid, UNNEST(p__sampling_site) AS site_rid " + "FROM w WHERE otype='SamplingEvent' AND p__sampling_site IS NOT NULL" + ) + + # Per-sample collection membership. Unnest BOTH relationship arrays + # (produced_by → events, sampling_site → sites) so a sample with multiple + # events / a site list joins through all of them — otherwise a member could + # be silently dropped from a non-first collection. DISTINCT collapses the + # fan-out to one row per (pid, collection). collection_id is a stable 16-hex + # digest of (source, label) so it survives rebuilds and is URL-safe. + con.sql( + """ + CREATE TEMP TABLE memb AS + SELECT DISTINCT + s.pid AS pid, + substr(md5(coalesce(s.source,'') || '\x1f' || st.site_label), 1, 16) AS collection_id, + st.site_label AS label, + s.source AS source, + s.latitude AS lat, + s.longitude AS lng + FROM ( + SELECT pid, source, latitude, longitude, UNNEST(p__produced_by) AS evt_rid + FROM w + WHERE otype='MaterialSampleRecord' AND pid IS NOT NULL + AND p__produced_by IS NOT NULL + ) s + JOIN evt e ON e.evt_rid = s.evt_rid + JOIN site st ON st.site_rid = e.site_rid + """ + ) + print(f"[2/4] built membership in {time.time() - t0:.1f}s") + + # Collections dimension (one row per collection). + con.sql( + """ + CREATE TEMP TABLE collections AS + SELECT + collection_id, + any_value(label) AS label, + any_value(source) AS source, + COUNT(DISTINCT pid) AS n_samples, + round(median(lat), 5) AS centroid_lat, + round(median(lng), 5) AS centroid_lng, + round(min(lat), 5) AS min_lat, + round(max(lat), 5) AS max_lat, + round(min(lng), 5) AS min_lng, + round(max(lng), 5) AS max_lng + FROM memb + GROUP BY collection_id + """ + ) + + stats = { + "samples_with_collection": con.sql("SELECT COUNT(DISTINCT pid) FROM memb").fetchone()[0], + "n_collections": con.sql("SELECT COUNT(*) FROM collections").fetchone()[0], + "pkap_samples": con.sql( + "SELECT n_samples FROM collections WHERE label='PKAP Survey Area'" + ).fetchone(), + } + print(f"[3/4] aggregated {stats['n_collections']:,} collections; " + f"{stats['samples_with_collection']:,} samples carry one") + pkap = stats["pkap_samples"][0] if stats["pkap_samples"] else None + print(f" PKAP Survey Area -> {pkap} samples " + f"(expected ~15,446)") + + print("\n Top 10 collections by sample count:") + print(con.sql( + "SELECT label, source, n_samples, centroid_lat, centroid_lng " + "FROM collections ORDER BY n_samples DESC LIMIT 10" + ).df().to_string(index=False)) + + if dry_run: + print("\n[4/4] --dry-run: no files written") + return stats + + os.makedirs(out_dir, exist_ok=True) + dim_path = os.path.join(out_dir, f"isamples_{snapshot}_collections.parquet") + memb_path = os.path.join(out_dir, f"isamples_{snapshot}_sample_collections.parquet") + + con.sql( + f"COPY (SELECT * FROM collections ORDER BY n_samples DESC) " + f"TO '{dim_path}' (FORMAT PARQUET, COMPRESSION ZSTD)" + ) + con.sql( + # Order by collection_id so the explorer's `WHERE collection_id IN (...)` + # filter can prune row groups (and it compresses better). + f"COPY (SELECT DISTINCT pid, collection_id FROM memb ORDER BY collection_id, pid) " + f"TO '{memb_path}' (FORMAT PARQUET, COMPRESSION ZSTD)" + ) + print(f"\n[4/4] wrote:\n {dim_path} ({os.path.getsize(dim_path)/1e6:.1f} MB)" + f"\n {memb_path} ({os.path.getsize(memb_path)/1e6:.1f} MB)") + stats["dim_path"] = dim_path + stats["memb_path"] = memb_path + return stats + + +def main(argv=None) -> int: + ap = argparse.ArgumentParser(description="Build collection facet parquet files (#243)") + ap.add_argument("--wide", default=DEFAULT_WIDE, + help="wide parquet URL (default: %(default)s)") + ap.add_argument("--out-dir", default="/tmp/collections_build", + help="output directory (default: %(default)s)") + ap.add_argument("--snapshot", default="202604", + help="snapshot tag for filenames (default: %(default)s)") + ap.add_argument("--dry-run", action="store_true", + help="compute and report, but write no files") + args = ap.parse_args(argv) + + try: + build(args.wide, args.out_dir, args.snapshot, args.dry_run) + except Exception as exc: # noqa: BLE001 + print(f"ERROR: {exc}", file=sys.stderr) + return 1 + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/test_collections.py b/tests/test_collections.py new file mode 100644 index 00000000..aafd83a1 --- /dev/null +++ b/tests/test_collections.py @@ -0,0 +1,61 @@ +""" +Feature: Collections (issue #243) + As someone exploring iSamples + I want to jump to a named collection (e.g. an OpenContext project) and + filter the explorer to exactly its samples + So that I can browse meaningful groupings, not just map locations. + +These tests validate the static markup the feature ships: the Collections +landing page and the explorer's `collection` facet DOM. They do NOT require the +collections.parquet / sample_collections.parquet files to be live on R2 — the +data-layer behavior is verified separately (see scripts/build_collections.py and +the data-contract checks). Run the live facet verification after those two files +are uploaded to data.isamples.org. +""" +from conftest import SITE_URL + +COLLECTIONS_URL = f"{SITE_URL}/collections.html" +EXPLORER_URL = f"{SITE_URL}/explorer.html" + +# Stable id for PKAP Survey Area = substr(md5('OPENCONTEXT\x1fPKAP Survey Area'), 1, 16) +PKAP_COLLECTION_ID = "dd74c71982da0e21" + + +class TestCollectionsPage: + """Scenario: the Collections landing page lists featured collections.""" + + def test_page_renders(self, page): + page.goto(COLLECTIONS_URL, wait_until="domcontentloaded") + assert page.get_by_text("Featured Collections").count() > 0 + + def test_lists_pkap(self, page): + page.goto(COLLECTIONS_URL, wait_until="domcontentloaded") + assert page.get_by_text("PKAP", exact=False).count() > 0 + + def test_presets_use_collection_param(self, page): + """Each preset links into the explorer with a ?collection= filter.""" + page.goto(COLLECTIONS_URL, wait_until="domcontentloaded") + links = page.locator("a[href*='explorer.html?collection=']") + assert links.count() >= 12 + + def test_pkap_preset_id(self, page): + page.goto(COLLECTIONS_URL, wait_until="domcontentloaded") + assert page.locator( + f"a[href*='collection={PKAP_COLLECTION_ID}']" + ).count() >= 1 + + +class TestExplorerCollectionFacet: + """Scenario: the explorer exposes a Collection facet (search + checkboxes).""" + + def test_collection_filter_section_present(self, page): + page.goto(EXPLORER_URL, wait_until="domcontentloaded") + assert page.locator("#collectionFilter").count() == 1 + + def test_collection_search_box_present(self, page): + page.goto(EXPLORER_URL, wait_until="domcontentloaded") + assert page.locator("#collectionSearch").count() == 1 + + def test_collection_body_present(self, page): + page.goto(EXPLORER_URL, wait_until="domcontentloaded") + assert page.locator("#collectionFilterBody").count() == 1