Skip to content

feat(waterdata): add waterdata.xarray module returning CF datasets#281

Draft
thodson-usgs wants to merge 14 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-waterdata-drop-hash-ids
Draft

feat(waterdata): add waterdata.xarray module returning CF datasets#281
thodson-usgs wants to merge 14 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-waterdata-drop-hash-ids

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 17, 2026

Summary

Adds dataretrieval.waterdata.xarray, a module that mirrors the Water Data time-series getters but returns CF-conventions xarray.Datasets with series metadata populated, instead of bare DataFrames.

from dataretrieval.waterdata import xarray as wdx
ds = wdx.get_daily(monitoring_location_id="USGS-05407000",
                   parameter_code="00060", time="2024-06-01/2024-06-05")
discharge (monitoring_location_id, time)
    long_name:     Discharge, cubic feet per second
    units:         ft3 s-1
    cell_methods:  time: mean
    standard_name: water_volume_transport_in_river_channel
coords: monitoring_location_id (cf_role=timeseries_id), time, longitude, latitude
attrs: Conventions=CF-1.11, institution, source, references(URL)

Coverage

get_daily, get_continuous, get_latest_continuous, get_latest_daily, get_nearest_continuous, get_peaks, get_field_measurements, and (preliminary) get_stats_por / get_stats_date_range.

How it works

  • Each wrapper calls the underlying getter's default, hash-free output, so the large per-record UUID column is never fetched or materialized.
  • CF attributes are derived from columns already present: unit_of_measureunits (UDUNITS where mapped), statistic_idcell_methods, parameter_codestandard_name/usgs_parameter_code. Only the human-readable parameter name comes from a small, cached parameter_code-keyed metadata lookup.
  • One data variable per parameter; time and monitoring_location_id are the coordinates (CF discrete-sampling-geometry, cf_role=timeseries_id), with longitude/latitude from point geometry, and qualifier/approval_status as ancillary variables.
  • xarray is an optional dependency (pip install dataretrieval[xarray]); it is not imported by dataretrieval.waterdata, so the core package stays xarray-free.

Supporting change: hash-free defaults

This builds on dropping hash-valued ID columns by default — what makes the lean fetch lean. Every public get_* now omits the opaque hash/UUID columns and adds include_hash: bool = False for opt-in; stable identifiers (monitoring_location_id, parameter_code, statistic_id, time) are always returned. The secondary join keys (time_series_id, parent_time_series_id, field_series_id) are dropped too — set include_hash=True, or name them in properties=, to join back to the metadata endpoints.

Status

Draft. Known gaps from self-review to harden before merge: the statistics conversion is a preliminary flat layout (not yet serializable for percentile/list columns); NaT-time rows are silently dropped; mixed units within a group and properties= subsets that omit value/time/parameter_code are not yet guarded.

thodson-usgs and others added 6 commits May 24, 2026 12:02
The waterdata OGC services previously returned per-record UUID columns
(``daily_id``, ``continuous_id``, ``peak_id``, …) plus secondary hash
columns (``time_series_id``, ``parent_time_series_id``,
``field_visit_id``, ``field_measurements_series_id``) in every
response. These IDs are unstable across record refreshes and not
human-meaningful — stable identifiers like ``monitoring_location_id``
(AGENCY-ID format), ``parameter_code``, ``statistic_id`` and ``time``
are sufficient to pin a row.

Drop the hash columns by default and add ``include_hash_ids: bool =
False`` to every OGC ``get_*`` function for opt-in. Implementation
trims server-side via the OGC ``properties=`` query parameter (cached
per service from one queryables fetch) so the payload itself is
smaller, with a client-side drop as a safety net. ``monitoring_location_id``
and other AGENCY-ID / code-style identifiers are unaffected.

Offline benchmark on a synthetic 30,000-row payload (mirrors the
on-wire shape and per-row size of a 1-year ``get_continuous`` query):

  - Server payload: 14,310,081 → 10,230,081 bytes (28.5% smaller)
  - DataFrame memory: 14.2 MB → 9.4 MB (33.5% smaller)
  - Peak traced memory: 94.1 MB → 73.9 MB (21.5% smaller)
  - Local parse + DataFrame construction: 1.05s → 0.94s (10.8% faster)

Network savings stack on top of the local speedup. For very small
queries (≲1k rows) the one-time queryables fetch overhead can dominate
the savings; large queries are the target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e_range

The OGC ``get_*`` functions in the prior commit drop hash columns
through ``get_ogc_data``. The statistics services (which return JSON
through ``get_stats_data`` rather than OGC features) bypassed that
path, so ``get_stats_por`` and ``get_stats_date_range`` were still
returning ``computation_id`` (UUID) and ``parent_time_series_id``
(hex hash) by default.

This commit:
- Adds ``computation_id`` to ``_HASH_ID_COLUMNS``
  (``parent_time_series_id`` was already there).
- Plumbs ``include_hash_ids: bool = False`` through ``get_stats_data``,
  ``get_stats_por``, and ``get_stats_date_range``.
- Drops the hash columns at the end of ``get_stats_data``, after
  ``_expand_percentiles`` (which still needs ``computation_id`` as a
  join key while it explodes the percentile lists into rows).
- Updates ``test_get_stats_por_expanded_false`` /
  ``test_get_stats_date_range`` to reflect the new column count and
  adds ``test_get_stats_por_include_hash_ids`` documenting the opt-in.

Discovered while running a live-API sweep across every public
waterdata ``get_*`` function — the OGC services now pass, the stats
ones used to leak, and this commit closes that gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two mocked-response tests for ``get_stats_data``:
- ``test_get_stats_data_drops_hash_ids_by_default`` asserts
  ``computation_id`` and ``parent_time_series_id`` are removed when
  ``include_hash_ids=False`` (the new default).
- ``test_get_stats_data_keeps_hash_ids_when_opted_in`` asserts the
  opt-in path preserves them, matching the legacy behavior.

Both use ``monkeypatch`` to stub ``_handle_stats_nesting`` so the
fake response only needs to carry the column shape, not the full
nested-percentile body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A live-API column-content scan found that the Samples CSV service
returns two UUID-valued columns by default:
  - ``Activity_ActivityIdentifier`` (per-activity UUID)
  - ``Result_MeasureIdentifier`` (per-measurement UUID)

These weren't covered by the prior OGC / stats commits because
``get_samples`` parses CSV directly without going through
``get_ogc_data`` or ``get_stats_data``.

This commit:
- Adds the two CamelCase column names to ``_HASH_ID_COLUMNS``.
- Plumbs ``include_hash_ids: bool = False`` through ``get_samples``
  and drops the named columns from the parsed CSV before returning.
- Updates ``test_mock_get_samples`` to reflect the new column count
  (187 → 185) and adds ``test_mock_get_samples_include_hash_ids`` for
  the opt-in path.
- Updates ``test_samples_results`` and ``test_samples_activity``
  similarly.

Stable identifiers (``Org_Identifier``, ``Location_Identifier``,
``Project_Identifier``, ``USGSpcode``, …) are kept unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-PR-281 cleanup from a code review pass. No behavior change for
users; the three code paths that drop hash columns (OGC, stats,
samples) now share one helper, and a few smaller wins.

- New ``_drop_hash_columns(df, include_hash_ids, keep=None)`` helper
  replaces three near-identical drop blocks across ``_arrange_cols``,
  ``get_stats_data``, and ``get_samples``. Uses
  ``set(df.columns) & _HASH_ID_COLUMNS - keep`` (one Index intersection)
  in place of the per-call list-comprehension + ``if drop_cols:``
  truthiness guard. ``df.drop`` accepts an empty Index, so the guard
  was unnecessary.

- ``_HASH_ID_COLUMNS`` no longer leaks into ``api.py`` — the samples
  branch now calls the helper instead of touching the constant.

- New ``_properties_unspecified(properties)`` extracts the
  ``properties is None or all(pd.isna(properties))`` predicate so the
  five call sites that need it stop drifting between ``not properties``
  and ``properties is None`` variants.

- ``_default_non_hash_properties`` memoizes its result by
  ``(service, output_id)`` via ``_default_props_cache``. The previous
  version rebuilt a ~30–100-item list on every OGC call after the
  queryables cache was warm; this saves the per-call list rebuild on
  the hot path.

- ``get_ogc_data`` flattens its nested ``try/except + if/else``
  branches into a single ``use_server_trim`` flag, removing the
  duplicated ``_switch_properties_id`` assignment.

- The benchmark no longer clears ``_queryables_cache`` between
  measured rounds. Clearing per round only penalizes the default path
  (the legacy path doesn't consult the cache), so the previous
  comparison was pessimistic against the default. Real-world callers
  pay the queryables fetch once per process.

- The benchmark's ``capture_fixtures`` now calls
  ``_default_non_hash_properties`` directly instead of reimplementing
  the filter, so the "trimmed" fixture matches what the runtime
  actually sends bit-for-bit.

All 157 unit and mock tests pass after the refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``benchmarks/`` directory was useful during development for
measuring the hash-ID-drop impact, but it's not part of the runtime
or test surface and the numbers are captured in the PR description.
Removing to keep the diff focused on the library change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs force-pushed the worktree-waterdata-drop-hash-ids branch from 301cd49 to 5416fe5 Compare May 24, 2026 17:05
thodson-usgs and others added 4 commits May 24, 2026 16:47
The include_hash_ids docstrings claimed the dropped columns "are not
stable across record refreshes." That is true for the per-record version
UUID (daily_id/continuous_id/computation_id), which is regenerated on
every refresh, but wrong for the secondary hashes (time_series_id,
parent_time_series_id, field_series_id/field_measurements_series_id):
real-data checks confirm these are stable and are the documented join
keys back to the metadata endpoints (e.g. time_series_id links a values
row to get_time_series_metadata; the API docs recommend (time,
time_series_id) to identify an observation over time).

Rewrite the 11 OGC getter blocks, the _HASH_ID_COLUMNS comment, and the
get_stats_data block to distinguish the unstable per-record UUID from the
stable-but-opaque join keys, and tell callers to set include_hash_ids=True
(or name the column in properties) when they need to join, trace series
lineage, or disambiguate series sharing the same
(monitoring_location_id, parameter_code, statistic_id).

Docs only; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… docs

Rename the opt-in flag include_hash_ids -> include_hash across the OGC
getters, get_samples, get_stats_*, the internal helpers, and the tests.
The flag is new in this PR (never on main), so this is not a breaking
change. Collapse the per-getter docstrings to two lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New optional module dataretrieval.waterdata.xarray mirrors the time-series
getters (get_daily, get_continuous, get_latest_continuous/daily,
get_nearest_continuous, get_peaks, get_field_measurements, get_stats_por/
date_range) but returns a CF-conventions xarray.Dataset instead of a
DataFrame.

Each wrapper calls the underlying getter with include_hash=True, looks up the
series descriptors from the metadata endpoints (cached per site), and writes
them onto the dataset: one data variable per parameter with long_name, units
(UDUNITS where mapped), cell_methods (from the statistic/computation), and
standard_name where a confident USGS-pcode -> CF mapping exists. The
monitoring location is the CF discrete-sampling-geometry instance dimension
(cf_role=timeseries_id) with longitude/latitude coords from point geometry;
qualifier/approval_status ride along as ancillary variables; dataset attrs
carry Conventions/provenance/request URL. Parameters on different clocks are
outer-joined on a shared time axis; a (site, parameter) collision falls back
to time_series_id as the instance dimension.

xarray is an optional dependency (pip install dataretrieval[xarray]); the
module raises a clear ImportError if it is missing and is not imported by
dataretrieval.waterdata, so the core package stays xarray-free. Stats use a
preliminary flat conversion. Adds offline converter tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrappers previously forced include_hash=True purely to keep a join key,
which materialized the per-record UUID column (a unique ~36-char string per
row -- e.g. continuous_id measured at ~155 KB / 1825 unique values on a
20-day continuous pull) only to discard it.

Drop that. The wrappers now fetch the underlying getter's default, hash-free
output, so the heavy UUID column is never requested or built. Every CF
attribute is derived from columns already present: units from
unit_of_measure, cell_methods from statistic_id (new _STATISTIC_CELL_METHOD
map), standard_name/usgs_parameter_code from parameter_code. Only the
human-readable parameter name still needs the metadata endpoint, now looked
up (and cached) keyed by parameter_code rather than a series hash. time and
monitoring_location_id are the coordinates; the converter slims to just the
columns it pivots before copying.

Without the hash key the rare two-series-per-(site, parameter, statistic)
collision can no longer be split (the values are inherently ambiguous without
it), so it now keeps the first value and warns instead of switching the
instance dimension. Tests updated accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs changed the title feat(waterdata): drop hash-valued ID columns by default feat(waterdata): add waterdata.xarray module returning CF datasets May 25, 2026
thodson-usgs and others added 4 commits May 24, 2026 22:48
The CF lookup tables (USGS units -> UDUNITS, statistic_id -> cell_methods
operator, parameter_code -> standard_name) are plain reference data, so move
them out of the converter module into types.py as public, extensible
constants (CF_UNIT_MAP, CF_CELL_METHODS, CF_STANDARD_NAMES) alongside the
existing PROFILE_LOOKUP. They carry no xarray dependency, so types.py stays
import-light and the tables can be extended without importing the
xarray-optional converter. xarray.py imports them at the top; behavior is
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The xarray path never surfaces hash columns (neither the per-record OGC
feature id nor the per-series join key), so route every wrapper through a
_fetch helper that pops include_hash before calling the underlying getter.
This makes the flag inert in the xarray path -- passing include_hash=True no
longer triggers a fetch of a column we'd only discard -- and documents that
hash IDs are always omitted here. The pandas getters keep include_hash as
the opt-in flag; only the xarray layer drops them unconditionally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…routing

Addresses two review gaps:

- The shared wrapper docstring promised "a CF-conventions Dataset with series
  metadata populated", which is false for the stats wrappers -- _build_stats
  emits a flat, preliminary Dataset with only dataset-level provenance.
  Parametrize _xr_doc(cf_metadata=...) so get_stats_por/get_stats_date_range
  describe their actual output.

- The include_hash strip was only tested on _fetch in isolation; nothing
  pinned the wrappers to route through it. Add a test that monkeypatches
  _fetch and asserts every public wrapper delegates to it (and forwards
  include_hash for _fetch to drop), guarding against a wrapper reverting to a
  direct getter call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two correctness fixes in _build_timeseries from review:

- NaT time: a row whose time fails to parse (errors="coerce" -> NaT) could
  not index the array and was silently swallowed by to_xarray (and an
  all-NaT group crashed). Drop such rows explicitly with a warning, and
  return an empty Dataset if nothing parseable remains -- no more silent
  data loss.

- Mixed units: unit_of_measure was collapsed with _first over a
  (parameter_code, statistic_id) group that can span multiple units (e.g.
  the same parameter reported in different units across sites). A variable
  can carry only one units attr, so warn when a group spans multiple units
  instead of silently labeling every value with the first.

Adds tests for the bad-time drop, all-bad-time empty result, and the
mixed-unit warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs added the enhancement New feature or request label May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant