feat(waterdata): add waterdata.xarray module returning CF datasets by thodson-usgs · Pull Request #281 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-05-17T21:53:47Z

Summary

Adds dataretrieval.waterdata.xarray, a module that mirrors the Water Data time-series getters but returns CF-conventions xarray.Datasets with series metadata populated, instead of bare DataFrames.

from dataretrieval.waterdata import xarray as wdx
ds = wdx.get_daily(monitoring_location_id="USGS-05407000",
                   parameter_code="00060", time="2024-06-01/2024-06-05")

discharge (monitoring_location_id, time)
    long_name:     Discharge, cubic feet per second
    units:         ft3 s-1
    cell_methods:  time: mean
    standard_name: water_volume_transport_in_river_channel
coords: monitoring_location_id (cf_role=timeseries_id), time, longitude, latitude
attrs: Conventions=CF-1.11, institution, source, references(URL)

Coverage

get_daily, get_continuous, get_latest_continuous, get_latest_daily, get_nearest_continuous, get_peaks, get_field_measurements, and (preliminary) get_stats_por / get_stats_date_range.

How it works

Each wrapper calls the underlying getter's default, hash-free output, so the large per-record UUID column is never fetched or materialized.
CF attributes are derived from columns already present: unit_of_measure → units (UDUNITS where mapped), statistic_id → cell_methods, parameter_code → standard_name/usgs_parameter_code. Only the human-readable parameter name comes from a small, cached parameter_code-keyed metadata lookup.
One data variable per parameter; time and monitoring_location_id are the coordinates (CF discrete-sampling-geometry, cf_role=timeseries_id), with longitude/latitude from point geometry, and qualifier/approval_status as ancillary variables.
xarray is an optional dependency (pip install dataretrieval[xarray]); it is not imported by dataretrieval.waterdata, so the core package stays xarray-free.

Supporting change: hash-free defaults

This builds on dropping hash-valued ID columns by default — what makes the lean fetch lean. Every public get_* now omits the opaque hash/UUID columns and adds include_hash: bool = False for opt-in; stable identifiers (monitoring_location_id, parameter_code, statistic_id, time) are always returned. The secondary join keys (time_series_id, parent_time_series_id, field_series_id) are dropped too — set include_hash=True, or name them in properties=, to join back to the metadata endpoints.

Status

Draft. Known gaps from self-review to harden before merge: the statistics conversion is a preliminary flat layout (not yet serializable for percentile/list columns); NaT-time rows are silently dropped; mixed units within a group and properties= subsets that omit value/time/parameter_code are not yet guarded.

The waterdata OGC services previously returned per-record UUID columns (``daily_id``, ``continuous_id``, ``peak_id``, …) plus secondary hash columns (``time_series_id``, ``parent_time_series_id``, ``field_visit_id``, ``field_measurements_series_id``) in every response. These IDs are unstable across record refreshes and not human-meaningful — stable identifiers like ``monitoring_location_id`` (AGENCY-ID format), ``parameter_code``, ``statistic_id`` and ``time`` are sufficient to pin a row. Drop the hash columns by default and add ``include_hash_ids: bool = False`` to every OGC ``get_*`` function for opt-in. Implementation trims server-side via the OGC ``properties=`` query parameter (cached per service from one queryables fetch) so the payload itself is smaller, with a client-side drop as a safety net. ``monitoring_location_id`` and other AGENCY-ID / code-style identifiers are unaffected. Offline benchmark on a synthetic 30,000-row payload (mirrors the on-wire shape and per-row size of a 1-year ``get_continuous`` query): - Server payload: 14,310,081 → 10,230,081 bytes (28.5% smaller) - DataFrame memory: 14.2 MB → 9.4 MB (33.5% smaller) - Peak traced memory: 94.1 MB → 73.9 MB (21.5% smaller) - Local parse + DataFrame construction: 1.05s → 0.94s (10.8% faster) Network savings stack on top of the local speedup. For very small queries (≲1k rows) the one-time queryables fetch overhead can dominate the savings; large queries are the target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e_range The OGC ``get_*`` functions in the prior commit drop hash columns through ``get_ogc_data``. The statistics services (which return JSON through ``get_stats_data`` rather than OGC features) bypassed that path, so ``get_stats_por`` and ``get_stats_date_range`` were still returning ``computation_id`` (UUID) and ``parent_time_series_id`` (hex hash) by default. This commit: - Adds ``computation_id`` to ``_HASH_ID_COLUMNS`` (``parent_time_series_id`` was already there). - Plumbs ``include_hash_ids: bool = False`` through ``get_stats_data``, ``get_stats_por``, and ``get_stats_date_range``. - Drops the hash columns at the end of ``get_stats_data``, after ``_expand_percentiles`` (which still needs ``computation_id`` as a join key while it explodes the percentile lists into rows). - Updates ``test_get_stats_por_expanded_false`` / ``test_get_stats_date_range`` to reflect the new column count and adds ``test_get_stats_por_include_hash_ids`` documenting the opt-in. Discovered while running a live-API sweep across every public waterdata ``get_*`` function — the OGC services now pass, the stats ones used to leak, and this commit closes that gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two mocked-response tests for ``get_stats_data``: - ``test_get_stats_data_drops_hash_ids_by_default`` asserts ``computation_id`` and ``parent_time_series_id`` are removed when ``include_hash_ids=False`` (the new default). - ``test_get_stats_data_keeps_hash_ids_when_opted_in`` asserts the opt-in path preserves them, matching the legacy behavior. Both use ``monkeypatch`` to stub ``_handle_stats_nesting`` so the fake response only needs to carry the column shape, not the full nested-percentile body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A live-API column-content scan found that the Samples CSV service returns two UUID-valued columns by default: - ``Activity_ActivityIdentifier`` (per-activity UUID) - ``Result_MeasureIdentifier`` (per-measurement UUID) These weren't covered by the prior OGC / stats commits because ``get_samples`` parses CSV directly without going through ``get_ogc_data`` or ``get_stats_data``. This commit: - Adds the two CamelCase column names to ``_HASH_ID_COLUMNS``. - Plumbs ``include_hash_ids: bool = False`` through ``get_samples`` and drops the named columns from the parsed CSV before returning. - Updates ``test_mock_get_samples`` to reflect the new column count (187 → 185) and adds ``test_mock_get_samples_include_hash_ids`` for the opt-in path. - Updates ``test_samples_results`` and ``test_samples_activity`` similarly. Stable identifiers (``Org_Identifier``, ``Location_Identifier``, ``Project_Identifier``, ``USGSpcode``, …) are kept unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Post-PR-281 cleanup from a code review pass. No behavior change for users; the three code paths that drop hash columns (OGC, stats, samples) now share one helper, and a few smaller wins. - New ``_drop_hash_columns(df, include_hash_ids, keep=None)`` helper replaces three near-identical drop blocks across ``_arrange_cols``, ``get_stats_data``, and ``get_samples``. Uses ``set(df.columns) & _HASH_ID_COLUMNS - keep`` (one Index intersection) in place of the per-call list-comprehension + ``if drop_cols:`` truthiness guard. ``df.drop`` accepts an empty Index, so the guard was unnecessary. - ``_HASH_ID_COLUMNS`` no longer leaks into ``api.py`` — the samples branch now calls the helper instead of touching the constant. - New ``_properties_unspecified(properties)`` extracts the ``properties is None or all(pd.isna(properties))`` predicate so the five call sites that need it stop drifting between ``not properties`` and ``properties is None`` variants. - ``_default_non_hash_properties`` memoizes its result by ``(service, output_id)`` via ``_default_props_cache``. The previous version rebuilt a ~30–100-item list on every OGC call after the queryables cache was warm; this saves the per-call list rebuild on the hot path. - ``get_ogc_data`` flattens its nested ``try/except + if/else`` branches into a single ``use_server_trim`` flag, removing the duplicated ``_switch_properties_id`` assignment. - The benchmark no longer clears ``_queryables_cache`` between measured rounds. Clearing per round only penalizes the default path (the legacy path doesn't consult the cache), so the previous comparison was pessimistic against the default. Real-world callers pay the queryables fetch once per process. - The benchmark's ``capture_fixtures`` now calls ``_default_non_hash_properties`` directly instead of reimplementing the filter, so the "trimmed" fixture matches what the runtime actually sends bit-for-bit. All 157 unit and mock tests pass after the refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The ``benchmarks/`` directory was useful during development for measuring the hash-ID-drop impact, but it's not part of the runtime or test surface and the numbers are captured in the PR description. Removing to keep the diff focused on the library change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The include_hash_ids docstrings claimed the dropped columns "are not stable across record refreshes." That is true for the per-record version UUID (daily_id/continuous_id/computation_id), which is regenerated on every refresh, but wrong for the secondary hashes (time_series_id, parent_time_series_id, field_series_id/field_measurements_series_id): real-data checks confirm these are stable and are the documented join keys back to the metadata endpoints (e.g. time_series_id links a values row to get_time_series_metadata; the API docs recommend (time, time_series_id) to identify an observation over time). Rewrite the 11 OGC getter blocks, the _HASH_ID_COLUMNS comment, and the get_stats_data block to distinguish the unstable per-record UUID from the stable-but-opaque join keys, and tell callers to set include_hash_ids=True (or name the column in properties) when they need to join, trace series lineage, or disambiguate series sharing the same (monitoring_location_id, parameter_code, statistic_id). Docs only; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… docs Rename the opt-in flag include_hash_ids -> include_hash across the OGC getters, get_samples, get_stats_*, the internal helpers, and the tests. The flag is new in this PR (never on main), so this is not a breaking change. Collapse the per-getter docstrings to two lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New optional module dataretrieval.waterdata.xarray mirrors the time-series getters (get_daily, get_continuous, get_latest_continuous/daily, get_nearest_continuous, get_peaks, get_field_measurements, get_stats_por/ date_range) but returns a CF-conventions xarray.Dataset instead of a DataFrame. Each wrapper calls the underlying getter with include_hash=True, looks up the series descriptors from the metadata endpoints (cached per site), and writes them onto the dataset: one data variable per parameter with long_name, units (UDUNITS where mapped), cell_methods (from the statistic/computation), and standard_name where a confident USGS-pcode -> CF mapping exists. The monitoring location is the CF discrete-sampling-geometry instance dimension (cf_role=timeseries_id) with longitude/latitude coords from point geometry; qualifier/approval_status ride along as ancillary variables; dataset attrs carry Conventions/provenance/request URL. Parameters on different clocks are outer-joined on a shared time axis; a (site, parameter) collision falls back to time_series_id as the instance dimension. xarray is an optional dependency (pip install dataretrieval[xarray]); the module raises a clear ImportError if it is missing and is not imported by dataretrieval.waterdata, so the core package stays xarray-free. Stats use a preliminary flat conversion. Adds offline converter tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The wrappers previously forced include_hash=True purely to keep a join key, which materialized the per-record UUID column (a unique ~36-char string per row -- e.g. continuous_id measured at ~155 KB / 1825 unique values on a 20-day continuous pull) only to discard it. Drop that. The wrappers now fetch the underlying getter's default, hash-free output, so the heavy UUID column is never requested or built. Every CF attribute is derived from columns already present: units from unit_of_measure, cell_methods from statistic_id (new _STATISTIC_CELL_METHOD map), standard_name/usgs_parameter_code from parameter_code. Only the human-readable parameter name still needs the metadata endpoint, now looked up (and cached) keyed by parameter_code rather than a series hash. time and monitoring_location_id are the coordinates; the converter slims to just the columns it pivots before copying. Without the hash key the rare two-series-per-(site, parameter, statistic) collision can no longer be split (the values are inherently ambiguous without it), so it now keeps the first value and warns instead of switching the instance dimension. Tests updated accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The CF lookup tables (USGS units -> UDUNITS, statistic_id -> cell_methods operator, parameter_code -> standard_name) are plain reference data, so move them out of the converter module into types.py as public, extensible constants (CF_UNIT_MAP, CF_CELL_METHODS, CF_STANDARD_NAMES) alongside the existing PROFILE_LOOKUP. They carry no xarray dependency, so types.py stays import-light and the tables can be extended without importing the xarray-optional converter. xarray.py imports them at the top; behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The xarray path never surfaces hash columns (neither the per-record OGC feature id nor the per-series join key), so route every wrapper through a _fetch helper that pops include_hash before calling the underlying getter. This makes the flag inert in the xarray path -- passing include_hash=True no longer triggers a fetch of a column we'd only discard -- and documents that hash IDs are always omitted here. The pandas getters keep include_hash as the opt-in flag; only the xarray layer drops them unconditionally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…routing Addresses two review gaps: - The shared wrapper docstring promised "a CF-conventions Dataset with series metadata populated", which is false for the stats wrappers -- _build_stats emits a flat, preliminary Dataset with only dataset-level provenance. Parametrize _xr_doc(cf_metadata=...) so get_stats_por/get_stats_date_range describe their actual output. - The include_hash strip was only tested on _fetch in isolation; nothing pinned the wrappers to route through it. Add a test that monkeypatches _fetch and asserts every public wrapper delegates to it (and forwards include_hash for _fetch to drop), guarding against a wrapper reverting to a direct getter call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two correctness fixes in _build_timeseries from review: - NaT time: a row whose time fails to parse (errors="coerce" -> NaT) could not index the array and was silently swallowed by to_xarray (and an all-NaT group crashed). Drop such rows explicitly with a warning, and return an empty Dataset if nothing parseable remains -- no more silent data loss. - Mixed units: unit_of_measure was collapsed with _first over a (parameter_code, statistic_id) group that can span multiple units (e.g. the same parameter reported in different units across sites). A variable can carry only one units attr, so warn when a group spans multiple units instead of silently labeling every value with the first. Adds tests for the bad-time drop, all-bad-time empty result, and the mixed-unit warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs mentioned this pull request May 19, 2026

feat(waterdata): add get_waterdata for generalized CQL2 queries #284

Draft

5 tasks

thodson-usgs and others added 6 commits May 24, 2026 12:02

thodson-usgs force-pushed the worktree-waterdata-drop-hash-ids branch from 301cd49 to 5416fe5 Compare May 24, 2026 17:05

thodson-usgs and others added 4 commits May 24, 2026 16:47

thodson-usgs changed the title ~~feat(waterdata): drop hash-valued ID columns by default~~ feat(waterdata): add waterdata.xarray module returning CF datasets May 25, 2026

thodson-usgs and others added 4 commits May 24, 2026 22:48

thodson-usgs added the enhancement New feature or request label May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(waterdata): add waterdata.xarray module returning CF datasets#281

feat(waterdata): add waterdata.xarray module returning CF datasets#281
thodson-usgs wants to merge 14 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-waterdata-drop-hash-ids

thodson-usgs commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Coverage

How it works

Supporting change: hash-free defaults

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thodson-usgs commented May 17, 2026 •

edited

Loading