feat(waterdata): add waterdata.xarray module returning CF datasets#281
Draft
thodson-usgs wants to merge 14 commits into
Draft
feat(waterdata): add waterdata.xarray module returning CF datasets#281thodson-usgs wants to merge 14 commits into
thodson-usgs wants to merge 14 commits into
Conversation
5 tasks
The waterdata OGC services previously returned per-record UUID columns (``daily_id``, ``continuous_id``, ``peak_id``, …) plus secondary hash columns (``time_series_id``, ``parent_time_series_id``, ``field_visit_id``, ``field_measurements_series_id``) in every response. These IDs are unstable across record refreshes and not human-meaningful — stable identifiers like ``monitoring_location_id`` (AGENCY-ID format), ``parameter_code``, ``statistic_id`` and ``time`` are sufficient to pin a row. Drop the hash columns by default and add ``include_hash_ids: bool = False`` to every OGC ``get_*`` function for opt-in. Implementation trims server-side via the OGC ``properties=`` query parameter (cached per service from one queryables fetch) so the payload itself is smaller, with a client-side drop as a safety net. ``monitoring_location_id`` and other AGENCY-ID / code-style identifiers are unaffected. Offline benchmark on a synthetic 30,000-row payload (mirrors the on-wire shape and per-row size of a 1-year ``get_continuous`` query): - Server payload: 14,310,081 → 10,230,081 bytes (28.5% smaller) - DataFrame memory: 14.2 MB → 9.4 MB (33.5% smaller) - Peak traced memory: 94.1 MB → 73.9 MB (21.5% smaller) - Local parse + DataFrame construction: 1.05s → 0.94s (10.8% faster) Network savings stack on top of the local speedup. For very small queries (≲1k rows) the one-time queryables fetch overhead can dominate the savings; large queries are the target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e_range The OGC ``get_*`` functions in the prior commit drop hash columns through ``get_ogc_data``. The statistics services (which return JSON through ``get_stats_data`` rather than OGC features) bypassed that path, so ``get_stats_por`` and ``get_stats_date_range`` were still returning ``computation_id`` (UUID) and ``parent_time_series_id`` (hex hash) by default. This commit: - Adds ``computation_id`` to ``_HASH_ID_COLUMNS`` (``parent_time_series_id`` was already there). - Plumbs ``include_hash_ids: bool = False`` through ``get_stats_data``, ``get_stats_por``, and ``get_stats_date_range``. - Drops the hash columns at the end of ``get_stats_data``, after ``_expand_percentiles`` (which still needs ``computation_id`` as a join key while it explodes the percentile lists into rows). - Updates ``test_get_stats_por_expanded_false`` / ``test_get_stats_date_range`` to reflect the new column count and adds ``test_get_stats_por_include_hash_ids`` documenting the opt-in. Discovered while running a live-API sweep across every public waterdata ``get_*`` function — the OGC services now pass, the stats ones used to leak, and this commit closes that gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two mocked-response tests for ``get_stats_data``: - ``test_get_stats_data_drops_hash_ids_by_default`` asserts ``computation_id`` and ``parent_time_series_id`` are removed when ``include_hash_ids=False`` (the new default). - ``test_get_stats_data_keeps_hash_ids_when_opted_in`` asserts the opt-in path preserves them, matching the legacy behavior. Both use ``monkeypatch`` to stub ``_handle_stats_nesting`` so the fake response only needs to carry the column shape, not the full nested-percentile body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A live-API column-content scan found that the Samples CSV service returns two UUID-valued columns by default: - ``Activity_ActivityIdentifier`` (per-activity UUID) - ``Result_MeasureIdentifier`` (per-measurement UUID) These weren't covered by the prior OGC / stats commits because ``get_samples`` parses CSV directly without going through ``get_ogc_data`` or ``get_stats_data``. This commit: - Adds the two CamelCase column names to ``_HASH_ID_COLUMNS``. - Plumbs ``include_hash_ids: bool = False`` through ``get_samples`` and drops the named columns from the parsed CSV before returning. - Updates ``test_mock_get_samples`` to reflect the new column count (187 → 185) and adds ``test_mock_get_samples_include_hash_ids`` for the opt-in path. - Updates ``test_samples_results`` and ``test_samples_activity`` similarly. Stable identifiers (``Org_Identifier``, ``Location_Identifier``, ``Project_Identifier``, ``USGSpcode``, …) are kept unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-PR-281 cleanup from a code review pass. No behavior change for users; the three code paths that drop hash columns (OGC, stats, samples) now share one helper, and a few smaller wins. - New ``_drop_hash_columns(df, include_hash_ids, keep=None)`` helper replaces three near-identical drop blocks across ``_arrange_cols``, ``get_stats_data``, and ``get_samples``. Uses ``set(df.columns) & _HASH_ID_COLUMNS - keep`` (one Index intersection) in place of the per-call list-comprehension + ``if drop_cols:`` truthiness guard. ``df.drop`` accepts an empty Index, so the guard was unnecessary. - ``_HASH_ID_COLUMNS`` no longer leaks into ``api.py`` — the samples branch now calls the helper instead of touching the constant. - New ``_properties_unspecified(properties)`` extracts the ``properties is None or all(pd.isna(properties))`` predicate so the five call sites that need it stop drifting between ``not properties`` and ``properties is None`` variants. - ``_default_non_hash_properties`` memoizes its result by ``(service, output_id)`` via ``_default_props_cache``. The previous version rebuilt a ~30–100-item list on every OGC call after the queryables cache was warm; this saves the per-call list rebuild on the hot path. - ``get_ogc_data`` flattens its nested ``try/except + if/else`` branches into a single ``use_server_trim`` flag, removing the duplicated ``_switch_properties_id`` assignment. - The benchmark no longer clears ``_queryables_cache`` between measured rounds. Clearing per round only penalizes the default path (the legacy path doesn't consult the cache), so the previous comparison was pessimistic against the default. Real-world callers pay the queryables fetch once per process. - The benchmark's ``capture_fixtures`` now calls ``_default_non_hash_properties`` directly instead of reimplementing the filter, so the "trimmed" fixture matches what the runtime actually sends bit-for-bit. All 157 unit and mock tests pass after the refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``benchmarks/`` directory was useful during development for measuring the hash-ID-drop impact, but it's not part of the runtime or test surface and the numbers are captured in the PR description. Removing to keep the diff focused on the library change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
301cd49 to
5416fe5
Compare
The include_hash_ids docstrings claimed the dropped columns "are not stable across record refreshes." That is true for the per-record version UUID (daily_id/continuous_id/computation_id), which is regenerated on every refresh, but wrong for the secondary hashes (time_series_id, parent_time_series_id, field_series_id/field_measurements_series_id): real-data checks confirm these are stable and are the documented join keys back to the metadata endpoints (e.g. time_series_id links a values row to get_time_series_metadata; the API docs recommend (time, time_series_id) to identify an observation over time). Rewrite the 11 OGC getter blocks, the _HASH_ID_COLUMNS comment, and the get_stats_data block to distinguish the unstable per-record UUID from the stable-but-opaque join keys, and tell callers to set include_hash_ids=True (or name the column in properties) when they need to join, trace series lineage, or disambiguate series sharing the same (monitoring_location_id, parameter_code, statistic_id). Docs only; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… docs Rename the opt-in flag include_hash_ids -> include_hash across the OGC getters, get_samples, get_stats_*, the internal helpers, and the tests. The flag is new in this PR (never on main), so this is not a breaking change. Collapse the per-getter docstrings to two lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New optional module dataretrieval.waterdata.xarray mirrors the time-series getters (get_daily, get_continuous, get_latest_continuous/daily, get_nearest_continuous, get_peaks, get_field_measurements, get_stats_por/ date_range) but returns a CF-conventions xarray.Dataset instead of a DataFrame. Each wrapper calls the underlying getter with include_hash=True, looks up the series descriptors from the metadata endpoints (cached per site), and writes them onto the dataset: one data variable per parameter with long_name, units (UDUNITS where mapped), cell_methods (from the statistic/computation), and standard_name where a confident USGS-pcode -> CF mapping exists. The monitoring location is the CF discrete-sampling-geometry instance dimension (cf_role=timeseries_id) with longitude/latitude coords from point geometry; qualifier/approval_status ride along as ancillary variables; dataset attrs carry Conventions/provenance/request URL. Parameters on different clocks are outer-joined on a shared time axis; a (site, parameter) collision falls back to time_series_id as the instance dimension. xarray is an optional dependency (pip install dataretrieval[xarray]); the module raises a clear ImportError if it is missing and is not imported by dataretrieval.waterdata, so the core package stays xarray-free. Stats use a preliminary flat conversion. Adds offline converter tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrappers previously forced include_hash=True purely to keep a join key, which materialized the per-record UUID column (a unique ~36-char string per row -- e.g. continuous_id measured at ~155 KB / 1825 unique values on a 20-day continuous pull) only to discard it. Drop that. The wrappers now fetch the underlying getter's default, hash-free output, so the heavy UUID column is never requested or built. Every CF attribute is derived from columns already present: units from unit_of_measure, cell_methods from statistic_id (new _STATISTIC_CELL_METHOD map), standard_name/usgs_parameter_code from parameter_code. Only the human-readable parameter name still needs the metadata endpoint, now looked up (and cached) keyed by parameter_code rather than a series hash. time and monitoring_location_id are the coordinates; the converter slims to just the columns it pivots before copying. Without the hash key the rare two-series-per-(site, parameter, statistic) collision can no longer be split (the values are inherently ambiguous without it), so it now keeps the first value and warns instead of switching the instance dimension. Tests updated accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CF lookup tables (USGS units -> UDUNITS, statistic_id -> cell_methods operator, parameter_code -> standard_name) are plain reference data, so move them out of the converter module into types.py as public, extensible constants (CF_UNIT_MAP, CF_CELL_METHODS, CF_STANDARD_NAMES) alongside the existing PROFILE_LOOKUP. They carry no xarray dependency, so types.py stays import-light and the tables can be extended without importing the xarray-optional converter. xarray.py imports them at the top; behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The xarray path never surfaces hash columns (neither the per-record OGC feature id nor the per-series join key), so route every wrapper through a _fetch helper that pops include_hash before calling the underlying getter. This makes the flag inert in the xarray path -- passing include_hash=True no longer triggers a fetch of a column we'd only discard -- and documents that hash IDs are always omitted here. The pandas getters keep include_hash as the opt-in flag; only the xarray layer drops them unconditionally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…routing Addresses two review gaps: - The shared wrapper docstring promised "a CF-conventions Dataset with series metadata populated", which is false for the stats wrappers -- _build_stats emits a flat, preliminary Dataset with only dataset-level provenance. Parametrize _xr_doc(cf_metadata=...) so get_stats_por/get_stats_date_range describe their actual output. - The include_hash strip was only tested on _fetch in isolation; nothing pinned the wrappers to route through it. Add a test that monkeypatches _fetch and asserts every public wrapper delegates to it (and forwards include_hash for _fetch to drop), guarding against a wrapper reverting to a direct getter call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two correctness fixes in _build_timeseries from review: - NaT time: a row whose time fails to parse (errors="coerce" -> NaT) could not index the array and was silently swallowed by to_xarray (and an all-NaT group crashed). Drop such rows explicitly with a warning, and return an empty Dataset if nothing parseable remains -- no more silent data loss. - Mixed units: unit_of_measure was collapsed with _first over a (parameter_code, statistic_id) group that can span multiple units (e.g. the same parameter reported in different units across sites). A variable can carry only one units attr, so warn when a group spans multiple units instead of silently labeling every value with the first. Adds tests for the bad-time drop, all-bad-time empty result, and the mixed-unit warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
dataretrieval.waterdata.xarray, a module that mirrors the Water Data time-series getters but returns CF-conventionsxarray.Datasets with series metadata populated, instead of bare DataFrames.Coverage
get_daily,get_continuous,get_latest_continuous,get_latest_daily,get_nearest_continuous,get_peaks,get_field_measurements, and (preliminary)get_stats_por/get_stats_date_range.How it works
unit_of_measure→units(UDUNITS where mapped),statistic_id→cell_methods,parameter_code→standard_name/usgs_parameter_code. Only the human-readable parameter name comes from a small, cachedparameter_code-keyed metadata lookup.timeandmonitoring_location_idare the coordinates (CF discrete-sampling-geometry,cf_role=timeseries_id), withlongitude/latitudefrom point geometry, andqualifier/approval_statusas ancillary variables.xarrayis an optional dependency (pip install dataretrieval[xarray]); it is not imported bydataretrieval.waterdata, so the core package stays xarray-free.Supporting change: hash-free defaults
This builds on dropping hash-valued ID columns by default — what makes the lean fetch lean. Every public
get_*now omits the opaque hash/UUID columns and addsinclude_hash: bool = Falsefor opt-in; stable identifiers (monitoring_location_id,parameter_code,statistic_id,time) are always returned. The secondary join keys (time_series_id,parent_time_series_id,field_series_id) are dropped too — setinclude_hash=True, or name them inproperties=, to join back to the metadata endpoints.Status
Draft. Known gaps from self-review to harden before merge: the statistics conversion is a preliminary flat layout (not yet serializable for percentile/list columns); NaT-time rows are silently dropped; mixed units within a group and
properties=subsets that omitvalue/time/parameter_codeare not yet guarded.