Use correlation-based distance metric for hierarchical clustering#13305
Use correlation-based distance metric for hierarchical clustering#13305dafeda wants to merge 1 commit intoequinor:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #13305 +/- ##
==========================================
+ Coverage 89.98% 90.02% +0.03%
==========================================
Files 457 457
Lines 31702 31742 +40
==========================================
+ Hits 28528 28575 +47
+ Misses 3174 3167 -7
Flags with carried forward coverage won't be shown. Click here to find out more.
|
4732e49 to
08ae6d7
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the autoscaling/misfit preprocessor’s hierarchical clustering to use a correlation-derived distance (d = 1 - |rho|) rather than Euclidean distance between correlation-profile rows, aiming to produce more intuitive merge ordering (most strongly correlated pairs merge first).
Changes:
- Switch clustering linkage input from correlation-row features to a condensed precomputed distance vector derived from
1 - |correlation|. - Update the unit test that checks “first merge” behavior to reflect the new distance interpretation/thresholds.
- Loosen an edge-case clustering assertion from an exact cluster count to a lower-bound threshold.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/ert/analysis/misfit_preprocessor.py |
Computes `1 - |
tests/ert/unit_tests/analysis/test_misfit_preprocessor.py |
Updates clustering-merge expectations/params for the new metric and relaxes an edge-case assertion on cluster count. |
| With distance = 1 - |rho|, the pair with the highest |rho| has the | ||
| smallest distance and merges first: | ||
| - When corr_de_target > 0.707: D-E distance < A-B distance, so D-E merges first. | ||
| - When corr_de_target < 0.707: A-B (or B-C) merges first. |
There was a problem hiding this comment.
The docstring states that when corr_de_target < 0.707, A-B (or B-C) should merge first, but the test (below) only asserts whether D-E merged. That means a regression where some other unintended pair merges (while D-E does not) would still satisfy the stated expectation. Consider extending the assertions to explicitly require that A-B or B-C is the merged pair in the expect_de_merged == False cases.
| # With standard scaling, the algorithm correctly identifies many clusters | ||
| # instead of collapsing everything into one. We get 91 rather than 100 | ||
| # instead of collapsing everything into one. We get at least 90 clusters | ||
| # because with finite samples (100 obs, 1000 realizations), sample | ||
| # eigenvalues spread even for independent data, so the top ~91 components | ||
| # already explain 95% of variance. | ||
| assert len(np.unique(clusters)) == 91 | ||
| assert len(np.unique(clusters)) >= 90 |
There was a problem hiding this comment.
This assertion was loosened from an exact expected count to >= 90, which makes the regression protection less precise and introduces a magic threshold. If the intent is to avoid platform-dependent flakiness, consider expressing the expectation relative to n_observations (e.g., a percentage) and/or asserting an invariant that better matches the underlying algorithm (for example, that the number of clusters stays close to the PCA-derived prim_components, or that it does not collapse to a small number).
| distance_matrix = 1.0 - np.abs(correlation) | ||
| np.fill_diagonal(distance_matrix, 0.0) | ||
| np.clip(distance_matrix, 0.0, None, out=distance_matrix) | ||
| condensed_distance = squareform(distance_matrix) |
There was a problem hiding this comment.
This one fails on drogon due to floating point differences making distance matrix non symmetrical. Could maybe just set checks=False in squareform
There was a problem hiding this comment.
Nice catch, thanks for testing.
I propose we symmetrize the distance matrix before calling squareform: distance_matrix = (distance_matrix + distance_matrix.T) / 2.0.
I have pushed a fix and added a test.
08ae6d7 to
ed8aefd
Compare
Replace Euclidean distance on correlation matrix rows with d = 1 - |rho| condensed via squareform. The old approach compared full correlation profiles, causing unintuitive merges where weakly correlated but globally consistent pairs were prioritized over strongly correlated pairs.
ed8aefd to
3eee203
Compare
Replace Euclidean distance on correlation matrix rows with d = 1 - |rho| condensed via squareform. The old approach compared full correlation profiles, causing unintuitive merges where weakly correlated but globally consistent pairs were prioritized over strongly correlated pairs.
git rebase -i main --exec 'just rapid-tests')When applicable