Skip to content

Use correlation-based distance metric for hierarchical clustering#13305

Open
dafeda wants to merge 1 commit intoequinor:mainfrom
dafeda:autoscale-clustering
Open

Use correlation-based distance metric for hierarchical clustering#13305
dafeda wants to merge 1 commit intoequinor:mainfrom
dafeda:autoscale-clustering

Conversation

@dafeda
Copy link
Copy Markdown
Collaborator

@dafeda dafeda commented Apr 15, 2026

Replace Euclidean distance on correlation matrix rows with d = 1 - |rho| condensed via squareform. The old approach compared full correlation profiles, causing unintuitive merges where weakly correlated but globally consistent pairs were prioritized over strongly correlated pairs.

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure unit tests pass locally after every commit (git rebase -i main --exec 'just rapid-tests')

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Add backport label to latest release (format: 'backport release-branch-name')

@dafeda dafeda self-assigned this Apr 15, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.02%. Comparing base (d3cf41c) to head (3eee203).
⚠️ Report is 6 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13305      +/-   ##
==========================================
+ Coverage   89.98%   90.02%   +0.03%     
==========================================
  Files         457      457              
  Lines       31702    31742      +40     
==========================================
+ Hits        28528    28575      +47     
+ Misses       3174     3167       -7     
Flag Coverage Δ
cli-tests 37.51% <90.00%> (-0.03%) ⬇️
fuzz 44.39% <90.00%> (-0.03%) ⬇️
gui-tests 67.04% <90.00%> (+0.02%) ⬆️
performance-and-unit-tests 77.98% <100.00%> (+0.03%) ⬆️
test 46.04% <10.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/ert/analysis/misfit_preprocessor.py 98.03% <100.00%> (+0.31%) ⬆️

... and 10 files with indirect coverage changes

@dafeda dafeda marked this pull request as ready for review April 17, 2026 10:16
@dafeda dafeda force-pushed the autoscale-clustering branch from 4732e49 to 08ae6d7 Compare April 17, 2026 10:17
@xjules xjules requested a review from Copilot April 17, 2026 11:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the autoscaling/misfit preprocessor’s hierarchical clustering to use a correlation-derived distance (d = 1 - |rho|) rather than Euclidean distance between correlation-profile rows, aiming to produce more intuitive merge ordering (most strongly correlated pairs merge first).

Changes:

  • Switch clustering linkage input from correlation-row features to a condensed precomputed distance vector derived from 1 - |correlation|.
  • Update the unit test that checks “first merge” behavior to reflect the new distance interpretation/thresholds.
  • Loosen an edge-case clustering assertion from an exact cluster count to a lower-bound threshold.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/ert/analysis/misfit_preprocessor.py Computes `1 -
tests/ert/unit_tests/analysis/test_misfit_preprocessor.py Updates clustering-merge expectations/params for the new metric and relaxes an edge-case assertion on cluster count.

Comment on lines +366 to +369
With distance = 1 - |rho|, the pair with the highest |rho| has the
smallest distance and merges first:
- When corr_de_target > 0.707: D-E distance < A-B distance, so D-E merges first.
- When corr_de_target < 0.707: A-B (or B-C) merges first.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states that when corr_de_target < 0.707, A-B (or B-C) should merge first, but the test (below) only asserts whether D-E merged. That means a regression where some other unintended pair merges (while D-E does not) would still satisfy the stated expectation. Consider extending the assertions to explicitly require that A-B or B-C is the merged pair in the expect_de_merged == False cases.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +534 to +539
# With standard scaling, the algorithm correctly identifies many clusters
# instead of collapsing everything into one. We get 91 rather than 100
# instead of collapsing everything into one. We get at least 90 clusters
# because with finite samples (100 obs, 1000 realizations), sample
# eigenvalues spread even for independent data, so the top ~91 components
# already explain 95% of variance.
assert len(np.unique(clusters)) == 91
assert len(np.unique(clusters)) >= 90
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion was loosened from an exact expected count to >= 90, which makes the regression protection less precise and introduces a magic threshold. If the intent is to avoid platform-dependent flakiness, consider expressing the expectation relative to n_observations (e.g., a percentage) and/or asserting an invariant that better matches the underlying algorithm (for example, that the number of clusters stays close to the PCA-derived prim_components, or that it does not collapse to a small number).

Copilot generated this review using guidance from repository custom instructions.
distance_matrix = 1.0 - np.abs(correlation)
np.fill_diagonal(distance_matrix, 0.0)
np.clip(distance_matrix, 0.0, None, out=distance_matrix)
condensed_distance = squareform(distance_matrix)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one fails on drogon due to floating point differences making distance matrix non symmetrical. Could maybe just set checks=False in squareform

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks for testing.
I propose we symmetrize the distance matrix before calling squareform: distance_matrix = (distance_matrix + distance_matrix.T) / 2.0.
I have pushed a fix and added a test.

@dafeda dafeda force-pushed the autoscale-clustering branch from 08ae6d7 to ed8aefd Compare April 20, 2026 07:07
Replace Euclidean distance on correlation matrix rows with
d = 1 - |rho| condensed via squareform. The old approach
compared full correlation profiles, causing unintuitive merges
where weakly correlated but globally consistent pairs were
prioritized over strongly correlated pairs.
@dafeda dafeda force-pushed the autoscale-clustering branch from ed8aefd to 3eee203 Compare April 20, 2026 07:14
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 20, 2026

Merging this PR will not alter performance

✅ 36 untouched benchmarks


Comparing dafeda:autoscale-clustering (3eee203) with main (8036c93)

Open in CodSpeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants