Use correlation-based distance metric for hierarchical clustering by dafeda · Pull Request #13305 · equinor/ert

dafeda · 2026-04-15T12:53:19Z

Replace Euclidean distance on correlation matrix rows with d = 1 - |rho| condensed via squareform. The old approach compared full correlation profiles, causing unintuitive merges where weakly correlated but globally consistent pairs were prioritized over strongly correlated pairs.

PR title captures the intent of the changes, and is fitting for release notes.
Added appropriate release note label
Commit history is consistent and clean, in line with the contribution guidelines.
Make sure unit tests pass locally after every commit (git rebase -i main --exec 'just rapid-tests')

When applicable

When there are user facing changes: Updated documentation
New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
Large PR: Prepare changes in small commits for more convenient review
Bug fix: Add regression test for the bug
Bug fix: Add backport label to latest release (format: 'backport release-branch-name')

codecov-commenter · 2026-04-15T13:06:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.02%. Comparing base (d3cf41c) to head (3eee203).
⚠️ Report is 6 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #13305      +/-   ##
==========================================
+ Coverage   89.98%   90.02%   +0.03%     
==========================================
  Files         457      457              
  Lines       31702    31742      +40     
==========================================
+ Hits        28528    28575      +47     
+ Misses       3174     3167       -7

Flag	Coverage Δ
cli-tests	`37.51% <90.00%> (-0.03%)`	⬇️
fuzz	`44.39% <90.00%> (-0.03%)`	⬇️
gui-tests	`67.04% <90.00%> (+0.02%)`	⬆️
performance-and-unit-tests	`77.98% <100.00%> (+0.03%)`	⬆️
test	`46.04% <10.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/ert/analysis/misfit_preprocessor.py	`98.03% <100.00%> (+0.31%)`	⬆️

... and 10 files with indirect coverage changes

Copilot

Pull request overview

This PR updates the autoscaling/misfit preprocessor’s hierarchical clustering to use a correlation-derived distance (d = 1 - |rho|) rather than Euclidean distance between correlation-profile rows, aiming to produce more intuitive merge ordering (most strongly correlated pairs merge first).

Changes:

Switch clustering linkage input from correlation-row features to a condensed precomputed distance vector derived from 1 - |correlation|.
Update the unit test that checks “first merge” behavior to reflect the new distance interpretation/thresholds.
Loosen an edge-case clustering assertion from an exact cluster count to a lower-bound threshold.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`src/ert/analysis/misfit_preprocessor.py`	Computes `1 -
`tests/ert/unit_tests/analysis/test_misfit_preprocessor.py`	Updates clustering-merge expectations/params for the new metric and relaxes an edge-case assertion on cluster count.

Copilot · 2026-04-17T11:36:08Z

+    With distance = 1 - |rho|, the pair with the highest |rho| has the
+    smallest distance and merges first:
+    - When corr_de_target > 0.707: D-E distance < A-B distance, so D-E merges first.
+    - When corr_de_target < 0.707: A-B (or B-C) merges first.


The docstring states that when corr_de_target < 0.707, A-B (or B-C) should merge first, but the test (below) only asserts whether D-E merged. That means a regression where some other unintended pair merges (while D-E does not) would still satisfy the stated expectation. Consider extending the assertions to explicitly require that A-B or B-C is the merged pair in the expect_de_merged == False cases.

Copilot · 2026-04-17T11:36:08Z

    # With standard scaling, the algorithm correctly identifies many clusters
-    # instead of collapsing everything into one. We get 91 rather than 100
+    # instead of collapsing everything into one. We get at least 90 clusters
    # because with finite samples (100 obs, 1000 realizations), sample
    # eigenvalues spread even for independent data, so the top ~91 components
    # already explain 95% of variance.
-    assert len(np.unique(clusters)) == 91
+    assert len(np.unique(clusters)) >= 90


This assertion was loosened from an exact expected count to >= 90, which makes the regression protection less precise and introduces a magic threshold. If the intent is to avoid platform-dependent flakiness, consider expressing the expectation relative to n_observations (e.g., a percentage) and/or asserting an invariant that better matches the underlying algorithm (for example, that the number of clusters stays close to the PCA-derived prim_components, or that it does not collapse to a small number).

larsevj · 2026-04-17T13:13:51Z

+    distance_matrix = 1.0 - np.abs(correlation)
+    np.fill_diagonal(distance_matrix, 0.0)
+    np.clip(distance_matrix, 0.0, None, out=distance_matrix)
+    condensed_distance = squareform(distance_matrix)


This one fails on drogon due to floating point differences making distance matrix non symmetrical. Could maybe just set checks=False in squareform

Nice catch, thanks for testing.
I propose we symmetrize the distance matrix before calling squareform: distance_matrix = (distance_matrix + distance_matrix.T) / 2.0.
I have pushed a fix and added a test.

Replace Euclidean distance on correlation matrix rows with d = 1 - |rho| condensed via squareform. The old approach compared full correlation profiles, causing unintuitive merges where weakly correlated but globally consistent pairs were prioritized over strongly correlated pairs.

codspeed-hq · 2026-04-20T08:03:33Z

Merging this PR will not alter performance

✅ 36 untouched benchmarks

_{Comparing dafeda:autoscale-clustering (3eee203) with main (8036c93)}

dafeda self-assigned this Apr 15, 2026

dafeda marked this pull request as ready for review April 17, 2026 10:16

dafeda force-pushed the autoscale-clustering branch from 4732e49 to 08ae6d7 Compare April 17, 2026 10:17

xjules requested a review from Copilot April 17, 2026 11:32

Copilot started reviewing on behalf of xjules April 17, 2026 11:32 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

larsevj reviewed Apr 17, 2026

View reviewed changes

dafeda force-pushed the autoscale-clustering branch from 08ae6d7 to ed8aefd Compare April 20, 2026 07:07

dafeda force-pushed the autoscale-clustering branch from ed8aefd to 3eee203 Compare April 20, 2026 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use correlation-based distance metric for hierarchical clustering#13305

Use correlation-based distance metric for hierarchical clustering#13305
dafeda wants to merge 1 commit intoequinor:mainfrom
dafeda:autoscale-clustering

dafeda commented Apr 15, 2026

Uh oh!

codecov-commenter commented Apr 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

larsevj Apr 17, 2026

Uh oh!

dafeda Apr 20, 2026

Uh oh!

codspeed-hq Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dafeda commented Apr 15, 2026

When applicable

Uh oh!

codecov-commenter commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

larsevj Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

dafeda Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

codspeed-hq Bot commented Apr 20, 2026

Merging this PR will not alter performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 15, 2026 •

edited

Loading