Skip to content

CA: optimize simulator performance by embedding scheduler structures#9388

Merged
k8s-ci-robot merged 3 commits intokubernetes:masterfrom
Choraden:nodeinfo_wrapper_v4
Mar 23, 2026
Merged

CA: optimize simulator performance by embedding scheduler structures#9388
k8s-ci-robot merged 3 commits intokubernetes:masterfrom
Choraden:nodeinfo_wrapper_v4

Conversation

@Choraden
Copy link
Contributor

What type of PR is this?

/kind cleanup
/kind regression

What this PR does / why we need it:

This PR redesigns simulator framework.NodeInfo and framework.PodInfo to embed their
upstream scheduler implementation counterparts. This eliminates the need for frequent WrapSchedulerNodeInfo calls, significantly reducing memory allocations and GC pressure during simulations.

This is to fix the performance regression introduced in initial DRA implementation.

Which issue(s) this PR fixes:

Fixes #8965

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Update drasnapshot.findPodClaims to return nil instead of an empty slice
when no claims are present. This ensures consistency with
framework.NewPodInfo and prevents test failures due to mismatch between
nil and empty slices in deep comparisons, without needing to relax
strictness in tests.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/regression Categorizes issue or PR as related to a regression from a prior release. do-not-merge/needs-area area/cluster-autoscaler labels Mar 19, 2026
@k8s-ci-robot k8s-ci-robot added area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/provider/huaweicloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/provider/rancher area/provider/utho Issues or PRs related to Utho provider and removed do-not-merge/needs-area labels Mar 19, 2026
@Choraden Choraden force-pushed the nodeinfo_wrapper_v4 branch from dc2b30c to 4adbf46 Compare March 19, 2026 13:19
@Choraden
Copy link
Contributor Author

I'm posting the results from the benchmark job https://github.com/kubernetes/autoscaler/actions/runs/23296851893?pr=9388

goos: linux
goarch: amd64
pkg: k8s.io/autoscaler/cluster-autoscaler/core/bench
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                   │   base.txt   │               pr.txt                │
                   │    sec/op    │   sec/op     vs base                │
RunOnceScaleUp-4      734.7m ± 8%   663.0m ± 9%   -9.77% (p=0.002 n=10)
RunOnceScaleDown-4   1125.0m ± 2%   813.7m ± 1%  -27.68% (p=0.000 n=10)
geomean               909.2m        734.5m       -19.22%

                   │   base.txt   │                pr.txt                │
                   │     B/op     │     B/op      vs base                │
RunOnceScaleUp-4     534.3Mi ± 0%   425.6Mi ± 0%  -20.35% (p=0.000 n=10)
RunOnceScaleDown-4   999.6Mi ± 1%   537.6Mi ± 5%  -46.21% (p=0.000 n=10)
geomean              730.8Mi        478.4Mi       -34.54%

                   │   base.txt   │               pr.txt                │
                   │  allocs/op   │  allocs/op   vs base                │
RunOnceScaleUp-4      5.891M ± 0%   3.863M ± 0%  -34.42% (p=0.000 n=10)
RunOnceScaleDown-4   10.268M ± 0%   2.251M ± 6%  -78.08% (p=0.000 n=10)
geomean               7.778M        2.949M       -62.09%

@Choraden
Copy link
Contributor Author

/assign @mtrqq @BigDarkClown

@mtrqq
Copy link
Contributor

mtrqq commented Mar 19, 2026

I'm posting the results from the benchmark job https://github.com/kubernetes/autoscaler/actions/runs/23296851893?pr=9388

goos: linux
goarch: amd64
pkg: k8s.io/autoscaler/cluster-autoscaler/core/bench
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                   │   base.txt   │               pr.txt                │
                   │    sec/op    │   sec/op     vs base                │
RunOnceScaleUp-4      734.7m ± 8%   663.0m ± 9%   -9.77% (p=0.002 n=10)
RunOnceScaleDown-4   1125.0m ± 2%   813.7m ± 1%  -27.68% (p=0.000 n=10)
geomean               909.2m        734.5m       -19.22%

                   │   base.txt   │                pr.txt                │
                   │     B/op     │     B/op      vs base                │
RunOnceScaleUp-4     534.3Mi ± 0%   425.6Mi ± 0%  -20.35% (p=0.000 n=10)
RunOnceScaleDown-4   999.6Mi ± 1%   537.6Mi ± 5%  -46.21% (p=0.000 n=10)
geomean              730.8Mi        478.4Mi       -34.54%

                   │   base.txt   │               pr.txt                │
                   │  allocs/op   │  allocs/op   vs base                │
RunOnceScaleUp-4      5.891M ± 0%   3.863M ± 0%  -34.42% (p=0.000 n=10)
RunOnceScaleDown-4   10.268M ± 0%   2.251M ± 6%  -78.08% (p=0.000 n=10)
geomean               7.778M        2.949M       -62.09%

That would also be ideal to compare benchmarks from real-hardware scalability tests, did you manage to obtain those? This is not blocking comment as this PR is straight an improvement, just want to understand the improvements made in real clusters

Copy link
Contributor

@mtrqq mtrqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for the change, left 1 major comment about data consistency and two nits so far.

One thing I want to callout about this change is that we shift DRA and CSI computations from on-demand to on-startup when creating a snapshot. I haven't been tracing autoscaler code paths specifically to proof this, but it potentially may lead to having more DRA & CSI computations than it would have needed when lazy loading those objects (not confirmed scenario to show the problem: binpacking doesn't need to consider all the nodes from the snapshot to make a scale up decision)

I think this change is a step into the correct direction, but we probably should also explore the topic of reconciling cluster snapshot with each loop instead of completely throwing it away and reconstructing on each iteration, this may yield large performance improvements especially in huge clusters.

Just to clarify - this discussion isn't something blocking this PR by any means

@x13n @BigDarkClown WDYT?

@x13n
Copy link
Member

x13n commented Mar 20, 2026

First of all, thanks for the change, left 1 major comment about data consistency and two nits so far.

One thing I want to callout about this change is that we shift DRA and CSI computations from on-demand to on-startup when creating a snapshot. I haven't been tracing autoscaler code paths specifically to proof this, but it potentially may lead to having more DRA & CSI computations than it would have needed when lazy loading those objects (not confirmed scenario to show the problem: binpacking doesn't need to consider all the nodes from the snapshot to make a scale up decision)

I think this change is a step into the correct direction, but we probably should also explore the topic of reconciling cluster snapshot with each loop instead of completely throwing it away and reconstructing on each iteration, this may yield large performance improvements especially in huge clusters.

Just to clarify - this discussion isn't something blocking this PR by any means

@x13n @BigDarkClown WDYT?

I agree constant snapshot recreation is costly, but it is orthogonal and from what I've seen the problem addressed by this PR is much more visible in large clusters. NodeInfo wrapping was causing major CPU burn, so happy to see this addressed. We can check if there are other problems worth tackling once this one is out of the way. Let's launch and iterate.

@x13n
Copy link
Member

x13n commented Mar 20, 2026

Is #9022 obsoleted by this one?

@Choraden
Copy link
Contributor Author

I'm posting the results from the benchmark job https://github.com/kubernetes/autoscaler/actions/runs/23296851893?pr=9388

goos: linux
goarch: amd64
pkg: k8s.io/autoscaler/cluster-autoscaler/core/bench
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                   │   base.txt   │               pr.txt                │
                   │    sec/op    │   sec/op     vs base                │
RunOnceScaleUp-4      734.7m ± 8%   663.0m ± 9%   -9.77% (p=0.002 n=10)
RunOnceScaleDown-4   1125.0m ± 2%   813.7m ± 1%  -27.68% (p=0.000 n=10)
geomean               909.2m        734.5m       -19.22%

                   │   base.txt   │                pr.txt                │
                   │     B/op     │     B/op      vs base                │
RunOnceScaleUp-4     534.3Mi ± 0%   425.6Mi ± 0%  -20.35% (p=0.000 n=10)
RunOnceScaleDown-4   999.6Mi ± 1%   537.6Mi ± 5%  -46.21% (p=0.000 n=10)
geomean              730.8Mi        478.4Mi       -34.54%

                   │   base.txt   │               pr.txt                │
                   │  allocs/op   │  allocs/op   vs base                │
RunOnceScaleUp-4      5.891M ± 0%   3.863M ± 0%  -34.42% (p=0.000 n=10)
RunOnceScaleDown-4   10.268M ± 0%   2.251M ± 6%  -78.08% (p=0.000 n=10)
geomean               7.778M        2.949M       -62.09%

That would also be ideal to compare benchmarks from real-hardware scalability tests, did you manage to obtain those? This is not blocking comment as this PR is straight an improvement, just want to understand the improvements made in real clusters

Thanks to @tetianakh we managed to obtain a profile diff:
Screenshot 2026-03-23 at 10 41 15
cluster spec:

  • ~4k nodes
  • ~17k pods
  • idle state i.e. no operation running in the cluster

@Choraden
Copy link
Contributor Author

Is #9022 obsoleted by this one?

Yes, #9022 is superseded by this PR and will be closed. I've narrowed the scope here to essential performance fixes to streamline the review while @towca is away. Now that the Cluster Autoscaler Benchmark job is available, I'll follow up with the remaining minor improvements as separate, data-justified PRs.

@Choraden Choraden force-pushed the nodeinfo_wrapper_v4 branch from 4adbf46 to e87265e Compare March 23, 2026 10:34
// Manual initialization may result in errors.
type PodInfo struct {
// This type embeds *apiv1.Pod to make the accesses easier - most of the code just needs to access the Pod.
*apiv1.Pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to store pod if we already have PodInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we don't, but I decided to keep it to save this PR from all the boilerplate required to remove the embedded pod:
PodInfo.<embedded field> -> PodInfo.Pod().<embedded field>

I can follow up with that change once this is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

ni := framework.NewNodeInfo(node, slices)

if s.enableCSINodeAwareScheduling && csiSnapshot != nil {
if csiNode, err := csiSnapshot.Get(node.Name); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return error if it is returned here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csiSnapshot.Get returns an error if snapshot of the given name is not found, so if we push down that error there, SetClusterState will fail every time CSINodeAwareScheduling is enbaled and a node does not have csiSnapshot. Is that expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to push the error down to keep the logic as before the change. I'll open issue to discuss that uncertainty.

Redesign simulator framework.NodeInfo and framework.PodInfo to embed their
upstream scheduler implementation counterparts. This eliminates the need for
frequent WrapSchedulerNodeInfo calls, significantly reducing memory
allocations and GC pressure during simulations.

This change shifts the ClusterSnapshot from a just-in-time wrapping model
to a persistent, pre-populated approach. Internal objects now eagerly
integrate DRA and CSI data, ensuring that all necessary state is readily
available without redundant lookups or repeated object creation.

Main implementation paths:
- Embed schedulerimpl.NodeInfo and schedulerimpl.PodInfo into CA's internal
  framework types to natively satisfy scheduler interfaces.
- Update ClusterSnapshotStore to manage internal NodeInfo/PodInfo directly,
  removing the need for ToScheduler() conversions.
- Enforce internal type consistency in NodeInfo operations (e.g., AddPod)
  to preserve critical DRA/CSI metadata throughout the simulation.
- Optimize snapshotting by implementing efficient Snapshot() methods on the
  refined NodeInfo structures.
@Choraden Choraden force-pushed the nodeinfo_wrapper_v4 branch from e87265e to a4d5e3c Compare March 23, 2026 16:00
@BigDarkClown
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 23, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BigDarkClown, Choraden

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 23, 2026
@k8s-ci-robot k8s-ci-robot merged commit eb987bf into kubernetes:master Mar 23, 2026
10 of 11 checks passed
@Choraden Choraden deleted the nodeinfo_wrapper_v4 branch March 23, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CA DRA: remove NodeInfo wrapping

5 participants