You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Before Cluster Autoscaler was integrated with DRA, ClusterSnapshot would directly store the scheduler-framework NodeInfo objects, and CA code would operate on these scheduler-framework objects.
In order to support DRA, CA had to start tracking ResourceSlices and ResourceClaims in addition to just Nodes and Pods. The scheduler NodeInfo type doesn't contain them, and back then it was just a struct, not an interface - so we couldn't directly extend it. Instead, we introduced an internal NodeInfo and PodInfo types - wrapping the scheduler structs, and adding CA-specific fields on top:
Whenever scheduler plugins list (the scheduler version of) NodeInfos or DRA objects through the scheduler framework, they are listed directly from ClusterSnapshotStore and dynamicresources/snapshot.Snapshot with no memory allocations.
On the other hand, Cluster Autoscaler code always lists/gets the CA-specific version of NodeInfo. Since parts of the object live in different places (ClusterSnapshotStore and dynamicresources/snapshot.Snapshot), the internal NodeInfos always get computed on-the-fly, wrapping both parts:
This results in allocating memory for the internal NodeInfo every time it's accessed from ClusterSnapshot. After we added parallelism to binpacking in #8729, it turned out after benchmarking that these memory allocations are now a performance bottleneck.
The scheduler-framework NodeInfo was changed to an interface in K8s 1.34, so we can now return the internal NodeInfos directly to scheduler code - as long as they implement the interface. Since the internal NodeInfo wraps the scheduler one, most of the interface methods can just be passed through to the wrapped scheduler struct.
Proposed solution:
High-level overview:
Make internal NodeInfo implement the scheduler-framework NodeInfo interface, so that it can be directly returned and used by the scheduler framework.
Change the "storage version" in ClusterSnapshotStore from scheduler NodeInfo/PodInfo to CA-internal NodeInfo/PodInfo. Have ClusterSnapshotStore directly provide the CA-internal objects to both CA and scheduler-framework without any wrapping/memory allocations.
Modify ClusterSnapshotStore interface and implementations so that any time the internal NodeInfo/PodInfo objects are allocated during its methods, they have the CA-specific fields correctly set using DRA snapshot before storing.
Modify ClusterSnapshot interface and its PredicateSnapshot implementation to delegate NodeInfo getting/listing to the embedded ClusterSnapshotStore.
Handling PodInfo will be tricky - the internal NodeInfo.Pods() should return a list of internal PodInfo objects without allocating them on the fly. So we can't just pass this method through to the scheduler NodeInfo.Pods(). We have to actually embed the scheduler PodInfo in the internal PodInfo and implement the PodInfo interface, keep a list of internal PodInfos in the internal NodeInfo, and return that from the internal NodeInfo.Pods().
Modify ClusterSnapshotStore interface and both implementations (BasicSnapshotStore and DeltaSnapshotStore):
Store the internal NodeInfos instead of the scheduler version in the implementations. The ClusterSnapshotStore.NodeInfos() method exposing the NodeInfos to scheduler code should keep working as-is, because the internal NodeInfo implements the correct interface.
AddSchedulerNodeInfo and RemoveSchedulerNodeInfo should be changed to operate on the internal NodeInfo, and renamed to something like ForceAddNodeInfo/ForceRemoveNodeInfo. The internal NodeInfo provided to ForceAddNodeInfo() should already have all the CA-specific parts filled in, so it can just be stored as-is.
Move the logic that allocates internal NodeInfo objects including all the CA-specific parts from on-the-fly in PredicateSnapshot.GetNodeInfo()/ListNodeInfos(), to once-per-loop in ClusterSnapshotStore.SetClusterState().
Change ForceAddPod() to allocate an internal PodInfo and include all the DRA-related information (taken from the DRA snapshot) before storing it in a NodeInfo.
Modify ClusterSnapshot interface and implementation (PredicateSnapshot):
Move GetNodeInfo() and ListNodeInfos() methods from ClusterSnapshot to ClusterSnapshotStore. Implement the new methods in both ClusterSnapshotStore implementations by just getting/listing the stored internal NodeInfos without any wrapping.
Change AddNodeInfo() to call the new ClusterSnapshotStore.ForceAddNodeInfo() and pass the whole internal NodeInfo to it.
Change RemoveNodeInfo to call the new ClusterSnapshotStore.ForceRemoveNodeInfo().
Mental model changes
Currently, a good mental model for Cluster Snapshot and DRA is something like this:
ClusterSnapshotStore is the single storage source-of-truth for Nodes and Pods.
ClusterSnapshotStore.DraSnapshot() is the single storage source-of-truth for DRA objects.
ClusterSnapshot.GetNodeInfo()/ListNodeInfos() correlates Nodes and Pods the DRA objects, and returns the CA-internal NodeInfo objects combining both of them. The result is "ephemeral": read-only, only valid until the next ClusterSnapshot mutation, shouldn't be stored.
Any time ClusterSnapshot is mutated, ClusterSnapshotStore and the embedded ClusterSnapshotStore.DraSnapshot() are mutated independently. The previous results of GetNodeInfo()/ListNodeInfos() are now invalid, so we don't care about "synchronizing" them with the mutation. For each tracked object, we only change the single source of truth it's stored in.
The proposed changes make the mental model more complex:
ClusterSnapshotStore.DraSnapshot() still stores the DRA objects and is still the source-of-truth conceptually, but it's not the only place where the DRA objects are stored.
ClusterSnapshotStore stores fully-formed CA-internal NodeInfo objects - so Nodes, Pods, and DRA objects. It's still the single storage source-of-truth for Nodes and Pods.
DRA objects are stored in 2 places now - ClusterSnapshotStore.DraSnapshot(), and the CA-internal NodeInfos tracked in ClusterSnapshotStore. We can't get away from having a separate DRA snapshot because there are objects to track that aren't bound to a particular Node or Pod. We also can't get away from having the DRA objects in the internal NodeInfos stored by ClusterSnapshotStore - either they're persisted there, or we're back to allocating new NodeInfo objects on the fly.
We still don't care about synchronizing mutations to the ephemeral results of GetNodeInfo()/ListNodeInfos(), but we now need to synchronize any mutations between ClusterSnapshotStore.DraSnapshot() and the CA-internal NodeInfos tracked in ClusterSnapshotStore. In return, any time we get a NodeInfo from ClusterSnapshotStore, we can just return it with no extra logic.
I couldn't come up with a solution that doesn't make the mental model more complex at all, and at least in the proposed solution the new complexity is limited to ClusterSnapshot and ClusterSnapshotStore implementations. The expectations for the rest of the code base shouldn't change.
Additional context:
Add csinode limit awareness in cluster-autoscaler #8721 adds more fields to the CA-internal NodeInfo, and a new CSISnapshot for filling them to ClusterSnapshotStore. It'll probably land before this issue is fixed, so the parts that mention "DRA" in the implementation plan above will become "DRA and CSINode". Both work the same way, so the same solution should be fine.
Which component are you using?
/area cluster-autoscaler
/area core-autoscaler
Problem description:
Before Cluster Autoscaler was integrated with DRA,
ClusterSnapshotwould directly store the scheduler-frameworkNodeInfoobjects, and CA code would operate on these scheduler-framework objects.In order to support DRA, CA had to start tracking ResourceSlices and ResourceClaims in addition to just Nodes and Pods. The scheduler
NodeInfotype doesn't contain them, and back then it was just a struct, not an interface - so we couldn't directly extend it. Instead, we introduced an internal NodeInfo and PodInfo types - wrapping the scheduler structs, and adding CA-specific fields on top:autoscaler/cluster-autoscaler/simulator/framework/infos.go
Line 47 in 91bae6b
Scheduler NodeInfos are tracked in
ClusterSnapshotStore:autoscaler/cluster-autoscaler/simulator/clustersnapshot/store/delta.go
Line 54 in 91bae6b
ClusterSnapshotStorealso containsdynamicresources/snapshot.Snapshotwhich tracks DRA objects:autoscaler/cluster-autoscaler/simulator/dynamicresources/snapshot/snapshot.go
Line 48 in 91bae6b
Whenever scheduler plugins list (the scheduler version of) NodeInfos or DRA objects through the scheduler framework, they are listed directly from
ClusterSnapshotStoreanddynamicresources/snapshot.Snapshotwith no memory allocations.On the other hand, Cluster Autoscaler code always lists/gets the CA-specific version of NodeInfo. Since parts of the object live in different places (
ClusterSnapshotStoreanddynamicresources/snapshot.Snapshot), the internal NodeInfos always get computed on-the-fly, wrapping both parts:autoscaler/cluster-autoscaler/simulator/clustersnapshot/predicate/predicate_snapshot.go
Line 54 in 91bae6b
This results in allocating memory for the internal NodeInfo every time it's accessed from
ClusterSnapshot. After we added parallelism to binpacking in #8729, it turned out after benchmarking that these memory allocations are now a performance bottleneck.The scheduler-framework
NodeInfowas changed to an interface in K8s 1.34, so we can now return the internal NodeInfos directly to scheduler code - as long as they implement the interface. Since the internal NodeInfo wraps the scheduler one, most of the interface methods can just be passed through to the wrapped scheduler struct.Proposed solution:
High-level overview:
NodeInfoimplement the scheduler-frameworkNodeInfointerface, so that it can be directly returned and used by the scheduler framework.ClusterSnapshotStorefrom schedulerNodeInfo/PodInfoto CA-internalNodeInfo/PodInfo. HaveClusterSnapshotStoredirectly provide the CA-internal objects to both CA and scheduler-framework without any wrapping/memory allocations.ClusterSnapshotStoreinterface and implementations so that any time the internalNodeInfo/PodInfoobjects are allocated during its methods, they have the CA-specific fields correctly set using DRA snapshot before storing.ClusterSnapshotinterface and itsPredicateSnapshotimplementation to delegateNodeInfogetting/listing to the embeddedClusterSnapshotStore.Detailed plan:
NodeInfointerface for the CA-internal NodeInfo struct, passing most methods through to the embedded scheduler-frameworkNodeInfoimplementation.PodInfowill be tricky - the internalNodeInfo.Pods()should return a list of internalPodInfoobjects without allocating them on the fly. So we can't just pass this method through to the schedulerNodeInfo.Pods(). We have to actually embed the schedulerPodInfoin the internalPodInfoand implement thePodInfointerface, keep a list of internalPodInfosin the internalNodeInfo, and return that from the internalNodeInfo.Pods().ClusterSnapshotStoreinterface and both implementations (BasicSnapshotStoreandDeltaSnapshotStore):NodeInfosinstead of the scheduler version in the implementations. TheClusterSnapshotStore.NodeInfos()method exposing theNodeInfosto scheduler code should keep working as-is, because the internalNodeInfoimplements the correct interface.AddSchedulerNodeInfoandRemoveSchedulerNodeInfoshould be changed to operate on the internalNodeInfo, and renamed to something likeForceAddNodeInfo/ForceRemoveNodeInfo. The internalNodeInfoprovided toForceAddNodeInfo()should already have all the CA-specific parts filled in, so it can just be stored as-is.NodeInfoobjects including all the CA-specific parts from on-the-fly inPredicateSnapshot.GetNodeInfo()/ListNodeInfos(), to once-per-loop inClusterSnapshotStore.SetClusterState().ForceAddPod()to allocate an internalPodInfoand include all the DRA-related information (taken from the DRA snapshot) before storing it in aNodeInfo.ClusterSnapshotinterface and implementation (PredicateSnapshot):GetNodeInfo()andListNodeInfos()methods fromClusterSnapshottoClusterSnapshotStore. Implement the new methods in bothClusterSnapshotStoreimplementations by just getting/listing the stored internalNodeInfoswithout any wrapping.AddNodeInfo()to call the newClusterSnapshotStore.ForceAddNodeInfo()and pass the whole internalNodeInfoto it.RemoveNodeInfoto call the newClusterSnapshotStore.ForceRemoveNodeInfo().Mental model changes
Currently, a good mental model for Cluster Snapshot and DRA is something like this:
ClusterSnapshotStoreis the single storage source-of-truth for Nodes and Pods.ClusterSnapshotStore.DraSnapshot()is the single storage source-of-truth for DRA objects.ClusterSnapshot.GetNodeInfo()/ListNodeInfos()correlates Nodes and Pods the DRA objects, and returns the CA-internal NodeInfo objects combining both of them. The result is "ephemeral": read-only, only valid until the nextClusterSnapshotmutation, shouldn't be stored.ClusterSnapshotis mutated,ClusterSnapshotStoreand the embeddedClusterSnapshotStore.DraSnapshot()are mutated independently. The previous results ofGetNodeInfo()/ListNodeInfos()are now invalid, so we don't care about "synchronizing" them with the mutation. For each tracked object, we only change the single source of truth it's stored in.The proposed changes make the mental model more complex:
ClusterSnapshotStore.DraSnapshot()still stores the DRA objects and is still the source-of-truth conceptually, but it's not the only place where the DRA objects are stored.ClusterSnapshotStorestores fully-formed CA-internalNodeInfoobjects - so Nodes, Pods, and DRA objects. It's still the single storage source-of-truth for Nodes and Pods.ClusterSnapshotStore.DraSnapshot(), and the CA-internalNodeInfostracked inClusterSnapshotStore. We can't get away from having a separate DRA snapshot because there are objects to track that aren't bound to a particular Node or Pod. We also can't get away from having the DRA objects in the internalNodeInfosstored byClusterSnapshotStore- either they're persisted there, or we're back to allocating newNodeInfoobjects on the fly.GetNodeInfo()/ListNodeInfos(), but we now need to synchronize any mutations betweenClusterSnapshotStore.DraSnapshot()and the CA-internalNodeInfostracked inClusterSnapshotStore. In return, any time we get a NodeInfo fromClusterSnapshotStore, we can just return it with no extra logic.I couldn't come up with a solution that doesn't make the mental model more complex at all, and at least in the proposed solution the new complexity is limited to
ClusterSnapshotandClusterSnapshotStoreimplementations. The expectations for the rest of the code base shouldn't change.Additional context:
NodeInfo, and a newCSISnapshotfor filling them toClusterSnapshotStore. It'll probably land before this issue is fixed, so the parts that mention "DRA" in the implementation plan above will become "DRA and CSINode". Both work the same way, so the same solution should be fine.