Skip to content

CASSANALYTICS-175: Exclude IP address from RingInstance equality so node replacement does not fail bulk write jobs#216

Open
rustyrazorblade wants to merge 1 commit into
apache:trunkfrom
rustyrazorblade:ring-instance-ip-equality
Open

CASSANALYTICS-175: Exclude IP address from RingInstance equality so node replacement does not fail bulk write jobs#216
rustyrazorblade wants to merge 1 commit into
apache:trunkfrom
rustyrazorblade:ring-instance-ip-equality

Conversation

@rustyrazorblade

Copy link
Copy Markdown
Contributor

During S3 bulk writes, CassandraTopologyMonitor polls the topology every 5 seconds and cancels the job if it differs from the topology captured at job start. The comparison bottoms out in RingInstance.equals, which included the node's IP address.

A node that goes down and rejoins with a different IP — routine in Kubernetes, where a rescheduled pod keeps its hostname, host ID, and data but gets a new IP — is the same logical instance with the same token ownership. The write remains correct, but the monitor reported "Topology changed during bulk write" and failed the job. On clusters with hundreds of nodes across several DCs, this makes long-running jobs fail routinely. The same equality is also used by RecordWriter.validateTaskTokenRangeMappings on every executor task (both transports) and by the per-instance consistency accounting in ReplicaAwareFailureHandler and ImportCompletionCoordinator, which counted the old-IP and new-IP instance as two distinct replicas.

This patch removes the IP address from RingInstance.equals/hashCode. Instance identity is now clusterId, token, fqdn, rack, port, and datacenter — sufficient to distinguish nodes, since two live nodes cannot share fqdn + port + datacenter. Sidecar resolves fqdn via reverse DNS and falls back to the IP string when resolution fails, so deployments without DNS see no behavior change. Real topology changes (nodes added, removed, joining, leaving) are still detected through instance membership and pending-state comparison.

The IP address was originally added to the equality in CASSANDRA-18852, the same change that introduced building RingInstance from ReplicaMetadata, which carries no token — leaving the IP as a stand-in discriminator.

JIRA: https://issues.apache.org/jira/browse/CASSANALYTICS-175

…ode replacement does not fail bulk write jobs

A node that is replaced and returns with a different IP address (e.g. a pod
replacement in Kubernetes) is the same logical instance, but RingInstance
equality included the IP address, so CassandraTopologyMonitor reported a
topology change and failed S3 bulk write jobs. The same comparison runs in
RecordWriter task validation and per-instance consistency accounting.

Instance identity is now clusterId, token, fqdn, rack, port and datacenter.

Patch by Jon Haddad for CASSANALYTICS-175
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant