Skip to content

fix: connect via all DNS IPs when first contact point is non-responsive (DRIVER-201)#865

Open
nikagra wants to merge 1 commit intoscylladb:scylla-4.xfrom
nikagra:fix/DRIVER-201-non-responsive-first-contact-point
Open

fix: connect via all DNS IPs when first contact point is non-responsive (DRIVER-201)#865
nikagra wants to merge 1 commit intoscylladb:scylla-4.xfrom
nikagra:fix/DRIVER-201-non-responsive-first-contact-point

Conversation

@nikagra
Copy link
Copy Markdown

@nikagra nikagra commented Apr 10, 2026

Problem

When RESOLVE_CONTACT_POINTS=false (the default), a contact point hostname is stored as a single unresolved InetSocketAddress. At connection time, ChannelFactory passed this address directly to Netty's bootstrap.connect(), which internally called InetAddress.getByName() — returning only the first IP for the hostname. If that IP was non-responsive, the driver raised AllNodesFailedException with no fallback to other IPs the hostname might resolve to.

This is particularly impactful in dynamic DNS environments where a hostname can map to multiple nodes and the first one may be temporarily unavailable.

Fixes DRIVER-201.

Changes

ChannelFactory.java

Refactored the connection logic using the Parameter Object pattern and SRP decomposition:

  • ConnectRequest inner class: bundles all per-connection-attempt state, eliminating the unwieldy 9-parameter private connect() method.
  • connect(ConnectRequest): resolves the endpoint address. For unresolved hostnames, calls InetAddress.getAllByName() to expand to all known IPs, building a candidate list. DNS resolution is still deferred to connection time — the EndPoint continues to hold the unresolved hostname — preserving the dynamic-DNS semantics of RESOLVE_CONTACT_POINTS=false.
  • tryNextAddress(request, candidates, index): iterates the candidate list. On per-address failure it tries the next candidate; only when all are exhausted does it fail the overall resultFuture.
  • connectToAddress(request, address): performs a single Netty bootstrap connect to one resolved IP. Protocol-version negotiation (downgrade retries) is scoped to the same IP, which is semantically correct. Returns its own CompletableFuture<DriverChannel> so tryNextAddress can distinguish a per-address failure from the final outcome.
  • tryNextAddressRaw: unchanged pass-through for non-InetSocketAddress types (e.g. Unix domain sockets).

MockResolverIT.java

New integration test should_connect_when_first_dns_entry_is_non_responsive:

  • 2-node CCM cluster on 127.0.1.x
  • test.cluster.fake resolves to 127.0.1.11 (non-existent) first, then nodes 1 and 2
  • RESOLVE_CONTACT_POINTS=false, RECONNECT_ON_INIT=false
  • Asserts the session opens successfully and both nodes come up despite the first DNS entry being unreachable

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses DRIVER-201 by improving initial contact point connection behavior when RESOLVE_CONTACT_POINTS=false, ensuring that if a hostname resolves to multiple IPs and the first is unreachable, the driver can fall back to subsequent IPs instead of failing immediately.

Changes:

  • Refactors ChannelFactory connection logic to expand unresolved hostnames to all DNS IPs at connection time and attempt them sequentially.
  • Introduces a ConnectRequest parameter object and decomposes connection steps into smaller methods.
  • Adds an integration test covering the “first DNS A record is unreachable” scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
core/src/main/java/com/datastax/oss/driver/internal/core/channel/ChannelFactory.java Expands unresolved hostnames to all resolved IPs at connect time and adds sequential fallback across addresses.
integration-tests/src/test/java/com/datastax/oss/driver/core/resolver/MockResolverIT.java Adds an integration test to validate successful connection when the first DNS entry is non-responsive.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ve (DRIVER-201)

When RESOLVE_CONTACT_POINTS=false (the default), a hostname is stored as a
single unresolved InetSocketAddress. At connection time Netty's bootstrap
called InetAddress.getByName(), returning only the first IP. If that IP was
non-responsive the driver raised AllNodesFailedException with no fallback.

Refactor ChannelFactory to:
- Introduce ConnectRequest parameter object, eliminating the 9-parameter
  private connect() method.
- Decompose the connection logic into three single-responsibility methods:
  connect(ConnectRequest)      – resolves hostname via getAllByName() and
                                 builds a candidate list of all IPs;
  tryNextAddress(...)          – iterates candidates, falling back to the
                                 next IP on failure;
  connectToAddress(...)        – performs a single Netty connect and handles
                                 protocol-version negotiation on the same IP.
- DNS resolution is still deferred to connection time, preserving the
  dynamic-DNS semantics of RESOLVE_CONTACT_POINTS=false.

Add integration test should_connect_when_first_dns_entry_is_non_responsive
to MockResolverIT that maps the first DNS entry to a non-existent IP and
asserts the session opens successfully via the subsequent entries.
@nikagra nikagra force-pushed the fix/DRIVER-201-non-responsive-first-contact-point branch from 89e53ea to 7da0919 Compare April 10, 2026 13:05
@nikagra nikagra marked this pull request as ready for review April 10, 2026 14:27
@nikagra nikagra requested a review from dkropachev April 10, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants