Skip to content

asyncio reactor: SSL connections hang due to incorrect use of loop.sock_recv()/sock_sendall() with SSLSocket #821

@mykaul

Description

@mykaul

Summary

The asyncio reactor (AsyncioConnection) uses loop.sock_recv() and loop.sock_sendall() with an ssl.SSLSocket. CPython explicitly rejects SSLSocket in these APIs — raising TypeError("Socket cannot be of type SSLSocket") — on all Python versions this driver supports (≥3.9). The TypeError is silently swallowed because the run_coroutine_threadsafe futures are never inspected, causing the read and write coroutines to die immediately. The connection then hangs with no I/O until Connection.factory() times out with OperationTimedOut.

This was surfaced by PR #773 (updating the CI Scylla version from 2025.2 to 2026.1), which enables the test_client_routes.py SSL tests for the first time. All 5 asyncio CI jobs fail on TestSslThroughNlb::test_ssl_without_hostname_verification_through_nlb with OperationTimedOut, while all libev and asyncore jobs pass.

In short: asyncio + SSL has never worked on supported Python versions. It was never caught because no CI-exercised test path combined these two until now.

Root Cause

The failure chain

  1. AsyncioConnection.__init__() (asyncioreactor.py:88) calls _connect_socket(), which wraps the raw socket with ssl_context.wrap_socket() (connection.py:1073), producing an ssl.SSLSocket. The TCP connection and SSL handshake complete synchronously (blocking) during _connect_socket().

  2. The socket is set to non-blocking (asyncioreactor.py:93), then handle_read() and handle_write() coroutines are scheduled on the event loop thread via run_coroutine_threadsafe() (asyncioreactor.py:102-107).

  3. _send_options_message() (asyncioreactor.py:108) queues the CQL OPTIONS message via push()_push_msg()_write_queue.put_nowait().

  4. On the event loop thread, handle_write() dequeues the message and calls await self._loop.sock_sendall(self._socket, next_msg) (asyncioreactor.py:203). Since self._socket is an ssl.SSLSocket, CPython's _check_ssl_socket() guard immediately raises TypeError("Socket cannot be of type SSLSocket").

  5. The TypeError is not caught by except socket.error (TypeError is not a subclass of OSError). It propagates out of the coroutine and is stored in the concurrent.futures.Future returned by run_coroutine_threadsafe().

  6. Nobody ever calls .result() on _read_watcher or _write_watcher, so the TypeError is silently swallowed. Both coroutines are dead.

  7. With no read/write I/O, the CQL OPTIONS message is never sent, no SUPPORTED response is ever received, connected_event is never set, and Connection.factory() (connection.py:980) times out with OperationTimedOut.

Why libev/asyncore work

Both libevreactor.py and asyncorereactor.py call self._socket.recv() / self._socket.send() directly — the SSL socket's own methods — rather than passing the socket through asyncio's sock_recv()/sock_sendall() APIs. They also handle SSL_ERROR_WANT_READ/SSL_ERROR_WANT_WRITE on both read and write paths:

# libevreactor.py handle_write (line 330-341)
except socket.error as err:
    if (err.args[0] in NONBLOCKING or
            err.args[0] in (ssl.SSL_ERROR_WANT_READ, ssl.SSL_ERROR_WANT_WRITE)):
        ...

# libevreactor.py handle_read (line 367-374)
if isinstance(err, ssl.SSLError):
    if err.args[0] in (ssl.SSL_ERROR_WANT_READ, ssl.SSL_ERROR_WANT_WRITE):
        ...

Why the right fix is asyncio SSL Transport, not just catching TypeError

Even if you bypassed the _check_ssl_socket guard (e.g., by calling self._socket.recv() directly like libev does), sock_recv()/sock_sendall() have deeper architectural problems with SSL sockets:

  1. They only catch BlockingIOError/InterruptedError internally. SSL sockets raise SSLWantReadError/SSLWantWriteError instead, which are not caught by the retry logic.

  2. Direction mismatch. sock_recv() registers add_reader() on the fd; sock_sendall() registers add_writer(). But SSL operations can need the opposite direction — recv() may need to write (TLS key update response), and send() may need to read. These APIs have no mechanism for cross-direction I/O.

  3. TLS 1.3 makes cross-direction I/O common. TLS 1.3 NewSessionTicket messages arrive immediately after the handshake, and KeyUpdate messages can arrive at any time. Processing these during a recv() may require writing, and vice versa. With TLS 1.2 this was rare; with TLS 1.3 it is nearly guaranteed on the first read.

  4. handle_write has no SSL error handling. Even with direct recv()/send(), the current handle_write (asyncioreactor.py:198) only catches socket.errorSSLWantReadError/SSLWantWriteError during a write would defunct the connection.

Asyncio's built-in SSL transport (SSLProtocol + ssl.MemoryBIO) handles all of this correctly: it keeps the raw TCP socket for the selector, uses ssl.SSLObject over memory BIOs for the TLS state machine, catches SSLWantReadError as "try again later," and flushes outgoing TLS frames after every operation.

Why it wasn't caught before

  • The client_routes SSL tests are gated by @skip_scylla_version_lt("2026.1.0") because system.client_routes didn't exist in 2025.2. PR ci: update Scylla test version from 2025.2 to 2026.1 #773 bumps the CI version to 2026.1, enabling these tests for the first time.
  • The tests/integration/long/test_ssl.py suite (20+ SSL tests) is never run in CI — only tests/integration/standard/ is run.
  • No other standard integration test exercises SSL through the asyncio reactor.
  • The asyncio reactor unit tests (tests/unit/io/test_asyncioreactor.py) only test timers — they don't cover read/write/SSL at all. The shared ReactorTestMixin (which covers SSL error recovery, partial reads, etc.) is not used by the asyncio tests.

Proposed Fix: Use asyncio SSL Transport (SSLProtocol + MemoryBIO)

Replace the raw sock_recv()/sock_sendall() approach with asyncio's built-in SSL transport for SSL connections. Non-SSL connections continue to use the existing approach.

Implementation Plan

Step 1: Override _wrap_socket_from_context() to skip SSL wrapping

Override _wrap_socket_from_context() in AsyncioConnection to return the raw socket unchanged (no ssl.SSLSocket wrapping). Store self.ssl_context and extract server_hostname for later use by create_connection().

server_hostname extraction must follow the same logic as the current _wrap_socket_from_context() (connection.py:1019-1033):

  • If ssl_options provides server_hostname, use it.
  • If ssl_context.check_hostname is True and no explicit server_hostname, use self.endpoint.address.
  • If check_hostname is False and no explicit server_hostname, pass server_hostname="" (empty string satisfies the asyncio API requirement without enabling hostname verification).

Note: loop.create_connection(ssl=ctx, sock=sock) requires server_hostname when using a pre-connected socket — omitting it raises ValueError("You must set server_hostname when using ssl without a host").

Step 2: Create an asyncio Protocol bridge

class _CQLProtocol(asyncio.Protocol):
    """Bridge between asyncio's transport/protocol model and Connection."""

    def __init__(self, connection):
        self._connection = connection
        self.transport = None

    def connection_made(self, transport):
        self.transport = transport
        # SSL handshake is complete; now start the CQL handshake
        self._connection._send_options_message()

    def data_received(self, data):
        self._connection._iobuf.write(data)
        self._connection.process_io_buffer()

    def connection_lost(self, exc):
        if exc:
            self._connection.defunct(exc)
        else:
            self._connection.close()

    def pause_writing(self):
        self._connection._socket_writable = False

    def resume_writing(self):
        self._connection._socket_writable = True

Note on pause_writing/resume_writing: the current asyncio reactor never sets _socket_writable = False (pre-existing gap), so implementing these callbacks is an improvement. _socket_writable gates send_msg() at connection.py:1216 — when False, it raises ConnectionBusy. This provides proper write backpressure through the asyncio transport's high/low water marks.

Step 3: Refactor AsyncioConnection.__init__ for SSL vs non-SSL

For SSL connections, replace the handle_read() / handle_write() coroutines with a single _create_ssl_connection() coroutine that sets up the transport:

async def _create_ssl_connection(self):
    try:
        transport, protocol = await asyncio.wait_for(
            self._loop.create_connection(
                lambda: _CQLProtocol(self),
                sock=self._socket,
                ssl=self.ssl_context,
                server_hostname=self._server_hostname,
            ),
            timeout=self._ssl_handshake_timeout,
        )
        self._transport = transport
        self._protocol = protocol
        # connection_made() has already been called by this point,
        # which triggers _send_options_message()
    except Exception as exc:
        self.defunct(exc)

For non-SSL connections, the existing handle_read() / handle_write() / _send_options_message() flow remains unchanged.

The connect_timeout budget must be split:

  • TCP connect: happens synchronously in _connect_socket() (bounded by socket.settimeout(connect_timeout))
  • SSL handshake timeout: remaining budget after TCP connect, passed as _ssl_handshake_timeout and enforced via asyncio.wait_for()
  • CQL startup: remaining budget after handshake, bounded by Connection.factory()'s connected_event.wait(timeout - elapsed)

Step 4: Preserve connected_event semantics exactly

connected_event must NOT be set from connection_made(). In this driver, connected_event means "CQL-level startup is complete (READY/auth received) or has failed." It is set at:

  • connection.py:1584 — after ReadyMessage received
  • connection.py:1640 — after AuthSuccessMessage received
  • connection.py:1136 — on defunct() (error path)
  • asyncioreactor.py:168 — on _close() (cleanup path)

Connection.factory() at connection.py:980 waits on this event with connect_timeout. Setting it earlier would expose half-initialized connections that haven't completed the CQL handshake.

The flow is: connection_made()_send_options_message() → OPTIONS/SUPPORTED/STARTUP/READY exchange → connected_event.set().

Step 5: Refactor push() for SSL connections

For SSL connections, replace the write queue + sock_sendall() with transport.write():

def push(self, data):
    if self._transport:  # SSL path — transport handles buffering
        self._loop.call_soon_threadsafe(self._transport.write, data)
    else:  # non-SSL path — existing write queue behavior
        ...existing chunked queue logic...

No chunking is needed for the transport path — asyncio's SSLProtocol handles buffering internally.

Step 6: Update close() / _close() for SSL connections

For SSL connections:

  • Call self._transport.close() instead of self._socket.close() — the transport owns the socket and handles graceful TLS shutdown (close_notify).
  • Skip remove_reader()/remove_writer() calls — the transport manages its own fd registration.
  • Cancel the _create_ssl_connection future if still pending.

For non-SSL connections, keep the existing close logic.

Step 7: Error propagation

All errors from the transport/protocol must reach the driver's error handling:

  • _create_ssl_connection() catches exceptions from create_connection() (handshake failures, cert errors, timeouts) and calls self.defunct(exc), which sets last_error, calls close(), errors all requests, and sets connected_event.
  • connection_lost(exc) in _CQLProtocol calls self.defunct(exc) for error cases and self.close() for clean shutdown.
  • asyncio.wait_for() raises asyncio.TimeoutError if the handshake exceeds the timeout budget, which _create_ssl_connection() catches and passes to defunct().

Step 8: Tests

Unit tests (new, required):

  • Successful SSL startup via transport: mock create_connection()connection_made() → OPTIONS/SUPPORTED/STARTUP/READY → connected_event set
  • SSL handshake failure: create_connection() raises ssl.SSLCertVerificationErrordefunct() called → connected_event set with error
  • SSL handshake timeout: asyncio.wait_for() raises TimeoutErrordefunct() called
  • connection_lost(exc)defunct() called
  • connection_lost(None)close() called
  • Write via transport.write() — verify data reaches the transport mock
  • pause_writing() / resume_writing()_socket_writable toggled
  • server_hostname extraction: check_hostname=True with/without explicit hostname, check_hostname=False
  • Non-SSL connections unchanged — existing sock_recv/sock_sendall path still used

Integration tests (existing, need validation):

  • All test_client_routes.py SSL tests pass under EVENT_LOOP_MANAGER=asyncio (this is the primary acceptance criterion)
  • Run tests/integration/long/test_ssl.py under EVENT_LOOP_MANAGER=asyncio at least once manually (not in CI, but verified before merge)
  • Both check_hostname=True and check_hostname=False paths work

Regression tests:

  • Adapt ReactorTestMixin from tests/unit/io/utils.py for the asyncio reactor (currently not used by asyncio tests — only timers are tested)
  • Non-SSL asyncio connections must pass all existing tests unchanged

Scope and Risk

  • Files changed: Primarily cassandra/io/asyncioreactor.py, with minor changes to cassandra/connection.py (to allow subclass override of _wrap_socket_from_context()).
  • Backward compatibility: The public API (Cluster(ssl_context=...)) is unchanged. Only the internal transport mechanism changes for asyncio+SSL.
  • Risk areas:
    • Timeout budget splitting between TCP connect, SSL handshake, and CQL startup
    • _send_options_message() being called from connection_made() (async, on event loop thread) vs. from __init__() (sync, on caller thread) — verify thread safety of the CQL startup path
    • push() being called from both the caller thread and the event loop thread — the call_soon_threadsafe approach handles this, but verify no races with transport readiness
  • Non-SSL connections: Completely unchanged — sock_recv()/sock_sendall() work correctly for plain TCP sockets and the existing code path is not modified.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions