Skip to content

feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor#795

Merged
vikrantpuppala merged 1 commit into
mainfrom
feat/kernel-complex-types-as-json
May 20, 2026
Merged

feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor#795
vikrantpuppala merged 1 commit into
mainfrom
feat/kernel-complex-types-as-json

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

Summary

The connector's _use_arrow_native_complex_types toggle is honoured by the Thrift backend (forwarded server-side as complexTypesAsArrow) but was silently ignored by the kernel backend — the kernel always returned native Arrow List / Map / Struct regardless of the flag. This was the root cause of the 5 THRIFT_VS_KERNEL_COMPLEX_DISABLED diffs in the comparator's COMPLEX_TYPES suite.

The kernel side gained an opt-in complex_types_as_json post-processor in databricks/databricks-sql-kernel#36 that rewrites complex columns to Utf8 columns of compact JSON text, matching the Thrift wire format byte-for-byte. This PR wires the connector's existing kwarg through to that flag.

Changes

  • session.py — pass _use_arrow_native_complex_types to KernelDatabricksClient (it was being dropped on the floor for the kernel branch in _create_backend).
  • backend/kernel/client.py — read the kwarg in __init__ (default True, matching the connector-wide default), invert at the boundary, and set complex_types_as_json=not _use_arrow_native_complex_types on the kernel Session() constructor.
  • backend/kernel/type_mapping.py — extend _databricks_type_for_field to honour databricks.type_name for ARRAY / MAP / STRUCT (it already did this for VARIANT). When the kernel JSON path is on, the columns arrive as Utf8 but the kernel preserves the original SQL type name in field metadata; description should report array / map / struct, matching what the Thrift backend reports under complexTypesAsArrow=False.

Why the dependency on kernel#36

This PR is a no-op on its own — without the kernel-side post-processor, passing complex_types_as_json=True to _kernel.Session() is just an unrecognised kwarg. Once kernel#36 lands and the kernel wheel is rebuilt, this PR completes the wiring and unblocks the comparator's COMPLEX_TYPES_DISABLED suite. The connector code is correct regardless of merge ordering — the kernel side rejects unknown kwargs with a clear error if the consumer somehow gets a stale wheel.

Test plan

Unit tests

  • New parametrised test test_open_session_passes_complex_types_as_json_to_kernel — verifies the boundary inversion: connector True / unset → kernel False; connector False → kernel True.
  • New parametrised test test_description_recovers_complex_type_name_from_metadata — verifies description_from_arrow_schema recovers array / map / struct (case-insensitively) from databricks.type_name metadata on Utf8 columns.
  • New negative test test_description_passes_through_unknown_databricks_type_name — confirms unknown server-reported names ("INT" on an int64 Arrow column) defer to the Arrow shape rather than corrupting the description.
  • 85 → 94 kernel-backend unit tests; full kernel suite green; black clean.

End-to-end against pecotesting + comparator harness

Ran the focused thrift-vs-kernel-complex config (COMPLEX_TYPES suite × THRIFT_VS_KERNEL_COMPLEX_DISABLED / THRIFT_VS_KERNEL_COMPLEX_ENABLED):

Run Before After
COMPLEX_ENABLED (native Arrow) match match
COMPLEX_DISABLED (JSON strings) 5 type-shape diffs + 1 row diff 0 type-shape diffs, 1 row diff

The remaining row diff is a server-side Thrift bug: Thrift emits invalid JSON for map values containing embedded " characters ({"k":"val with "quote""} — unescaped inner quote) while the kernel emits the correctly-escaped form ({"k":"val with \"quote\""}). The kernel is right here; matching Thrift would mean deliberately producing un-parseable JSON. Belongs in the comparator's result_set_filters as a known-divergent row, not in either driver.

This pull request and its description were written by Claude Code.

…'s complex_types_as_json

The connector's `_use_arrow_native_complex_types` toggle is honoured
by the Thrift backend (forwarded server-side as `complexTypesAsArrow`)
but was silently ignored by the kernel backend — the kernel always
returned native Arrow `List` / `Map` / `Struct` regardless. This was
the root cause of the 5 `THRIFT_VS_KERNEL_COMPLEX_DISABLED` diffs in
the comparator's COMPLEX_TYPES suite.

The kernel side gained an opt-in `complex_types_as_json` post-
processor (kernel PR #36) that rewrites complex columns to `Utf8`
columns of compact JSON text, matching the Thrift wire format
byte-for-byte. This change wires the connector's existing kwarg
through to that flag:

- `session.py`: pass `_use_arrow_native_complex_types` to the kernel
  client (it was being dropped on the floor for the kernel branch).
- `backend/kernel/client.py`: read it from kwargs (default `True`,
  matching the connector-wide default), invert at the boundary, and
  set `complex_types_as_json=not _use_arrow_native_complex_types`
  on the kernel `Session()` constructor.
- `backend/kernel/type_mapping.py`: extend `_databricks_type_for_field`
  to honour `databricks.type_name` for `ARRAY` / `MAP` / `STRUCT` (it
  already did this for `VARIANT`). When the kernel JSON path is on,
  the columns arrive as `Utf8` but the kernel preserves the original
  SQL type name in metadata; `description` should report `array` /
  `map` / `struct`, matching what the Thrift backend reports under
  `complexTypesAsArrow=False`.

Verified end-to-end against the pecotesting comparator workspace:
the `THRIFT_VS_KERNEL_COMPLEX_DISABLED` suite drops from 5 type-shape
diffs + 1 row diff to 1 row diff. The remaining row diff is a Thrift
server-side bug — Thrift emits invalid JSON for map values containing
embedded `"` characters (`{"k":"val with "quote""}` — unescaped
inner quote), while the kernel emits the correctly-escaped form
(`{"k":"val with \"quote\""}`). The kernel is right here; matching
Thrift would mean deliberately producing un-parseable output.

Unit tests:
- Parametrised test of `_use_arrow_native_complex_types` (default /
  True / False) → kernel `Session(complex_types_as_json=…)`.
- Parametrised test of `description_from_arrow_schema` recovering
  `array` / `map` / `struct` from metadata, case-insensitively.
- Negative test that an unknown `databricks.type_name` defers to the
  Arrow type rather than corrupting the description.

85 → 94 kernel unit tests; full suite green; black-formatted.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala vikrantpuppala merged commit 0c10d7b into main May 20, 2026
34 checks passed
@jprakash-db jprakash-db mentioned this pull request May 21, 2026
3 tasks
vikrantpuppala added a commit that referenced this pull request May 27, 2026
The original pin (aed2efb) predates kernel PR #36 which added
`complex_types_as_json` to Session.__new__. Connector main already
passes that kwarg (added in PR #795), so every e2e test was failing
with:

  TypeError: Session.__new__() got an unexpected keyword argument
  'complex_types_as_json'

Bump to current kernel main (3aa25b21) which has the kwarg plus the
rest of the comparator-parity changes the connector code already
expects.

This is a good demonstration of why the bisectable KERNEL_REV pin
matters: the connector and kernel evolved in lockstep on `main`
before this CI existed, so the very first thing the workflow does
once it can actually build the wheel is catch that we'd been
shipping a stale pin.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
vikrantpuppala added a commit that referenced this pull request May 27, 2026
…rage (#808)

* ci: add kernel-e2e workflow + KERNEL_REV pin

Wires up CI coverage for use_kernel=True. The kernel is a private
repo with no published wheel, so we pin a kernel SHA in KERNEL_REV
and build the wheel inline via maturin develop using the existing
INTEGRATION_TEST_APP GitHub App (extended to include
databricks/databricks-sql-kernel in its repo allowlist).

Gate semantics mirror trigger-integration-tests.yml:
- Plain PR events post a synthetic-success Kernel E2E check so the
  required check doesn't block PRs that don't touch kernel code.
- The kernel-e2e label triggers a preview run on the PR and is
  auto-removed on synchronize for the same security reason as the
  integration-test label.
- merge_group is the real gate — runs when kernel-relevant files
  change (src/databricks/sql/backend/kernel/, test_kernel_backend.py,
  KERNEL_REV, etc.), auto-passes otherwise.

Unit tests are unchanged: tests/unit/test_kernel_*.py already runs
in every code-quality-checks.yml matrix combo against a fake
databricks_sql_kernel module injected at sys.modules import time.

Required follow-up before this merges:
1. Extend the INTEGRATION_TEST_APP allowlist to include
   databricks/databricks-sql-kernel.
2. Create the kernel-e2e label in this repo.
3. Add Kernel E2E as a required check on main once a green run lands.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

* ci(kernel-e2e): add id-token: write for JFrog OIDC exchange

setup-poetry runs setup-jfrog, which exchanges a GitHub OIDC token
for a JFrog access token to reach the internal PyPI mirror. That
needs id-token: write on the job, which was missing — the labelled
preview run failed at setup-poetry with
"ACTIONS_ID_TOKEN_REQUEST_TOKEN: unbound variable".

Declared at both workflow scope and on run-kernel-e2e directly:
a job-level permissions block fully overrides workflow scope, so
the redundancy is intentional.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

* ci(kernel-e2e): build kernel wheel into connector venv, not a new one

`poetry run maturin develop` from inside databricks-sql-kernel/pyo3/
makes poetry create a fresh, empty .venv next to the kernel source
(it discovers pyo3/pyproject.toml first and treats it as the project
root). That venv has no maturin → "Command not found: maturin".

Resolve the connector venv's python path explicitly before changing
working directory, then call maturin from that python via
`-m maturin`. `--interpreter <path>` pins the produced wheel to the
connector venv so the resulting extension is installed where pytest
will look for it.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

* ci(kernel-e2e): drop --interpreter from maturin develop (not a valid flag)

maturin develop installs into whichever python invoked it; the flag
exists on `maturin build` only. The previous commit's extra
`--interpreter $CONNECTOR_VENV_PY` was redundant — we're already
calling maturin via `$CONNECTOR_VENV_PY -m maturin`, so the venv
python is the one doing the build and install.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

* ci(kernel-e2e): route cargo through JFrog + audit cleanups

databricks-protected-runner-group blocks direct egress to
index.crates.io, so the maturin build was failing with SSL EOF on
the cargo metadata step. Extend setup-jfrog with an opt-in
`configure-cargo` input that writes ~/.cargo/config.toml +
credentials.toml against the JFrog db-cargo-remote proxy (recipe
borrowed verbatim from databricks-odbc's setup-jfrog action) and
forward it through setup-poetry so the kernel-e2e workflow can
enable it without bypassing the wrapper.

Bundled cleanups from a workflow audit:

- Drop the redundant `Set up Python 3.10` step — setup-poetry runs
  actions/setup-python internally at the matching version.
- Smoke-check now uses `$CONNECTOR_VENV_PY` (same interpreter we
  built the wheel with), so a wheel installed into the wrong venv
  would surface here rather than be masked by `poetry run python`
  re-resolving.
- Post `Kernel E2E` check on the labelled-PR path as well as the
  merge-queue path; previously the PR would still show the
  synthetic-success check forever even after a real labelled run
  failed.
- Add a comment to fetch-depth: 0 explaining why we keep it.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

* ci(kernel-e2e): bump KERNEL_REV to current kernel main

The original pin (aed2efb) predates kernel PR #36 which added
`complex_types_as_json` to Session.__new__. Connector main already
passes that kwarg (added in PR #795), so every e2e test was failing
with:

  TypeError: Session.__new__() got an unexpected keyword argument
  'complex_types_as_json'

Bump to current kernel main (3aa25b21) which has the kwarg plus the
rest of the comparator-parity changes the connector code already
expects.

This is a good demonstration of why the bisectable KERNEL_REV pin
matters: the connector and kernel evolved in lockstep on `main`
before this CI existed, so the very first thing the workflow does
once it can actually build the wheel is catch that we'd been
shipping a stale pin.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

* ci(kernel-e2e): disable bundled rust-cache in setup-rust-toolchain

actions-rust-lang/setup-rust-toolchain invokes Swatinem/rust-cache
internally, which runs `cargo metadata` from the workflow's working
directory. Our job's CWD is the connector repo root (no Cargo.toml
there — the kernel checkout is in a subdir), so the bundled cache
attempt fails with exit 101 and dumps a Node stack trace into the
log. It's cosmetic — the action handles its own errors — but reads
as a failure on first glance, and the bundled cache races with the
explicit rust-cache step we already configure with the correct
`workspaces: databricks-sql-kernel` path.

Disabling the bundled cache leaves a single, correctly-keyed
rust-cache invocation and cleans up the log.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

---------

Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants