feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor by vikrantpuppala · Pull Request #795 · databricks/databricks-sql-python

vikrantpuppala · 2026-05-20T11:23:37Z

Summary

The connector's _use_arrow_native_complex_types toggle is honoured by the Thrift backend (forwarded server-side as complexTypesAsArrow) but was silently ignored by the kernel backend — the kernel always returned native Arrow List / Map / Struct regardless of the flag. This was the root cause of the 5 THRIFT_VS_KERNEL_COMPLEX_DISABLED diffs in the comparator's COMPLEX_TYPES suite.

The kernel side gained an opt-in complex_types_as_json post-processor in databricks/databricks-sql-kernel#36 that rewrites complex columns to Utf8 columns of compact JSON text, matching the Thrift wire format byte-for-byte. This PR wires the connector's existing kwarg through to that flag.

Changes

session.py — pass _use_arrow_native_complex_types to KernelDatabricksClient (it was being dropped on the floor for the kernel branch in _create_backend).
backend/kernel/client.py — read the kwarg in __init__ (default True, matching the connector-wide default), invert at the boundary, and set complex_types_as_json=not _use_arrow_native_complex_types on the kernel Session() constructor.
backend/kernel/type_mapping.py — extend _databricks_type_for_field to honour databricks.type_name for ARRAY / MAP / STRUCT (it already did this for VARIANT). When the kernel JSON path is on, the columns arrive as Utf8 but the kernel preserves the original SQL type name in field metadata; description should report array / map / struct, matching what the Thrift backend reports under complexTypesAsArrow=False.

Why the dependency on kernel#36

This PR is a no-op on its own — without the kernel-side post-processor, passing complex_types_as_json=True to _kernel.Session() is just an unrecognised kwarg. Once kernel#36 lands and the kernel wheel is rebuilt, this PR completes the wiring and unblocks the comparator's COMPLEX_TYPES_DISABLED suite. The connector code is correct regardless of merge ordering — the kernel side rejects unknown kwargs with a clear error if the consumer somehow gets a stale wheel.

Test plan

Unit tests

New parametrised test test_open_session_passes_complex_types_as_json_to_kernel — verifies the boundary inversion: connector True / unset → kernel False; connector False → kernel True.
New parametrised test test_description_recovers_complex_type_name_from_metadata — verifies description_from_arrow_schema recovers array / map / struct (case-insensitively) from databricks.type_name metadata on Utf8 columns.
New negative test test_description_passes_through_unknown_databricks_type_name — confirms unknown server-reported names ("INT" on an int64 Arrow column) defer to the Arrow shape rather than corrupting the description.
85 → 94 kernel-backend unit tests; full kernel suite green; black clean.

End-to-end against pecotesting + comparator harness

Ran the focused thrift-vs-kernel-complex config (COMPLEX_TYPES suite × THRIFT_VS_KERNEL_COMPLEX_DISABLED / THRIFT_VS_KERNEL_COMPLEX_ENABLED):

Run	Before	After
`COMPLEX_ENABLED` (native Arrow)	match	match
`COMPLEX_DISABLED` (JSON strings)	5 type-shape diffs + 1 row diff	0 type-shape diffs, 1 row diff

The remaining row diff is a server-side Thrift bug: Thrift emits invalid JSON for map values containing embedded " characters ({"k":"val with "quote""} — unescaped inner quote) while the kernel emits the correctly-escaped form ({"k":"val with \"quote\""}). The kernel is right here; matching Thrift would mean deliberately producing un-parseable JSON. Belongs in the comparator's result_set_filters as a known-divergent row, not in either driver.

This pull request and its description were written by Claude Code.

…'s complex_types_as_json The connector's `_use_arrow_native_complex_types` toggle is honoured by the Thrift backend (forwarded server-side as `complexTypesAsArrow`) but was silently ignored by the kernel backend — the kernel always returned native Arrow `List` / `Map` / `Struct` regardless. This was the root cause of the 5 `THRIFT_VS_KERNEL_COMPLEX_DISABLED` diffs in the comparator's COMPLEX_TYPES suite. The kernel side gained an opt-in `complex_types_as_json` post- processor (kernel PR #36) that rewrites complex columns to `Utf8` columns of compact JSON text, matching the Thrift wire format byte-for-byte. This change wires the connector's existing kwarg through to that flag: - `session.py`: pass `_use_arrow_native_complex_types` to the kernel client (it was being dropped on the floor for the kernel branch). - `backend/kernel/client.py`: read it from kwargs (default `True`, matching the connector-wide default), invert at the boundary, and set `complex_types_as_json=not _use_arrow_native_complex_types` on the kernel `Session()` constructor. - `backend/kernel/type_mapping.py`: extend `_databricks_type_for_field` to honour `databricks.type_name` for `ARRAY` / `MAP` / `STRUCT` (it already did this for `VARIANT`). When the kernel JSON path is on, the columns arrive as `Utf8` but the kernel preserves the original SQL type name in metadata; `description` should report `array` / `map` / `struct`, matching what the Thrift backend reports under `complexTypesAsArrow=False`. Verified end-to-end against the pecotesting comparator workspace: the `THRIFT_VS_KERNEL_COMPLEX_DISABLED` suite drops from 5 type-shape diffs + 1 row diff to 1 row diff. The remaining row diff is a Thrift server-side bug — Thrift emits invalid JSON for map values containing embedded `"` characters (`{"k":"val with "quote""}` — unescaped inner quote), while the kernel emits the correctly-escaped form (`{"k":"val with \"quote\""}`). The kernel is right here; matching Thrift would mean deliberately producing un-parseable output. Unit tests: - Parametrised test of `_use_arrow_native_complex_types` (default / True / False) → kernel `Session(complex_types_as_json=…)`. - Parametrised test of `description_from_arrow_schema` recovering `array` / `map` / `struct` from metadata, case-insensitively. - Negative test that an unknown `databricks.type_name` defers to the Arrow type rather than corrupting the description. 85 → 94 kernel unit tests; full suite green; black-formatted. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

The original pin (aed2efb) predates kernel PR #36 which added `complex_types_as_json` to Session.__new__. Connector main already passes that kwarg (added in PR #795), so every e2e test was failing with: TypeError: Session.__new__() got an unexpected keyword argument 'complex_types_as_json' Bump to current kernel main (3aa25b21) which has the kwarg plus the rest of the comparator-parity changes the connector code already expects. This is a good demonstration of why the bisectable KERNEL_REV pin matters: the connector and kernel evolved in lockstep on `main` before this CI existed, so the very first thing the workflow does once it can actually build the wheel is catch that we'd been shipping a stale pin. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

…rage (#808) * ci: add kernel-e2e workflow + KERNEL_REV pin Wires up CI coverage for use_kernel=True. The kernel is a private repo with no published wheel, so we pin a kernel SHA in KERNEL_REV and build the wheel inline via maturin develop using the existing INTEGRATION_TEST_APP GitHub App (extended to include databricks/databricks-sql-kernel in its repo allowlist). Gate semantics mirror trigger-integration-tests.yml: - Plain PR events post a synthetic-success Kernel E2E check so the required check doesn't block PRs that don't touch kernel code. - The kernel-e2e label triggers a preview run on the PR and is auto-removed on synchronize for the same security reason as the integration-test label. - merge_group is the real gate — runs when kernel-relevant files change (src/databricks/sql/backend/kernel/, test_kernel_backend.py, KERNEL_REV, etc.), auto-passes otherwise. Unit tests are unchanged: tests/unit/test_kernel_*.py already runs in every code-quality-checks.yml matrix combo against a fake databricks_sql_kernel module injected at sys.modules import time. Required follow-up before this merges: 1. Extend the INTEGRATION_TEST_APP allowlist to include databricks/databricks-sql-kernel. 2. Create the kernel-e2e label in this repo. 3. Add Kernel E2E as a required check on main once a green run lands. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): add id-token: write for JFrog OIDC exchange setup-poetry runs setup-jfrog, which exchanges a GitHub OIDC token for a JFrog access token to reach the internal PyPI mirror. That needs id-token: write on the job, which was missing — the labelled preview run failed at setup-poetry with "ACTIONS_ID_TOKEN_REQUEST_TOKEN: unbound variable". Declared at both workflow scope and on run-kernel-e2e directly: a job-level permissions block fully overrides workflow scope, so the redundancy is intentional. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): build kernel wheel into connector venv, not a new one `poetry run maturin develop` from inside databricks-sql-kernel/pyo3/ makes poetry create a fresh, empty .venv next to the kernel source (it discovers pyo3/pyproject.toml first and treats it as the project root). That venv has no maturin → "Command not found: maturin". Resolve the connector venv's python path explicitly before changing working directory, then call maturin from that python via `-m maturin`. `--interpreter <path>` pins the produced wheel to the connector venv so the resulting extension is installed where pytest will look for it. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): drop --interpreter from maturin develop (not a valid flag) maturin develop installs into whichever python invoked it; the flag exists on `maturin build` only. The previous commit's extra `--interpreter $CONNECTOR_VENV_PY` was redundant — we're already calling maturin via `$CONNECTOR_VENV_PY -m maturin`, so the venv python is the one doing the build and install. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): route cargo through JFrog + audit cleanups databricks-protected-runner-group blocks direct egress to index.crates.io, so the maturin build was failing with SSL EOF on the cargo metadata step. Extend setup-jfrog with an opt-in `configure-cargo` input that writes ~/.cargo/config.toml + credentials.toml against the JFrog db-cargo-remote proxy (recipe borrowed verbatim from databricks-odbc's setup-jfrog action) and forward it through setup-poetry so the kernel-e2e workflow can enable it without bypassing the wrapper. Bundled cleanups from a workflow audit: - Drop the redundant `Set up Python 3.10` step — setup-poetry runs actions/setup-python internally at the matching version. - Smoke-check now uses `$CONNECTOR_VENV_PY` (same interpreter we built the wheel with), so a wheel installed into the wrong venv would surface here rather than be masked by `poetry run python` re-resolving. - Post `Kernel E2E` check on the labelled-PR path as well as the merge-queue path; previously the PR would still show the synthetic-success check forever even after a real labelled run failed. - Add a comment to fetch-depth: 0 explaining why we keep it. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): bump KERNEL_REV to current kernel main The original pin (aed2efb) predates kernel PR #36 which added `complex_types_as_json` to Session.__new__. Connector main already passes that kwarg (added in PR #795), so every e2e test was failing with: TypeError: Session.__new__() got an unexpected keyword argument 'complex_types_as_json' Bump to current kernel main (3aa25b21) which has the kwarg plus the rest of the comparator-parity changes the connector code already expects. This is a good demonstration of why the bisectable KERNEL_REV pin matters: the connector and kernel evolved in lockstep on `main` before this CI existed, so the very first thing the workflow does once it can actually build the wheel is catch that we'd been shipping a stale pin. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): disable bundled rust-cache in setup-rust-toolchain actions-rust-lang/setup-rust-toolchain invokes Swatinem/rust-cache internally, which runs `cargo metadata` from the workflow's working directory. Our job's CWD is the connector repo root (no Cargo.toml there — the kernel checkout is in a subdir), so the bundled cache attempt fails with exit 101 and dumps a Node stack trace into the log. It's cosmetic — the action handles its own errors — but reads as a failure on first glance, and the bundled cache races with the explicit rust-cache step we already configure with the correct `workspaces: databricks-sql-kernel` path. Disabling the bundled cache leaves a single, correctly-keyed rust-cache invocation and cleans up the log. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> --------- Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

vikrantpuppala temporarily deployed to azure-prod May 20, 2026 11:23 — with GitHub Actions Inactive

gopalldb approved these changes May 20, 2026

View reviewed changes

vikrantpuppala merged commit 0c10d7b into main May 20, 2026
34 checks passed

jprakash-db mentioned this pull request May 21, 2026

Bump to version 4.2.7 #797

Open

3 tasks

gopalldb mentioned this pull request May 25, 2026

fix(backend/kernel): comparator parity — async retention + intervals_as_string + precision/scale + named params + utils.exc fix #803

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor#795

feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor#795
vikrantpuppala merged 1 commit into
mainfrom
feat/kernel-complex-types-as-json

vikrantpuppala commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vikrantpuppala commented May 20, 2026

Summary

Changes

Why the dependency on kernel#36

Test plan

Unit tests

End-to-end against pecotesting + comparator harness

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants