feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor#795
Merged
Conversation
…'s complex_types_as_json The connector's `_use_arrow_native_complex_types` toggle is honoured by the Thrift backend (forwarded server-side as `complexTypesAsArrow`) but was silently ignored by the kernel backend — the kernel always returned native Arrow `List` / `Map` / `Struct` regardless. This was the root cause of the 5 `THRIFT_VS_KERNEL_COMPLEX_DISABLED` diffs in the comparator's COMPLEX_TYPES suite. The kernel side gained an opt-in `complex_types_as_json` post- processor (kernel PR #36) that rewrites complex columns to `Utf8` columns of compact JSON text, matching the Thrift wire format byte-for-byte. This change wires the connector's existing kwarg through to that flag: - `session.py`: pass `_use_arrow_native_complex_types` to the kernel client (it was being dropped on the floor for the kernel branch). - `backend/kernel/client.py`: read it from kwargs (default `True`, matching the connector-wide default), invert at the boundary, and set `complex_types_as_json=not _use_arrow_native_complex_types` on the kernel `Session()` constructor. - `backend/kernel/type_mapping.py`: extend `_databricks_type_for_field` to honour `databricks.type_name` for `ARRAY` / `MAP` / `STRUCT` (it already did this for `VARIANT`). When the kernel JSON path is on, the columns arrive as `Utf8` but the kernel preserves the original SQL type name in metadata; `description` should report `array` / `map` / `struct`, matching what the Thrift backend reports under `complexTypesAsArrow=False`. Verified end-to-end against the pecotesting comparator workspace: the `THRIFT_VS_KERNEL_COMPLEX_DISABLED` suite drops from 5 type-shape diffs + 1 row diff to 1 row diff. The remaining row diff is a Thrift server-side bug — Thrift emits invalid JSON for map values containing embedded `"` characters (`{"k":"val with "quote""}` — unescaped inner quote), while the kernel emits the correctly-escaped form (`{"k":"val with \"quote\""}`). The kernel is right here; matching Thrift would mean deliberately producing un-parseable output. Unit tests: - Parametrised test of `_use_arrow_native_complex_types` (default / True / False) → kernel `Session(complex_types_as_json=…)`. - Parametrised test of `description_from_arrow_schema` recovering `array` / `map` / `struct` from metadata, case-insensitively. - Negative test that an unknown `databricks.type_name` defers to the Arrow type rather than corrupting the description. 85 → 94 kernel unit tests; full suite green; black-formatted. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
gopalldb
approved these changes
May 20, 2026
8 tasks
vikrantpuppala
added a commit
that referenced
this pull request
May 27, 2026
The original pin (aed2efb) predates kernel PR #36 which added `complex_types_as_json` to Session.__new__. Connector main already passes that kwarg (added in PR #795), so every e2e test was failing with: TypeError: Session.__new__() got an unexpected keyword argument 'complex_types_as_json' Bump to current kernel main (3aa25b21) which has the kwarg plus the rest of the comparator-parity changes the connector code already expects. This is a good demonstration of why the bisectable KERNEL_REV pin matters: the connector and kernel evolved in lockstep on `main` before this CI existed, so the very first thing the workflow does once it can actually build the wheel is catch that we'd been shipping a stale pin. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
vikrantpuppala
added a commit
that referenced
this pull request
May 27, 2026
…rage (#808) * ci: add kernel-e2e workflow + KERNEL_REV pin Wires up CI coverage for use_kernel=True. The kernel is a private repo with no published wheel, so we pin a kernel SHA in KERNEL_REV and build the wheel inline via maturin develop using the existing INTEGRATION_TEST_APP GitHub App (extended to include databricks/databricks-sql-kernel in its repo allowlist). Gate semantics mirror trigger-integration-tests.yml: - Plain PR events post a synthetic-success Kernel E2E check so the required check doesn't block PRs that don't touch kernel code. - The kernel-e2e label triggers a preview run on the PR and is auto-removed on synchronize for the same security reason as the integration-test label. - merge_group is the real gate — runs when kernel-relevant files change (src/databricks/sql/backend/kernel/, test_kernel_backend.py, KERNEL_REV, etc.), auto-passes otherwise. Unit tests are unchanged: tests/unit/test_kernel_*.py already runs in every code-quality-checks.yml matrix combo against a fake databricks_sql_kernel module injected at sys.modules import time. Required follow-up before this merges: 1. Extend the INTEGRATION_TEST_APP allowlist to include databricks/databricks-sql-kernel. 2. Create the kernel-e2e label in this repo. 3. Add Kernel E2E as a required check on main once a green run lands. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): add id-token: write for JFrog OIDC exchange setup-poetry runs setup-jfrog, which exchanges a GitHub OIDC token for a JFrog access token to reach the internal PyPI mirror. That needs id-token: write on the job, which was missing — the labelled preview run failed at setup-poetry with "ACTIONS_ID_TOKEN_REQUEST_TOKEN: unbound variable". Declared at both workflow scope and on run-kernel-e2e directly: a job-level permissions block fully overrides workflow scope, so the redundancy is intentional. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): build kernel wheel into connector venv, not a new one `poetry run maturin develop` from inside databricks-sql-kernel/pyo3/ makes poetry create a fresh, empty .venv next to the kernel source (it discovers pyo3/pyproject.toml first and treats it as the project root). That venv has no maturin → "Command not found: maturin". Resolve the connector venv's python path explicitly before changing working directory, then call maturin from that python via `-m maturin`. `--interpreter <path>` pins the produced wheel to the connector venv so the resulting extension is installed where pytest will look for it. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): drop --interpreter from maturin develop (not a valid flag) maturin develop installs into whichever python invoked it; the flag exists on `maturin build` only. The previous commit's extra `--interpreter $CONNECTOR_VENV_PY` was redundant — we're already calling maturin via `$CONNECTOR_VENV_PY -m maturin`, so the venv python is the one doing the build and install. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): route cargo through JFrog + audit cleanups databricks-protected-runner-group blocks direct egress to index.crates.io, so the maturin build was failing with SSL EOF on the cargo metadata step. Extend setup-jfrog with an opt-in `configure-cargo` input that writes ~/.cargo/config.toml + credentials.toml against the JFrog db-cargo-remote proxy (recipe borrowed verbatim from databricks-odbc's setup-jfrog action) and forward it through setup-poetry so the kernel-e2e workflow can enable it without bypassing the wrapper. Bundled cleanups from a workflow audit: - Drop the redundant `Set up Python 3.10` step — setup-poetry runs actions/setup-python internally at the matching version. - Smoke-check now uses `$CONNECTOR_VENV_PY` (same interpreter we built the wheel with), so a wheel installed into the wrong venv would surface here rather than be masked by `poetry run python` re-resolving. - Post `Kernel E2E` check on the labelled-PR path as well as the merge-queue path; previously the PR would still show the synthetic-success check forever even after a real labelled run failed. - Add a comment to fetch-depth: 0 explaining why we keep it. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): bump KERNEL_REV to current kernel main The original pin (aed2efb) predates kernel PR #36 which added `complex_types_as_json` to Session.__new__. Connector main already passes that kwarg (added in PR #795), so every e2e test was failing with: TypeError: Session.__new__() got an unexpected keyword argument 'complex_types_as_json' Bump to current kernel main (3aa25b21) which has the kwarg plus the rest of the comparator-parity changes the connector code already expects. This is a good demonstration of why the bisectable KERNEL_REV pin matters: the connector and kernel evolved in lockstep on `main` before this CI existed, so the very first thing the workflow does once it can actually build the wheel is catch that we'd been shipping a stale pin. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * ci(kernel-e2e): disable bundled rust-cache in setup-rust-toolchain actions-rust-lang/setup-rust-toolchain invokes Swatinem/rust-cache internally, which runs `cargo metadata` from the workflow's working directory. Our job's CWD is the connector repo root (no Cargo.toml there — the kernel checkout is in a subdir), so the bundled cache attempt fails with exit 101 and dumps a Node stack trace into the log. It's cosmetic — the action handles its own errors — but reads as a failure on first glance, and the bundled cache races with the explicit rust-cache step we already configure with the correct `workspaces: databricks-sql-kernel` path. Disabling the bundled cache leaves a single, correctly-keyed rust-cache invocation and cleans up the log. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> --------- Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The connector's
_use_arrow_native_complex_typestoggle is honoured by the Thrift backend (forwarded server-side ascomplexTypesAsArrow) but was silently ignored by the kernel backend — the kernel always returned native ArrowList/Map/Structregardless of the flag. This was the root cause of the 5THRIFT_VS_KERNEL_COMPLEX_DISABLEDdiffs in the comparator's COMPLEX_TYPES suite.The kernel side gained an opt-in
complex_types_as_jsonpost-processor in databricks/databricks-sql-kernel#36 that rewrites complex columns toUtf8columns of compact JSON text, matching the Thrift wire format byte-for-byte. This PR wires the connector's existing kwarg through to that flag.Changes
session.py— pass_use_arrow_native_complex_typestoKernelDatabricksClient(it was being dropped on the floor for the kernel branch in_create_backend).backend/kernel/client.py— read the kwarg in__init__(defaultTrue, matching the connector-wide default), invert at the boundary, and setcomplex_types_as_json=not _use_arrow_native_complex_typeson the kernelSession()constructor.backend/kernel/type_mapping.py— extend_databricks_type_for_fieldto honourdatabricks.type_nameforARRAY/MAP/STRUCT(it already did this forVARIANT). When the kernel JSON path is on, the columns arrive asUtf8but the kernel preserves the original SQL type name in field metadata;descriptionshould reportarray/map/struct, matching what the Thrift backend reports undercomplexTypesAsArrow=False.Why the dependency on kernel#36
This PR is a no-op on its own — without the kernel-side post-processor, passing
complex_types_as_json=Trueto_kernel.Session()is just an unrecognised kwarg. Once kernel#36 lands and the kernel wheel is rebuilt, this PR completes the wiring and unblocks the comparator's COMPLEX_TYPES_DISABLED suite. The connector code is correct regardless of merge ordering — the kernel side rejects unknown kwargs with a clear error if the consumer somehow gets a stale wheel.Test plan
Unit tests
test_open_session_passes_complex_types_as_json_to_kernel— verifies the boundary inversion: connectorTrue/ unset → kernelFalse; connectorFalse→ kernelTrue.test_description_recovers_complex_type_name_from_metadata— verifiesdescription_from_arrow_schemarecoversarray/map/struct(case-insensitively) fromdatabricks.type_namemetadata onUtf8columns.test_description_passes_through_unknown_databricks_type_name— confirms unknown server-reported names ("INT"on anint64Arrow column) defer to the Arrow shape rather than corrupting the description.blackclean.End-to-end against pecotesting + comparator harness
Ran the focused
thrift-vs-kernel-complexconfig (COMPLEX_TYPESsuite ×THRIFT_VS_KERNEL_COMPLEX_DISABLED/THRIFT_VS_KERNEL_COMPLEX_ENABLED):COMPLEX_ENABLED(native Arrow)COMPLEX_DISABLED(JSON strings)The remaining row diff is a server-side Thrift bug: Thrift emits invalid JSON for map values containing embedded
"characters ({"k":"val with "quote""}— unescaped inner quote) while the kernel emits the correctly-escaped form ({"k":"val with \"quote\""}). The kernel is right here; matching Thrift would mean deliberately producing un-parseable JSON. Belongs in the comparator'sresult_set_filtersas a known-divergent row, not in either driver.This pull request and its description were written by Claude Code.