Skip to content

table: add opt-in pyiceberg-core arrow reader#2

Draft
abnobdoss wants to merge 5 commits into
aba-156-157-core-adaptersfrom
aba-158-opt-in-rust-arrow-scan
Draft

table: add opt-in pyiceberg-core arrow reader#2
abnobdoss wants to merge 5 commits into
aba-156-157-core-adaptersfrom
aba-158-opt-in-rust-arrow-scan

Conversation

@abnobdoss
Copy link
Copy Markdown
Owner

Stack position: PyIceberg PR after #1 (ABA-158).

Adds an env-gated DataScan.to_arrow_batch_reader() path using pyiceberg_core ArrowReader when PYICEBERG_RUST_ARROW_SCAN is enabled. Existing PyArrow behavior remains the default. The native path is skipped for limits and for projections that would require reading filter-only fields; unsupported native shapes fall back with a warning.

Validation:

  • python3 -m py_compile pyiceberg/io/pyiceberg_core.py pyiceberg/table/init.py tests/io/test_pyiceberg_core.py
  • uv run pytest tests/io/test_pyiceberg_core.py -q
  • commit hooks: ruff, format, mypy, pydocstyle, codespell

Challenger loop: red-team found limit handling and missing-core fallback issues; both were fixed. Final review green.

Abanoub Doss and others added 4 commits June 6, 2026 04:34
Fan scan tasks out over a thread pool of native ArrowReaders so decode uses
multiple cores instead of one, streaming batches as they complete with at most
one decoded batch per shard in flight. A default batch size amortizes the
per-batch GIL handoff that otherwise dominates the fan-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A limited scan previously always fell back to PyArrow. Push the limit through the
native reader instead: truncate the streamed result at the limit (slicing the
crossing batch and closing the shards so they stop decoding early) and cap the
batch size to the limit so a small limit does not decode a full batch per shard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The native reader emits arrow-rs's own types (string rather than large_string,
and run-end-encoded identity-partition columns), so its output diverged from the
PyArrow scan path. Cast every batch to schema_to_pyarrow(projected_schema),
decoding run-end-encoded columns first since there is no direct cast kernel for
them, so the native path is a faithful drop-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PYICEBERG_RUST_SCAN_PLANNING plans the scan in pyiceberg-core (Table.plan_files)
instead of PyIceberg's Python manifest planning, then streams the planned tasks
through the same sharded, casted reader as the read path. Falls back to PyArrow
on any scan pyiceberg-core cannot handle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant