table: add opt-in pyiceberg-core arrow reader by abnobdoss · Pull Request #2 · abnobdoss/iceberg-python

abnobdoss · 2026-05-25T00:56:10Z

Stack position: PyIceberg PR after #1 (ABA-158).

Adds an env-gated DataScan.to_arrow_batch_reader() path using pyiceberg_core ArrowReader when PYICEBERG_RUST_ARROW_SCAN is enabled. Existing PyArrow behavior remains the default. The native path is skipped for limits and for projections that would require reading filter-only fields; unsupported native shapes fall back with a warning.

Validation:

python3 -m py_compile pyiceberg/io/pyiceberg_core.py pyiceberg/table/init.py tests/io/test_pyiceberg_core.py
uv run pytest tests/io/test_pyiceberg_core.py -q
commit hooks: ruff, format, mypy, pydocstyle, codespell

Challenger loop: red-team found limit handling and missing-core fallback issues; both were fixed. Final review green.

Fan scan tasks out over a thread pool of native ArrowReaders so decode uses multiple cores instead of one, streaming batches as they complete with at most one decoded batch per shard in flight. A default batch size amortizes the per-batch GIL handoff that otherwise dominates the fan-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A limited scan previously always fell back to PyArrow. Push the limit through the native reader instead: truncate the streamed result at the limit (slicing the crossing batch and closing the shards so they stop decoding early) and cap the batch size to the limit so a small limit does not decode a full batch per shard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The native reader emits arrow-rs's own types (string rather than large_string, and run-end-encoded identity-partition columns), so its output diverged from the PyArrow scan path. Cast every batch to schema_to_pyarrow(projected_schema), decoding run-end-encoded columns first since there is no direct cast kernel for them, so the native path is a faithful drop-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PYICEBERG_RUST_SCAN_PLANNING plans the scan in pyiceberg-core (Table.plan_files) instead of PyIceberg's Python manifest planning, then streams the planned tasks through the same sharded, casted reader as the read path. Falls back to PyArrow on any scan pyiceberg-core cannot handle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(table): add opt-in pyiceberg-core arrow reader

8a30d87

abnobdoss mentioned this pull request May 25, 2026

io: enable native partition-aware arrow scans #3

Draft

Abanoub Doss and others added 4 commits June 6, 2026 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

table: add opt-in pyiceberg-core arrow reader#2

table: add opt-in pyiceberg-core arrow reader#2
abnobdoss wants to merge 5 commits into
aba-156-157-core-adaptersfrom
aba-158-opt-in-rust-arrow-scan

abnobdoss commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abnobdoss commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant