feat: HuggingFace Hub storage backend and CDC table properties#2375
Conversation
0f6b02e to
ca61a05
Compare
|
Sure, should be released soon enough. I also need to set up some additional testing. |
|
Pulled out the opendal 0.56 bump into a separate PR #2401 |
d7550d9 to
c7d0b6f
Compare
a82fbd8 to
ef16f03
Compare
| name: HuggingFace Hub integration tests | ||
| runs-on: ubuntu-latest | ||
| # Skip the job entirely when HF secrets are not available (e.g. PRs from forks). | ||
| if: ${{ secrets.HF_TOKEN != '' }} |
There was a problem hiding this comment.
HF doesn't have a minio-like setup, so we should configure a huggingface free account for the CI.
|
@Xuanwo could you please take a look? |
| /// Only the fields required by this crate are stored; revision is consumed | ||
| /// during parsing but not retained. | ||
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| pub(crate) struct HfUri { |
There was a problem hiding this comment.
Almost identical to HfUri in opendal just not exposed yet.
b3a089c to
e4f8219
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Thank you for working on this!
|
The only thing left here is the |
|
I haven't been able to reproduce it locally yet, but looking. |
6a0825b to
596edb8
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Thank you for this, only one question left.
Adds two opt-in capabilities for storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication. New `opendal-hf` feature on `iceberg-storage-opendal` (off by default, included in `opendal-all`) that wires HuggingFace's OpenDAL service into `FileIO`. Paths use the form: hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo> where `repo_type` must be one of `models`, `datasets`, `spaces`, or `buckets` (XET-backed object storage). The prefix is mandatory — there is no implicit default. Configuration is passed via `FileIOBuilder` properties: - `hf.token` — API token (required for private repos / writes) - `hf.endpoint` — Hub endpoint, defaults to https://huggingface.co - `hf.revision` — fallback revision when a path has no `@<revision>` `OpenDalResolvingStorage` recognises the `hf` scheme and lazily constructs a per-scheme storage instance. `delete_stream` groups paths by `<repo_type>/<repo_id>` so that bucket and dataset paths to the same repo do not share an operator. New table properties under the `write.parquet.content-defined-chunking.*` namespace (matching PyIceberg convention): - `write.parquet.content-defined-chunking.enabled` (bool, default false) - `write.parquet.content-defined-chunking.min-chunk-size` (bytes, default 256 KiB) - `write.parquet.content-defined-chunking.max-chunk-size` (bytes, default 1 MiB) - `write.parquet.content-defined-chunking.norm-level` (i32, default 0) CDC is opt-in: it activates only when `enabled = "true"` is set explicitly. Size/level properties without the enabled flag are parsed and stored but have no effect. Defaults match parquet's own `CdcOptions` defaults so the Iceberg layer stays in sync. CDC options are applied directly in the DataFusion physical write plan.
Two jobs gated on HF_TOKEN: Rust opendal-hf integration tests and Python CDC + HF tests. The Python HF test writes a table via PyIceberg and reads it back via IcebergDataFusionTable using the opendal-hf backend. Env vars: HF_TOKEN, HF_BUCKET, HF_DATASET.
The FFI table provider test crashes on macOS with datafusion-python 53 due to a pre-existing FFI ABI fragility (see iceberg-rust issue apache#1647). Keep the pin at 52.* so the test stays skipped via the module-level version guard, matching main's behaviour. Regenerated uv.lock to include the new HF dependencies.
Xuanwo
left a comment
There was a problem hiding this comment.
Thank you for the great work, let's move!
opendal's new release has been released. Dismiss this review to unblock the PR.
Which issue does this PR close?
What changes are included in this PR?
Adds two opt-in capabilities for storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication.
HuggingFace Hub storage
New
opendal-hffeature oniceberg-storage-opendal(off by default, included inopendal-all) that wires HuggingFace's OpenDAL service intoFileIO. Paths use the form:hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo>where
repo_typemust be one ofmodels,datasets,spaces, orbuckets. The prefix is mandatory. Configuration viaFileIOBuilderproperties:hf.token— API token (required for private repos / writes)hf.endpoint— Hub endpoint, defaults to https://huggingface.cohf.revision— fallback revision when a path has no@<revision>OpenDalResolvingStoragerecognises thehfscheme and lazily constructs a per-scheme storage instance.delete_streamgroups paths by<repo_type>/<repo_id>so bucket and dataset paths to the same repo do not share an operator.CDC (content-defined chunking) table properties
New table properties under
write.parquet.content-defined-chunking.*(matching PyIceberg convention):write.parquet.content-defined-chunking.enabled(bool, default false)write.parquet.content-defined-chunking.min-chunk-size(bytes, default 256 KiB)write.parquet.content-defined-chunking.max-chunk-size(bytes, default 1 MiB)write.parquet.content-defined-chunking.norm-level(i32, default 0)CDC activates only when
enabled = "true"is set explicitly. Defaults match parquet's ownCdcOptionsdefaults. CDC options are applied in the DataFusion physical write plan.Are these changes tested?
HfUriparsing and CDC property parsing.file_io_hf_test.rsguarded onHF_OPENDAL_TOKEN,HF_OPENDAL_BUCKET,HF_OPENDAL_DATASET; tests skip gracefully when env vars are unset.test_huggingface_and_cdc.pycovering CDC property persistence, PyIceberg writes with CDC, DataFusion read-back, and HF credentials end-to-end (skipped withoutHF_OPENDAL_TOKEN/HF_OPENDAL_TABLE_METADATA).