Skip to content

GH-48334: [C++][Parquet] Support reading encrypted bloom filters#49334

Merged
wgtmac merged 12 commits intoapache:mainfrom
fenfeng9:parquet/gh-48334-encrypted-bloom-filter
Mar 27, 2026
Merged

GH-48334: [C++][Parquet] Support reading encrypted bloom filters#49334
wgtmac merged 12 commits intoapache:mainfrom
fenfeng9:parquet/gh-48334-encrypted-bloom-filter

Conversation

@fenfeng9
Copy link
Copy Markdown
Contributor

@fenfeng9 fenfeng9 commented Feb 18, 2026

Rationale for this change

Reading bloom filters from encrypted Parquet files previously raised an exception. This change implements encrypted bloom filter deserialization by decrypting the Thrift header (module id 8) and bitset (module id 9) separately, and adds the necessary validation and tests.

What changes are included in this PR?

  • Wire metadata decryptor creation into the bloom filter reader
  • Add BlockSplitBloomFilter::DeserializeEncrypted(...) for encrypted bloom filters
  • Remove the fuzzer workaround that swallowed encrypted bloom filter exceptions

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@fenfeng9
Copy link
Copy Markdown
Contributor Author

fenfeng9 commented Feb 18, 2026

@pitrou Could you please take a look when you have a moment?

The C++ writer still rejects bloom filters when file encryption is enabled (ParquetException::NYI in file_writer.cc). Because of that, the tests here build an encrypted payload in memory to exercise the reader path.

void WriteBloomFilter() {
if (bloom_filter_builder_ != nullptr) {
if (properties_->file_encryption_properties()) {
ParquetException::NYI("Encryption is not currently supported with bloom filter");
}
// Serialize bloom filter after all row groups have been written and report

Do you think we should add writer-side support for encrypted bloom filters in this PR as well, or handle that in a follow-up?

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Feb 19, 2026

The C++ writer still rejects bloom filters when file encryption is enabled (ParquetException::NYI in file_writer.cc). Because of that, the tests here build an encrypted payload in memory to exercise the reader path.

How did you generate the encrypted payload? Ideally we should add a test file in https://github.com/apache/parquet-testing/tree/master/data (perhaps generated with another Parquet implementation?)

Do you think we should add writer-side support for encrypted bloom filters in this PR as well, or handle that in a follow-up?

It depends if you feel comfortable doing it!

@fenfeng9
Copy link
Copy Markdown
Contributor Author

The C++ writer still rejects bloom filters when file encryption is enabled (ParquetException::NYI in file_writer.cc). Because of that, the tests here build an encrypted payload in memory to exercise the reader path.

How did you generate the encrypted payload? Ideally we should add a test file in https://github.com/apache/parquet-testing/tree/master/data (perhaps generated with another Parquet implementation?)

Thanks!
At the moment I didn’t run an end‑to‑end test. The test builds an encrypted payload in memory (serialize a BloomFilterHeader, encrypt header/bitset separately, then concatenate) to exercise the reader path without relying on the C++ writer.

I will update the tests with a real encrypted test file generated by another Parquet implementation and add it to parquet-testing.

@fenfeng9 fenfeng9 marked this pull request as draft February 19, 2026 08:40
@fenfeng9
Copy link
Copy Markdown
Contributor Author

Marked as draft. Will add a real test file to parquet-testing and ping you when ready for review.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Feb 19, 2026

Thanks for tackling this @fenfeng9 :)

@fenfeng9
Copy link
Copy Markdown
Contributor Author

fenfeng9 commented Mar 3, 2026

@pitrou Could you please review this when you have a moment? Thanks! There are some CI failures, but they seem unrelated to this PR

@fenfeng9 fenfeng9 marked this pull request as ready for review March 3, 2026 13:02
@fenfeng9
Copy link
Copy Markdown
Contributor Author

@pitrou Friendly ping — this is ready for review. Thanks!

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took a quick pass on this and it looks pretty good. Thanks for adding this!

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 12, 2026
@fenfeng9
Copy link
Copy Markdown
Contributor Author

I just took a quick pass on this and it looks pretty good. Thanks for adding this!

Thanks for the review! I'll address these comments soon.

@fenfeng9 fenfeng9 requested a review from wgtmac March 14, 2026 03:32
@wgtmac wgtmac changed the title GH-48334: Support reading encrypted bloom filters GH-48334: [C++][Parquet] Support reading encrypted bloom filters Mar 20, 2026
@fenfeng9 fenfeng9 force-pushed the parquet/gh-48334-encrypted-bloom-filter branch from bfcc616 to 0da603e Compare March 20, 2026 11:46
@fenfeng9 fenfeng9 requested a review from wgtmac March 20, 2026 12:48
@fenfeng9 fenfeng9 force-pushed the parquet/gh-48334-encrypted-bloom-filter branch from 1a1fc6e to f95e363 Compare March 23, 2026 12:39
@fenfeng9 fenfeng9 requested a review from wgtmac March 23, 2026 13:17
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your quick update! This looks good to me. I've left a minor comment w.r.t. error handling. Let me know what you think.

@fenfeng9 fenfeng9 force-pushed the parquet/gh-48334-encrypted-bloom-filter branch from f95e363 to d59d1bb Compare March 26, 2026 09:10
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with me!

@wgtmac wgtmac force-pushed the parquet/gh-48334-encrypted-bloom-filter branch 2 times, most recently from 83a3449 to 5dc2e0c Compare March 27, 2026 14:50
@wgtmac wgtmac force-pushed the parquet/gh-48334-encrypted-bloom-filter branch from 5dc2e0c to 86542f3 Compare March 27, 2026 14:53
@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Mar 27, 2026

I've rebased this PR on top of the latest main branch to pass CIs and also updated parquet-testing to its latest commit. Will merge after CIs are complete. Thanks again @fenfeng9!

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Mar 27, 2026

CI failures are unrelated, mostly due to failed with IOError: Bucket 'ursa-labs-r-test' not found. I'll merge it.

@wgtmac wgtmac merged commit 3ecf1ca into apache:main Mar 27, 2026
44 of 54 checks passed
@wgtmac wgtmac removed the awaiting committer review Awaiting committer review label Mar 27, 2026
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 3ecf1ca.

There was 1 benchmark result with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants