GH-48334: [C++][Parquet] Support reading encrypted bloom filters#49334
GH-48334: [C++][Parquet] Support reading encrypted bloom filters#49334wgtmac merged 12 commits intoapache:mainfrom
Conversation
|
@pitrou Could you please take a look when you have a moment? The C++ writer still rejects bloom filters when file encryption is enabled (ParquetException::NYI in file_writer.cc). Because of that, the tests here build an encrypted payload in memory to exercise the reader path. arrow/cpp/src/parquet/file_writer.cc Lines 484 to 489 in aea1ad3 Do you think we should add writer-side support for encrypted bloom filters in this PR as well, or handle that in a follow-up? |
How did you generate the encrypted payload? Ideally we should add a test file in https://github.com/apache/parquet-testing/tree/master/data (perhaps generated with another Parquet implementation?)
It depends if you feel comfortable doing it! |
Thanks! I will update the tests with a real encrypted test file generated by another Parquet implementation and add it to parquet-testing. |
|
Marked as draft. Will add a real test file to parquet-testing and ping you when ready for review. |
|
Thanks for tackling this @fenfeng9 :) |
|
@pitrou Could you please review this when you have a moment? Thanks! There are some CI failures, but they seem unrelated to this PR |
|
@pitrou Friendly ping — this is ready for review. Thanks! |
wgtmac
left a comment
There was a problem hiding this comment.
I just took a quick pass on this and it looks pretty good. Thanks for adding this!
Thanks for the review! I'll address these comments soon. |
bfcc616 to
0da603e
Compare
1a1fc6e to
f95e363
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Thanks for your quick update! This looks good to me. I've left a minor comment w.r.t. error handling. Let me know what you think.
f95e363 to
d59d1bb
Compare
83a3449 to
5dc2e0c
Compare
5dc2e0c to
86542f3
Compare
|
I've rebased this PR on top of the latest main branch to pass CIs and also updated parquet-testing to its latest commit. Will merge after CIs are complete. Thanks again @fenfeng9! |
|
CI failures are unrelated, mostly due to |
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 3ecf1ca. There was 1 benchmark result with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Rationale for this change
Reading bloom filters from encrypted Parquet files previously raised an exception. This change implements encrypted bloom filter deserialization by decrypting the Thrift header (module id 8) and bitset (module id 9) separately, and adds the necessary validation and tests.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.