Validate column index for null_pages/null_counts contradiction#3458
Open
clee704 wants to merge 1 commit intoapache:masterfrom
Open
Validate column index for null_pages/null_counts contradiction#3458clee704 wants to merge 1 commit intoapache:masterfrom
clee704 wants to merge 1 commit intoapache:masterfrom
Conversation
A null_pages entry of true indicates the page consists entirely of null values; a corresponding null_counts entry of zero contradicts this, as it indicates the page has no null values at all. This inconsistency is a sign of invalid metadata that would cause incorrect page skipping during column index filtering — the reader treats such pages as all-null and silently excludes them from query results for predicates requiring non-null values. Add validation in ColumnIndexBuilder.build(PrimitiveType) to detect this contradiction and return null, consistent with the existing pattern for invalid min/max values. The null propagates through the read path, causing the filter to fall back to reading all pages for the affected column. Also fix a pre-existing NPE in the static build() method where build(type) could return null but was not null-checked before accessing boundaryOrder. Co-authored-by: Copilot <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3457
What changes were proposed in this pull request?
Add validation in
ColumnIndexBuilder.build(PrimitiveType)to detect a contradictory column index wherenull_pages[i]istrue(page consists entirely of null values) butnull_counts[i]is0(page has zero null values). When detected,build()returnsnull, following the existing pattern for invalid min/max values.Also fix a pre-existing NPE in the static
build()method wherebuild(type)could returnnullbut was not null-checked before accessingboundaryOrder.Why are the changes needed?
Column index filtering silently excludes pages with this contradiction from query results for all predicates:
WHERE col = 50):BoundaryOrdercomparators iterate overpageIndexes, which omits pages wherenull_pages[i]istrue. These pages are never evaluated and their rows are excluded.WHERE col IS NULL):ColumnIndexBase.visit(Eq)checksnullCounts[pageIndex] > 0, which returnsfalsewhennull_countsis0. The page is excluded.The result is silent data loss with no error or warning. Only unfiltered reads return correct results.
ColumnIndexBuilder.build(PrimitiveType)already returnsnullfor other kinds of invalid metadata (empty pages, invalid min/max values). This adds one more validation of the same kind.How was this patch tested?
3 new test methods (15 assertions) in
TestColumnIndexBuilder:testBuildReturnsNullForNullPageCountContradiction: 5 rejection cases — contradiction at first/middle/last page, single page, all pagestestBuildPreservesValidColumnIndex: 6 preservation cases — legitimate null pages, all non-null pages, single pages, boundary null_counts=1testBuildWithoutNullCountsIsNotRejected: null_counts absent (optional field) is not rejected