Skip to content

Validate column index for null_pages/null_counts contradiction#3458

Open
clee704 wants to merge 1 commit intoapache:masterfrom
clee704:validate-column-index-null-pages
Open

Validate column index for null_pages/null_counts contradiction#3458
clee704 wants to merge 1 commit intoapache:masterfrom
clee704:validate-column-index-null-pages

Conversation

@clee704
Copy link
Copy Markdown

@clee704 clee704 commented Mar 28, 2026

Fixes #3457

What changes were proposed in this pull request?

Add validation in ColumnIndexBuilder.build(PrimitiveType) to detect a contradictory column index where null_pages[i] is true (page consists entirely of null values) but null_counts[i] is 0 (page has zero null values). When detected, build() returns null, following the existing pattern for invalid min/max values.

Also fix a pre-existing NPE in the static build() method where build(type) could return null but was not null-checked before accessing boundaryOrder.

Why are the changes needed?

Column index filtering silently excludes pages with this contradiction from query results for all predicates:

  • Non-null predicates (e.g., WHERE col = 50): BoundaryOrder comparators iterate over pageIndexes, which omits pages where null_pages[i] is true. These pages are never evaluated and their rows are excluded.
  • Null predicates (e.g., WHERE col IS NULL): ColumnIndexBase.visit(Eq) checks nullCounts[pageIndex] > 0, which returns false when null_counts is 0. The page is excluded.

The result is silent data loss with no error or warning. Only unfiltered reads return correct results.

ColumnIndexBuilder.build(PrimitiveType) already returns null for other kinds of invalid metadata (empty pages, invalid min/max values). This adds one more validation of the same kind.

How was this patch tested?

3 new test methods (15 assertions) in TestColumnIndexBuilder:

  • testBuildReturnsNullForNullPageCountContradiction: 5 rejection cases — contradiction at first/middle/last page, single page, all pages
  • testBuildPreservesValidColumnIndex: 6 preservation cases — legitimate null pages, all non-null pages, single pages, boundary null_counts=1
  • testBuildWithoutNullCountsIsNotRejected: null_counts absent (optional field) is not rejected

A null_pages entry of true indicates the page consists entirely of null values;
a corresponding null_counts entry of zero contradicts this, as it indicates the
page has no null values at all. This inconsistency is a sign of invalid metadata
that would cause incorrect page skipping during column index filtering —
the reader treats such pages as all-null and silently excludes them from query
results for predicates requiring non-null values.

Add validation in ColumnIndexBuilder.build(PrimitiveType) to detect this
contradiction and return null, consistent with the existing pattern for invalid
min/max values. The null propagates through the read path, causing the filter to
fall back to reading all pages for the affected column.

Also fix a pre-existing NPE in the static build() method where build(type)
could return null but was not null-checked before accessing boundaryOrder.

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Column index with null_pages=true and null_counts=0 causes silent data loss during filter pushdown

1 participant