feat: [iceberg] allow native Iceberg scans with non-identity transform residuals by Shekharrajak · Pull Request #2948 · apache/datafusion-comet

Shekharrajak · 2025-12-20T15:39:56Z

Enable native execution for Iceberg queries with bucket/truncate/year/month/day/hour transforms in residuals. Row-group filtering skips these predicates but post-scan CometFilter applies them. Provides native performance while maintaining correctness.

parthchandra

Could you fix the scalastyle errors so the tests can run?

spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

codecov-commenter · 2026-01-06T17:34:20Z

Codecov Report

❌ Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.98%. Comparing base (f09f8af) to head (d8fa8c9).
⚠️ Report is 913 commits behind head on main.

Files with missing lines	Patch %	Lines
...n/scala/org/apache/comet/rules/CometScanRule.scala	46.15%	6 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2948      +/-   ##
============================================
+ Coverage     56.12%   59.98%   +3.85%     
- Complexity      976     1476     +500     
============================================
  Files           119      175      +56     
  Lines         11743    16169    +4426     
  Branches       2251     2682     +431     
============================================
+ Hits           6591     9699    +3108     
- Misses         4012     5118    +1106     
- Partials       1140     1352     +212

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Shekharrajak · 2026-01-07T10:46:15Z

Ref #2921

mbutrovich · 2026-01-12T18:01:12Z

We need [iceberg] in the title to run Iceberg Spark tests.

mbutrovich · 2026-01-14T16:00:41Z

Can you push a dummy commit to trigger CI with the Iceberg tests? I can't seem to force that rule to re-run.

andygrove

Review Summary

This PR enables native Iceberg scans when residual expressions contain non-identity transforms (truncate, bucket, year, month, day, hour). The approach is sound - row-group filtering skips these predicates while post-scan CometFilter ensures correctness.

Verified:

CI tests all passing
Tests verify correctness via checkSparkAnswer (compares Spark vs Comet results)
Tests added for truncate, bucket, and year transforms per @parthchandra's feedback

Minor Suggestions

1. Consider adding tests for month/day/hour transforms

The PR supports month, day, and hour transforms but only explicitly tests truncate, bucket, and year. While year uses a similar code path, explicit tests would provide better coverage:

test("non-identity transform residual - month transform allows native scan") {
  // Similar to year test but with PARTITIONED BY (month(event_date))
}

2. Question about reflection failure behavior

The behavior when reflection fails changed from "fall back to Spark" to "continue with native scan" (lines 503-512):

} catch {
  case e: Exception =>
    // Reflection failure - log warning but allow native execution
    // The predicate conversion will handle unsupported cases gracefully
    logWarning(...)
    true  // Previously was false
}

Could you clarify what "handle unsupported cases gracefully" means here? If an unexpected predicate reaches the native layer, does it fail safely (causing fallback) or could it produce incorrect results?

Overall the PR looks good - the core correctness is verified by checkSparkAnswer and the optimization enables native execution for more query patterns.

This review was generated with AI assistance.

mbutrovich · 2026-01-28T17:48:39Z

I converted this to draft because I don't want it to get accidentally merged. We have not run the Iceberg suite on it yet. We're waiting on a commit after I added [iceberg] to the title.

Shekharrajak · 2026-01-29T05:20:46Z

Added the tests. Please let me know if we can make it ready for review.

Shekharrajak · 2026-01-29T05:31:57Z

The PR supports month, day, and hour transforms but only explicitly tests truncate, bucket, and year. While year uses a similar code path, explicit tests would provide better coverage:

Added tests for these transform.

Shekharrajak · 2026-01-29T05:36:43Z

spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

+              logWarning(
+                s"Could not check for transform functions in residuals: ${e.getMessage}. " +
+                  "Continuing with native scan.")
+              true


if we could not find transform using IcebergReflection.findNonIdentityTransformInResiduals then we will do with native scan and get the correct result :

Native Scan -> CometFilter -> User Query

Reflection fails → "Try native anyway"
→ Native scan without row-group filter
→ Post-scan filter ensures correctness

Shekharrajak · 2026-01-29T05:37:15Z

Could you clarify what "handle unsupported cases gracefully" means here? If an unexpected predicate reaches the native layer, does it fail safely (causing fallback) or could it produce incorrect results?

if we could not find transform using IcebergReflection.findNonIdentityTransformInResiduals then we will do with native scan and get the correct result :

Native Scan -> CometFilter -> User Query

https://github.com/apache/datafusion-comet/pull/2948/files#r2740040695

Shekharrajak · 2026-01-31T18:01:04Z

delete operation tests where failing so we are falling back to spark for that. Please trigger the workflow to validate now.

github-actions · 2026-04-02T02:17:36Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Resolve conflict in CometScanRule.scala: upstream introduced IcebergTaskValidationResult with deleteFiles field in validateIcebergFileScanTasks single-pass validation. Adapt branch logic to use taskValidation.deleteFiles instead of separate reflection call to IcebergReflection.getDeleteFiles(). Semantic is preserved: allow native scan with post-scan CometFilter for read-only non-identity transform residuals; fall back to Spark only when delete files are present.

…y-transform-residuals

…for delete file safety

Shekharrajak · 2026-04-04T16:10:38Z

Now the implementation is looking fine and checks locally passes. I have added integration tests as well.

feat: allow native Iceberg scans with non-identity transform residuals

db2fa02

parthchandra reviewed Jan 5, 2026

View reviewed changes

spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala Outdated Show resolved Hide resolved

Add tests for non-identity transform residuals and fix scalastyle errors

f41059f

Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from e9138cd to f41059f Compare January 5, 2026 17:59

mbutrovich changed the title ~~feat: allow native Iceberg scans with non-identity transform residuals~~ feat: [iceberg] allow native Iceberg scans with non-identity transform residuals Jan 12, 2026

andygrove reviewed Jan 28, 2026

View reviewed changes

mbutrovich marked this pull request as draft January 28, 2026 17:48

Add tests for month, day, and hour transform residuals

63f4056

Shekharrajak commented Jan 29, 2026

View reviewed changes

Fix delete operations with non-identity transform residuals

d8fa8c9

mbutrovich self-requested a review January 31, 2026 19:02

github-actions bot added the Stale label Apr 2, 2026

Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from 3c5b481 to 16cf235 Compare April 3, 2026 17:47

assert CometFilterExec present in non-identity transform residual tests

3f9b89e

Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from 1c9588a to 3f9b89e Compare April 3, 2026 17:58

github-actions bot removed the Stale label Apr 4, 2026

Shekharrajak added 2 commits April 4, 2026 12:46

Merge remote-tracking branch 'upstream/main' into feature/non-identit…

bebab89

…y-transform-residuals

test: add integration tests for non-identity transform residuals

21d6521

Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from 375f38c to 21d6521 Compare April 4, 2026 08:53

Shekharrajak marked this pull request as ready for review April 4, 2026 10:04

Shekharrajak added 2 commits April 4, 2026 17:19

refactor: remove dead nonIdentityTransform detection from CometScanRule

c0d3839

feat: implement PartitionSpec-based non-identity transform detection …

60115fa

…for delete file safety

Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from 34478c7 to 60115fa Compare April 4, 2026 16:10

Conversation

Shekharrajak commented Dec 20, 2025

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Shekharrajak commented Jan 7, 2026

Uh oh!

mbutrovich commented Jan 12, 2026

Uh oh!

mbutrovich commented Jan 14, 2026

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Review Summary

Minor Suggestions

1. Consider adding tests for month/day/hour transforms

2. Question about reflection failure behavior

Uh oh!

mbutrovich commented Jan 28, 2026

Uh oh!

Shekharrajak commented Jan 29, 2026

Uh oh!

Shekharrajak commented Jan 29, 2026

Uh oh!

Shekharrajak Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak commented Jan 29, 2026

Uh oh!

Shekharrajak commented Jan 31, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Shekharrajak commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jan 6, 2026 •

edited

Loading