feat: [iceberg] allow native Iceberg scans with non-identity transform residuals#2948
feat: [iceberg] allow native Iceberg scans with non-identity transform residuals#2948Shekharrajak wants to merge 10 commits intoapache:mainfrom
Conversation
parthchandra
left a comment
There was a problem hiding this comment.
Could you fix the scalastyle errors so the tests can run?
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala
Outdated
Show resolved
Hide resolved
e9138cd to
f41059f
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2948 +/- ##
============================================
+ Coverage 56.12% 59.98% +3.85%
- Complexity 976 1476 +500
============================================
Files 119 175 +56
Lines 11743 16169 +4426
Branches 2251 2682 +431
============================================
+ Hits 6591 9699 +3108
- Misses 4012 5118 +1106
- Partials 1140 1352 +212 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Ref #2921 |
|
We need |
|
Can you push a dummy commit to trigger CI with the Iceberg tests? I can't seem to force that rule to re-run. |
andygrove
left a comment
There was a problem hiding this comment.
Review Summary
This PR enables native Iceberg scans when residual expressions contain non-identity transforms (truncate, bucket, year, month, day, hour). The approach is sound - row-group filtering skips these predicates while post-scan CometFilter ensures correctness.
Verified:
- CI tests all passing
- Tests verify correctness via
checkSparkAnswer(compares Spark vs Comet results) - Tests added for truncate, bucket, and year transforms per @parthchandra's feedback
Minor Suggestions
1. Consider adding tests for month/day/hour transforms
The PR supports month, day, and hour transforms but only explicitly tests truncate, bucket, and year. While year uses a similar code path, explicit tests would provide better coverage:
test("non-identity transform residual - month transform allows native scan") {
// Similar to year test but with PARTITIONED BY (month(event_date))
}2. Question about reflection failure behavior
The behavior when reflection fails changed from "fall back to Spark" to "continue with native scan" (lines 503-512):
} catch {
case e: Exception =>
// Reflection failure - log warning but allow native execution
// The predicate conversion will handle unsupported cases gracefully
logWarning(...)
true // Previously was false
}Could you clarify what "handle unsupported cases gracefully" means here? If an unexpected predicate reaches the native layer, does it fail safely (causing fallback) or could it produce incorrect results?
Overall the PR looks good - the core correctness is verified by checkSparkAnswer and the optimization enables native execution for more query patterns.
This review was generated with AI assistance.
|
I converted this to draft because I don't want it to get accidentally merged. We have not run the Iceberg suite on it yet. We're waiting on a commit after I added |
|
Added the tests. Please let me know if we can make it ready for review. |
Added tests for these transform. |
| logWarning( | ||
| s"Could not check for transform functions in residuals: ${e.getMessage}. " + | ||
| "Continuing with native scan.") | ||
| true |
There was a problem hiding this comment.
if we could not find transform using IcebergReflection.findNonIdentityTransformInResiduals then we will do with native scan and get the correct result :
Native Scan -> CometFilter -> User Query
There was a problem hiding this comment.
Reflection fails → "Try native anyway"
→ Native scan without row-group filter
→ Post-scan filter ensures correctness
if we could not find transform using IcebergReflection.findNonIdentityTransformInResiduals then we will do with native scan and get the correct result : Native Scan -> CometFilter -> User Query https://github.com/apache/datafusion-comet/pull/2948/files#r2740040695 |
|
delete operation tests where failing so we are falling back to spark for that. Please trigger the workflow to validate now. |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Resolve conflict in CometScanRule.scala: upstream introduced IcebergTaskValidationResult with deleteFiles field in validateIcebergFileScanTasks single-pass validation. Adapt branch logic to use taskValidation.deleteFiles instead of separate reflection call to IcebergReflection.getDeleteFiles(). Semantic is preserved: allow native scan with post-scan CometFilter for read-only non-identity transform residuals; fall back to Spark only when delete files are present.
3c5b481 to
16cf235
Compare
1c9588a to
3f9b89e
Compare
375f38c to
21d6521
Compare
34478c7 to
60115fa
Compare
|
Now the implementation is looking fine and checks locally passes. I have added integration tests as well. |
Enable native execution for Iceberg queries with bucket/truncate/year/month/day/hour transforms in residuals. Row-group filtering skips these predicates but post-scan CometFilter applies them. Provides native performance while maintaining correctness.