Skip to content

feat: [iceberg] allow native Iceberg scans with non-identity transform residuals#2948

Open
Shekharrajak wants to merge 10 commits intoapache:mainfrom
Shekharrajak:feature/non-identity-transform-residuals
Open

feat: [iceberg] allow native Iceberg scans with non-identity transform residuals#2948
Shekharrajak wants to merge 10 commits intoapache:mainfrom
Shekharrajak:feature/non-identity-transform-residuals

Conversation

@Shekharrajak
Copy link
Copy Markdown
Contributor

Enable native execution for Iceberg queries with bucket/truncate/year/month/day/hour transforms in residuals. Row-group filtering skips these predicates but post-scan CometFilter applies them. Provides native performance while maintaining correctness.

Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the scalastyle errors so the tests can run?

@Shekharrajak Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from e9138cd to f41059f Compare January 5, 2026 17:59
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 6, 2026

Codecov Report

❌ Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.98%. Comparing base (f09f8af) to head (d8fa8c9).
⚠️ Report is 913 commits behind head on main.

Files with missing lines Patch % Lines
...n/scala/org/apache/comet/rules/CometScanRule.scala 46.15% 6 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2948      +/-   ##
============================================
+ Coverage     56.12%   59.98%   +3.85%     
- Complexity      976     1476     +500     
============================================
  Files           119      175      +56     
  Lines         11743    16169    +4426     
  Branches       2251     2682     +431     
============================================
+ Hits           6591     9699    +3108     
- Misses         4012     5118    +1106     
- Partials       1140     1352     +212     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Ref #2921

@mbutrovich mbutrovich changed the title feat: allow native Iceberg scans with non-identity transform residuals feat: [iceberg] allow native Iceberg scans with non-identity transform residuals Jan 12, 2026
@mbutrovich
Copy link
Copy Markdown
Contributor

We need [iceberg] in the title to run Iceberg Spark tests.

@mbutrovich
Copy link
Copy Markdown
Contributor

Can you push a dummy commit to trigger CI with the Iceberg tests? I can't seem to force that rule to re-run.

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR enables native Iceberg scans when residual expressions contain non-identity transforms (truncate, bucket, year, month, day, hour). The approach is sound - row-group filtering skips these predicates while post-scan CometFilter ensures correctness.

Verified:

  • CI tests all passing
  • Tests verify correctness via checkSparkAnswer (compares Spark vs Comet results)
  • Tests added for truncate, bucket, and year transforms per @parthchandra's feedback

Minor Suggestions

1. Consider adding tests for month/day/hour transforms

The PR supports month, day, and hour transforms but only explicitly tests truncate, bucket, and year. While year uses a similar code path, explicit tests would provide better coverage:

test("non-identity transform residual - month transform allows native scan") {
  // Similar to year test but with PARTITIONED BY (month(event_date))
}

2. Question about reflection failure behavior

The behavior when reflection fails changed from "fall back to Spark" to "continue with native scan" (lines 503-512):

} catch {
  case e: Exception =>
    // Reflection failure - log warning but allow native execution
    // The predicate conversion will handle unsupported cases gracefully
    logWarning(...)
    true  // Previously was false
}

Could you clarify what "handle unsupported cases gracefully" means here? If an unexpected predicate reaches the native layer, does it fail safely (causing fallback) or could it produce incorrect results?


Overall the PR looks good - the core correctness is verified by checkSparkAnswer and the optimization enables native execution for more query patterns.


This review was generated with AI assistance.

@mbutrovich mbutrovich marked this pull request as draft January 28, 2026 17:48
@mbutrovich
Copy link
Copy Markdown
Contributor

I converted this to draft because I don't want it to get accidentally merged. We have not run the Iceberg suite on it yet. We're waiting on a commit after I added [iceberg] to the title.

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Added the tests. Please let me know if we can make it ready for review.

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

The PR supports month, day, and hour transforms but only explicitly tests truncate, bucket, and year. While year uses a similar code path, explicit tests would provide better coverage:

Added tests for these transform.

logWarning(
s"Could not check for transform functions in residuals: ${e.getMessage}. " +
"Continuing with native scan.")
true
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we could not find transform using IcebergReflection.findNonIdentityTransformInResiduals then we will do with native scan and get the correct result :

Native Scan -> CometFilter -> User Query

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reflection fails → "Try native anyway"
→ Native scan without row-group filter
→ Post-scan filter ensures correctness

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Could you clarify what "handle unsupported cases gracefully" means here? If an unexpected predicate reaches the native layer, does it fail safely (causing fallback) or could it produce incorrect results?

if we could not find transform using IcebergReflection.findNonIdentityTransformInResiduals then we will do with native scan and get the correct result :

Native Scan -> CometFilter -> User Query

https://github.com/apache/datafusion-comet/pull/2948/files#r2740040695

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

delete operation tests where failing so we are falling back to spark for that. Please trigger the workflow to validate now.

@mbutrovich mbutrovich self-requested a review January 31, 2026 19:02
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 2, 2026
Resolve conflict in CometScanRule.scala: upstream introduced IcebergTaskValidationResult
with deleteFiles field in validateIcebergFileScanTasks single-pass validation. Adapt
branch logic to use taskValidation.deleteFiles instead of separate reflection call to
IcebergReflection.getDeleteFiles(). Semantic is preserved: allow native scan with
post-scan CometFilter for read-only non-identity transform residuals; fall back to Spark
only when delete files are present.
@Shekharrajak Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from 3c5b481 to 16cf235 Compare April 3, 2026 17:47
@Shekharrajak Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from 1c9588a to 3f9b89e Compare April 3, 2026 17:58
@github-actions github-actions bot removed the Stale label Apr 4, 2026
@Shekharrajak Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from 375f38c to 21d6521 Compare April 4, 2026 08:53
@Shekharrajak Shekharrajak marked this pull request as ready for review April 4, 2026 10:04
@Shekharrajak Shekharrajak force-pushed the feature/non-identity-transform-residuals branch from 34478c7 to 60115fa Compare April 4, 2026 16:10
@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Now the implementation is looking fine and checks locally passes. I have added integration tests as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants