Skip to content

Commit 2106cef

Browse files
authored
docs: document iceberg spark tests in contributor guide (#3777)
1 parent 4d2c398 commit 2106cef

1 file changed

Lines changed: 96 additions & 0 deletions

File tree

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Running Iceberg Spark Tests
21+
22+
Running Apache Iceberg's Spark tests with Comet enabled is a good way to ensure that Comet produces the same
23+
results as Spark when reading Iceberg tables. To enable this, we apply diff files to the Apache Iceberg source
24+
code so that Comet is loaded when we run the tests.
25+
26+
Here is an overview of the changes that the diffs make to Iceberg:
27+
28+
- Configure Comet as a dependency and set the correct version in `libs.versions.toml` and `build.gradle`
29+
- Delete upstream Comet reader classes that reference legacy Comet APIs removed in [#3739]. These classes were
30+
added upstream in [apache/iceberg#15674] and depend on Comet's old Iceberg Java integration. Since Comet now
31+
uses a native Iceberg scan, these classes fail to compile and must be removed.
32+
- Configure test base classes (`TestBase`, `ExtensionsTestBase`, `ScanTestBase`, etc.) to load the Comet Spark
33+
plugin and shuffle manager
34+
35+
[#3739]: https://github.com/apache/datafusion-comet/pull/3739
36+
[apache/iceberg#15674]: https://github.com/apache/iceberg/pull/15674
37+
38+
## 1. Install Comet
39+
40+
Run `make release` in Comet to install the Comet JAR into the local Maven repository, specifying the Spark version.
41+
42+
```shell
43+
PROFILES="-Pspark-3.5" make release
44+
```
45+
46+
## 2. Clone Iceberg and Apply Diff
47+
48+
Clone Apache Iceberg locally and apply the diff file from Comet against the matching tag.
49+
50+
```shell
51+
git clone git@github.com:apache/iceberg.git apache-iceberg
52+
cd apache-iceberg
53+
git checkout apache-iceberg-1.8.1
54+
git apply ../datafusion-comet/dev/diffs/iceberg-rust/1.8.1.diff
55+
```
56+
57+
## 3. Run Iceberg Spark Tests
58+
59+
```shell
60+
ENABLE_COMET=true ./gradlew -DsparkVersions=3.5 -DscalaVersion=2.13 -DflinkVersions= -DkafkaVersions= \
61+
:iceberg-spark:iceberg-spark-3.5_2.13:test \
62+
-Pquick=true -x javadoc
63+
```
64+
65+
The three Gradle targets tested in CI are:
66+
67+
- `:iceberg-spark:iceberg-spark-<sparkVersion>_<scalaVersion>:test`
68+
- `:iceberg-spark:iceberg-spark-extensions-<sparkVersion>_<scalaVersion>:test`
69+
- `:iceberg-spark:iceberg-spark-runtime-<sparkVersion>_<scalaVersion>:integrationTest`
70+
71+
## Updating Diffs
72+
73+
To update a diff (e.g. after modifying test configuration), apply the existing diff, make changes, then
74+
regenerate:
75+
76+
```shell
77+
cd apache-iceberg
78+
git reset --hard apache-iceberg-1.8.1 && git clean -fd
79+
git apply ../datafusion-comet/dev/diffs/iceberg-rust/1.8.1.diff
80+
81+
# Make changes, then run spotless to fix formatting
82+
./gradlew spotlessApply
83+
84+
# Stage any new or deleted files, then generate the diff
85+
git add -A
86+
git diff apache-iceberg-1.8.1 > ../datafusion-comet/dev/diffs/iceberg-rust/1.8.1.diff
87+
```
88+
89+
Repeat for each Iceberg version (1.8.1, 1.9.1, 1.10.0). The file contents differ between versions, so each
90+
diff must be generated against its own tag.
91+
92+
## Running Tests in CI
93+
94+
The `iceberg_spark_test.yml` workflow applies these diffs and runs the three Gradle targets above against
95+
each Iceberg version. The test matrix covers Spark 3.4 and 3.5 across Iceberg 1.8.1, 1.9.1, and 1.10.0
96+
with Java 11 and 17. The workflow only runs when the PR title contains `[iceberg]`.

0 commit comments

Comments
 (0)