Skip to content

Commit ba6b3ce

Browse files
authored
docs: expand profiling guide with JVM and async-profiler coverage (#3628)
1 parent 012d5b4 commit ba6b3ce

3 files changed

Lines changed: 297 additions & 95 deletions

File tree

docs/source/contributor-guide/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Benchmarking Guide <benchmarking>
3535
Adding a New Operator <adding_a_new_operator>
3636
Adding a New Expression <adding_a_new_expression>
3737
Tracing <tracing>
38-
Profiling Native Code <profiling_native_code>
38+
Profiling <profiling>
3939
Spark SQL Tests <spark-sql-tests.md>
4040
Iceberg Spark Tests <iceberg-spark-tests.md>
4141
SQL File Tests <sql-file-tests.md>
Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Profiling
21+
22+
This guide covers profiling tools and techniques for Comet development. Because Comet
23+
spans JVM (Spark) and native (Rust) code, choosing the right profiler depends on what
24+
you are investigating.
25+
26+
## Choosing a Profiling Tool
27+
28+
| Tool | JVM Frames | Native (Rust) Frames | Install Required | Best For |
29+
| ------------------------------------------------------------------------------ | ---------- | -------------------- | ---------------- | ------------------------------------------------------------------------------ |
30+
| [async-profiler](https://github.com/async-profiler/async-profiler) | Yes | Yes | Yes | End-to-end Comet profiling with unified JVM + native flame graphs |
31+
| [Java Flight Recorder (JFR)](https://docs.oracle.com/en/java/javase/17/jfapi/) | Yes | No | No (JDK 11+) | GC pressure, allocations, thread contention, I/O — any JVM-level investigation |
32+
| [cargo-flamegraph](https://github.com/flamegraph-rs/flamegraph) | No | Yes | Yes | Isolated Rust micro-benchmarks without a JVM |
33+
34+
**Recommendation:** Use **async-profiler** when profiling Spark queries with Comet enabled —
35+
it is the only tool that shows both JVM and native frames in a single flame graph.
36+
Use **JFR** when you need rich JVM event data (GC, locks, I/O) without installing anything.
37+
Use **cargo-flamegraph** when working on native code in isolation via `cargo bench`.
38+
39+
## Profiling with async-profiler
40+
41+
[async-profiler](https://github.com/async-profiler/async-profiler) captures JVM and
42+
native code in the same flame graph by using Linux `perf_events` or macOS `dtrace`.
43+
This makes it ideal for profiling Comet, where hot paths cross the JNI boundary between
44+
Spark and Rust.
45+
46+
### Installation
47+
48+
Download a release from the [async-profiler releases page](https://github.com/async-profiler/async-profiler/releases):
49+
50+
```shell
51+
# Linux x64
52+
wget https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz
53+
mkdir -p $HOME/opt/async-profiler
54+
tar xzf async-profiler-3.0-linux-x64.tar.gz -C $HOME/opt/async-profiler --strip-components=1
55+
export ASYNC_PROFILER_HOME=$HOME/opt/async-profiler
56+
```
57+
58+
On macOS, download the appropriate `macos` archive instead.
59+
60+
### Attaching to a running Spark application
61+
62+
Use the `asprof` launcher to attach to a running JVM by PID:
63+
64+
```shell
65+
# Start CPU profiling for 30 seconds, output an HTML flame graph
66+
$ASYNC_PROFILER_HOME/bin/asprof -d 30 -f flamegraph.html <pid>
67+
68+
# Wall-clock profiling
69+
$ASYNC_PROFILER_HOME/bin/asprof -e wall -d 30 -f flamegraph.html <pid>
70+
71+
# Start profiling (no duration limit), then stop later
72+
$ASYNC_PROFILER_HOME/bin/asprof start -e cpu <pid>
73+
# ... run your query ...
74+
$ASYNC_PROFILER_HOME/bin/asprof stop -f flamegraph.html <pid>
75+
```
76+
77+
Find the Spark driver/executor PID with `jps` or `pgrep -f SparkSubmit`.
78+
79+
### Passing profiler flags via spark-submit
80+
81+
You can also attach async-profiler as a Java agent at JVM startup:
82+
83+
```shell
84+
spark-submit \
85+
--conf "spark.driver.extraJavaOptions=-agentpath:$ASYNC_PROFILER_HOME/lib/libasyncProfiler.so=start,event=cpu,file=driver.html,tree" \
86+
--conf "spark.executor.extraJavaOptions=-agentpath:$ASYNC_PROFILER_HOME/lib/libasyncProfiler.so=start,event=cpu,file=executor.html,tree" \
87+
...
88+
```
89+
90+
Note: If the executor is distributed then `executor.html` will be written on the remote node.
91+
92+
### Choosing an event type
93+
94+
| Event | When to use |
95+
| ------- | --------------------------------------------------------------------------------------------------------- |
96+
| `cpu` | Default. Shows where CPU cycles are spent. Use for compute-bound queries. |
97+
| `wall` | Wall-clock time including blocked/waiting threads. Use to find JNI boundary overhead and I/O stalls. |
98+
| `alloc` | Heap allocation profiling. Use to find JVM allocation hotspots around Arrow FFI and columnar conversions. |
99+
| `lock` | Lock contention. Use when threads appear to spend time waiting on synchronized blocks or locks. |
100+
101+
### Output formats
102+
103+
| Format | Flag | Description |
104+
| ---------------- | ------------------ | -------------------------------------------------- |
105+
| HTML flame graph | `-f out.html` | Interactive flame graph (default and most useful). |
106+
| JFR | `-f out.jfr` | Viewable in JDK Mission Control or IntelliJ. |
107+
| Collapsed stacks | `-f out.collapsed` | For use with Brendan Gregg's FlameGraph scripts. |
108+
| Text summary | `-o text` | Flat list of hot methods. |
109+
110+
### Platform notes
111+
112+
**Linux:** Set `perf_event_paranoid` to allow profiling:
113+
114+
```shell
115+
sudo sysctl kernel.perf_event_paranoid=1 # or 0 / -1 for full access
116+
sudo sysctl kernel.kptr_restrict=0 # optional: enable kernel symbols
117+
```
118+
119+
**macOS:** async-profiler uses `dtrace` on macOS, which requires running as root or
120+
with SIP (System Integrity Protection) adjustments. Native Rust frames may not be fully
121+
resolved on macOS; Linux is recommended for the most complete flame graphs.
122+
123+
### Integrated benchmark profiling
124+
125+
The TPC benchmark scripts in `benchmarks/tpc/` have built-in async-profiler support via
126+
the `--async-profiler` flag. See [benchmarks/tpc/README.md](https://github.com/apache/datafusion-comet/blob/main/benchmarks/tpc/README.md)
127+
for details.
128+
129+
## Profiling with Java Flight Recorder
130+
131+
[Java Flight Recorder (JFR)](https://docs.oracle.com/en/java/javase/17/jfapi/) is built
132+
into JDK 11+ and collects detailed JVM runtime data with very low overhead. It does not
133+
see native Rust frames, but is excellent for diagnosing GC pressure, thread contention,
134+
I/O latency, and JVM-level allocation patterns.
135+
136+
### Adding JFR flags to spark-submit
137+
138+
```shell
139+
spark-submit \
140+
--conf "spark.driver.extraJavaOptions=-XX:StartFlightRecording=duration=120s,filename=driver.jfr" \
141+
--conf "spark.executor.extraJavaOptions=-XX:StartFlightRecording=duration=120s,filename=executor.jfr" \
142+
...
143+
```
144+
145+
For continuous recording without a fixed duration:
146+
147+
```shell
148+
--conf "spark.driver.extraJavaOptions=-XX:StartFlightRecording=disk=true,maxsize=500m,filename=driver.jfr"
149+
```
150+
151+
You can also start and stop recording dynamically using `jcmd`:
152+
153+
```shell
154+
jcmd <pid> JFR.start name=profile
155+
# ... run your query ...
156+
jcmd <pid> JFR.stop name=profile filename=recording.jfr
157+
```
158+
159+
### Viewing recordings
160+
161+
- **[JDK Mission Control (JMC)](https://jdk.java.net/jmc/)** — the most comprehensive viewer.
162+
Shows flame graphs, GC timeline, thread activity, I/O, and allocation hot spots.
163+
- **IntelliJ IDEA** — open `.jfr` files directly in the built-in profiler
164+
(Run → Open Profiler Snapshot).
165+
- **`jfr` CLI** — quick summaries from the command line: `jfr summary driver.jfr`
166+
167+
### Useful JFR events for Comet debugging
168+
169+
| Event | What it shows |
170+
| ------------------------------------------------------------------- | --------------------------------------------------------------------------- |
171+
| `jdk.GCPhasePause` | GC pause durations — helps identify memory pressure from Arrow allocations. |
172+
| `jdk.ObjectAllocationInNewTLAB` / `jdk.ObjectAllocationOutsideTLAB` | Allocation hot spots. |
173+
| `jdk.JavaMonitorWait` / `jdk.ThreadPark` | Thread contention and lock waits. |
174+
| `jdk.FileRead` / `jdk.FileWrite` / `jdk.SocketRead` | I/O latency. |
175+
| `jdk.ExecutionSample` | CPU sampling (method profiling, similar to a flame graph). |
176+
177+
### Integrated benchmark profiling
178+
179+
The TPC benchmark scripts support `--jfr` for automatic JFR recording during benchmark
180+
runs. See [benchmarks/tpc/README.md](https://github.com/apache/datafusion-comet/blob/main/benchmarks/tpc/README.md) for details.
181+
182+
## Profiling Native Code with cargo-flamegraph
183+
184+
For profiling Rust code in isolation — without a JVM — use `cargo bench` with
185+
[cargo-flamegraph](https://github.com/flamegraph-rs/flamegraph).
186+
187+
### Running micro benchmarks with cargo bench
188+
189+
When implementing a new operator or expression, it is good practice to add a new microbenchmark under `core/benches`.
190+
191+
It is often easiest to copy an existing benchmark and modify it for the new operator or expression. It is also
192+
necessary to add a new section to the `Cargo.toml` file, such as:
193+
194+
```toml
195+
[[bench]]
196+
name = "shuffle_writer"
197+
harness = false
198+
```
199+
200+
These benchmarks are useful for comparing performance between releases or between feature branches and the
201+
main branch to help prevent regressions in performance when adding new features or fixing bugs.
202+
203+
Individual benchmarks can be run by name with the following command.
204+
205+
```shell
206+
cargo bench shuffle_writer
207+
```
208+
209+
Here is some sample output from running this command.
210+
211+
```
212+
Running benches/shuffle_writer.rs (target/release/deps/shuffle_writer-e37b59e37879cce7)
213+
Gnuplot not found, using plotters backend
214+
shuffle_writer/shuffle_writer
215+
time: [2.0880 ms 2.0989 ms 2.1118 ms]
216+
Found 9 outliers among 100 measurements (9.00%)
217+
3 (3.00%) high mild
218+
6 (6.00%) high severe
219+
```
220+
221+
### Profiling with cargo-flamegraph
222+
223+
Install cargo-flamegraph:
224+
225+
```shell
226+
cargo install flamegraph
227+
```
228+
229+
Follow the instructions in [cargo-flamegraph](https://github.com/flamegraph-rs/flamegraph) for your platform for
230+
running flamegraph.
231+
232+
Here is a sample command for running `cargo-flamegraph` on MacOS.
233+
234+
```shell
235+
cargo flamegraph --root --bench shuffle_writer
236+
```
237+
238+
This will produce output similar to the following.
239+
240+
```
241+
dtrace: system integrity protection is on, some features will not be available
242+
243+
dtrace: description 'profile-997 ' matched 1 probe
244+
Gnuplot not found, using plotters backend
245+
Testing shuffle_writer/shuffle_writer
246+
Success
247+
248+
dtrace: pid 66402 has exited
249+
writing flamegraph to "flamegraph.svg"
250+
```
251+
252+
The generated flamegraph can now be opened in a browser that supports svg format.
253+
254+
Here is the flamegraph for this example:
255+
256+
![flamegraph](../_static/images/flamegraph.png)
257+
258+
## Tips for Profiling Comet
259+
260+
### Use wall-clock profiling to spot JNI boundary overhead
261+
262+
When profiling Comet with async-profiler, `wall` mode is often more revealing than `cpu`
263+
because it captures time spent crossing the JNI boundary, waiting for native results,
264+
and blocked on I/O — none of which show up in CPU-only profiles.
265+
266+
```shell
267+
$ASYNC_PROFILER_HOME/bin/asprof -e wall -d 60 -f wall-profile.html <pid>
268+
```
269+
270+
### Use alloc profiling around Arrow FFI
271+
272+
JVM allocation profiling can identify hotspots in the Arrow FFI path where temporary
273+
objects are created during data transfer between JVM and native code:
274+
275+
```shell
276+
$ASYNC_PROFILER_HOME/bin/asprof -e alloc -d 60 -f alloc-profile.html <pid>
277+
```
278+
279+
Look for allocations in `CometExecIterator`, `CometBatchIterator`, and Arrow vector
280+
classes.
281+
282+
### Isolate Rust-only performance issues
283+
284+
If a flame graph shows the hot path is entirely within native code, switch to
285+
`cargo-flamegraph` to get better symbol resolution and avoid JVM noise:
286+
287+
```shell
288+
cd native
289+
cargo flamegraph --root --bench <benchmark_name>
290+
```
291+
292+
### Correlating JVM and native frames
293+
294+
In async-profiler flame graphs, native Rust frames appear below JNI entry points like
295+
`Java_org_apache_comet_Native_*`. Look for these transition points to understand how
296+
time is split between Spark's JVM code and Comet's native execution.

0 commit comments

Comments
 (0)