|
| 1 | +<!-- |
| 2 | +Licensed to the Apache Software Foundation (ASF) under one |
| 3 | +or more contributor license agreements. See the NOTICE file |
| 4 | +distributed with this work for additional information |
| 5 | +regarding copyright ownership. The ASF licenses this file |
| 6 | +to you under the Apache License, Version 2.0 (the |
| 7 | +"License"); you may not use this file except in compliance |
| 8 | +with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | +Unless required by applicable law or agreed to in writing, |
| 13 | +software distributed under the License is distributed on an |
| 14 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | +KIND, either express or implied. See the License for the |
| 16 | +specific language governing permissions and limitations |
| 17 | +under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# Profiling |
| 21 | + |
| 22 | +This guide covers profiling tools and techniques for Comet development. Because Comet |
| 23 | +spans JVM (Spark) and native (Rust) code, choosing the right profiler depends on what |
| 24 | +you are investigating. |
| 25 | + |
| 26 | +## Choosing a Profiling Tool |
| 27 | + |
| 28 | +| Tool | JVM Frames | Native (Rust) Frames | Install Required | Best For | |
| 29 | +| ------------------------------------------------------------------------------ | ---------- | -------------------- | ---------------- | ------------------------------------------------------------------------------ | |
| 30 | +| [async-profiler](https://github.com/async-profiler/async-profiler) | Yes | Yes | Yes | End-to-end Comet profiling with unified JVM + native flame graphs | |
| 31 | +| [Java Flight Recorder (JFR)](https://docs.oracle.com/en/java/javase/17/jfapi/) | Yes | No | No (JDK 11+) | GC pressure, allocations, thread contention, I/O — any JVM-level investigation | |
| 32 | +| [cargo-flamegraph](https://github.com/flamegraph-rs/flamegraph) | No | Yes | Yes | Isolated Rust micro-benchmarks without a JVM | |
| 33 | + |
| 34 | +**Recommendation:** Use **async-profiler** when profiling Spark queries with Comet enabled — |
| 35 | +it is the only tool that shows both JVM and native frames in a single flame graph. |
| 36 | +Use **JFR** when you need rich JVM event data (GC, locks, I/O) without installing anything. |
| 37 | +Use **cargo-flamegraph** when working on native code in isolation via `cargo bench`. |
| 38 | + |
| 39 | +## Profiling with async-profiler |
| 40 | + |
| 41 | +[async-profiler](https://github.com/async-profiler/async-profiler) captures JVM and |
| 42 | +native code in the same flame graph by using Linux `perf_events` or macOS `dtrace`. |
| 43 | +This makes it ideal for profiling Comet, where hot paths cross the JNI boundary between |
| 44 | +Spark and Rust. |
| 45 | + |
| 46 | +### Installation |
| 47 | + |
| 48 | +Download a release from the [async-profiler releases page](https://github.com/async-profiler/async-profiler/releases): |
| 49 | + |
| 50 | +```shell |
| 51 | +# Linux x64 |
| 52 | +wget https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz |
| 53 | +mkdir -p $HOME/opt/async-profiler |
| 54 | +tar xzf async-profiler-3.0-linux-x64.tar.gz -C $HOME/opt/async-profiler --strip-components=1 |
| 55 | +export ASYNC_PROFILER_HOME=$HOME/opt/async-profiler |
| 56 | +``` |
| 57 | + |
| 58 | +On macOS, download the appropriate `macos` archive instead. |
| 59 | + |
| 60 | +### Attaching to a running Spark application |
| 61 | + |
| 62 | +Use the `asprof` launcher to attach to a running JVM by PID: |
| 63 | + |
| 64 | +```shell |
| 65 | +# Start CPU profiling for 30 seconds, output an HTML flame graph |
| 66 | +$ASYNC_PROFILER_HOME/bin/asprof -d 30 -f flamegraph.html <pid> |
| 67 | + |
| 68 | +# Wall-clock profiling |
| 69 | +$ASYNC_PROFILER_HOME/bin/asprof -e wall -d 30 -f flamegraph.html <pid> |
| 70 | + |
| 71 | +# Start profiling (no duration limit), then stop later |
| 72 | +$ASYNC_PROFILER_HOME/bin/asprof start -e cpu <pid> |
| 73 | +# ... run your query ... |
| 74 | +$ASYNC_PROFILER_HOME/bin/asprof stop -f flamegraph.html <pid> |
| 75 | +``` |
| 76 | + |
| 77 | +Find the Spark driver/executor PID with `jps` or `pgrep -f SparkSubmit`. |
| 78 | + |
| 79 | +### Passing profiler flags via spark-submit |
| 80 | + |
| 81 | +You can also attach async-profiler as a Java agent at JVM startup: |
| 82 | + |
| 83 | +```shell |
| 84 | +spark-submit \ |
| 85 | + --conf "spark.driver.extraJavaOptions=-agentpath:$ASYNC_PROFILER_HOME/lib/libasyncProfiler.so=start,event=cpu,file=driver.html,tree" \ |
| 86 | + --conf "spark.executor.extraJavaOptions=-agentpath:$ASYNC_PROFILER_HOME/lib/libasyncProfiler.so=start,event=cpu,file=executor.html,tree" \ |
| 87 | + ... |
| 88 | +``` |
| 89 | + |
| 90 | +Note: If the executor is distributed then `executor.html` will be written on the remote node. |
| 91 | + |
| 92 | +### Choosing an event type |
| 93 | + |
| 94 | +| Event | When to use | |
| 95 | +| ------- | --------------------------------------------------------------------------------------------------------- | |
| 96 | +| `cpu` | Default. Shows where CPU cycles are spent. Use for compute-bound queries. | |
| 97 | +| `wall` | Wall-clock time including blocked/waiting threads. Use to find JNI boundary overhead and I/O stalls. | |
| 98 | +| `alloc` | Heap allocation profiling. Use to find JVM allocation hotspots around Arrow FFI and columnar conversions. | |
| 99 | +| `lock` | Lock contention. Use when threads appear to spend time waiting on synchronized blocks or locks. | |
| 100 | + |
| 101 | +### Output formats |
| 102 | + |
| 103 | +| Format | Flag | Description | |
| 104 | +| ---------------- | ------------------ | -------------------------------------------------- | |
| 105 | +| HTML flame graph | `-f out.html` | Interactive flame graph (default and most useful). | |
| 106 | +| JFR | `-f out.jfr` | Viewable in JDK Mission Control or IntelliJ. | |
| 107 | +| Collapsed stacks | `-f out.collapsed` | For use with Brendan Gregg's FlameGraph scripts. | |
| 108 | +| Text summary | `-o text` | Flat list of hot methods. | |
| 109 | + |
| 110 | +### Platform notes |
| 111 | + |
| 112 | +**Linux:** Set `perf_event_paranoid` to allow profiling: |
| 113 | + |
| 114 | +```shell |
| 115 | +sudo sysctl kernel.perf_event_paranoid=1 # or 0 / -1 for full access |
| 116 | +sudo sysctl kernel.kptr_restrict=0 # optional: enable kernel symbols |
| 117 | +``` |
| 118 | + |
| 119 | +**macOS:** async-profiler uses `dtrace` on macOS, which requires running as root or |
| 120 | +with SIP (System Integrity Protection) adjustments. Native Rust frames may not be fully |
| 121 | +resolved on macOS; Linux is recommended for the most complete flame graphs. |
| 122 | + |
| 123 | +### Integrated benchmark profiling |
| 124 | + |
| 125 | +The TPC benchmark scripts in `benchmarks/tpc/` have built-in async-profiler support via |
| 126 | +the `--async-profiler` flag. See [benchmarks/tpc/README.md](https://github.com/apache/datafusion-comet/blob/main/benchmarks/tpc/README.md) |
| 127 | +for details. |
| 128 | + |
| 129 | +## Profiling with Java Flight Recorder |
| 130 | + |
| 131 | +[Java Flight Recorder (JFR)](https://docs.oracle.com/en/java/javase/17/jfapi/) is built |
| 132 | +into JDK 11+ and collects detailed JVM runtime data with very low overhead. It does not |
| 133 | +see native Rust frames, but is excellent for diagnosing GC pressure, thread contention, |
| 134 | +I/O latency, and JVM-level allocation patterns. |
| 135 | + |
| 136 | +### Adding JFR flags to spark-submit |
| 137 | + |
| 138 | +```shell |
| 139 | +spark-submit \ |
| 140 | + --conf "spark.driver.extraJavaOptions=-XX:StartFlightRecording=duration=120s,filename=driver.jfr" \ |
| 141 | + --conf "spark.executor.extraJavaOptions=-XX:StartFlightRecording=duration=120s,filename=executor.jfr" \ |
| 142 | + ... |
| 143 | +``` |
| 144 | + |
| 145 | +For continuous recording without a fixed duration: |
| 146 | + |
| 147 | +```shell |
| 148 | +--conf "spark.driver.extraJavaOptions=-XX:StartFlightRecording=disk=true,maxsize=500m,filename=driver.jfr" |
| 149 | +``` |
| 150 | + |
| 151 | +You can also start and stop recording dynamically using `jcmd`: |
| 152 | + |
| 153 | +```shell |
| 154 | +jcmd <pid> JFR.start name=profile |
| 155 | +# ... run your query ... |
| 156 | +jcmd <pid> JFR.stop name=profile filename=recording.jfr |
| 157 | +``` |
| 158 | + |
| 159 | +### Viewing recordings |
| 160 | + |
| 161 | +- **[JDK Mission Control (JMC)](https://jdk.java.net/jmc/)** — the most comprehensive viewer. |
| 162 | + Shows flame graphs, GC timeline, thread activity, I/O, and allocation hot spots. |
| 163 | +- **IntelliJ IDEA** — open `.jfr` files directly in the built-in profiler |
| 164 | + (Run → Open Profiler Snapshot). |
| 165 | +- **`jfr` CLI** — quick summaries from the command line: `jfr summary driver.jfr` |
| 166 | + |
| 167 | +### Useful JFR events for Comet debugging |
| 168 | + |
| 169 | +| Event | What it shows | |
| 170 | +| ------------------------------------------------------------------- | --------------------------------------------------------------------------- | |
| 171 | +| `jdk.GCPhasePause` | GC pause durations — helps identify memory pressure from Arrow allocations. | |
| 172 | +| `jdk.ObjectAllocationInNewTLAB` / `jdk.ObjectAllocationOutsideTLAB` | Allocation hot spots. | |
| 173 | +| `jdk.JavaMonitorWait` / `jdk.ThreadPark` | Thread contention and lock waits. | |
| 174 | +| `jdk.FileRead` / `jdk.FileWrite` / `jdk.SocketRead` | I/O latency. | |
| 175 | +| `jdk.ExecutionSample` | CPU sampling (method profiling, similar to a flame graph). | |
| 176 | + |
| 177 | +### Integrated benchmark profiling |
| 178 | + |
| 179 | +The TPC benchmark scripts support `--jfr` for automatic JFR recording during benchmark |
| 180 | +runs. See [benchmarks/tpc/README.md](https://github.com/apache/datafusion-comet/blob/main/benchmarks/tpc/README.md) for details. |
| 181 | + |
| 182 | +## Profiling Native Code with cargo-flamegraph |
| 183 | + |
| 184 | +For profiling Rust code in isolation — without a JVM — use `cargo bench` with |
| 185 | +[cargo-flamegraph](https://github.com/flamegraph-rs/flamegraph). |
| 186 | + |
| 187 | +### Running micro benchmarks with cargo bench |
| 188 | + |
| 189 | +When implementing a new operator or expression, it is good practice to add a new microbenchmark under `core/benches`. |
| 190 | + |
| 191 | +It is often easiest to copy an existing benchmark and modify it for the new operator or expression. It is also |
| 192 | +necessary to add a new section to the `Cargo.toml` file, such as: |
| 193 | + |
| 194 | +```toml |
| 195 | +[[bench]] |
| 196 | +name = "shuffle_writer" |
| 197 | +harness = false |
| 198 | +``` |
| 199 | + |
| 200 | +These benchmarks are useful for comparing performance between releases or between feature branches and the |
| 201 | +main branch to help prevent regressions in performance when adding new features or fixing bugs. |
| 202 | + |
| 203 | +Individual benchmarks can be run by name with the following command. |
| 204 | + |
| 205 | +```shell |
| 206 | +cargo bench shuffle_writer |
| 207 | +``` |
| 208 | + |
| 209 | +Here is some sample output from running this command. |
| 210 | + |
| 211 | +``` |
| 212 | + Running benches/shuffle_writer.rs (target/release/deps/shuffle_writer-e37b59e37879cce7) |
| 213 | +Gnuplot not found, using plotters backend |
| 214 | +shuffle_writer/shuffle_writer |
| 215 | + time: [2.0880 ms 2.0989 ms 2.1118 ms] |
| 216 | +Found 9 outliers among 100 measurements (9.00%) |
| 217 | + 3 (3.00%) high mild |
| 218 | + 6 (6.00%) high severe |
| 219 | +``` |
| 220 | + |
| 221 | +### Profiling with cargo-flamegraph |
| 222 | + |
| 223 | +Install cargo-flamegraph: |
| 224 | + |
| 225 | +```shell |
| 226 | +cargo install flamegraph |
| 227 | +``` |
| 228 | + |
| 229 | +Follow the instructions in [cargo-flamegraph](https://github.com/flamegraph-rs/flamegraph) for your platform for |
| 230 | +running flamegraph. |
| 231 | + |
| 232 | +Here is a sample command for running `cargo-flamegraph` on MacOS. |
| 233 | + |
| 234 | +```shell |
| 235 | +cargo flamegraph --root --bench shuffle_writer |
| 236 | +``` |
| 237 | + |
| 238 | +This will produce output similar to the following. |
| 239 | + |
| 240 | +``` |
| 241 | +dtrace: system integrity protection is on, some features will not be available |
| 242 | +
|
| 243 | +dtrace: description 'profile-997 ' matched 1 probe |
| 244 | +Gnuplot not found, using plotters backend |
| 245 | +Testing shuffle_writer/shuffle_writer |
| 246 | +Success |
| 247 | +
|
| 248 | +dtrace: pid 66402 has exited |
| 249 | +writing flamegraph to "flamegraph.svg" |
| 250 | +``` |
| 251 | + |
| 252 | +The generated flamegraph can now be opened in a browser that supports svg format. |
| 253 | + |
| 254 | +Here is the flamegraph for this example: |
| 255 | + |
| 256 | + |
| 257 | + |
| 258 | +## Tips for Profiling Comet |
| 259 | + |
| 260 | +### Use wall-clock profiling to spot JNI boundary overhead |
| 261 | + |
| 262 | +When profiling Comet with async-profiler, `wall` mode is often more revealing than `cpu` |
| 263 | +because it captures time spent crossing the JNI boundary, waiting for native results, |
| 264 | +and blocked on I/O — none of which show up in CPU-only profiles. |
| 265 | + |
| 266 | +```shell |
| 267 | +$ASYNC_PROFILER_HOME/bin/asprof -e wall -d 60 -f wall-profile.html <pid> |
| 268 | +``` |
| 269 | + |
| 270 | +### Use alloc profiling around Arrow FFI |
| 271 | + |
| 272 | +JVM allocation profiling can identify hotspots in the Arrow FFI path where temporary |
| 273 | +objects are created during data transfer between JVM and native code: |
| 274 | + |
| 275 | +```shell |
| 276 | +$ASYNC_PROFILER_HOME/bin/asprof -e alloc -d 60 -f alloc-profile.html <pid> |
| 277 | +``` |
| 278 | + |
| 279 | +Look for allocations in `CometExecIterator`, `CometBatchIterator`, and Arrow vector |
| 280 | +classes. |
| 281 | + |
| 282 | +### Isolate Rust-only performance issues |
| 283 | + |
| 284 | +If a flame graph shows the hot path is entirely within native code, switch to |
| 285 | +`cargo-flamegraph` to get better symbol resolution and avoid JVM noise: |
| 286 | + |
| 287 | +```shell |
| 288 | +cd native |
| 289 | +cargo flamegraph --root --bench <benchmark_name> |
| 290 | +``` |
| 291 | + |
| 292 | +### Correlating JVM and native frames |
| 293 | + |
| 294 | +In async-profiler flame graphs, native Rust frames appear below JNI entry points like |
| 295 | +`Java_org_apache_comet_Native_*`. Look for these transition points to understand how |
| 296 | +time is split between Spark's JVM code and Comet's native execution. |
0 commit comments