Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful Apache DataFusion query engine. Comet is designed to significantly enhance the performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the Spark ecosystem without requiring any code changes.
Comet provides a 2x speedup for TPC-H @ 1TB, resulting in 50% cost savings.
That 2x speedup gives you a choice: finish the same Spark workload in half the time on the cluster you already have, or match your current Spark performance on roughly half the resources. Either way, the gain translates directly into lower cloud bills, reduced on-prem capacity, and lower energy usage, with no changes to your existing Spark SQL, DataFrame, or PySpark code. Comet runs on commodity hardware: no GPUs, FPGAs, or other specialized accelerators are required, so the savings come from better utilization of the infrastructure you already run on.
See the Comet Benchmarking Guide for more details.
Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion. It uses Apache Arrow for zero-copy data transfer between the JVM and native code.
- Parquet scans: native Parquet reader integrated with Spark's query planner
- Apache Iceberg: accelerated Parquet scans when reading Iceberg tables from Spark (see the Iceberg guide)
- Shuffle: native columnar shuffle with support for hash and range partitioning
- Expressions: hundreds of supported Spark expressions across math, string, datetime, array, map, JSON, hash, and predicate categories
- Aggregations: hash aggregate with support for
FILTER (WHERE ...)clauses - Joins: hash join, sort-merge join, and broadcast join
For the authoritative lists, see the supported expressions and supported operators pages.
Comet is designed as a drop-in accelerator for Apache Spark, allowing you to integrate Comet into your existing Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness the benefits of Comet's acceleration capabilities without disrupting your Spark applications.
Comet supports Apache Spark 3.4 and 3.5, and provides experimental support for Spark 4.0. See the installation guide for the detailed version, Java, and Scala compatibility matrix.
Install Comet by adding the jar for your Spark and Scala version to the Spark classpath and enabling the plugin. A typical configuration looks like:
export COMET_JAR=/path/to/comet-spark-spark3.5_2.12-<version>.jar
$SPARK_HOME/bin/spark-shell \
--jars $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.comet.explainFallback.enabled=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=4gFor full installation instructions, published jar downloads, and configuration reference, see the installation guide and the configuration reference.
Join the DataFusion Slack and Discord channels to connect with other users, ask questions, and share your experiences with Comet.
We welcome contributions from the community to help improve and enhance Apache DataFusion Comet. Whether it's fixing bugs, adding new features, writing documentation, or optimizing performance, your contributions are invaluable in shaping the future of Comet. Check out our contributor guide to get started.
Apache DataFusion Comet is licensed under the Apache License 2.0. See the LICENSE.txt file for details.


