Apache DataFusion Comet

Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful Apache DataFusion query engine. Comet is designed to significantly enhance the performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the Spark ecosystem without requiring any code changes.

Comet provides a 2x speedup for TPC-H @ 1TB, resulting in 50% cost savings.

That 2x speedup gives you a choice: finish the same Spark workload in half the time on the cluster you already have, or match your current Spark performance on roughly half the resources. Either way, the gain translates directly into lower cloud bills, reduced on-prem capacity, and lower energy usage, with no changes to your existing Spark SQL, DataFrame, or PySpark code. Comet runs on commodity hardware: no GPUs, FPGAs, or other specialized accelerators are required, so the savings come from better utilization of the infrastructure you already run on.

See the Comet Benchmarking Guide for more details.

What Comet Accelerates

Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion. It uses Apache Arrow for zero-copy data transfer between the JVM and native code.

Parquet scans: native Parquet reader integrated with Spark's query planner
Apache Iceberg: accelerated Parquet scans when reading Iceberg tables from Spark (see the Iceberg guide)
Shuffle: native columnar shuffle with support for hash and range partitioning
Expressions: hundreds of supported Spark expressions across math, string, datetime, array, map, JSON, hash, and predicate categories
Aggregations: hash aggregate with support for FILTER (WHERE ...) clauses
Joins: hash join, sort-merge join, and broadcast join

For the authoritative lists, see the supported expressions and supported operators pages.

Drop-In Integration

Comet is designed as a drop-in accelerator for Apache Spark, allowing you to integrate Comet into your existing Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness the benefits of Comet's acceleration capabilities without disrupting your Spark applications.

Getting Started

Comet supports Apache Spark 3.4 and 3.5, and provides experimental support for Spark 4.0. See the installation guide for the detailed version, Java, and Scala compatibility matrix.

Install Comet by adding the jar for your Spark and Scala version to the Spark classpath and enabling the plugin. A typical configuration looks like:

export COMET_JAR=/path/to/comet-spark-spark3.5_2.12-<version>.jar

$SPARK_HOME/bin/spark-shell \
    --jars $COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --conf spark.plugins=org.apache.spark.CometPlugin \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    --conf spark.comet.explainFallback.enabled=true \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=4g

For full installation instructions, published jar downloads, and configuration reference, see the installation guide and the configuration reference.

Community

Join the DataFusion Slack and Discord channels to connect with other users, ask questions, and share your experiences with Comet.

Contributing

We welcome contributions from the community to help improve and enhance Apache DataFusion Comet. Whether it's fixing bugs, adding new features, writing documentation, or optimizing performance, your contributions are invaluable in shaping the future of Comet. Check out our contributor guide to get started.

License

Apache DataFusion Comet is licensed under the Apache License 2.0. See the LICENSE.txt file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache DataFusion Comet

What Comet Accelerates

Drop-In Integration

Getting Started

Community

Contributing

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Apache DataFusion Comet

What Comet Accelerates

Drop-In Integration

Getting Started

Community

Contributing

License