@@ -26,6 +26,26 @@ For full instructions on running these benchmarks on an EC2 instance, see the [C
2626
2727[ Comet Benchmarking on EC2 Guide ] : https://datafusion.apache.org/comet/contributor-guide/benchmarking_aws_ec2.html
2828
29+ ## Usage
30+
31+ All benchmarks are run via ` run.py ` :
32+
33+ ```
34+ python3 run.py --engine <engine> --benchmark <tpch|tpcds> [options]
35+ ```
36+
37+ | Option | Description |
38+ | -------------- | ------------------------------------------------ |
39+ | ` --engine ` | Engine name (matches a TOML file in ` engines/ ` ) |
40+ | ` --benchmark ` | ` tpch ` or ` tpcds ` |
41+ | ` --iterations ` | Number of iterations (default: 1) |
42+ | ` --output ` | Output directory (default: ` . ` ) |
43+ | ` --query ` | Run a single query number |
44+ | ` --no-restart ` | Skip Spark master/worker restart |
45+ | ` --dry-run ` | Print the spark-submit command without executing |
46+
47+ Available engines: ` spark ` , ` comet ` , ` comet-iceberg ` , ` gluten `
48+
2949## Example usage
3050
3151Set Spark environment variables:
@@ -47,7 +67,7 @@ Run Spark benchmark:
4767``` shell
4868export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
4969sudo ./drop-caches.sh
50- ./ spark- tpch.sh
70+ python3 run.py --engine spark --benchmark tpch
5171```
5272
5373Run Comet benchmark:
@@ -56,7 +76,7 @@ Run Comet benchmark:
5676export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
5777export COMET_JAR=/opt/comet/comet-spark-spark3.5_2.12-0.10.0.jar
5878sudo ./drop-caches.sh
59- ./ comet- tpch.sh
79+ python3 run.py --engine comet --benchmark tpch
6080```
6181
6282Run Gluten benchmark:
@@ -65,7 +85,13 @@ Run Gluten benchmark:
6585export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
6686export GLUTEN_JAR=/opt/gluten/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.4.0.jar
6787sudo ./drop-caches.sh
68- ./gluten-tpch.sh
88+ python3 run.py --engine gluten --benchmark tpch
89+ ```
90+
91+ Preview a command without running it:
92+
93+ ``` shell
94+ python3 run.py --engine comet --benchmark tpch --dry-run
6995```
7096
7197Generating charts:
@@ -74,6 +100,11 @@ Generating charts:
74100python3 generate-comparison.py --benchmark tpch --labels " Spark 3.5.3" " Comet 0.9.0" " Gluten 1.4.0" --title " TPC-H @ 100 GB (single executor, 8 cores, local Parquet files)" spark-tpch-1752338506381.json comet-tpch-1752337818039.json gluten-tpch-1752337474344.json
75101```
76102
103+ ## Engine Configuration
104+
105+ Each engine is defined by a TOML file in ` engines/ ` . The config specifies JARs, Spark conf overrides,
106+ required environment variables, and optional defaults/exports. See existing files for examples.
107+
77108## Iceberg Benchmarking
78109
79110Comet includes native Iceberg support via iceberg-rust integration. This enables benchmarking TPC-H queries
@@ -90,14 +121,16 @@ export ICEBERG_JAR=/path/to/iceberg-spark-runtime-3.5_2.12-1.8.1.jar
90121
91122Note: Table creation uses ` --packages ` which auto-downloads the dependency.
92123
93- ### Create Iceberg TPC-H tables
124+ ### Create Iceberg tables
94125
95- Convert existing Parquet TPC-H data to Iceberg format:
126+ Convert existing Parquet data to Iceberg format using ` create-iceberg-tables.py ` .
127+ The script configures the Iceberg catalog automatically -- no ` --conf ` flags needed.
96128
97129``` shell
98130export ICEBERG_WAREHOUSE=/mnt/bigdata/iceberg-warehouse
99- export ICEBERG_CATALOG= ${ICEBERG_CATALOG :- local}
131+ mkdir -p $ICEBERG_WAREHOUSE
100132
133+ # TPC-H
101134$SPARK_HOME /bin/spark-submit \
102135 --master $SPARK_MASTER \
103136 --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1 \
@@ -106,13 +139,24 @@ $SPARK_HOME/bin/spark-submit \
106139 --conf spark.executor.cores=8 \
107140 --conf spark.cores.max=8 \
108141 --conf spark.executor.memory=16g \
109- --conf spark.sql.catalog.${ICEBERG_CATALOG} =org.apache.iceberg.spark.SparkCatalog \
110- --conf spark.sql.catalog.${ICEBERG_CATALOG} .type=hadoop \
111- --conf spark.sql.catalog.${ICEBERG_CATALOG} .warehouse=$ICEBERG_WAREHOUSE \
112- create-iceberg-tpch.py \
142+ create-iceberg-tables.py \
143+ --benchmark tpch \
113144 --parquet-path $TPCH_DATA \
114- --catalog $ICEBERG_CATALOG \
115- --database tpch
145+ --warehouse $ICEBERG_WAREHOUSE
146+
147+ # TPC-DS
148+ $SPARK_HOME /bin/spark-submit \
149+ --master $SPARK_MASTER \
150+ --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1 \
151+ --conf spark.driver.memory=8G \
152+ --conf spark.executor.instances=2 \
153+ --conf spark.executor.cores=8 \
154+ --conf spark.cores.max=16 \
155+ --conf spark.executor.memory=16g \
156+ create-iceberg-tables.py \
157+ --benchmark tpcds \
158+ --parquet-path $TPCDS_DATA \
159+ --warehouse $ICEBERG_WAREHOUSE
116160```
117161
118162### Run Iceberg benchmark
@@ -124,20 +168,22 @@ export ICEBERG_JAR=/path/to/iceberg-spark-runtime-3.5_2.12-1.8.1.jar
124168export ICEBERG_WAREHOUSE=/mnt/bigdata/iceberg-warehouse
125169export TPCH_QUERIES=/mnt/bigdata/tpch/queries/
126170sudo ./drop-caches.sh
127- ./comet-tpch- iceberg.sh
171+ python3 run.py --engine comet- iceberg --benchmark tpch
128172```
129173
130174The benchmark uses ` spark.comet.scan.icebergNative.enabled=true ` to enable Comet's native iceberg-rust
131175integration. Verify native scanning is active by checking for ` CometIcebergNativeScanExec ` in the
132176physical plan output.
133177
134- ### Iceberg-specific options
178+ ### create-iceberg-tables.py options
135179
136- | Environment Variable | Default | Description |
137- | -------------------- | ---------- | ----------------------------------- |
138- | ` ICEBERG_CATALOG ` | ` local ` | Iceberg catalog name |
139- | ` ICEBERG_DATABASE ` | ` tpch ` | Database containing TPC-H tables |
140- | ` ICEBERG_WAREHOUSE ` | (required) | Path to Iceberg warehouse directory |
180+ | Option | Required | Default | Description |
181+ | ---------------- | -------- | -------------- | ----------------------------------- |
182+ | ` --benchmark ` | Yes | | ` tpch ` or ` tpcds ` |
183+ | ` --parquet-path ` | Yes | | Path to source Parquet data |
184+ | ` --warehouse ` | Yes | | Path to Iceberg warehouse directory |
185+ | ` --catalog ` | No | ` local ` | Iceberg catalog name |
186+ | ` --database ` | No | benchmark name | Database name for the tables |
141187
142188### Comparing Parquet vs Iceberg performance
143189
0 commit comments