[SPARK-57420][INFRA] Only generate TPC-DS data when required and check CPU compatibility early in benchmark workflow by iemejia · Pull Request #56479 · apache/spark

iemejia · 2026-06-12T18:53:27Z

What changes were proposed in this pull request?

Two improvements to the benchmark workflow:

Skip TPC-DS data generation for non-TPCDS benchmarks. Change contains(inputs.class, '*') to inputs.class == '*' so wildcard patterns like *VectorizedDeltaReaderBenchmark no longer trigger the expensive TPC-DS generation job (~5-10 min saved per run).
Add early CPU model check step that runs immediately after checkout, before compilation. Prints the CPU as a ::notice:: annotation for live visibility in the Actions UI, and optionally fails fast if the runner CPU does not match the expected-cpu input parameter.

Why are the changes needed?

The benchmark workflow currently generates TPC-DS data (~5-10 min) for every benchmark run, even when the benchmark class does not use TPC-DS data. This is because contains(inputs.class, '*') matches any class with a wildcard (e.g., *VectorizedDeltaReaderBenchmark), not just the literal * (all benchmarks).

Additionally, when benchmark results need to match a specific CPU (e.g., AMD EPYC 7763 for consistent comparisons against upstream baselines), there is no way to detect a CPU mismatch until the full benchmark completes (~20-30 min). The early CPU check allows the job to fail within seconds of starting if the runner does not match, saving significant time and compute.

Does this PR introduce any user-facing change?

No. This only affects the GHA benchmark workflow. Existing behavior is preserved when expected-cpu is not set (default).

How was this patch tested?

The workflow changes are self-contained in .github/workflows/benchmark.yml. Tested by inspection. The expected-cpu parameter is optional and defaults to empty (no check), preserving backward compatibility.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenCode (Claude claude-opus-4.6)

…k CPU compatibility early in benchmark workflow Two improvements to the benchmark workflow: 1. Skip TPC-DS data generation for non-TPCDS benchmarks. Change contains(inputs.class, '*') to inputs.class == '*' so wildcard patterns like '*VectorizedDeltaReaderBenchmark' no longer trigger the expensive TPC-DS generation job (~5-10 min saved per run). 2. Add early CPU model check step that runs immediately after checkout, before compilation. Prints the CPU as a ::notice:: annotation for live visibility, and optionally fails fast if the runner CPU does not match the expected-cpu input parameter. Assisted-by: OpenCode:claude-opus-4.6

iemejia · 2026-06-12T18:58:49Z

@LuciferYang Would you mind taking a look at this one when you get a chance? While working on the Parquet encoding benchmarks, I noticed the workflow was spending ~5-10 min generating TPC-DS data on every run even when the benchmark does not use it (because contains(inputs.class, '*') matches any wildcard pattern, not just the literal *).

I also kept having to wait for full 20-30 min runs to complete only to discover the runner landed on the wrong CPU. For that, I added an optional expected-cpu input parameter that detects the runner CPU immediately after checkout and fails the job within seconds if it does not match -- so you do not waste the entire compilation + benchmark time before finding out.

These two small fixes should save a lot of time for anyone using the benchmark workflow with specific class patterns and CPU-sensitive comparisons. Happy to adjust anything if needed.

shrirangmhalgi · 2026-06-15T06:31:33Z

+        CPU_MODEL=$(grep "model name" /proc/cpuinfo | head -1 | sed 's/model name\s*:\s*//')
+        echo "Runner CPU: $CPU_MODEL"
+        echo "::notice::Runner CPU: $CPU_MODEL"
+        if [ -n "${{ inputs.expected-cpu }}" ]; then


Nit: ${{ inputs.expected-cpu }} is template-expanded before the shell runs, so special characters in the input would be interpreted as shell syntax. For workflow_dispatch this is low-risk (only committers can trigger), but the standard hygiene fix is to route through env: so it becomes a properly-quoted shell variable:

env: EXPECTED_CPU: ${{ inputs.expected-cpu }} run: | ... if echo "$CPU_MODEL" | grep -qF "$EXPECTED_CPU"; then

Good catch, thanks! Updated to route through env: so the input is a properly-quoted shell variable. Fixed in the next push.

shrirangmhalgi · 2026-06-15T06:34:08Z

This is clean workflow improvement. The contains -> == fix for the wildcard case and the early CPU check are both well thought out.

Left one minor hygiene nit on the CPU check step.

Address review nit: use step-level env block to pass the inputs.expected-cpu value as a properly-quoted shell variable instead of template-expanding it directly into the script. Assisted-by: OpenCode:claude-opus-4.6

iemejia · 2026-06-15T07:18:55Z

Thanks for the review @shrirangmhalgi! Addressed the nit -- now routing expected-cpu through an env variable for proper shell quoting.

shrirangmhalgi reviewed Jun 15, 2026

View reviewed changes

Route expected-cpu input through env variable for shell safety

1b52da5

Address review nit: use step-level env block to pass the inputs.expected-cpu value as a properly-quoted shell variable instead of template-expanding it directly into the script. Assisted-by: OpenCode:claude-opus-4.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57420][INFRA] Only generate TPC-DS data when required and check CPU compatibility early in benchmark workflow#56479

[SPARK-57420][INFRA] Only generate TPC-DS data when required and check CPU compatibility early in benchmark workflow#56479
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:SPARK-57420-benchmark-workflow-improvements

iemejia commented Jun 12, 2026

Uh oh!

iemejia commented Jun 12, 2026

Uh oh!

shrirangmhalgi Jun 15, 2026

Uh oh!

iemejia Jun 15, 2026

Uh oh!

shrirangmhalgi commented Jun 15, 2026

Uh oh!

iemejia commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iemejia commented Jun 12, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

iemejia commented Jun 12, 2026

Uh oh!

shrirangmhalgi Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

shrirangmhalgi commented Jun 15, 2026

Uh oh!

iemejia commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants