Skip to content

[SPARK-57420][INFRA] Only generate TPC-DS data when required and check CPU compatibility early in benchmark workflow#56479

Open
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:SPARK-57420-benchmark-workflow-improvements
Open

[SPARK-57420][INFRA] Only generate TPC-DS data when required and check CPU compatibility early in benchmark workflow#56479
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:SPARK-57420-benchmark-workflow-improvements

Conversation

@iemejia

@iemejia iemejia commented Jun 12, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Two improvements to the benchmark workflow:

  1. Skip TPC-DS data generation for non-TPCDS benchmarks. Change contains(inputs.class, '*') to inputs.class == '*' so wildcard patterns like *VectorizedDeltaReaderBenchmark no longer trigger the expensive TPC-DS generation job (~5-10 min saved per run).

  2. Add early CPU model check step that runs immediately after checkout, before compilation. Prints the CPU as a ::notice:: annotation for live visibility in the Actions UI, and optionally fails fast if the runner CPU does not match the expected-cpu input parameter.

Why are the changes needed?

The benchmark workflow currently generates TPC-DS data (~5-10 min) for every benchmark run, even when the benchmark class does not use TPC-DS data. This is because contains(inputs.class, '*') matches any class with a wildcard (e.g., *VectorizedDeltaReaderBenchmark), not just the literal * (all benchmarks).

Additionally, when benchmark results need to match a specific CPU (e.g., AMD EPYC 7763 for consistent comparisons against upstream baselines), there is no way to detect a CPU mismatch until the full benchmark completes (~20-30 min). The early CPU check allows the job to fail within seconds of starting if the runner does not match, saving significant time and compute.

Does this PR introduce any user-facing change?

No. This only affects the GHA benchmark workflow. Existing behavior is preserved when expected-cpu is not set (default).

How was this patch tested?

The workflow changes are self-contained in .github/workflows/benchmark.yml. Tested by inspection. The expected-cpu parameter is optional and defaults to empty (no check), preserving backward compatibility.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenCode (Claude claude-opus-4.6)

…k CPU compatibility early in benchmark workflow

Two improvements to the benchmark workflow:

1. Skip TPC-DS data generation for non-TPCDS benchmarks. Change
   contains(inputs.class, '*') to inputs.class == '*' so wildcard
   patterns like '*VectorizedDeltaReaderBenchmark' no longer trigger
   the expensive TPC-DS generation job (~5-10 min saved per run).

2. Add early CPU model check step that runs immediately after checkout,
   before compilation. Prints the CPU as a ::notice:: annotation for
   live visibility, and optionally fails fast if the runner CPU does
   not match the expected-cpu input parameter.

Assisted-by: OpenCode:claude-opus-4.6
@iemejia

iemejia commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@LuciferYang Would you mind taking a look at this one when you get a chance? While working on the Parquet encoding benchmarks, I noticed the workflow was spending ~5-10 min generating TPC-DS data on every run even when the benchmark does not use it (because contains(inputs.class, '*') matches any wildcard pattern, not just the literal *).

I also kept having to wait for full 20-30 min runs to complete only to discover the runner landed on the wrong CPU. For that, I added an optional expected-cpu input parameter that detects the runner CPU immediately after checkout and fails the job within seconds if it does not match -- so you do not waste the entire compilation + benchmark time before finding out.

These two small fixes should save a lot of time for anyone using the benchmark workflow with specific class patterns and CPU-sensitive comparisons. Happy to adjust anything if needed.

Comment thread .github/workflows/benchmark.yml Outdated
CPU_MODEL=$(grep "model name" /proc/cpuinfo | head -1 | sed 's/model name\s*:\s*//')
echo "Runner CPU: $CPU_MODEL"
echo "::notice::Runner CPU: $CPU_MODEL"
if [ -n "${{ inputs.expected-cpu }}" ]; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ${{ inputs.expected-cpu }} is template-expanded before the shell runs, so special characters in the input would be interpreted as shell syntax. For workflow_dispatch this is low-risk (only committers can trigger), but the standard hygiene fix is to route through env: so it becomes a properly-quoted shell variable:

env:
  EXPECTED_CPU: ${{ inputs.expected-cpu }}
run: |
  ...
  if echo "$CPU_MODEL" | grep -qF "$EXPECTED_CPU"; then

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! Updated to route through env: so the input is a properly-quoted shell variable. Fixed in the next push.

@shrirangmhalgi

Copy link
Copy Markdown
Contributor

This is clean workflow improvement. The contains -> == fix for the wildcard case and the early CPU check are both well thought out.

Left one minor hygiene nit on the CPU check step.

Address review nit: use step-level env block to pass the
inputs.expected-cpu value as a properly-quoted shell variable
instead of template-expanding it directly into the script.

Assisted-by: OpenCode:claude-opus-4.6
@iemejia

iemejia commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Thanks for the review @shrirangmhalgi! Addressed the nit -- now routing expected-cpu through an env variable for proper shell quoting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants