Skip to content

CI: "Check job status" aggregator passes when build jobs fail (skipped-as-success regression) #2208

@leofang

Description

@leofang

The required status check Check job status reports success even when one or more Build * jobs fail. The merge gate on main is effectively bypassable today.

Symptom

Recent CI runs on main where build jobs failed but Check job status succeeded:

Both runs are on real PRs (not [no-ci] / [doc-only]). The repository's Prevent committing without PR ruleset on main lists only Check job status + pre-commit.ci - pr as required status checks. Nothing else gates the merge.

Root cause

Check job status is defined here (current main, c12838169e):

checks:
name: Check job status
if: always()
runs-on: ubuntu-latest
needs:
- should-skip
- detect-changes
- test-sdist-linux
- test-sdist-windows
- test-linux-64
- test-linux-aarch64
- test-windows
- doc
steps:
- name: Exit
run: |
# if any dependencies were cancelled or failed, that's a failure
#
# see https://docs.github.com/en/actions/reference/workflows-and-actions/expressions#always
# and https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/collaborating-on-repositories-with-code-quality-features/troubleshooting-required-status-checks#handling-skipped-but-required-checks
# for why this cannot be encoded in the job-level `if:` field
#
# TL; DR: `$REASONS`
#
# The intersection of skipped-as-success and required status checks
# creates a scenario where if you DON'T `always()` run this job, the
# status check UI will block merging and if you DO `always()` run and
# a dependency is _cancelled_ (due to a critical failure, which is
# somehow not considered a failure ¯\_(ツ)_/¯) then the critically
# failing job(s) will timeout causing a cancellation here and the
# build to succeed which we don't want (originally this was just
# 'exit 0')
#
# Note: When [doc-only] is in PR title, test jobs are intentionally
# skipped and should not cause failure.
#
# detect-changes gates whether heavy test matrices run at all; if it
# does not succeed, downstream test jobs are skipped rather than
# failed, which would otherwise go unnoticed here. Require its
# success explicitly so a broken gating step cannot masquerade as a
# green CI run.
doc_only=${{ needs.should-skip.outputs.doc-only }}
if ${{ needs.detect-changes.result != 'success' }}; then
exit 1
fi
if ${{ needs.doc.result == 'cancelled' || needs.doc.result == 'failure' }}; then
exit 1
fi
if ${{ needs.test-sdist-linux.result == 'cancelled' ||
needs.test-sdist-linux.result == 'failure' ||
needs.test-sdist-windows.result == 'cancelled' ||
needs.test-sdist-windows.result == 'failure' }}; then
exit 1
fi
if [[ "${doc_only}" != "true" ]]; then
if ${{ needs.test-linux-64.result == 'cancelled' ||
needs.test-linux-64.result == 'failure' ||
needs.test-linux-aarch64.result == 'cancelled' ||
needs.test-linux-aarch64.result == 'failure' ||
needs.test-windows.result == 'cancelled' ||
needs.test-windows.result == 'failure' }}; then
exit 1
fi
fi
exit 0

The Exit step body only catches 'cancelled' or 'failure' results on its test-*, doc, test-sdist-* dependencies — see the inline if chain:

doc_only=${{ needs.should-skip.outputs.doc-only }}
if ${{ needs.detect-changes.result != 'success' }}; then
exit 1
fi
if ${{ needs.doc.result == 'cancelled' || needs.doc.result == 'failure' }}; then
exit 1
fi
if ${{ needs.test-sdist-linux.result == 'cancelled' ||
needs.test-sdist-linux.result == 'failure' ||
needs.test-sdist-windows.result == 'cancelled' ||
needs.test-sdist-windows.result == 'failure' }}; then
exit 1
fi
if [[ "${doc_only}" != "true" ]]; then
if ${{ needs.test-linux-64.result == 'cancelled' ||
needs.test-linux-64.result == 'failure' ||
needs.test-linux-aarch64.result == 'cancelled' ||
needs.test-linux-aarch64.result == 'failure' ||
needs.test-windows.result == 'cancelled' ||
needs.test-windows.result == 'failure' }}; then
exit 1
fi
fi
exit 0

In particular, the test-job clause (lines 474–479) reads:

if ${{ needs.test-linux-64.result == 'cancelled' ||
needs.test-linux-64.result == 'failure' ||
needs.test-linux-aarch64.result == 'cancelled' ||
needs.test-linux-aarch64.result == 'failure' ||
needs.test-windows.result == 'cancelled' ||
needs.test-windows.result == 'failure' }}; then

When build-linux-64 (or build-windows) fails, GitHub Actions sets the dependent test-linux-64.result to 'skipped' (default needs-failure propagation). 'skipped' is not in the catch list, so the aggregator exits 0.

GitHub's "skipped checks count as success for required status checks" behavior — explicitly called out in CCCL's ci: gate — makes this a silent merge gate bypass:
https://github.com/NVIDIA/cccl/blob/8c0e6cb1b6412d7bd0070f9ac3f55fa80231961a/.github/workflows/ci-workflow-pull-request.yml#L463-L526

Regression history

PR #1008 (merged 2025-10-01) replaced poseidon/wait-for-status-checks with a needs:-only aggregator and an exit 0 step. The previous design was here, at the pre-merge base commit:

steps:
- name: GitHub Checks
uses: poseidon/wait-for-status-checks@899c768d191b56eef585c18f8558da19e1f3e707 # v0.6.0
with:
token: ${{ secrets.GITHUB_TOKEN }}
match_pattern: "CI*"
ignore_pattern: ".*Check job status.*" # ignore self

The original action would have failed on any non-success check (including skipped); the replacement does not. Subsequent additions of if: always() + the cancelled || failure chain partially restored intent but did not cover the skipped case.

Proposed fix

Adopt CCCL's check_result pattern: explicit expected="success" per dependency, with expected="skipped" for legitimate [doc-only] skips. Reference implementation (CCCL):
https://github.com/NVIDIA/cccl/blob/8c0e6cb1b6412d7bd0070f9ac3f55fa80231961a/.github/workflows/ci-workflow-pull-request.yml#L463-L526

Sketch adapted to cuda-python:

- name: Check results
  run: |
    if [[ "${{ needs.should-skip.outputs.skip }}" == "true" ]]; then
      echo "[no-ci] - skipping aggregator checks"
      exit 0
    fi
    doc_only="${{ needs.should-skip.outputs.doc-only }}"
    status="success"
    check_result() {
      name=$1; expected=$2; result=$3
      if [[ "$result" \!= "$expected" ]]; then
        echo "::error::$name did not match expected result"
        status="failed"
      fi
    }
    check_result "detect-changes" "success" "${{ needs.detect-changes.result }}"
    check_result "doc"            "success" "${{ needs.doc.result }}"
    if [[ "$doc_only" == "true" ]]; then expected="skipped"; else expected="success"; fi
    check_result "test-sdist-linux"   "$expected" "${{ needs.test-sdist-linux.result }}"
    # ...etc for every test-*
    [[ "$status" == "success" ]]

Demonstration

A draft PR will be opened against main that intentionally fails the Build stage. It should reproduce: Build jobs go red, Check job status goes green.

Related

NVIDIA/numba-cuda-mlir shares the same aggregator design and was bitten by this in NVIDIA/numba-cuda-mlir#125 (auto-merged with failing Windows builds; reverted in NVIDIA/numba-cuda-mlir#134). Fix should be ported there once shape is agreed.

Metadata

Metadata

Assignees

Labels

CI/CDCI/CD infrastructureP0High priority - Must do!bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions