Skip to content

[DO-NOT-MERGE][PYTHON][PS] Support numpy 2.5 PEP 695 NDArray type hints in pandas-on-Spark#56789

Closed
HyukjinKwon wants to merge 2 commits into
apache:masterfrom
HyukjinKwon:ci-fix/agent6-pandas-ndarray-numpy25
Closed

[DO-NOT-MERGE][PYTHON][PS] Support numpy 2.5 PEP 695 NDArray type hints in pandas-on-Spark#56789
HyukjinKwon wants to merge 2 commits into
apache:masterfrom
HyukjinKwon:ci-fix/agent6-pandas-ndarray-numpy25

Conversation

@HyukjinKwon

@HyukjinKwon HyukjinKwon commented Jun 25, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

as_spark_type in python/pyspark/pandas/typedef/typehints.py only recognized the legacy form of numpy.typing.NDArray, where NDArray[T] is a GenericAlias with __origin__ is np.ndarray and __args__ == (shape, np.dtype[T]).

Starting with NumPy 2.5, NDArray is declared as a PEP 695 type alias (type NDArray[ScalarT: np.generic] = np.ndarray[Any, np.dtype[ScalarT]]). At runtime NDArray[T] is then a GenericAlias whose __origin__ is the NDArray TypeAliasType (not np.ndarray) and whose __args__ is the single scalar type (T,). The old check no longer matched, so the annotation fell through to issubclass(tpe.__origin__, list), which raised TypeError: issubclass() arg 1 must be a class.

This PR:

  • adds _extract_ndarray_scalar_type, which detects the new PEP 695 NDArray form (origin is a TypeAliasType whose __value__ expands to np.ndarray) and returns its scalar element type;
  • routes that scalar through as_spark_type to build the ArrayType, mirroring the legacy branch;
  • guards the existing list branch with isinstance(tpe.__origin__, type) so a non-class __origin__ can never crash issubclass again.

The legacy NumPy (<2.5) path is unchanged.

Why are the changes needed?

On runners with NumPy 2.5 (e.g. Build / MacOS-26, Python 3.12), pyspark-pandas tests that use NDArray[...] type hints — test_apply_batch_with_type, test_apply_with_type, test_transform_batch_with_type, and their *-connect parities — fail with:

TypeError: Cannot interpret 'NDArray[int]' as a data type
TypeError: issubclass() arg 1 must be a class

Evidence: apache/spark run 28064134730 (Build / ... MacOS-26, Python 3.12), module pyspark-pandas / pyspark-pandas-connect.

Does this PR introduce any user-facing change?

No. It restores correct NDArray type-hint handling under NumPy 2.5.

How was this patch tested?

The existing pyspark.pandas.tests.test_typedef.TypeHintTests.test_as_spark_type_pandas_on_spark_dtype already exercises as_spark_type(ntp.NDArray[...]) / pandas_on_spark_type(ntp.NDArray[...]) across many element types; the fix makes it pass under NumPy 2.5 while keeping it green under NumPy 2.4.

Validated locally on Python 3.12 against:

  • the real NumPy 2.4 NDArray (legacy form) — no regression;
  • a faithful simulation of NumPy 2.5's PEP 695 NDArray (origin is a TypeAliasType) — previously reproduced the exact issubclass() arg 1 must be a class failure; now as_spark_type(NDArray[int]) == ArrayType(LongType()).

NumPy 2.5 is not yet on PyPI for the OSS CI image, so this cannot be reproduced on the SBT fork lane; it reproduces on the NumPy-2.5 MacOS-26 scheduled lane.

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code.

CI evidence

Note: the OSS CI image where this fails uses NumPy 2.5.0, which is not yet installable from PyPI for the fork's standard lanes, so the red MacOS-26 lane cannot be re-run green on the fork directly. The fix was validated locally on Python 3.12 against (a) the real NumPy 2.4 NDArray (legacy form, no regression) and (b) a faithful simulation of NumPy 2.5's PEP 695 NDArray (origin is a TypeAliasType), which reproduced the exact failure before the fix and produced as_spark_type(NDArray[int]) == ArrayType(LongType()) after. The existing test test_typedef.TypeHintTests.test_as_spark_type_pandas_on_spark_dtype covers NDArray[...] and will turn green on the NumPy-2.5 lane.

CI evidence

…ts in pandas-on-Spark

### What changes were proposed in this pull request?

`as_spark_type` in `pyspark/pandas/typedef/typehints.py` only recognized the
legacy form of `numpy.typing.NDArray`, where `NDArray[T]` is a `GenericAlias`
with `__origin__ is np.ndarray` and `__args__ == (shape, np.dtype[T])`.

Starting with NumPy 2.5, `NDArray` is declared as a PEP 695 type alias
(`type NDArray[ScalarT: np.generic] = np.ndarray[Any, np.dtype[ScalarT]]`).
At runtime `NDArray[T]` is then a `GenericAlias` whose `__origin__` is the
`NDArray` `TypeAliasType` (not `np.ndarray`) and whose `__args__` is the single
scalar type `(T,)`. The old check no longer matched, so the annotation fell
through to `issubclass(tpe.__origin__, list)`, raising
`TypeError: issubclass() arg 1 must be a class`.

This adds a helper that detects the new PEP 695 `NDArray` form (origin is a
`TypeAliasType` whose value expands to `np.ndarray`) and extracts the scalar
element type, and also guards the `list` branch with `isinstance(_, type)` so a
non-class `__origin__` can never crash `issubclass`. The legacy form is kept.

### Why are the changes needed?

On runners with NumPy 2.5 (e.g. `Build / MacOS-26`, Python 3.12), the
pyspark-pandas tests `test_apply_batch_with_type` (and other `*_with_type`
tests that use `NDArray[int]` hints) fail with
`TypeError: Cannot interpret 'NDArray[int]' as a data type` /
`TypeError: issubclass() arg 1 must be a class`. Evidence: apache/spark run
28064134730.

### Does this PR introduce any user-facing change?

No. It restores correct NDArray type-hint handling under NumPy 2.5.

### How was this patch tested?

Reproduced the failure and validated the fix against both the legacy NumPy 2.4
NDArray form and a simulated NumPy 2.5 PEP 695 `NDArray` (origin is a
`TypeAliasType`): `as_spark_type(NDArray[int]) == ArrayType(LongType())` in both
cases, with no regression to `List[T]` / scalar handling.

### Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code.

Co-authored-by: Isaac
…rigin__

The infer_return_type SeriesType branch used issubclass(tpe.__origin__, SeriesType)
unguarded; under NumPy 2.5 a bare NDArray return hint has a non-class __origin__
(a TypeAliasType), which would raise 'issubclass() arg 1 must be a class'. Guard
with isinstance(tpe.__origin__, type) so it falls through to as_spark_type instead.

Co-authored-by: Isaac
HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Jun 25, 2026
…o pandas modules

Validation harness for apache#56789: flip the fork guard on build_python_3.12_macos26.yml
and narrow python_hosted_runner_test matrix to pyspark-pandas + pyspark-pandas-connect
to verify the numpy>=2.5 PEP 695 NDArray fix on macOS-26 (numpy 2.5.0). Not for the PR branch.

Co-authored-by: Isaac
@gaogaotiantian

Copy link
Copy Markdown
Contributor

I think #56757 should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants