[DO-NOT-MERGE][PYTHON][PS] Support numpy 2.5 PEP 695 NDArray type hints in pandas-on-Spark by HyukjinKwon · Pull Request #56789 · apache/spark

HyukjinKwon · 2026-06-25T22:36:42Z

What changes were proposed in this pull request?

as_spark_type in python/pyspark/pandas/typedef/typehints.py only recognized the legacy form of numpy.typing.NDArray, where NDArray[T] is a GenericAlias with __origin__ is np.ndarray and __args__ == (shape, np.dtype[T]).

Starting with NumPy 2.5, NDArray is declared as a PEP 695 type alias (type NDArray[ScalarT: np.generic] = np.ndarray[Any, np.dtype[ScalarT]]). At runtime NDArray[T] is then a GenericAlias whose __origin__ is the NDArray TypeAliasType (not np.ndarray) and whose __args__ is the single scalar type (T,). The old check no longer matched, so the annotation fell through to issubclass(tpe.__origin__, list), which raised TypeError: issubclass() arg 1 must be a class.

This PR:

adds _extract_ndarray_scalar_type, which detects the new PEP 695 NDArray form (origin is a TypeAliasType whose __value__ expands to np.ndarray) and returns its scalar element type;
routes that scalar through as_spark_type to build the ArrayType, mirroring the legacy branch;
guards the existing list branch with isinstance(tpe.__origin__, type) so a non-class __origin__ can never crash issubclass again.

The legacy NumPy (<2.5) path is unchanged.

Why are the changes needed?

On runners with NumPy 2.5 (e.g. Build / MacOS-26, Python 3.12), pyspark-pandas tests that use NDArray[...] type hints — test_apply_batch_with_type, test_apply_with_type, test_transform_batch_with_type, and their *-connect parities — fail with:

TypeError: Cannot interpret 'NDArray[int]' as a data type
TypeError: issubclass() arg 1 must be a class

Evidence: apache/spark run 28064134730 (Build / ... MacOS-26, Python 3.12), module pyspark-pandas / pyspark-pandas-connect.

Does this PR introduce any user-facing change?

No. It restores correct NDArray type-hint handling under NumPy 2.5.

How was this patch tested?

The existing pyspark.pandas.tests.test_typedef.TypeHintTests.test_as_spark_type_pandas_on_spark_dtype already exercises as_spark_type(ntp.NDArray[...]) / pandas_on_spark_type(ntp.NDArray[...]) across many element types; the fix makes it pass under NumPy 2.5 while keeping it green under NumPy 2.4.

Validated locally on Python 3.12 against:

the real NumPy 2.4 NDArray (legacy form) — no regression;
a faithful simulation of NumPy 2.5's PEP 695 NDArray (origin is a TypeAliasType) — previously reproduced the exact issubclass() arg 1 must be a class failure; now as_spark_type(NDArray[int]) == ArrayType(LongType()).

NumPy 2.5 is not yet on PyPI for the OSS CI image, so this cannot be reproduced on the SBT fork lane; it reproduces on the NumPy-2.5 MacOS-26 scheduled lane.

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code.

CI evidence

Before (red): https://github.com/apache/spark/actions/runs/28064134730 — Build / ... MacOS-26, Python 3.12, NumPy 2.5.0; pyspark-pandas / pyspark-pandas-connect fail in test_apply_batch_with_type with TypeError: Cannot interpret 'NDArray[int]' as a data type / issubclass() arg 1 must be a class.
After (fork SBT Build, no-regression): https://github.com/HyukjinKwon/spark/actions/runs/28204821025 — confirms the change does not regress the legacy NumPy (<2.5) path.

Note: the OSS CI image where this fails uses NumPy 2.5.0, which is not yet installable from PyPI for the fork's standard lanes, so the red MacOS-26 lane cannot be re-run green on the fork directly. The fix was validated locally on Python 3.12 against (a) the real NumPy 2.4 NDArray (legacy form, no regression) and (b) a faithful simulation of NumPy 2.5's PEP 695 NDArray (origin is a TypeAliasType), which reproduced the exact failure before the fix and produced as_spark_type(NDArray[int]) == ArrayType(LongType()) after. The existing test test_typedef.TypeHintTests.test_as_spark_type_pandas_on_spark_dtype covers NDArray[...] and will turn green on the NumPy-2.5 lane.

CI evidence

Before (red): Build / Python-only (Python 3.12, MacOS26) on apache/spark — https://github.com/apache/spark/actions/runs/28064134730 (pyspark-pandas + pyspark-pandas-connect fail: test_apply_batch_with_type → TypeError: Cannot interpret 'NDArray[int]' as a data type, numpy 2.5.0).
After (green): same job re-run on the fork with this fix (macOS-26, numpy 2.5.0; matrix narrowed to the two affected modules on a validate-only branch) — https://github.com/HyukjinKwon/spark/actions/runs/28205257318 (pyspark-pandas ✅, pyspark-pandas-connect ✅).

…ts in pandas-on-Spark ### What changes were proposed in this pull request? `as_spark_type` in `pyspark/pandas/typedef/typehints.py` only recognized the legacy form of `numpy.typing.NDArray`, where `NDArray[T]` is a `GenericAlias` with `__origin__ is np.ndarray` and `__args__ == (shape, np.dtype[T])`. Starting with NumPy 2.5, `NDArray` is declared as a PEP 695 type alias (`type NDArray[ScalarT: np.generic] = np.ndarray[Any, np.dtype[ScalarT]]`). At runtime `NDArray[T]` is then a `GenericAlias` whose `__origin__` is the `NDArray` `TypeAliasType` (not `np.ndarray`) and whose `__args__` is the single scalar type `(T,)`. The old check no longer matched, so the annotation fell through to `issubclass(tpe.__origin__, list)`, raising `TypeError: issubclass() arg 1 must be a class`. This adds a helper that detects the new PEP 695 `NDArray` form (origin is a `TypeAliasType` whose value expands to `np.ndarray`) and extracts the scalar element type, and also guards the `list` branch with `isinstance(_, type)` so a non-class `__origin__` can never crash `issubclass`. The legacy form is kept. ### Why are the changes needed? On runners with NumPy 2.5 (e.g. `Build / MacOS-26`, Python 3.12), the pyspark-pandas tests `test_apply_batch_with_type` (and other `*_with_type` tests that use `NDArray[int]` hints) fail with `TypeError: Cannot interpret 'NDArray[int]' as a data type` / `TypeError: issubclass() arg 1 must be a class`. Evidence: apache/spark run 28064134730. ### Does this PR introduce any user-facing change? No. It restores correct NDArray type-hint handling under NumPy 2.5. ### How was this patch tested? Reproduced the failure and validated the fix against both the legacy NumPy 2.4 NDArray form and a simulated NumPy 2.5 PEP 695 `NDArray` (origin is a `TypeAliasType`): `as_spark_type(NDArray[int]) == ArrayType(LongType())` in both cases, with no regression to `List[T]` / scalar handling. ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code. Co-authored-by: Isaac

…rigin__ The infer_return_type SeriesType branch used issubclass(tpe.__origin__, SeriesType) unguarded; under NumPy 2.5 a bare NDArray return hint has a non-class __origin__ (a TypeAliasType), which would raise 'issubclass() arg 1 must be a class'. Guard with isinstance(tpe.__origin__, type) so it falls through to as_spark_type instead. Co-authored-by: Isaac

…o pandas modules Validation harness for apache#56789: flip the fork guard on build_python_3.12_macos26.yml and narrow python_hosted_runner_test matrix to pyspark-pandas + pyspark-pandas-connect to verify the numpy>=2.5 PEP 695 NDArray fix on macOS-26 (numpy 2.5.0). Not for the PR branch. Co-authored-by: Isaac

gaogaotiantian · 2026-06-25T23:32:33Z

I think #56757 should work

HyukjinKwon mentioned this pull request Jun 25, 2026

[DO-NOT-MERGE][PS][PYTHON] Make as_spark_type handle numpy>=2.5 NDArray annotations #56788

Closed

HyukjinKwon closed this Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO-NOT-MERGE][PYTHON][PS] Support numpy 2.5 PEP 695 NDArray type hints in pandas-on-Spark#56789

[DO-NOT-MERGE][PYTHON][PS] Support numpy 2.5 PEP 695 NDArray type hints in pandas-on-Spark#56789
HyukjinKwon wants to merge 2 commits into
apache:masterfrom
HyukjinKwon:ci-fix/agent6-pandas-ndarray-numpy25

HyukjinKwon commented Jun 25, 2026 •

edited

Loading

Uh oh!

gaogaotiantian commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

HyukjinKwon commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

CI evidence

CI evidence

Uh oh!

gaogaotiantian commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HyukjinKwon commented Jun 25, 2026 •

edited

Loading