[DO-NOT-MERGE][PYTHON][PS] Support numpy 2.5 PEP 695 NDArray type hints in pandas-on-Spark#56789
Closed
HyukjinKwon wants to merge 2 commits into
Closed
[DO-NOT-MERGE][PYTHON][PS] Support numpy 2.5 PEP 695 NDArray type hints in pandas-on-Spark#56789HyukjinKwon wants to merge 2 commits into
HyukjinKwon wants to merge 2 commits into
Conversation
…ts in pandas-on-Spark ### What changes were proposed in this pull request? `as_spark_type` in `pyspark/pandas/typedef/typehints.py` only recognized the legacy form of `numpy.typing.NDArray`, where `NDArray[T]` is a `GenericAlias` with `__origin__ is np.ndarray` and `__args__ == (shape, np.dtype[T])`. Starting with NumPy 2.5, `NDArray` is declared as a PEP 695 type alias (`type NDArray[ScalarT: np.generic] = np.ndarray[Any, np.dtype[ScalarT]]`). At runtime `NDArray[T]` is then a `GenericAlias` whose `__origin__` is the `NDArray` `TypeAliasType` (not `np.ndarray`) and whose `__args__` is the single scalar type `(T,)`. The old check no longer matched, so the annotation fell through to `issubclass(tpe.__origin__, list)`, raising `TypeError: issubclass() arg 1 must be a class`. This adds a helper that detects the new PEP 695 `NDArray` form (origin is a `TypeAliasType` whose value expands to `np.ndarray`) and extracts the scalar element type, and also guards the `list` branch with `isinstance(_, type)` so a non-class `__origin__` can never crash `issubclass`. The legacy form is kept. ### Why are the changes needed? On runners with NumPy 2.5 (e.g. `Build / MacOS-26`, Python 3.12), the pyspark-pandas tests `test_apply_batch_with_type` (and other `*_with_type` tests that use `NDArray[int]` hints) fail with `TypeError: Cannot interpret 'NDArray[int]' as a data type` / `TypeError: issubclass() arg 1 must be a class`. Evidence: apache/spark run 28064134730. ### Does this PR introduce any user-facing change? No. It restores correct NDArray type-hint handling under NumPy 2.5. ### How was this patch tested? Reproduced the failure and validated the fix against both the legacy NumPy 2.4 NDArray form and a simulated NumPy 2.5 PEP 695 `NDArray` (origin is a `TypeAliasType`): `as_spark_type(NDArray[int]) == ArrayType(LongType())` in both cases, with no regression to `List[T]` / scalar handling. ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code. Co-authored-by: Isaac
…rigin__ The infer_return_type SeriesType branch used issubclass(tpe.__origin__, SeriesType) unguarded; under NumPy 2.5 a bare NDArray return hint has a non-class __origin__ (a TypeAliasType), which would raise 'issubclass() arg 1 must be a class'. Guard with isinstance(tpe.__origin__, type) so it falls through to as_spark_type instead. Co-authored-by: Isaac
HyukjinKwon
added a commit
to HyukjinKwon/spark
that referenced
this pull request
Jun 25, 2026
…o pandas modules Validation harness for apache#56789: flip the fork guard on build_python_3.12_macos26.yml and narrow python_hosted_runner_test matrix to pyspark-pandas + pyspark-pandas-connect to verify the numpy>=2.5 PEP 695 NDArray fix on macOS-26 (numpy 2.5.0). Not for the PR branch. Co-authored-by: Isaac
Contributor
|
I think #56757 should work |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
as_spark_typeinpython/pyspark/pandas/typedef/typehints.pyonly recognized the legacy form ofnumpy.typing.NDArray, whereNDArray[T]is aGenericAliaswith__origin__ is np.ndarrayand__args__ == (shape, np.dtype[T]).Starting with NumPy 2.5,
NDArrayis declared as a PEP 695 type alias (type NDArray[ScalarT: np.generic] = np.ndarray[Any, np.dtype[ScalarT]]). At runtimeNDArray[T]is then aGenericAliaswhose__origin__is theNDArrayTypeAliasType(notnp.ndarray) and whose__args__is the single scalar type(T,). The old check no longer matched, so the annotation fell through toissubclass(tpe.__origin__, list), which raisedTypeError: issubclass() arg 1 must be a class.This PR:
_extract_ndarray_scalar_type, which detects the new PEP 695NDArrayform (origin is aTypeAliasTypewhose__value__expands tonp.ndarray) and returns its scalar element type;as_spark_typeto build theArrayType, mirroring the legacy branch;listbranch withisinstance(tpe.__origin__, type)so a non-class__origin__can never crashissubclassagain.The legacy NumPy (<2.5) path is unchanged.
Why are the changes needed?
On runners with NumPy 2.5 (e.g.
Build / MacOS-26, Python 3.12), pyspark-pandas tests that useNDArray[...]type hints —test_apply_batch_with_type,test_apply_with_type,test_transform_batch_with_type, and their*-connectparities — fail with:Evidence: apache/spark run
28064134730(Build / ... MacOS-26, Python 3.12), modulepyspark-pandas/pyspark-pandas-connect.Does this PR introduce any user-facing change?
No. It restores correct
NDArraytype-hint handling under NumPy 2.5.How was this patch tested?
The existing
pyspark.pandas.tests.test_typedef.TypeHintTests.test_as_spark_type_pandas_on_spark_dtypealready exercisesas_spark_type(ntp.NDArray[...])/pandas_on_spark_type(ntp.NDArray[...])across many element types; the fix makes it pass under NumPy 2.5 while keeping it green under NumPy 2.4.Validated locally on Python 3.12 against:
NDArray(legacy form) — no regression;NDArray(origin is aTypeAliasType) — previously reproduced the exactissubclass() arg 1 must be a classfailure; nowas_spark_type(NDArray[int]) == ArrayType(LongType()).NumPy 2.5 is not yet on PyPI for the OSS CI image, so this cannot be reproduced on the SBT fork lane; it reproduces on the NumPy-2.5
MacOS-26scheduled lane.Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code.
CI evidence
Build / ... MacOS-26, Python 3.12, NumPy 2.5.0;pyspark-pandas/pyspark-pandas-connectfail intest_apply_batch_with_typewithTypeError: Cannot interpret 'NDArray[int]' as a data type/issubclass() arg 1 must be a class.Note: the OSS CI image where this fails uses NumPy 2.5.0, which is not yet installable from PyPI for the fork's standard lanes, so the red MacOS-26 lane cannot be re-run green on the fork directly. The fix was validated locally on Python 3.12 against (a) the real NumPy 2.4
NDArray(legacy form, no regression) and (b) a faithful simulation of NumPy 2.5's PEP 695NDArray(origin is aTypeAliasType), which reproduced the exact failure before the fix and producedas_spark_type(NDArray[int]) == ArrayType(LongType())after. The existing testtest_typedef.TypeHintTests.test_as_spark_type_pandas_on_spark_dtypecoversNDArray[...]and will turn green on the NumPy-2.5 lane.CI evidence
Build / Python-only (Python 3.12, MacOS26)on apache/spark — https://github.com/apache/spark/actions/runs/28064134730 (pyspark-pandas+pyspark-pandas-connectfail:test_apply_batch_with_type→TypeError: Cannot interpret 'NDArray[int]' as a data type, numpy 2.5.0).pyspark-pandas✅,pyspark-pandas-connect✅).