fix: Add fallback for incompatible string concenation in pandas#3548
fix: Add fallback for incompatible string concenation in pandas#3548MarcoGorelli wants to merge 6 commits intonarwhals-dev:mainfrom
Conversation
|
Note Edited to increase the range Have you thought about handling the dtypes somewhere around here instead? narwhals/narwhals/_pandas_like/namespace.py Lines 320 to 328 in 2eaaa03 E.g. you could do 2 passes where:
This is reminding me a bit of #3398 (comment) 😄 |
|
not sure about doing it for all ops tbh, but for string concatenation at least it seems to be necessary so i'd suggest starting with that, we can make it more generally available later if necessary |
Sorry if I wasn't clear, I was talking about |
|
we'd then need to do the same casts for |
I was a bit confused by this, I had no idea import polars as pl
>>> pl.Series(["a", "b", "c"]) + pl.Series(["d", "e", "f"])
shape: (3,)
Series: '' [str]
[
"ad"
"be"
"cf"
] |
|
I've aligned this with the upstream pandas behaviour in pandas-dev/pandas#65220 (comment) any objections? |
FBruzzesi
left a comment
There was a problem hiding this comment.
Thanks @MarcoGorelli - I am mostly trusting the test suite on this!
Also added a tiny suggestion
FBruzzesi
left a comment
There was a problem hiding this comment.
Thanks @MarcoGorelli ! No further questions from my side
| return self.native.iloc[-1] if len(self.native) else None | ||
|
|
||
| def _with_binary(self, op: Callable[..., PandasLikeSeries], other: Any) -> Self: | ||
| def _with_binary(self, op: Callable[..., pd.Series], other: Any) -> Self: |
camriddell
left a comment
There was a problem hiding this comment.
Overall, this is a good compatibility bridge. With the current implementation I think we should:
- Discuss the costs of re-trying all failed
__add__operations. - Remove the written out test cases for a combinatoric solution and property derived dtype detection.
| pd.ArrowDtype(pa.large_string()), | ||
| pd.ArrowDtype(pa.large_string()), | ||
| pd.ArrowDtype(pa.large_string()), | ||
| ), |
There was a problem hiding this comment.
The cases are serve effectively as a handwritten truth table.
Can this be refactored to
- parameterize from
itertools.product - embed the logic for deriving the appropriate datatype into the test?
left_dtypeis apandas.StringDtype->left_dtypeleft_dtypeis apandas.ArrowDtypeandright_dtypeispandas.StringDtype->pd.ArrowDtype(pa.large_string()- both left/right are
ArrowDtype:pa.stringif all inputs arepa.stringotherwisepa.large_string
This reduces the length of the parameters spec and codifies the expected rules for returning a datatype.
DTYPES = [
pd.StringDtype("pyarrow", na_value=np.nan), # type: ignore[call-arg]
pd.StringDtype("pyarrow", na_value=pd.NA), # type: ignore[call-arg]
pd.ArrowDtype(pa.string()),
pd.ArrowDtype(pa.large_string()),
]
@pytest.mark.parametrize(
("left_dtype", "right_dtype"),
[*product(DTYPES, repeat=2)],
)
def test_pandas_str_types(...):
...
assert_equal_data(res, expected)
result_dtype = res.to_native()["concat_col"].dtype
if isinstance(left_dtype, pd.StringDtype):
expected_dtype = left_dtype
elif isinstance(left_dtype, pd.ArrowDtype) and isinstance(right_dtype, pd.StringDtype):
expected_dtype = pd.ArrowDtype(pa.large_string())
elif isinstance(left_dtype, pd.ArrowDtype) and isinstance(right_dtype, pd.ArrowDtype):
left_type = left_dtype.pyarrow_dtype
right_type = right_dtype.pyarrow_dtype
if pa.types.is_large_string(left_type) or pa.types.is_large_string(right_type):
expected_dtype = pd.ArrowDtype(pa.large_string())
else:
expected_dtype = pd.ArrowDtype(pa.string())
assert result_dtype == expected_dtype| ).alias(self.name) | ||
| try: | ||
| res = op(ser, other_native) | ||
| except Exception: |
There was a problem hiding this comment.
- Can this be more specific (e.g.
TypeErrorrather thanException)? Observing the test results without this snippet, I only observeTypeErrorsbut this could be due to a backport from pandas. - For cases when addition will fail no matter what (e.g. adding integers and strings) retrying
__add__in this fashion makes for very long tracebacks due to the nested exceptions (see below)- This may be avoided by checking for string datatypes on both arrays before retrying. I think we may even be able to do this check against the narwhals type?
very long tracebacks
import pandas as pd
import narwhals as nw
df = pd.DataFrame({"count": [1, 2, 3], "fruit": ["apple", "banana", "orange"]})
nw.from_native(df).with_columns(concat=nw.col("count") + nw.col("fruit"))produces
Traceback (most recent call last):
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 988, in _evaluate_op_method
result = pc.binary_join_element_wise(other, self._pa_array, sep)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pyarrow/compute.py", line 271, in wrapper
return func.call(args, options, memory_pool)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_compute.pyx", line 399, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'binary_join_element_wise' has no kernel matching input types (int64, large_string, large_string)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/series.py", line 407, in _with_binary
res = op(ser, other_native)
^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/ops/common.py", line 85, in new_method
return method(self, other)
^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arraylike.py", line 190, in __add__
return self._arith_method(other, operator.add)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/series.py", line 6751, in _arith_method
return base.IndexOpsMixin._arith_method(self, other, op)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/base.py", line 1644, in _arith_method
result = ops.arithmetic_op(lvalues, rvalues, op)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/ops/array_ops.py", line 279, in arithmetic_op
res_values = op(left, right)
^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 845, in __array_ufunc__
result = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/base.py", line 2704, in __array_ufunc__
result = arraylike.maybe_dispatch_ufunc_to_dunder_op(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas/_libs/ops_dispatch.pyx", line 113, in pandas._libs.ops_dispatch.maybe_dispatch_ufunc_to_dunder_op
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/ops/common.py", line 85, in new_method
return method(self, other)
^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arraylike.py", line 194, in __radd__
return self._arith_method(other, roperator.radd)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 1079, in _arith_method
result = self._evaluate_op_method(other, op, ARROW_ARITHMETIC_FUNCS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 990, in _evaluate_op_method
raise TypeError(
TypeError: operation 'radd' not supported for dtype 'str' with dtype 'int64'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/cameron/repos/opensource/narwhals-dev/t.py", line 5, in <module>
nw.from_native(df).with_columns(concat=nw.col("count") + nw.col("fruit"))
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/dataframe.py", line 1506, in with_columns
return super().with_columns(*exprs, **named_exprs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/dataframe.py", line 214, in with_columns
return self._with_compliant(self._compliant_frame.with_columns(*compliant_exprs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/dataframe.py", line 503, in with_columns
columns = self._evaluate_exprs(*exprs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/dataframe.py", line 367, in _evaluate_exprs
return tuple(chain.from_iterable(self._evaluate_expr(expr) for expr in exprs)) # pyright: ignore[reportArgumentType]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/dataframe.py", line 367, in <genexpr>
return tuple(chain.from_iterable(self._evaluate_expr(expr) for expr in exprs)) # pyright: ignore[reportArgumentType]
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/dataframe.py", line 380, in _evaluate_expr
result = expr(self)
^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/expr.py", line 247, in __call__
return self._call(df)
^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/expr.py", line 651, in <lambda>
lambda df: [series.alias(name) for series in self(df)],
^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/expr.py", line 247, in __call__
return self._call(df)
^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/expr.py", line 372, in _reuse_series_inner
series._from_scalar(method(series)) if returns_scalar else method(series)
^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/series.py", line 452, in __add__
return self._with_binary(operator.add, other)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/series.py", line 411, in _with_binary
res = binary_string_sum_fallback(ser, other_native, pdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/utils.py", line 742, in binary_string_sum_fallback
return left + right.astype(left_dtype)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/generic.py", line 6541, in astype
new_data = self._mgr.astype(dtype=dtype, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 611, in astype
return self.apply("astype", dtype=dtype, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 442, in apply
applied = getattr(b, f)(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line 607, in astype
new_values = astype_array_safe(values, dtype, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 240, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 182, in astype_array
values = values.astype(dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/string_arrow.py", line 336, in astype
return super().astype(dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/base.py", line 881, in astype
return np.asarray(self, dtype=dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 864, in __array__
return self.to_numpy(dtype=dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 1717, in to_numpy
result = result.astype(dtype, copy=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'apple'
| import pyarrow as pa | ||
|
|
||
| STRING_DTYPE_NAN = pd.StringDtype("pyarrow", na_value=np.nan) # type: ignore[call-arg] | ||
| STRING_DTYPE_NA = pd.StringDtype("pyarrow", na_value=pd.NA) # type: ignore[call-arg] |
There was a problem hiding this comment.
Would there be any purpose to add non-pyarrow backed StringDtypes into this test?

closes #3546
Description
What type of PR is this? (check all applicable)
Related issues
Checklist