Skip to content

fix: Add fallback for incompatible string concenation in pandas#3548

Open
MarcoGorelli wants to merge 6 commits intonarwhals-dev:mainfrom
MarcoGorelli:binary-pyarrow-comparisons
Open

fix: Add fallback for incompatible string concenation in pandas#3548
MarcoGorelli wants to merge 6 commits intonarwhals-dev:mainfrom
MarcoGorelli:binary-pyarrow-comparisons

Conversation

@MarcoGorelli
Copy link
Copy Markdown
Member

closes #3546

Description

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #<issue number>

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

@dangotbanned
Copy link
Copy Markdown
Member

dangotbanned commented Apr 14, 2026

Note

Edited to increase the range

Have you thought about handling the dtypes somewhere around here instead?

def concat_str(
self, *exprs: PandasLikeExpr, separator: str, ignore_nulls: bool
) -> PandasLikeExpr:
string = self._version.dtypes.String()
def func(df: PandasLikeDataFrame) -> list[PandasLikeSeries]:
expr_results = [s for _expr in exprs for s in _expr(df)]
series = [s.cast(string) for s in expr_results]

E.g. you could do 2 passes where:

  1. Collect all the native dtypes
  2. Decide on what's needed to make them compatible & then do the casts at the native-level

This is reminding me a bit of #3398 (comment) 😄

@MarcoGorelli
Copy link
Copy Markdown
Member Author

not sure about doing it for all ops tbh, but for string concatenation at least it seems to be necessary so i'd suggest starting with that, we can make it more generally available later if necessary

@MarcoGorelli MarcoGorelli marked this pull request as ready for review April 14, 2026 15:19
@dangotbanned
Copy link
Copy Markdown
Member

not sure about doing it for all ops tbh, but for string concatenation at least it seems to be necessary so i'd suggest starting with that, we can make it more generally available later if necessary

Sorry if I wasn't clear, I was talking about concat_str - have edited (#3548 (comment))

@MarcoGorelli
Copy link
Copy Markdown
Member Author

we'd then need to do the same casts for __add__, so i think it's cleaner to do just it in one place?

@dangotbanned
Copy link
Copy Markdown
Member

we'd then need to do the same casts for __add__

I was a bit confused by this, I had no idea polars did it too 😂

import polars as pl

>>> pl.Series(["a", "b", "c"]) + pl.Series(["d", "e", "f"])
shape: (3,)
Series: '' [str]
[
	"ad"
	"be"
	"cf"
]

@MarcoGorelli
Copy link
Copy Markdown
Member Author

I've aligned this with the upstream pandas behaviour in pandas-dev/pandas#65220 (comment)

any objections?

Copy link
Copy Markdown
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoGorelli - I am mostly trusting the test suite on this!

Also added a tiny suggestion

Comment thread narwhals/_pandas_like/series.py
Copy link
Copy Markdown
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoGorelli ! No further questions from my side

return self.native.iloc[-1] if len(self.native) else None

def _with_binary(self, op: Callable[..., PandasLikeSeries], other: Any) -> Self:
def _with_binary(self, op: Callable[..., pd.Series], other: Any) -> Self:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch here :)

Copy link
Copy Markdown
Member

@camriddell camriddell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is a good compatibility bridge. With the current implementation I think we should:

  1. Discuss the costs of re-trying all failed __add__ operations.
  2. Remove the written out test cases for a combinatoric solution and property derived dtype detection.

pd.ArrowDtype(pa.large_string()),
pd.ArrowDtype(pa.large_string()),
pd.ArrowDtype(pa.large_string()),
),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cases are serve effectively as a handwritten truth table.

Can this be refactored to

  1. parameterize from itertools.product
  2. embed the logic for deriving the appropriate datatype into the test?
    • left_dtype is a pandas.StringDtype -> left_dtype
    • left_dtype is a pandas.ArrowDtype and right_dtype is pandas.StringDtype -> pd.ArrowDtype(pa.large_string()
    • both left/right are ArrowDtype: pa.string if all inputs are pa.string otherwise pa.large_string

This reduces the length of the parameters spec and codifies the expected rules for returning a datatype.

DTYPES = [
    pd.StringDtype("pyarrow", na_value=np.nan),  # type: ignore[call-arg]
    pd.StringDtype("pyarrow", na_value=pd.NA),  # type: ignore[call-arg]
    pd.ArrowDtype(pa.string()),
    pd.ArrowDtype(pa.large_string()),
]

@pytest.mark.parametrize(
    ("left_dtype", "right_dtype"),
    [*product(DTYPES, repeat=2)],
)
def test_pandas_str_types(...):
    ...

    assert_equal_data(res, expected)

    result_dtype = res.to_native()["concat_col"].dtype
    if isinstance(left_dtype, pd.StringDtype):
        expected_dtype = left_dtype

    elif isinstance(left_dtype, pd.ArrowDtype) and isinstance(right_dtype, pd.StringDtype):
        expected_dtype = pd.ArrowDtype(pa.large_string())

    elif isinstance(left_dtype, pd.ArrowDtype) and isinstance(right_dtype, pd.ArrowDtype):
        left_type = left_dtype.pyarrow_dtype
        right_type = right_dtype.pyarrow_dtype

        if pa.types.is_large_string(left_type) or pa.types.is_large_string(right_type):
            expected_dtype = pd.ArrowDtype(pa.large_string())
        else:
            expected_dtype = pd.ArrowDtype(pa.string())

    assert result_dtype == expected_dtype

).alias(self.name)
try:
res = op(ser, other_native)
except Exception:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can this be more specific (e.g. TypeError rather than Exception)? Observing the test results without this snippet, I only observe TypeErrors but this could be due to a backport from pandas.
  2. For cases when addition will fail no matter what (e.g. adding integers and strings) retrying __add__ in this fashion makes for very long tracebacks due to the nested exceptions (see below)
    • This may be avoided by checking for string datatypes on both arrays before retrying. I think we may even be able to do this check against the narwhals type?
very long tracebacks
import pandas as pd
import narwhals as nw

df = pd.DataFrame({"count": [1, 2, 3], "fruit": ["apple", "banana", "orange"]})
nw.from_native(df).with_columns(concat=nw.col("count") + nw.col("fruit"))

produces

Traceback (most recent call last):
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 988, in _evaluate_op_method
    result = pc.binary_join_element_wise(other, self._pa_array, sep)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pyarrow/compute.py", line 271, in wrapper
    return func.call(args, options, memory_pool)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_compute.pyx", line 399, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'binary_join_element_wise' has no kernel matching input types (int64, large_string, large_string)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/series.py", line 407, in _with_binary
    res = op(ser, other_native)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/ops/common.py", line 85, in new_method
    return method(self, other)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arraylike.py", line 190, in __add__
    return self._arith_method(other, operator.add)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/series.py", line 6751, in _arith_method
    return base.IndexOpsMixin._arith_method(self, other, op)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/base.py", line 1644, in _arith_method
    result = ops.arithmetic_op(lvalues, rvalues, op)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/ops/array_ops.py", line 279, in arithmetic_op
    res_values = op(left, right)
                 ^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 845, in __array_ufunc__
    result = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/base.py", line 2704, in __array_ufunc__
    result = arraylike.maybe_dispatch_ufunc_to_dunder_op(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/ops_dispatch.pyx", line 113, in pandas._libs.ops_dispatch.maybe_dispatch_ufunc_to_dunder_op
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/ops/common.py", line 85, in new_method
    return method(self, other)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arraylike.py", line 194, in __radd__
    return self._arith_method(other, roperator.radd)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 1079, in _arith_method
    result = self._evaluate_op_method(other, op, ARROW_ARITHMETIC_FUNCS)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 990, in _evaluate_op_method
    raise TypeError(
TypeError: operation 'radd' not supported for dtype 'str' with dtype 'int64'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cameron/repos/opensource/narwhals-dev/t.py", line 5, in <module>
    nw.from_native(df).with_columns(concat=nw.col("count") + nw.col("fruit"))
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/dataframe.py", line 1506, in with_columns
    return super().with_columns(*exprs, **named_exprs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/dataframe.py", line 214, in with_columns
    return self._with_compliant(self._compliant_frame.with_columns(*compliant_exprs))
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/dataframe.py", line 503, in with_columns
    columns = self._evaluate_exprs(*exprs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/dataframe.py", line 367, in _evaluate_exprs
    return tuple(chain.from_iterable(self._evaluate_expr(expr) for expr in exprs))  # pyright: ignore[reportArgumentType]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/dataframe.py", line 367, in <genexpr>
    return tuple(chain.from_iterable(self._evaluate_expr(expr) for expr in exprs))  # pyright: ignore[reportArgumentType]
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/dataframe.py", line 380, in _evaluate_expr
    result = expr(self)
             ^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/expr.py", line 247, in __call__
    return self._call(df)
           ^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/expr.py", line 651, in <lambda>
    lambda df: [series.alias(name) for series in self(df)],
                                                 ^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/expr.py", line 247, in __call__
    return self._call(df)
           ^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_compliant/expr.py", line 372, in _reuse_series_inner
    series._from_scalar(method(series)) if returns_scalar else method(series)
                                                               ^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/series.py", line 452, in __add__
    return self._with_binary(operator.add, other)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/series.py", line 411, in _with_binary
    res = binary_string_sum_fallback(ser, other_native, pdx)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/narwhals/_pandas_like/utils.py", line 742, in binary_string_sum_fallback
    return left + right.astype(left_dtype)
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/generic.py", line 6541, in astype
    new_data = self._mgr.astype(dtype=dtype, errors=errors)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 611, in astype
    return self.apply("astype", dtype=dtype, errors=errors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 442, in apply
    applied = getattr(b, f)(**kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line 607, in astype
    new_values = astype_array_safe(values, dtype, errors=errors)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 240, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 182, in astype_array
    values = values.astype(dtype, copy=copy)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/string_arrow.py", line 336, in astype
    return super().astype(dtype, copy=copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/base.py", line 881, in astype
    return np.asarray(self, dtype=dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 864, in __array__
    return self.to_numpy(dtype=dtype, copy=copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/arrays/arrow/array.py", line 1717, in to_numpy
    result = result.astype(dtype, copy=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'apple'

import pyarrow as pa

STRING_DTYPE_NAN = pd.StringDtype("pyarrow", na_value=np.nan) # type: ignore[call-arg]
STRING_DTYPE_NA = pd.StringDtype("pyarrow", na_value=pd.NA) # type: ignore[call-arg]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any purpose to add non-pyarrow backed StringDtypes into this test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Concatenating nw.lit('string') column with PyArrow string column raises TypeError

4 participants