[SPARK-57423][PYTHON][TEST] Share base mixin across MapInBatch Arrow/Pandas siblings by viirya · Pull Request #56483 · apache/spark

viirya · 2026-06-13T00:25:33Z

What changes were proposed in this pull request?

Follow-up to SPARK-57137/57138/57414. Apply the sibling base-mixin pattern to the last remaining Arrow/Pandas pair in python/benchmarks/bench_eval_type.py:

_MapPandasIterBenchMixin now subclasses _MapArrowIterBenchMixin, dropping its duplicated _build_scenario (identical body) and _write_scenario (identical except the eval type constant and the ret_type handling).
The Arrow base gains the _eval_type class attribute and parameterizes _write_scenario on self._eval_type. The Pandas _udfs values are normalized from (func, arg_offsets) to the Arrow base's (func, ret_type, arg_offsets) shape so the inherited writer works unchanged.
Drive-by: convert the _build_scenario of the newly added _TransformWithStatePandasBenchMixin (SPARK-57020, developed in parallel with SPARK-57414) from @staticmethod with a hardcoded class-name lookup to @classmethod with cls._scenario_configs, matching the now-uniform convention.

Net diff: +16 / -40 lines.

Why are the changes needed?

The two MapInBatch mixins were the last undeclared sibling pair: near-identical _build_scenario and _write_scenario bodies that had to be kept in lock-step manually. After this change, every Arrow/Pandas sibling pair in the file shares a base mixin, every _build_scenario resolves configs via cls, and no redundant protocol-writing copies remain.

Does this PR introduce any user-facing change?

No. Test-only change in the benchmark module.

How was this patch tested?

Verified the generated worker input is byte-identical to the pre-refactor output for all 21 *TimeBench classes across every (scenario, udf) cell, with the cloudpickled UDF command excluded and the TWS stub state-server port pinned (the port is freshly allocated per process; the UDF pickle embeds co_firstlineno and a random cloudpickle class-tracker id -- all environment/location metadata unaffected by this change).
Compared the pickle opcode streams of all 63 pickled UDF/UDTF callables between old and new: only the metadata above differs; code, constants, and names are unchanged.
Ran setup + time_worker end-to-end for the affected *TimeBench classes (MapArrowIter, MapPandasIter, TransformWithStatePandas) across every UDF, and peakmem_worker for MapPandasIter and TransformWithStatePandas.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

…Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137/57138/57414. Apply the sibling base-mixin pattern to the last remaining Arrow/Pandas pair in `python/benchmarks/bench_eval_type.py`: - `_MapPandasIterBenchMixin` now subclasses `_MapArrowIterBenchMixin`, dropping its duplicated `_build_scenario` (identical body) and `_write_scenario` (identical except the eval type constant and the `ret_type` handling). - The Arrow base gains the `_eval_type` class attribute and parameterizes `_write_scenario` on `self._eval_type`. The Pandas `_udfs` values are normalized from `(func, arg_offsets)` to the Arrow base's `(func, ret_type, arg_offsets)` shape so the inherited writer works unchanged. - Drive-by: convert the `_build_scenario` of the newly added `_TransformWithStatePandasBenchMixin` (SPARK-57020, developed in parallel with SPARK-57414) from `@staticmethod` with a hardcoded class-name lookup to `@classmethod` with `cls._scenario_configs`, matching the now-uniform convention. Net diff: +16 / -40 lines. ### Why are the changes needed? The two MapInBatch mixins were the last undeclared sibling pair: near-identical `_build_scenario` and `_write_scenario` bodies that had to be kept in lock-step manually. After this change, every Arrow/Pandas sibling pair in the file shares a base mixin, every `_build_scenario` resolves configs via `cls`, and no redundant protocol-writing copies remain. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Verified the generated worker input is byte-identical to the pre-refactor output for all 21 `*TimeBench` classes across every (scenario, udf) cell, with the cloudpickled UDF command excluded and the TWS stub state-server port pinned (the port is freshly allocated per process; the UDF pickle embeds `co_firstlineno` and a random cloudpickle class-tracker id -- all environment/location metadata). - Compared the pickle opcode streams of all 63 pickled UDF/UDTF callables between old and new: only the metadata above differs; code, constants, and names are unchanged. - Ran `setup` + `time_worker` end-to-end for the affected `*TimeBench` classes (MapArrowIter, MapPandasIter, TransformWithStatePandas) across every UDF, and `peakmem_worker` for MapPandasIter and TransformWithStatePandas. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Co-authored-by: Claude Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57423][PYTHON][TEST] Share base mixin across MapInBatch Arrow/Pandas siblings#56483

[SPARK-57423][PYTHON][TEST] Share base mixin across MapInBatch Arrow/Pandas siblings#56483
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-57423-map-iter-dedup

viirya commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

viirya commented Jun 13, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant