Skip to content

[SPARK-57423][PYTHON][TEST] Share base mixin across MapInBatch Arrow/Pandas siblings#56483

Open
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-57423-map-iter-dedup
Open

[SPARK-57423][PYTHON][TEST] Share base mixin across MapInBatch Arrow/Pandas siblings#56483
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-57423-map-iter-dedup

Conversation

@viirya

@viirya viirya commented Jun 13, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Follow-up to SPARK-57137/57138/57414. Apply the sibling base-mixin pattern to the last remaining Arrow/Pandas pair in python/benchmarks/bench_eval_type.py:

  • _MapPandasIterBenchMixin now subclasses _MapArrowIterBenchMixin, dropping its duplicated _build_scenario (identical body) and _write_scenario (identical except the eval type constant and the ret_type handling).
  • The Arrow base gains the _eval_type class attribute and parameterizes _write_scenario on self._eval_type. The Pandas _udfs values are normalized from (func, arg_offsets) to the Arrow base's (func, ret_type, arg_offsets) shape so the inherited writer works unchanged.
  • Drive-by: convert the _build_scenario of the newly added _TransformWithStatePandasBenchMixin (SPARK-57020, developed in parallel with SPARK-57414) from @staticmethod with a hardcoded class-name lookup to @classmethod with cls._scenario_configs, matching the now-uniform convention.

Net diff: +16 / -40 lines.

Why are the changes needed?

The two MapInBatch mixins were the last undeclared sibling pair: near-identical _build_scenario and _write_scenario bodies that had to be kept in lock-step manually. After this change, every Arrow/Pandas sibling pair in the file shares a base mixin, every _build_scenario resolves configs via cls, and no redundant protocol-writing copies remain.

Does this PR introduce any user-facing change?

No. Test-only change in the benchmark module.

How was this patch tested?

  • Verified the generated worker input is byte-identical to the pre-refactor output for all 21 *TimeBench classes across every (scenario, udf) cell, with the cloudpickled UDF command excluded and the TWS stub state-server port pinned (the port is freshly allocated per process; the UDF pickle embeds co_firstlineno and a random cloudpickle class-tracker id -- all environment/location metadata unaffected by this change).
  • Compared the pickle opcode streams of all 63 pickled UDF/UDTF callables between old and new: only the metadata above differs; code, constants, and names are unchanged.
  • Ran setup + time_worker end-to-end for the affected *TimeBench classes (MapArrowIter, MapPandasIter, TransformWithStatePandas) across every UDF, and peakmem_worker for MapPandasIter and TransformWithStatePandas.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

…Pandas siblings

### What changes were proposed in this pull request?

Follow-up to SPARK-57137/57138/57414. Apply the sibling base-mixin pattern to the
last remaining Arrow/Pandas pair in `python/benchmarks/bench_eval_type.py`:

- `_MapPandasIterBenchMixin` now subclasses `_MapArrowIterBenchMixin`, dropping its
  duplicated `_build_scenario` (identical body) and `_write_scenario` (identical
  except the eval type constant and the `ret_type` handling).
- The Arrow base gains the `_eval_type` class attribute and parameterizes
  `_write_scenario` on `self._eval_type`. The Pandas `_udfs` values are normalized
  from `(func, arg_offsets)` to the Arrow base's `(func, ret_type, arg_offsets)`
  shape so the inherited writer works unchanged.
- Drive-by: convert the `_build_scenario` of the newly added
  `_TransformWithStatePandasBenchMixin` (SPARK-57020, developed in parallel with
  SPARK-57414) from `@staticmethod` with a hardcoded class-name lookup to
  `@classmethod` with `cls._scenario_configs`, matching the now-uniform convention.

Net diff: +16 / -40 lines.

### Why are the changes needed?

The two MapInBatch mixins were the last undeclared sibling pair: near-identical
`_build_scenario` and `_write_scenario` bodies that had to be kept in lock-step
manually. After this change, every Arrow/Pandas sibling pair in the file shares a
base mixin, every `_build_scenario` resolves configs via `cls`, and no redundant
protocol-writing copies remain.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Verified the generated worker input is byte-identical to the pre-refactor output
  for all 21 `*TimeBench` classes across every (scenario, udf) cell, with the
  cloudpickled UDF command excluded and the TWS stub state-server port pinned (the
  port is freshly allocated per process; the UDF pickle embeds `co_firstlineno`
  and a random cloudpickle class-tracker id -- all environment/location metadata).
- Compared the pickle opcode streams of all 63 pickled UDF/UDTF callables between
  old and new: only the metadata above differs; code, constants, and names are
  unchanged.
- Ran `setup` + `time_worker` end-to-end for the affected `*TimeBench` classes
  (MapArrowIter, MapPandasIter, TransformWithStatePandas) across every UDF, and
  `peakmem_worker` for MapPandasIter and TransformWithStatePandas.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

Co-authored-by: Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant