FIX-#7405: Internally sort indices for loc/iloc set by noloerino · Pull Request #7440 · modin-project/modin

noloerino · 2025-02-14T00:00:03Z

What do these changes do?

Refactors internal logic for the PandasQueryCompiler write_items function to sort the items array and row/column indexers before passing them to partitions. This fixes the linked bug, which occurs when the internal DataFrame representation sorts indices before performing the write operation.

I considered removing the sort in PandasDataFrame, but this created difficulties in determining which data went to which partitions. Furthermore, it was difficult to cleanly ensure that both the items and index respected the sort order at that layer.

This change will cause overhead in some indexing cases, though this should be minimal under the assumption that len(items) << len(df). The code also includes logic to avoid sorting the items array if the index is an item like a range/RangeIndex/slice with step > 0 that is known to already be sorted.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves BUG: incorrect iloc behavior in modin when assigning index values based on row indices #7405
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

…oc sets Signed-off-by: Jonathan Shi <jhshi07@gmail.com>

noloerino · 2025-02-25T20:16:24Z

@anmyachev would you be able to take a look, since you did most of the implemntation for modin.pandas.utils.broadcast_item?

anmyachev · 2025-02-26T12:09:30Z

This fixes the linked bug, which occurs when the internal DataFrame representation sorts indices before performing the write operation.

Hi @noloerino, you write that the problem is that sorting occurs internally, and at the same time you add another one internal sorting, as I understand it. Why does this solve the problem?

Could it be that the external and internal index (you need to manually look at the internal partitions for this) simply do not match? This was a common problem due to optimizations, as far as I remember.

noloerino · 2025-02-26T21:28:40Z

Hi @noloerino, you write that the problem is that sorting occurs internally, and at the same time you add another one internal sorting, as I understand it. Why does this solve the problem?

Could it be that the external and internal index (you need to manually look at the internal partitions for this) simply do not match? This was a common problem due to optimizations, as far as I remember.

In testing, the problem I identified is that the index is sorted before determining which indices belong to which partitions. For example, if you were to do df.iloc[100, 0, 1, 101] = ['a', 'b', 'c', 'd'], we internally sort the row indices to [0, 1, 100, 101] before determining which partition has which indices. If positions [0, 1] belong to partition 0 and [100, 101] belong to partition 1, we end up incorrectly distributing ['a', 'b'] to partition 0 and ['c', 'd'] to partition 1.

Removing the index sort would be difficult because it seems like the partition-finding logic in _get_dict_of_block_index requires the global index to be sorted. The fix in this PR is to sort the values list based on the index's sort order (obtained via np.argsort), so the values remain aligned with the corresponding index when passed to each partition.

anmyachev · 2025-02-27T19:13:54Z

Hi @noloerino, you write that the problem is that sorting occurs internally, and at the same time you add another one internal sorting, as I understand it. Why does this solve the problem?
Could it be that the external and internal index (you need to manually look at the internal partitions for this) simply do not match? This was a common problem due to optimizations, as far as I remember.

In testing, the problem I identified is that the index is sorted before determining which indices belong to which partitions. For example, if you were to do df.iloc[100, 0, 1, 101] = ['a', 'b', 'c', 'd'], we internally sort the row indices to [0, 1, 100, 101] before determining which partition has which indices. If positions [0, 1] belong to partition 0 and [100, 101] belong to partition 1, we end up incorrectly distributing ['a', 'b'] to partition 0 and ['c', 'd'] to partition 1.

Removing the index sort would be difficult because it seems like the partition-finding logic in _get_dict_of_block_index requires the global index to be sorted. The fix in this PR is to sort the values list based on the index's sort order (obtained via np.argsort), so the values remain aligned with the corresponding index when passed to each partition.

I see.

Removing the index sort would be difficult because it seems like the partition-finding logic in _get_dict_of_block_index requires the global index to be sorted.

It's true. It seems to me that it would be more performant to return the previous order in _get_dict_of_block_index function itself, for example after the dict is built (considering the amount of data reordering). The current solution is also ok.

noloerino requested review from a team, RehanSD, YarShev, anmyachev, dchigarev, devin-petersohn, mvashishtha and vnlitvinov as code owners February 14, 2025 00:00

noloerino added 4 commits February 13, 2025 16:01

FIX-modin-project#7405: Internally sort indices and values for loc/il…

0bcd4de

…oc sets Signed-off-by: Jonathan Shi <jhshi07@gmail.com>

remove prints

a3a6da5

move sort logic to broadcast_items

4d124d5

lint

ef752c2

noloerino force-pushed the joshi/7405-iloc-set-order branch from fcbcf7f to ef752c2 Compare February 14, 2025 00:02

noloerino added 2 commits February 14, 2025 13:21

fix baseonpython, rename variable

14ae713

lint

c459dba

anmyachev approved these changes Feb 27, 2025

View reviewed changes

sfc-gh-joshi merged commit bb29f89 into modin-project:main Feb 27, 2025
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#7405: Internally sort indices for loc/iloc set#7440

FIX-#7405: Internally sort indices for loc/iloc set#7440
sfc-gh-joshi merged 6 commits intomodin-project:mainfrom
noloerino:joshi/7405-iloc-set-order

noloerino commented Feb 14, 2025

Uh oh!

noloerino commented Feb 25, 2025

Uh oh!

anmyachev commented Feb 26, 2025

Uh oh!

noloerino commented Feb 26, 2025

Uh oh!

anmyachev commented Feb 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

noloerino commented Feb 14, 2025

What do these changes do?

Uh oh!

noloerino commented Feb 25, 2025

Uh oh!

anmyachev commented Feb 26, 2025

Uh oh!

noloerino commented Feb 26, 2025

Uh oh!

anmyachev commented Feb 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants