Skip to content

FIX-#7405: Internally sort indices for loc/iloc set#7440

Merged
sfc-gh-joshi merged 6 commits intomodin-project:mainfrom
noloerino:joshi/7405-iloc-set-order
Feb 27, 2025
Merged

FIX-#7405: Internally sort indices for loc/iloc set#7440
sfc-gh-joshi merged 6 commits intomodin-project:mainfrom
noloerino:joshi/7405-iloc-set-order

Conversation

@noloerino
Copy link
Collaborator

What do these changes do?

Refactors internal logic for the PandasQueryCompiler write_items function to sort the items array and row/column indexers before passing them to partitions. This fixes the linked bug, which occurs when the internal DataFrame representation sorts indices before performing the write operation.

I considered removing the sort in PandasDataFrame, but this created difficulties in determining which data went to which partitions. Furthermore, it was difficult to cleanly ensure that both the items and index respected the sort order at that layer.

This change will cause overhead in some indexing cases, though this should be minimal under the assumption that len(items) << len(df). The code also includes logic to avoid sorting the items array if the index is an item like a range/RangeIndex/slice with step > 0 that is known to already be sorted.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: incorrect iloc behavior in modin when assigning index values based on row indices #7405
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@noloerino noloerino force-pushed the joshi/7405-iloc-set-order branch from fcbcf7f to ef752c2 Compare February 14, 2025 00:02
@noloerino
Copy link
Collaborator Author

@anmyachev would you be able to take a look, since you did most of the implemntation for modin.pandas.utils.broadcast_item?

@anmyachev
Copy link
Collaborator

This fixes the linked bug, which occurs when the internal DataFrame representation sorts indices before performing the write operation.

Hi @noloerino, you write that the problem is that sorting occurs internally, and at the same time you add another one internal sorting, as I understand it. Why does this solve the problem?

Could it be that the external and internal index (you need to manually look at the internal partitions for this) simply do not match? This was a common problem due to optimizations, as far as I remember.

@noloerino
Copy link
Collaborator Author

Hi @noloerino, you write that the problem is that sorting occurs internally, and at the same time you add another one internal sorting, as I understand it. Why does this solve the problem?

Could it be that the external and internal index (you need to manually look at the internal partitions for this) simply do not match? This was a common problem due to optimizations, as far as I remember.

In testing, the problem I identified is that the index is sorted before determining which indices belong to which partitions. For example, if you were to do df.iloc[100, 0, 1, 101] = ['a', 'b', 'c', 'd'], we internally sort the row indices to [0, 1, 100, 101] before determining which partition has which indices. If positions [0, 1] belong to partition 0 and [100, 101] belong to partition 1, we end up incorrectly distributing ['a', 'b'] to partition 0 and ['c', 'd'] to partition 1.

Removing the index sort would be difficult because it seems like the partition-finding logic in _get_dict_of_block_index requires the global index to be sorted. The fix in this PR is to sort the values list based on the index's sort order (obtained via np.argsort), so the values remain aligned with the corresponding index when passed to each partition.

@anmyachev
Copy link
Collaborator

Hi @noloerino, you write that the problem is that sorting occurs internally, and at the same time you add another one internal sorting, as I understand it. Why does this solve the problem?
Could it be that the external and internal index (you need to manually look at the internal partitions for this) simply do not match? This was a common problem due to optimizations, as far as I remember.

In testing, the problem I identified is that the index is sorted before determining which indices belong to which partitions. For example, if you were to do df.iloc[100, 0, 1, 101] = ['a', 'b', 'c', 'd'], we internally sort the row indices to [0, 1, 100, 101] before determining which partition has which indices. If positions [0, 1] belong to partition 0 and [100, 101] belong to partition 1, we end up incorrectly distributing ['a', 'b'] to partition 0 and ['c', 'd'] to partition 1.

Removing the index sort would be difficult because it seems like the partition-finding logic in _get_dict_of_block_index requires the global index to be sorted. The fix in this PR is to sort the values list based on the index's sort order (obtained via np.argsort), so the values remain aligned with the corresponding index when passed to each partition.

I see.

Removing the index sort would be difficult because it seems like the partition-finding logic in _get_dict_of_block_index requires the global index to be sorted.

It's true. It seems to me that it would be more performant to return the previous order in _get_dict_of_block_index function itself, for example after the dict is built (considering the amount of data reordering). The current solution is also ok.

@sfc-gh-joshi sfc-gh-joshi merged commit bb29f89 into modin-project:main Feb 27, 2025
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: incorrect iloc behavior in modin when assigning index values based on row indices

3 participants