Skip to content

DF.distinct options #1096

@skyqrose

Description

@skyqrose

I've been trying to use DataFrame.distinct (docs), and ran into some hiccups that could either become new options or just documentation improvements.

My biggest issue is: If there's a duplicate, and you use keep_all, which row is kept? The docs don't address this. Based on trying it out, it appears that the first row is kept (or maybe it's random). But I need the last row to be kept. I would like to have an option to control this, similar to polars.DataFrame.unique's keep option. If that feature is not added, the docs could be improved by describing what the current behavior is: Is it guaranteed that the first row is kept, or is it arbitrary?

Second issue: Is the order of the output guaranteed to be preserved? polars.DataFrame.unique has a maintain_order option. Based on reading the code, when using the polars backend this option is not set, so there's no way to control the order. I'd like an option to set this option (in my use case, it's okay to not use LazyFrames). Alternatively, if the docs explained that the output order is not defined, it could prevent people from incorrectly assuming the output is ordered.

Lastly: It took me a while to find this function. I was looking for a function named "unique" by analogy to Enum.uniq and polars.DataFrame.unique. I suggest adding the word "unique" somewhere into the docs text so it's easier to find by searching.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions