-
Notifications
You must be signed in to change notification settings - Fork 145
Description
I've been trying to use DataFrame.distinct (docs), and ran into some hiccups that could either become new options or just documentation improvements.
My biggest issue is: If there's a duplicate, and you use keep_all, which row is kept? The docs don't address this. Based on trying it out, it appears that the first row is kept (or maybe it's random). But I need the last row to be kept. I would like to have an option to control this, similar to polars.DataFrame.unique's keep option. If that feature is not added, the docs could be improved by describing what the current behavior is: Is it guaranteed that the first row is kept, or is it arbitrary?
Second issue: Is the order of the output guaranteed to be preserved? polars.DataFrame.unique has a maintain_order option. Based on reading the code, when using the polars backend this option is not set, so there's no way to control the order. I'd like an option to set this option (in my use case, it's okay to not use LazyFrames). Alternatively, if the docs explained that the output order is not defined, it could prevent people from incorrectly assuming the output is ordered.
Lastly: It took me a while to find this function. I was looking for a function named "unique" by analogy to Enum.uniq and polars.DataFrame.unique. I suggest adding the word "unique" somewhere into the docs text so it's easier to find by searching.