Skip to content

Potential data overlap and incomplete coverage due to random.shuffle in concept_generation.py #42

@hjhhsy120

Description

@hjhhsy120

Hi, I would like to report a potential issue in atlas_rag/kg_construction/concept_generation.py at line 113. The load_data_with_shard function currently employs random.shuffle when processing multiple shards. In a concurrent sharding environment, this approach might lead to data overlap and incomplete coverage, since each shard independently shuffles the dataset before selecting its subset. I am not entirely sure if this behavior is by design or if there is a misunderstanding on my part regarding the sharding logic. I would appreciate your feedback on this, and if it is indeed a potential issue, I hope it can be addressed. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions