fix: default workflow input extensions by filetype by nightcityblade · Pull Request #1982 · NVIDIA-NeMo/Curator

nightcityblade · 2026-05-14T15:10:02Z

Description

Workflows that expose input_file_extensions now default to the extensions for their selected input_filetype when no explicit override is provided, instead of falling back to the generic FilePartitioningStage defaults.

Usage

from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

workflow = ExactDeduplicationWorkflow(
    input_path="/data/input",
    output_path="/data/output",
    input_filetype="jsonl",
)
# Defaults to [".jsonl", ".json"] unless input_file_extensions is set.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Tests:

uv run ruff check nemo_curator/stages/deduplication/exact/workflow.py nemo_curator/stages/deduplication/fuzzy/workflow.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/deduplication/exact/test_workflow.py tests/stages/deduplication/fuzzy/test_fuzzy_workflow.py tests/stages/text/deduplication/test_removal_workflow.py
uv run pytest tests/stages/text/deduplication/test_removal_workflow.py::TestTextDuplicatesRemovalWorkflowGenerateStages -q (blocked locally: NeMo-Curator raises on non-Linux/darwin)

Signed-off-by: nightcityblade <[email protected]>

copy-pr-bot · 2026-05-14T15:10:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-14T15:12:35Z

Greptile Summary

This PR fixes workflows that expose input_file_extensions so they default to the extensions appropriate for their input_filetype (e.g. [".jsonl", ".json"] for "jsonl") instead of falling back to FilePartitioningStage's catch-all list. It introduces a shared get_default_file_extensions helper in file_utils.py and applies it consistently across five workflows.

Core helper (file_utils.py): new get_default_file_extensions(input_filetype) raises a descriptive ValueError for unknown filetypes rather than silently returning [].
Five workflows updated: ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow, KMeansStage, TextDuplicatesRemovalWorkflow, and TextSemanticDeduplicationWorkflow all use self.input_file_extensions or get_default_file_extensions(self.input_filetype).
Test coverage: each changed workflow gains default-to-filetype and explicit-override tests; TextSemanticDeduplicationWorkflow is covered by the new test_semantic_file_extensions.py using monkeypatch to avoid GPU dependencies.

Confidence Score: 5/5

Safe to merge — the change is a targeted fix with no breaking surface and good test coverage across all five modified workflows.

All five workflows receive an equivalent, well-tested fix. The shared helper is simple and correct. The only finding is dead code left behind in KMeansStage.decompose from the refactoring, which has no runtime impact.

nemo_curator/stages/deduplication/semantic/kmeans.py — the if not file_extensions: guard is now unreachable and can be cleaned up.

Important Files Changed

Filename	Overview
nemo_curator/utils/file_utils.py	Adds `get_default_file_extensions` helper that wraps `FILETYPE_TO_DEFAULT_EXTENSIONS.get()` and raises a descriptive `ValueError` for unsupported filetypes; clean and correct.
nemo_curator/stages/deduplication/exact/workflow.py	Replaces bare `self.input_file_extensions` with `… or get_default_file_extensions(self.input_filetype)` so `FilePartitioningStage` defaults to filetype-appropriate extensions.
nemo_curator/stages/deduplication/fuzzy/workflow.py	Identical one-line fix to `_create_minhash_pipeline`; mirrors exact workflow change correctly.
nemo_curator/stages/deduplication/semantic/kmeans.py	Switches from `FILETYPE_TO_DEFAULT_EXTENSIONS.get(..., [])` to `get_default_file_extensions`; the subsequent `if not file_extensions:` guard is now dead code since the new helper never returns an empty list.
nemo_curator/stages/text/deduplication/removal_workflow.py	Same `or get_default_file_extensions` fix applied to `TextDuplicatesRemovalWorkflow`; correct.
nemo_curator/stages/text/deduplication/semantic.py	Applies the extension-defaulting fix inside both `if/elif` branches of `_run_embedding_generation` for `JsonlReader` and `ParquetReader`; the `else: raise NotImplementedError` branch means `get_default_file_extensions` is only ever called for filetypes it already knows about.
tests/stages/text/deduplication/test_semantic_file_extensions.py	New test file covering default-to-filetype, explicit-override, and unsupported-filetype error cases for `TextSemanticDeduplicationWorkflow._run_embedding_generation`; uses `pytest.importorskip` and `monkeypatch` cleanly.
tests/stages/deduplication/exact/test_workflow.py	Adds two tests: default-to-filetype and explicit-override for `ExactDeduplicationWorkflow`; accesses `_create_input_filegroups()` to inspect the constructed stages without running the full pipeline.
tests/stages/deduplication/fuzzy/test_fuzzy_workflow.py	Adds two tests mirroring the exact workflow tests for `FuzzyDeduplicationWorkflow`.
tests/stages/text/deduplication/test_removal_workflow.py	Extends the existing parametrized `test_reader_stage` to verify per-filetype extension defaults and adds a separate custom-override test; removes the stale hardcoded assertion against the old `FilePartitioningStage` catch-all list.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Workflow.__init__\ninput_filetype, input_file_extensions] --> B{input_file_extensions\nprovided?}
    B -- Yes --> C[Use input_file_extensions as-is]
    B -- No --> D[get_default_file_extensions\ninput_filetype]
    D --> E{filetype in\nFILETYPE_TO_DEFAULT_EXTENSIONS?}
    E -- Yes --> F[Return extensions list\ne.g. .jsonl .json]
    E -- No --> G[Raise ValueError\nUnsupported filetype]
    C --> H[Pass file_extensions to\nFilePartitioningStage /\nJsonlReader / ParquetReader]
    F --> H

_{Reviews (2): Last reviewed commit: "fix: validate default input file extensi..." | Re-trigger Greptile}

greptile-apps · 2026-05-14T15:12:39Z

+                    file_extensions=(
+                        self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype]
+                    ),


All four changed files use direct dictionary subscription (FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype]) rather than .get(). If input_filetype is anything other than "jsonl" or "parquet" at runtime (e.g. due to serialisation/deserialisation bypassing the Literal annotation), this raises an opaque KeyError instead of a readable error message. The existing KMeansStage.decompose uses .get(self.input_filetype, []) followed by a descriptive ValueError — the same pattern should be applied here for consistency and better diagnostics. The same change applies to the three other modified workflows.

Suggested change

file_extensions=(

self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype]

),

file_extensions=(

self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS.get(self.input_filetype) or _unsupported_filetype(self.input_filetype)

),

greptile-apps · 2026-05-14T15:12:40Z

@@ -259,7 +259,7 @@ def _run_embedding_generation(self, executor: BaseExecutor) -> list[Task]:
                    + [self.text_field]
                    + (self.metadata_fields or [])
                ),
-                file_extensions=self.input_file_extensions,
+                file_extensions=self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype],


Missing unit test for semantic.py changes — The three other modified workflows (ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow, TextDuplicatesRemovalWorkflow) each received new unit tests for both the default-to-filetype and explicit-override cases. TextSemanticDeduplicationWorkflow got neither, yet the same logic change was applied at both the _run_embedding_generation and _run_deduplication call sites. A lightweight test that constructs the workflow with input_filetype="jsonl" and asserts the resolved file_extensions would close this gap without needing a full GPU-backed run.

sarahyurick

Should this include https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/deduplication/semantic/workflow.py too?

nightcityblade · 2026-05-16T03:16:07Z

Good catch on the semantic workflow path — that should stay aligned with the other dedup workflows. I haven't pushed a follow-up yet, but the missing semantic workflow coverage and tests are the right next changes here.

nightcityblade · 2026-05-16T15:08:19Z

Thanks for the review — I pushed a follow-up in 00d85ce.

Addressed:

Replaced the direct FILETYPE_TO_DEFAULT_EXTENSIONS[...] lookups with a shared get_default_file_extensions() helper that raises a clear ValueError for unsupported filetypes.
Applied the same helper to KMeansStage, which is where nemo_curator/stages/deduplication/semantic/workflow.py delegates input partitioning for the semantic workflow.
Added lightweight unit coverage for TextSemanticDeduplicationWorkflow._run_embedding_generation to verify jsonl/parquet defaults and explicit extension overrides.

Validation:

uv run ruff check on the changed files: passed
uv run ruff format --check on the changed files: passed
python3 -m py_compile on the changed Python files: passed
Targeted pytest could not run on this local Darwin runner because NeMo-Curator raises at import time on non-Linux systems (NeMo-Curator currently only supports Linux systems).

sarahyurick · 2026-05-18T16:35:13Z

I don't think a new file needs to be created for this, it should just be added to an existing test file right?

fix: default workflow input extensions by filetype

31ba333

Signed-off-by: nightcityblade <[email protected]>

nightcityblade requested review from a team, ayushdg and praateekmahajan as code owners May 14, 2026 15:10

github-actions Bot added the community-request label May 14, 2026

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

sarahyurick reviewed May 15, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 15, 2026

fix: validate default input file extensions

00d85ce

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 16, 2026

sarahyurick reviewed May 18, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: default workflow input extensions by filetype#1982

fix: default workflow input extensions by filetype#1982
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1330

nightcityblade commented May 14, 2026

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

greptile-apps Bot commented May 14, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 14, 2026

Uh oh!

greptile-apps Bot May 14, 2026

Uh oh!

sarahyurick left a comment

Uh oh!

nightcityblade commented May 16, 2026

Uh oh!

nightcityblade commented May 16, 2026

Uh oh!

sarahyurick May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nightcityblade commented May 14, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

greptile-apps Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

nightcityblade commented May 16, 2026

Uh oh!

nightcityblade commented May 16, 2026

Uh oh!

sarahyurick May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 14, 2026 •

edited

Loading