Skip to content

fix: default workflow input extensions by filetype#1982

Open
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1330
Open

fix: default workflow input extensions by filetype#1982
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1330

Conversation

@nightcityblade
Copy link
Copy Markdown
Contributor

Description

Fixes #1330.

Workflows that expose input_file_extensions now default to the extensions for their selected input_filetype when no explicit override is provided, instead of falling back to the generic FilePartitioningStage defaults.

Usage

from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

workflow = ExactDeduplicationWorkflow(
    input_path="/data/input",
    output_path="/data/output",
    input_filetype="jsonl",
)
# Defaults to [".jsonl", ".json"] unless input_file_extensions is set.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Tests:

  • uv run ruff check nemo_curator/stages/deduplication/exact/workflow.py nemo_curator/stages/deduplication/fuzzy/workflow.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/deduplication/exact/test_workflow.py tests/stages/deduplication/fuzzy/test_fuzzy_workflow.py tests/stages/text/deduplication/test_removal_workflow.py
  • uv run pytest tests/stages/text/deduplication/test_removal_workflow.py::TestTextDuplicatesRemovalWorkflowGenerateStages -q (blocked locally: NeMo-Curator raises on non-Linux/darwin)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR fixes workflows that expose input_file_extensions so they default to the extensions appropriate for their input_filetype (e.g. [".jsonl", ".json"] for "jsonl") instead of falling back to FilePartitioningStage's catch-all list. It introduces a shared get_default_file_extensions helper in file_utils.py and applies it consistently across five workflows.

  • Core helper (file_utils.py): new get_default_file_extensions(input_filetype) raises a descriptive ValueError for unknown filetypes rather than silently returning [].
  • Five workflows updated: ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow, KMeansStage, TextDuplicatesRemovalWorkflow, and TextSemanticDeduplicationWorkflow all use self.input_file_extensions or get_default_file_extensions(self.input_filetype).
  • Test coverage: each changed workflow gains default-to-filetype and explicit-override tests; TextSemanticDeduplicationWorkflow is covered by the new test_semantic_file_extensions.py using monkeypatch to avoid GPU dependencies.

Confidence Score: 5/5

Safe to merge — the change is a targeted fix with no breaking surface and good test coverage across all five modified workflows.

All five workflows receive an equivalent, well-tested fix. The shared helper is simple and correct. The only finding is dead code left behind in KMeansStage.decompose from the refactoring, which has no runtime impact.

nemo_curator/stages/deduplication/semantic/kmeans.py — the if not file_extensions: guard is now unreachable and can be cleaned up.

Important Files Changed

Filename Overview
nemo_curator/utils/file_utils.py Adds get_default_file_extensions helper that wraps FILETYPE_TO_DEFAULT_EXTENSIONS.get() and raises a descriptive ValueError for unsupported filetypes; clean and correct.
nemo_curator/stages/deduplication/exact/workflow.py Replaces bare self.input_file_extensions with … or get_default_file_extensions(self.input_filetype) so FilePartitioningStage defaults to filetype-appropriate extensions.
nemo_curator/stages/deduplication/fuzzy/workflow.py Identical one-line fix to _create_minhash_pipeline; mirrors exact workflow change correctly.
nemo_curator/stages/deduplication/semantic/kmeans.py Switches from FILETYPE_TO_DEFAULT_EXTENSIONS.get(..., []) to get_default_file_extensions; the subsequent if not file_extensions: guard is now dead code since the new helper never returns an empty list.
nemo_curator/stages/text/deduplication/removal_workflow.py Same or get_default_file_extensions fix applied to TextDuplicatesRemovalWorkflow; correct.
nemo_curator/stages/text/deduplication/semantic.py Applies the extension-defaulting fix inside both if/elif branches of _run_embedding_generation for JsonlReader and ParquetReader; the else: raise NotImplementedError branch means get_default_file_extensions is only ever called for filetypes it already knows about.
tests/stages/text/deduplication/test_semantic_file_extensions.py New test file covering default-to-filetype, explicit-override, and unsupported-filetype error cases for TextSemanticDeduplicationWorkflow._run_embedding_generation; uses pytest.importorskip and monkeypatch cleanly.
tests/stages/deduplication/exact/test_workflow.py Adds two tests: default-to-filetype and explicit-override for ExactDeduplicationWorkflow; accesses _create_input_filegroups() to inspect the constructed stages without running the full pipeline.
tests/stages/deduplication/fuzzy/test_fuzzy_workflow.py Adds two tests mirroring the exact workflow tests for FuzzyDeduplicationWorkflow.
tests/stages/text/deduplication/test_removal_workflow.py Extends the existing parametrized test_reader_stage to verify per-filetype extension defaults and adds a separate custom-override test; removes the stale hardcoded assertion against the old FilePartitioningStage catch-all list.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Workflow.__init__\ninput_filetype, input_file_extensions] --> B{input_file_extensions\nprovided?}
    B -- Yes --> C[Use input_file_extensions as-is]
    B -- No --> D[get_default_file_extensions\ninput_filetype]
    D --> E{filetype in\nFILETYPE_TO_DEFAULT_EXTENSIONS?}
    E -- Yes --> F[Return extensions list\ne.g. .jsonl .json]
    E -- No --> G[Raise ValueError\nUnsupported filetype]
    C --> H[Pass file_extensions to\nFilePartitioningStage /\nJsonlReader / ParquetReader]
    F --> H
Loading

Reviews (2): Last reviewed commit: "fix: validate default input file extensi..." | Re-trigger Greptile

Comment on lines +155 to +157
file_extensions=(
self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype]
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 All four changed files use direct dictionary subscription (FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype]) rather than .get(). If input_filetype is anything other than "jsonl" or "parquet" at runtime (e.g. due to serialisation/deserialisation bypassing the Literal annotation), this raises an opaque KeyError instead of a readable error message. The existing KMeansStage.decompose uses .get(self.input_filetype, []) followed by a descriptive ValueError — the same pattern should be applied here for consistency and better diagnostics. The same change applies to the three other modified workflows.

Suggested change
file_extensions=(
self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype]
),
file_extensions=(
self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS.get(self.input_filetype) or _unsupported_filetype(self.input_filetype)
),

Comment on lines +248 to +262
@@ -259,7 +259,7 @@ def _run_embedding_generation(self, executor: BaseExecutor) -> list[Task]:
+ [self.text_field]
+ (self.metadata_fields or [])
),
file_extensions=self.input_file_extensions,
file_extensions=self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing unit test for semantic.py changes — The three other modified workflows (ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow, TextDuplicatesRemovalWorkflow) each received new unit tests for both the default-to-filetype and explicit-override cases. TextSemanticDeduplicationWorkflow got neither, yet the same logic change was applied at both the _run_embedding_generation and _run_deduplication call sites. A lightweight test that constructs the workflow with input_filetype="jsonl" and asserts the resolved file_extensions would close this gap without needing a full GPU-backed run.

Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 15, 2026
@nightcityblade
Copy link
Copy Markdown
Contributor Author

Good catch on the semantic workflow path — that should stay aligned with the other dedup workflows. I haven't pushed a follow-up yet, but the missing semantic workflow coverage and tests are the right next changes here.

@nightcityblade
Copy link
Copy Markdown
Contributor Author

Thanks for the review — I pushed a follow-up in 00d85ce.

Addressed:

  • Replaced the direct FILETYPE_TO_DEFAULT_EXTENSIONS[...] lookups with a shared get_default_file_extensions() helper that raises a clear ValueError for unsupported filetypes.
  • Applied the same helper to KMeansStage, which is where nemo_curator/stages/deduplication/semantic/workflow.py delegates input partitioning for the semantic workflow.
  • Added lightweight unit coverage for TextSemanticDeduplicationWorkflow._run_embedding_generation to verify jsonl/parquet defaults and explicit extension overrides.

Validation:

  • uv run ruff check on the changed files: passed
  • uv run ruff format --check on the changed files: passed
  • python3 -m py_compile on the changed Python files: passed
  • Targeted pytest could not run on this local Darwin runner because NeMo-Curator raises at import time on non-Linux systems (NeMo-Curator currently only supports Linux systems).

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 16, 2026
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a new file needs to be created for this, it should just be added to an existing test file right?

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

See if all workflows expose file_extensions for reader i/o

3 participants