fix: default workflow input extensions by filetype#1982
Conversation
Signed-off-by: nightcityblade <[email protected]>
Greptile SummaryThis PR fixes workflows that expose
Confidence Score: 5/5Safe to merge — the change is a targeted fix with no breaking surface and good test coverage across all five modified workflows. All five workflows receive an equivalent, well-tested fix. The shared helper is simple and correct. The only finding is dead code left behind in nemo_curator/stages/deduplication/semantic/kmeans.py — the Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Workflow.__init__\ninput_filetype, input_file_extensions] --> B{input_file_extensions\nprovided?}
B -- Yes --> C[Use input_file_extensions as-is]
B -- No --> D[get_default_file_extensions\ninput_filetype]
D --> E{filetype in\nFILETYPE_TO_DEFAULT_EXTENSIONS?}
E -- Yes --> F[Return extensions list\ne.g. .jsonl .json]
E -- No --> G[Raise ValueError\nUnsupported filetype]
C --> H[Pass file_extensions to\nFilePartitioningStage /\nJsonlReader / ParquetReader]
F --> H
Reviews (2): Last reviewed commit: "fix: validate default input file extensi..." | Re-trigger Greptile |
| file_extensions=( | ||
| self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype] | ||
| ), |
There was a problem hiding this comment.
All four changed files use direct dictionary subscription (
FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype]) rather than .get(). If input_filetype is anything other than "jsonl" or "parquet" at runtime (e.g. due to serialisation/deserialisation bypassing the Literal annotation), this raises an opaque KeyError instead of a readable error message. The existing KMeansStage.decompose uses .get(self.input_filetype, []) followed by a descriptive ValueError — the same pattern should be applied here for consistency and better diagnostics. The same change applies to the three other modified workflows.
| file_extensions=( | |
| self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype] | |
| ), | |
| file_extensions=( | |
| self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS.get(self.input_filetype) or _unsupported_filetype(self.input_filetype) | |
| ), |
| @@ -259,7 +259,7 @@ def _run_embedding_generation(self, executor: BaseExecutor) -> list[Task]: | |||
| + [self.text_field] | |||
| + (self.metadata_fields or []) | |||
| ), | |||
| file_extensions=self.input_file_extensions, | |||
| file_extensions=self.input_file_extensions or FILETYPE_TO_DEFAULT_EXTENSIONS[self.input_filetype], | |||
There was a problem hiding this comment.
Missing unit test for
semantic.py changes — The three other modified workflows (ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow, TextDuplicatesRemovalWorkflow) each received new unit tests for both the default-to-filetype and explicit-override cases. TextSemanticDeduplicationWorkflow got neither, yet the same logic change was applied at both the _run_embedding_generation and _run_deduplication call sites. A lightweight test that constructs the workflow with input_filetype="jsonl" and asserts the resolved file_extensions would close this gap without needing a full GPU-backed run.
|
Good catch on the semantic workflow path — that should stay aligned with the other dedup workflows. I haven't pushed a follow-up yet, but the missing semantic workflow coverage and tests are the right next changes here. |
|
Thanks for the review — I pushed a follow-up in 00d85ce. Addressed:
Validation:
|
There was a problem hiding this comment.
I don't think a new file needs to be created for this, it should just be added to an existing test file right?
Description
Fixes #1330.
Workflows that expose
input_file_extensionsnow default to the extensions for their selectedinput_filetypewhen no explicit override is provided, instead of falling back to the genericFilePartitioningStagedefaults.Usage
Checklist
Tests:
uv run ruff check nemo_curator/stages/deduplication/exact/workflow.py nemo_curator/stages/deduplication/fuzzy/workflow.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/deduplication/exact/test_workflow.py tests/stages/deduplication/fuzzy/test_fuzzy_workflow.py tests/stages/text/deduplication/test_removal_workflow.pyuv run pytest tests/stages/text/deduplication/test_removal_workflow.py::TestTextDuplicatesRemovalWorkflowGenerateStages -q(blocked locally: NeMo-Curator raises on non-Linux/darwin)