minimum length of encoder output for CTC training#2051
minimum length of encoder output for CTC training#2051KarelVesely84 wants to merge 1 commit intok2-fsa:masterfrom
Conversation
📝 WalkthroughWalkthroughRefactors ctc_align.py to accept a single Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (ctc_align)
participant Loader as CutSet Loader
participant Filter as Cut Filter
participant Aligner as Aligner
participant Saver as Output Saver
CLI->>Loader: load dataset_manifest -> dataset_cuts / dataloader
Loader->>Filter: provide cuts
Filter->>Filter: apply max_utt_duration and min_T token/frame checks\nreturn kept_cuts and removed_cut_ids
Filter->>Aligner: kept_cuts -> create dataloader
Aligner->>Aligner: run alignment -> results_dict
Aligner->>Saver: results_dict, dataset_label, removed_cut_ids
Saver->>Saver: write alignment outputs and placeholder lines for removed_cut_ids
Saver-->>CLI: completion / output paths
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
egs/librispeech/ASR/zipformer/ctc_align.py (1)
657-675: The logic is correct, but multiple iterations over the CutSet may be inefficient.The code iterates over
dataset_cutsthree times:
list(dataset_cuts.ids)filter(remove_long_transcripts)list(dataset_cuts.ids)again after filteringFor lazy CutSets backed by files this works, but consider collecting IDs during the filter pass if performance becomes a concern with large manifests.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
egs/librispeech/ASR/zipformer/ctc_align.py(6 hunks)egs/librispeech/ASR/zipformer/train.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
egs/librispeech/ASR/zipformer/ctc_align.py (1)
egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py (2)
load_manifest(406-411)test_dataloaders(383-403)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (59)
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
🔇 Additional comments (5)
egs/librispeech/ASR/zipformer/train.py (3)
66-66: LGTM!The numpy import is needed for the new token repeat counting logic below.
388-392: LGTM!Formatting-only change with no semantic impact.
414-431: LGTM! The CTC-aware minimum frame logic is correct.The calculation properly accounts for the fact that valid CTC alignments require blank tokens between consecutive identical symbols, making the minimum encoder output length
len(tokens) + num_repeats.egs/librispeech/ASR/zipformer/ctc_align.py (2)
55-55: LGTM!Good addition of the
Cutimport for type usage, and the new CLI arguments cleanly support single-manifest processing with duration filtering.Also applies to: 169-179
468-475: LGTM!The placeholder entries with
-2.0sentinel values provide a clear marker for downstream processing to identify cuts that were filtered out, distinct from normal confidence values (0-1) and the "no scores" marker (-1.0).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| def save_alignment_output( | ||
| params: AttributeDict, | ||
| test_set_name: str, | ||
| dataset_name: str, | ||
| results_dict: Dict[str, List[Tuple[str, List[str], List[str]]]], | ||
| removed_cut_ids: list[str], |
There was a problem hiding this comment.
Keep Python 3.6–3.8 compatibility in ctc_align signature
The new save_alignment_output signature uses a PEP585 type hint (removed_cut_ids: list[str]), but this module doesn’t enable from __future__ import annotations and the project advertises python_requires=">=3.6.0" in setup.py. On Python 3.6–3.8 the expression list[str] raises TypeError: 'type' object is not subscriptable when the file is imported, so the forced-alignment script will fail to start on those versions. Consider reverting to typing.List[str] or deferring annotation evaluation to preserve the stated interpreter support.
Useful? React with 👍 / 👎.
3431f6a to
17d1fe1
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
egs/librispeech/ASR/zipformer/ctc_align.py (2)
633-655: Consider documenting the subsampling formula.The formula
((c.num_frames - 7) // 2 + 1) // 2at line 641 is specific to the zipformer encoder's subsampling architecture (7-wide kernel with two stride-2 layers). A brief inline comment explaining this derivation would help maintainability if the encoder architecture changes.The CTC minimum-frames logic (
T < len(tokens) + num_repeats) correctly implements the requirement from the PR objectives—valid CTC alignments need blanks between consecutive identical tokens.Also, the warning message fix ("from alignment" instead of "from training") addresses the previous review feedback.
🔎 Suggested documentation
T = ((c.num_frames - 7) // 2 + 1) // 2 + # ^ Zipformer subsampling: Conv(kernel=7, stride=2) → stride-2 pooling tokens = np.array(sp.encode(c.supervisions[0].text, out_type=str))
657-659: Remove redundantlist()wrappers.
set()can directly consume any iterable, so the intermediatelist()calls are unnecessary and add memory overhead.🔎 Proposed fix
- cut_ids_orig = set(list(dataset_cuts.ids)) + cut_ids_orig = set(dataset_cuts.ids) dataset_cuts = dataset_cuts.filter(remove_long_transcripts) - cut_ids_removed = cut_ids_orig - set(list(dataset_cuts.ids)) + cut_ids_removed = cut_ids_orig - set(dataset_cuts.ids)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
egs/librispeech/ASR/zipformer/ctc_align.py(6 hunks)egs/librispeech/ASR/zipformer/train.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- egs/librispeech/ASR/zipformer/train.py
🧰 Additional context used
🧬 Code graph analysis (1)
egs/librispeech/ASR/zipformer/ctc_align.py (1)
egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py (2)
load_manifest(406-411)test_dataloaders(383-403)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (90)
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
🔇 Additional comments (6)
egs/librispeech/ASR/zipformer/ctc_align.py (6)
47-55: LGTM!The new imports are correctly added and used:
Setis used in the function signature at line 404, andCutis used for type hinting in the filter function at line 633.
169-180: LGTM!The new
--max-utt-durationargument provides a sensible default of 60 seconds, and switching to a single positionaldataset_manifestsimplifies the interface for the typical single-dataset alignment workflow.
400-404: LGTM!The signature now correctly uses
Set[str]from thetypingmodule, which matches the actual type passed by the caller (cut_ids_removedis a set) and maintains compatibility with Python versions prior to 3.9. This addresses the previous review concern about the type mismatch.
468-476: LGTM!Using -2.0 as sentinel values for removed cuts is a good choice since valid confidence scores are in the [0, 1] range, making filtered entries easily identifiable in downstream analysis.
628-631: LGTM!The dataset label extraction from the manifest path and the single-manifest loading flow are clean and straightforward.
661-675: LGTM!The alignment workflow is well-structured: load manifest → filter cuts → create dataloader → align → save results with removed cut tracking.
There might be consecutive repetition of symbol in the reference, and for this the CTC alignment must put a blank in between, so the reverse mapping of aligned symbols produces the original reference. I realised it recently while playing with CTC aligner from torachaudio with the noisy yodas2 dataset. To illustrate: "a a b c c d e f" - len(tokens) is 8 - but, because of duplications 'a a', 'c c' - the minimum length of encoder output is 10 - the shortest valid CTC alignment is: "a ∅ a b c ∅ c d e f"
17d1fe1 to
6d2280d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
egs/librispeech/ASR/zipformer/ctc_align.py (1)
657-659: Minor: redundantlist()calls insideset()constructor.
set()accepts any iterable directly, solist()is unnecessary.♻️ Proposed simplification
- cut_ids_orig = set(list(dataset_cuts.ids)) + cut_ids_orig = set(dataset_cuts.ids) dataset_cuts = dataset_cuts.filter(remove_long_transcripts) - cut_ids_removed = cut_ids_orig - set(list(dataset_cuts.ids)) + cut_ids_removed = cut_ids_orig - set(dataset_cuts.ids)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@egs/librispeech/ASR/zipformer/ctc_align.py` around lines 657 - 659, The code creates sets from dataset_cuts.ids but wraps the iterables in redundant list() calls; simplify by passing the iterable directly to set() for cut_ids_orig and when computing cut_ids_removed (i.e., replace set(list(dataset_cuts.ids)) with set(dataset_cuts.ids)), keeping the same variables and the filter(dataset_cuts.filter(remove_long_transcripts)) call unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@egs/librispeech/ASR/zipformer/ctc_align.py`:
- Around line 657-659: The code creates sets from dataset_cuts.ids but wraps the
iterables in redundant list() calls; simplify by passing the iterable directly
to set() for cut_ids_orig and when computing cut_ids_removed (i.e., replace
set(list(dataset_cuts.ids)) with set(dataset_cuts.ids)), keeping the same
variables and the filter(dataset_cuts.filter(remove_long_transcripts)) call
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7e6e4908-beca-42c8-8a90-f54cb3ca0744
📒 Files selected for processing (2)
egs/librispeech/ASR/zipformer/ctc_align.pyegs/librispeech/ASR/zipformer/train.py
|
Hi, I have rebased this PR on top of current HEAD. Could someone look into this ? It should be straightforward, solving an issue with CTC training properly... Best regards |
|
Thanks! Looks good to me. @yaozengwei Can you also have a look? |
Hello Fangyun,
sending a PR related the the CTC training.
First please have a look into
egs/librispeech/ASR/zipformer/train.py:There might be consecutive repetition of symbol in the reference, and for this the CTC alignment must put a blank in between, so the reverse mapping of aligned symbols produces the original reference.
I realised it recently while playing with CTC aligner from torachaudio with the noisy yodas2 dataset.
To illustrate:
"a a b c c d e f"
Also updating
egs/librispeech/ASR/zipformer/ctc_align.py, which I originally contributed...Best regards,
Karel
Summary by CodeRabbit
New Features
Improvements