Skip to content

minimum length of encoder output for CTC training#2051

Open
KarelVesely84 wants to merge 1 commit intok2-fsa:masterfrom
KarelVesely84:ctc_min_frames
Open

minimum length of encoder output for CTC training#2051
KarelVesely84 wants to merge 1 commit intok2-fsa:masterfrom
KarelVesely84:ctc_min_frames

Conversation

@KarelVesely84
Copy link
Copy Markdown
Contributor

@KarelVesely84 KarelVesely84 commented Dec 19, 2025

Hello Fangyun,
sending a PR related the the CTC training.

First please have a look into egs/librispeech/ASR/zipformer/train.py:

There might be consecutive repetition of symbol in the reference, and for this the CTC alignment must put a blank in between, so the reverse mapping of aligned symbols produces the original reference.

I realised it recently while playing with CTC aligner from torachaudio with the noisy yodas2 dataset.

To illustrate:

"a a b c c d e f"

  • len(tokens) is 8
  • but, because of duplications 'a a', 'c c'
  • the minimum length of encoder output is 10
  • the shortest valid CTC alignment is: "a ∅ a b c ∅ c d e f"

Also updating egs/librispeech/ASR/zipformer/ctc_align.py, which I originally contributed...

Best regards,
Karel

Summary by CodeRabbit

  • New Features

    • Added --max-utt-duration flag to filter utterances by duration.
  • Improvements

    • Alignment now accepts a single dataset manifest, runs once per dataset, and names outputs by dataset.
    • Alignment records IDs of removed cuts and appends placeholder confidence entries for them.
    • Training/filtering uses a token-aware minimum-frame threshold (accounts for repeated tokens); warnings now report this threshold.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 19, 2025

📝 Walkthrough

Walkthrough

Refactors ctc_align.py to accept a single dataset_manifest, adds --max-utt-duration and cut-filtering that records removed cut IDs, changes save_alignment_output signature to include removed_cut_ids, and updates train.py filtering to compute a CTC-aware min_T using NumPy token encoding and repeated-token counts.

Changes

Cohort / File(s) Summary
CTC alignment
egs/librispeech/ASR/zipformer/ctc_align.py
Replaced multi-manifest loop with a single dataset_manifest positional arg; added --max-utt-duration; introduced remove_long_transcripts predicate (duration and subsampled-frame vs token checks) that produces kept and removed cut IDs; changed save_alignment_output(params, dataset_name, results_dict, removed_cut_ids) and updated output naming/logging to use dataset_label.
Training filtering
egs/librispeech/ASR/zipformer/train.py
Imported numpy as np; modified remove_short_and_long_utt to encode supervision tokens as a NumPy array, compute num_repeats of consecutive identical tokens when use_ctc is enabled, set min_T = len(tokens) + num_repeats for CTC (else min_T = len(tokens)), replaced previous T check with T < min_T, and updated warning messages to include min_T.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (ctc_align)
    participant Loader as CutSet Loader
    participant Filter as Cut Filter
    participant Aligner as Aligner
    participant Saver as Output Saver

    CLI->>Loader: load dataset_manifest -> dataset_cuts / dataloader
    Loader->>Filter: provide cuts
    Filter->>Filter: apply max_utt_duration and min_T token/frame checks\nreturn kept_cuts and removed_cut_ids
    Filter->>Aligner: kept_cuts -> create dataloader
    Aligner->>Aligner: run alignment -> results_dict
    Aligner->>Saver: results_dict, dataset_label, removed_cut_ids
    Saver->>Saver: write alignment outputs and placeholder lines for removed_cut_ids
    Saver-->>CLI: completion / output paths
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • zipformer/ctc_align.py #2020 — touches the same egs/librispeech/ASR/zipformer/ctc_align.py area (CLI/parser and alignment output logic), likely directly related to the signature and single-manifest flow changes.

Poem

🐰
I hopped through cuts both short and long,
Kept the neat ones, nudged out the wrong.
Tokens repeat, frames count in tune,
One manifest aligned beneath the moon.
Carrots for logs, a soft bug-free tune. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly matches the main objective: enforcing a minimum encoder-output length for CTC training when references contain consecutive repeated symbols.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
egs/librispeech/ASR/zipformer/ctc_align.py (1)

657-675: The logic is correct, but multiple iterations over the CutSet may be inefficient.

The code iterates over dataset_cuts three times:

  1. list(dataset_cuts.ids)
  2. filter(remove_long_transcripts)
  3. list(dataset_cuts.ids) again after filtering

For lazy CutSets backed by files this works, but consider collecting IDs during the filter pass if performance becomes a concern with large manifests.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0904e49 and 3431f6a.

📒 Files selected for processing (2)
  • egs/librispeech/ASR/zipformer/ctc_align.py (6 hunks)
  • egs/librispeech/ASR/zipformer/train.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
egs/librispeech/ASR/zipformer/ctc_align.py (1)
egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py (2)
  • load_manifest (406-411)
  • test_dataloaders (383-403)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (59)
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
🔇 Additional comments (5)
egs/librispeech/ASR/zipformer/train.py (3)

66-66: LGTM!

The numpy import is needed for the new token repeat counting logic below.


388-392: LGTM!

Formatting-only change with no semantic impact.


414-431: LGTM! The CTC-aware minimum frame logic is correct.

The calculation properly accounts for the fact that valid CTC alignments require blank tokens between consecutive identical symbols, making the minimum encoder output length len(tokens) + num_repeats.

egs/librispeech/ASR/zipformer/ctc_align.py (2)

55-55: LGTM!

Good addition of the Cut import for type usage, and the new CLI arguments cleanly support single-manifest processing with duration filtering.

Also applies to: 169-179


468-475: LGTM!

The placeholder entries with -2.0 sentinel values provide a clear marker for downstream processing to identify cuts that were filtered out, distinct from normal confidence values (0-1) and the "no scores" marker (-1.0).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +400 to +404
def save_alignment_output(
params: AttributeDict,
test_set_name: str,
dataset_name: str,
results_dict: Dict[str, List[Tuple[str, List[str], List[str]]]],
removed_cut_ids: list[str],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep Python 3.6–3.8 compatibility in ctc_align signature

The new save_alignment_output signature uses a PEP585 type hint (removed_cut_ids: list[str]), but this module doesn’t enable from __future__ import annotations and the project advertises python_requires=">=3.6.0" in setup.py. On Python 3.6–3.8 the expression list[str] raises TypeError: 'type' object is not subscriptable when the file is imported, so the forced-alignment script will fail to start on those versions. Consider reverting to typing.List[str] or deferring annotation evaluation to preserve the stated interpreter support.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed...

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
egs/librispeech/ASR/zipformer/ctc_align.py (2)

633-655: Consider documenting the subsampling formula.

The formula ((c.num_frames - 7) // 2 + 1) // 2 at line 641 is specific to the zipformer encoder's subsampling architecture (7-wide kernel with two stride-2 layers). A brief inline comment explaining this derivation would help maintainability if the encoder architecture changes.

The CTC minimum-frames logic (T < len(tokens) + num_repeats) correctly implements the requirement from the PR objectives—valid CTC alignments need blanks between consecutive identical tokens.

Also, the warning message fix ("from alignment" instead of "from training") addresses the previous review feedback.

🔎 Suggested documentation
         T = ((c.num_frames - 7) // 2 + 1) // 2
+        # ^ Zipformer subsampling: Conv(kernel=7, stride=2) → stride-2 pooling
         tokens = np.array(sp.encode(c.supervisions[0].text, out_type=str))

657-659: Remove redundant list() wrappers.

set() can directly consume any iterable, so the intermediate list() calls are unnecessary and add memory overhead.

🔎 Proposed fix
-    cut_ids_orig = set(list(dataset_cuts.ids))
+    cut_ids_orig = set(dataset_cuts.ids)
     dataset_cuts = dataset_cuts.filter(remove_long_transcripts)
-    cut_ids_removed = cut_ids_orig - set(list(dataset_cuts.ids))
+    cut_ids_removed = cut_ids_orig - set(dataset_cuts.ids)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3431f6a and 17d1fe1.

📒 Files selected for processing (2)
  • egs/librispeech/ASR/zipformer/ctc_align.py (6 hunks)
  • egs/librispeech/ASR/zipformer/train.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • egs/librispeech/ASR/zipformer/train.py
🧰 Additional context used
🧬 Code graph analysis (1)
egs/librispeech/ASR/zipformer/ctc_align.py (1)
egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py (2)
  • load_manifest (406-411)
  • test_dataloaders (383-403)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (90)
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
🔇 Additional comments (6)
egs/librispeech/ASR/zipformer/ctc_align.py (6)

47-55: LGTM!

The new imports are correctly added and used: Set is used in the function signature at line 404, and Cut is used for type hinting in the filter function at line 633.


169-180: LGTM!

The new --max-utt-duration argument provides a sensible default of 60 seconds, and switching to a single positional dataset_manifest simplifies the interface for the typical single-dataset alignment workflow.


400-404: LGTM!

The signature now correctly uses Set[str] from the typing module, which matches the actual type passed by the caller (cut_ids_removed is a set) and maintains compatibility with Python versions prior to 3.9. This addresses the previous review concern about the type mismatch.


468-476: LGTM!

Using -2.0 as sentinel values for removed cuts is a good choice since valid confidence scores are in the [0, 1] range, making filtered entries easily identifiable in downstream analysis.


628-631: LGTM!

The dataset label extraction from the manifest path and the single-manifest loading flow are clean and straightforward.


661-675: LGTM!

The alignment workflow is well-structured: load manifest → filter cuts → create dataloader → align → save results with removed cut tracking.

There might be consecutive repetition of symbol in the reference,
and for this the CTC alignment must put a blank in between, so the
reverse mapping of aligned symbols produces the original reference.

I realised it recently while playing with CTC aligner from torachaudio
with the noisy yodas2 dataset.

To illustrate:

"a a b c c d e f"

- len(tokens) is 8
- but, because of duplications 'a a', 'c c'
- the minimum length of encoder output is 10
- the shortest valid CTC alignment is: "a ∅  a b c ∅  c d e f"
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
egs/librispeech/ASR/zipformer/ctc_align.py (1)

657-659: Minor: redundant list() calls inside set() constructor.

set() accepts any iterable directly, so list() is unnecessary.

♻️ Proposed simplification
-    cut_ids_orig = set(list(dataset_cuts.ids))
+    cut_ids_orig = set(dataset_cuts.ids)
     dataset_cuts = dataset_cuts.filter(remove_long_transcripts)
-    cut_ids_removed = cut_ids_orig - set(list(dataset_cuts.ids))
+    cut_ids_removed = cut_ids_orig - set(dataset_cuts.ids)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@egs/librispeech/ASR/zipformer/ctc_align.py` around lines 657 - 659, The code
creates sets from dataset_cuts.ids but wraps the iterables in redundant list()
calls; simplify by passing the iterable directly to set() for cut_ids_orig and
when computing cut_ids_removed (i.e., replace set(list(dataset_cuts.ids)) with
set(dataset_cuts.ids)), keeping the same variables and the
filter(dataset_cuts.filter(remove_long_transcripts)) call unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@egs/librispeech/ASR/zipformer/ctc_align.py`:
- Around line 657-659: The code creates sets from dataset_cuts.ids but wraps the
iterables in redundant list() calls; simplify by passing the iterable directly
to set() for cut_ids_orig and when computing cut_ids_removed (i.e., replace
set(list(dataset_cuts.ids)) with set(dataset_cuts.ids)), keeping the same
variables and the filter(dataset_cuts.filter(remove_long_transcripts)) call
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e6e4908-beca-42c8-8a90-f54cb3ca0744

📥 Commits

Reviewing files that changed from the base of the PR and between 17d1fe1 and 6d2280d.

📒 Files selected for processing (2)
  • egs/librispeech/ASR/zipformer/ctc_align.py
  • egs/librispeech/ASR/zipformer/train.py

@KarelVesely84
Copy link
Copy Markdown
Contributor Author

Hi, I have rebased this PR on top of current HEAD. Could someone look into this ?
The important change is in remove_short_and_long_utt() in zipformer/train.py.

It should be straightforward, solving an issue with CTC training properly...

Best regards
Karel from BUT

@csukuangfj
Copy link
Copy Markdown
Collaborator

Thanks! Looks good to me.

@yaozengwei Can you also have a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants