Add Consistency-Regularized CTC by yaozengwei · Pull Request #1766 · k2-fsa/icefall

yaozengwei · 2024-10-08T02:31:05Z

This PR implements the Consistency-Regularized CTC (CR-CTC) in https://arxiv.org/pdf/2410.05101,
which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. It significantly improves the CTC performance, and could also be an auxiliary loss to boost the performance of transducer or CTC/AED. Please see paper for more details.

yaozengwei · 2024-10-08T02:47:42Z

On LibriSpeech dataset, results comparison with Zipformer, without using an external language model:

Model	Params (M)	test-clean	test-other
CTC/AED, Zipformer-S	46.3	2.46	6.04
CTC/AED, Zipformer-M	90.0	2.22	4.97
CTC/AED, Zipformer-L	174.3	2.09	4.59
Pruned transducer, Zipformer-S	23.3	2.42	5.73
Pruned transducer, Zipformer-M	65.6	2.21	4.79
Pruned transducer, Zipformer-L	148.4	2.00	4.38
CTC, Zipformer-S	22.1	2.85	6.89
CTC, Zipformer-M	64.3	2.52	6.02
CTC, Zipformer-L	147.0	2.5	5.72
CR-CTC, Zipformer-S	22.1	2.52	5.85
CR-CTC, Zipformer-M	64.3	2.1	4.61
CR-CTC, Zipformer-L	147.0	2.02	4.35
CR-CTC/AED, Zipformer-L	174.3	1.96	4.08
Pruned transducer w/ CR-CTC, Zipformer-L	148.8	1.88	3.95

csukuangfj · 2024-10-08T02:49:52Z

Could you update RESULTS.md to include the URLs for the checkpoints and training logs of your PR?

yaozengwei · 2024-10-08T02:51:22Z

Could you update RESULTS.md to include the URLs for the checkpoints and training logs of your PR?

Sure. Will do it later.

kobenaxie · 2024-10-08T03:49:58Z

egs/librispeech/ASR/zipformer/train.py

@@ -950,7 +943,6 @@ def compute_loss(
            spec_augment=spec_augment,
            supervision_segments=supervision_segments,
            time_warp_factor=params.spec_aug_time_warp_factor,


can not find the definition of spec_aug_time_warp_factor

It is defined in zipformer/asr_datamodule.py

yaozengwei · 2024-10-09T13:24:54Z

An example of training script using 4 * 32G-V100:

export CUDA_VISIBLE_DEVICES="0,1,2,3"
./zipformer/train.py \
  --world-size 4 \
  --num-epochs 50 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-cr-loss-scale-0.2-time-mask-ratio-2.5 \
  --use-cr-ctc 1 \
  --use-ctc 1 \
  --use-transducer 0 \
  --use-attention-decoder 0 \
  --enable-spec-aug 0 \
  --cr-loss-scale 0.2 \
  --time-mask-ratio 2.5 \
  --full-libri 1 \
  --max-duration 700 \
  --master-port 12345

yaozengwei · 2024-10-20T09:55:55Z

I have uploaded the checkpoints and updated RESULTS.md. @pkufool will make a PR for adding ctc-prefix-decoding.

pkufool

LGTM

yaozengwei · 2024-10-22T03:22:04Z

I did some finetuning exps:

The initialized weights are from the models trained on GigaSpeech, using Transducer loss or CR-CTC loss
Finetune a Transducer model on LibriSpeech, only initialize the encoder (so the decoder and joiner are randomly initialized)

Results on GigaSpeech:

Zipformer-L, Transducer, 10.23, 10.28
Zipformer-L, CR-CTC, 10.31, 10.41

Finetuned results on LibriSpeech:

finetune on train-clean-100:
Initialize with Transducer-trained encoder, epoch-5: 3.42, 7.45; epoch-10: 3.24, 7.36
Initialize with CR-CTC-trained encoder, epoch-5: 3.12, 7.03; epoch-10: 3.18, 7.06
finetune on full-libri:
Initialize with Transducer-trained encoder, epoch-5: 2.04, 4.57; epoch-10: 1.99, 4.39
Initialize with CR-CTC-trained encoder, epoch-5: 1.99, 4.35; epoch-10: 1.97, 4.33

The results show that CR-CTC could be a good choice for pretraining.

xiaoxi91 · 2024-10-24T03:17:19Z

First of all, I would like to express my deepest gratitude for sharing your invaluable code and paper. They have been immensely helpful in my research endeavors. While reading through your paper and exploring the code, I have encountered a question concerning the batch_size setting, and I would appreciate your insights.

In your paper, you mention that "As CR-CTC requires two forward pass during training, we train CR-CTC models with half the batch size and half the number of epochs compared to CTC models, ensuring a fair comparison in terms of training cost". However, in the model.py file, I noticed that the forward function scale the ctc_loss and transducer_loss by 0.5. I wonder do I need to continue adjusting the setting of batch_size(max_duration) ?

Once again, thank you for your hard work and generous sharing!
Best regards

yaozengwei · 2024-10-29T02:22:36Z

First of all, I would like to express my deepest gratitude for sharing your invaluable code and paper. They have been immensely helpful in my research endeavors. While reading through your paper and exploring the code, I have encountered a question concerning the batch_size setting, and I would appreciate your insights.

In your paper, you mention that "As CR-CTC requires two forward pass during training, we train CR-CTC models with half the batch size and half the number of epochs compared to CTC models, ensuring a fair comparison in terms of training cost". However, in the model.py file, I noticed that the forward function scale the ctc_loss and transducer_loss by 0.5. I wonder do I need to continue adjusting the setting of batch_size(max_duration) ?

Once again, thank you for your hard work and generous sharing! Best regards

For example, if you use max-duration of 1400 for standard CTC, you could use max-duration of 700 for CR-CTC. It will create two copies and then concat them along the batch dim. The reason why we scale the loss values by 0.5 is to keep the logging loss values comparable to other setups (without CR-CTC), as we get the info["frames"] in train.py (before batch duplicating) and normalize the loss values by that before printing. You could refer to the script examples in RESULTS.md.

zhangwenkai-orion · 2024-11-01T07:13:52Z

Are there any results in streaming ASR? My experiments on streaming ASR using CTC seem to not be working. The CTC loss gets worse while the CR loss gets better, WER gets worse.

yaozengwei · 2024-11-06T08:03:10Z

Are there any results in streaming ASR? My experiments on streaming ASR using CTC seem to not be working. The CTC loss gets worse while the CR loss gets better, WER gets worse.

I tested the performance on streaming Zipformer-CTC models, getting the following results with ctc_decode.py, using --causal 1 --chunk-size 32 --left-context-frames 256 --decoding-method ctc-greedy-search:

CTC, epoch-90-avg-30, WER = 3.57, 8.81
CR-CTC, epoch-45-avg-15, WER = 2.96, 7.02

huutuongtu · 2024-11-18T07:23:22Z

Hello, can I know how you perform inference? Do you fuse the two branches using softmax, addition, and then decoding, or something else? In your paper, I noticed you mentioned that you ensemble the two branches, but I’m not sure about the specific ensemble technique you used. Thank you
@yaozengwei

yaozengwei · 2024-11-18T07:34:31Z

Hello, can I know how you perform inference? Do you fuse the two branches using softmax, addition, and then decoding, or something else? In your paper, I noticed you mentioned that you ensemble the two branches, but I’m not sure about the specific ensemble technique you used. Thank you @yaozengwei

The term of "ensemble" is just an explanation of using drop-based training techniques. For CR-CTC, "two branches" just denotes that it accepts different augmented views and gets different outputs (even using same inputs, the outputs are still different since of the dropout in training). Physically it just has one model and you don't need to get the ensemble in inference.

yaozengwei · 2024-12-10T03:23:28Z

In the revised manuscript (https://arxiv.org/pdf/2410.05101), we have added experiments using Conformer encoder in Appendix 7.

Code: [Not for merge] CR-CTC on Conformer #1872

moadel2002 · 2025-04-30T12:17:37Z

which is the recommended cr-loss-scale if I will use zipformer + pruned-transducer w/ CR-CTC 0.2 or 0.02 ? the paper sets it to 0.2 but the script provided in RESULTS.md sets it to 0.02

yaozengwei · 2025-04-30T13:28:40Z

@moadel2002 Yes, we use cr-loss-scale of 0.02 in the pruned-transducer w/ CR-CTC system. For the system trained with pruned-transducer and CTC, we set ctc-loss-scale to 0.1, and we want to maintain the relative scale between the CR loss scale and CTC loss scale, keeping it at 0.2.

1215thebqtic · 2025-05-13T10:24:51Z

@yaozengwei Hi, May I know why the loss is multiplied by 0.5? thanks

yaozengwei · 2025-05-13T12:08:20Z

@1215thebqtic To ensure the printing loss values remain comparable in magnitude to those of the regular system (without CR).

Garry-sh · 2025-05-23T08:16:07Z

hi, my task is tiny asr for mcu. my model is tcn and the model param size is above 250k. while use cr_ctc，it don't work. The ctc loss decreased from 80.x to 60.x , but the cr loss does not converge and keeps fluctuating at 15.x 。 May I ask if it is normal for cr_loss not to converge?

yaozengwei · 2025-05-23T09:04:37Z

@Garry-sh Could you show how you compute the CR loss on the two CTC outputs? In addition, does your model use dropout or different input masking in the two branches? If not, there is no benefit using the CR loss.

nshmyrev · 2025-05-23T09:09:32Z

For smaller size model (20M parameters) improvement is not so big here as well. Basically no model capacity to model multiview I think. I'll provide more graphs later.

Garry-sh · 2025-05-23T09:24:46Z

### 1. I draw on the code from icefall. the code is:

  logits = apply_criterion_activation(logits, loss_activation)
  ctc_loss_ = F.ctc_loss(
    logits.transpose(0, 1),
    target,
    lengths,
    target_lengths,
    blank=blank_label,
    reduction="sum",
    zero_infinity=zero_infinity,
)
exchanged_targets = logits.detach().chunk(2, dim=0)
exchanged_targets = torch.cat([exchanged_targets[1], exchanged_targets[0]], dim=0)  # exchange: [x1, x2] -> [x2, x1]
cr_loss_ = F.kl_div(
    input=logits,
    target=exchanged_targets,
    reduction="none",
    log_target=True,
    )  # (2 * N, T, C)
length_mask = make_pad_mask(lengths).unsqueeze(-1)
cr_loss_ = cr_loss_.masked_fill(length_mask, 0.0).sum()
logger.debug("ctc_loss_:{}, cr_loss_:{}".format(ctc_loss_, cr_loss_))
return ctc_loss_, cr_loss_

2. use time_mask and freq_mask . code is :

  time_warp_factor = 80
  supervision_segments = None
  feats = time_warp(feats, time_warp_factor=time_warp_factor,supervision_segments=supervision_segments,)
  feats_lengths = feats_lengths.repeat(2)
  target = torch.cat([target, target], dim=0)
  label_lengths = label_lengths.repeat(2)
  assert spec_aug_conf != None
  #copy feats,  do spec_aug,  but cr_ctc loss does not converge ...
  feats = spec_aug(feats.repeat(2, 1, 1), 0.8, **spec_aug_conf)

3. i don't use dropout.

Garry-sh · 2025-05-23T09:32:32Z

For smaller size model (20M parameters) improvement is not so big here as well. Basically no model capacity to model multiview I think. I'll provide more graphs later.

Could it be a problem with the model structure? only transoformer is useful？

yaozengwei · 2025-05-25T12:33:18Z

@Garry-sh I don't see clear issues in your pasted codes. I am not sure if the problem might be due to the too small model size, i.e., 250k parameters, as we haven’t tested it on such a small model before. Have you tried running this on a larger model, like a Conformer?

Garry-sh · 2025-05-26T07:08:24Z

ok， thanks. I want to know the cr_loss is convergent for your expriment ?

yaozengwei · 2025-05-27T04:12:35Z

@Garry-sh Yes. In my experiment, the cr-loss is decreasing.

sangeet2020 · 2025-06-07T21:31:47Z

Train-cross loss is converging, but the Valid cr-loss does not look converging. What could be the reason?

Also, when I finetune this model, both train and valid cr-loss seems non-oconverging.

* support consistency-regularized CTC * update arguments of cr-ctc * set default value of cr_loss_masked_scale to 1.0 * minor fix * refactor codes * update RESULTS.md

yaozengwei added 5 commits September 4, 2024 14:27

support consistency-regularized CTC

ebbbcbc

update arguments of cr-ctc

07d6b12

set default value of cr_loss_masked_scale to 1.0

cf796ee

minor fix

a6eead6

refactor codes

ae59e5d

kobenaxie reviewed Oct 8, 2024

View reviewed changes

yaozengwei mentioned this pull request Oct 10, 2024

[Not for merge] Add Smooth-Regularized CTC #1769

Open

update RESULTS.md

b65873f

yaozengwei requested a review from pkufool October 20, 2024 10:00

pkufool approved these changes Oct 21, 2024

View reviewed changes

pkufool merged commit 693d84a into k2-fsa:master Oct 21, 2024

marcoyang1998 mentioned this pull request Nov 2, 2024

location of CR-CTC code #1795

Closed

Conversation

yaozengwei commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaozengwei commented Oct 8, 2024

Uh oh!

csukuangfj commented Oct 8, 2024

Uh oh!

yaozengwei commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kobenaxie Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

yaozengwei Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaozengwei commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaozengwei commented Oct 20, 2024

Uh oh!

pkufool left a comment

Choose a reason for hiding this comment

Uh oh!

yaozengwei commented Oct 22, 2024

Uh oh!

xiaoxi91 commented Oct 24, 2024

Uh oh!

yaozengwei commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangwenkai-orion commented Nov 1, 2024

Uh oh!

yaozengwei commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huutuongtu commented Nov 18, 2024

Uh oh!

yaozengwei commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaozengwei commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moadel2002 commented Apr 30, 2025

Uh oh!

yaozengwei commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1215thebqtic commented May 13, 2025

Uh oh!

yaozengwei commented May 13, 2025

Uh oh!

Garry-sh commented May 23, 2025

Uh oh!

yaozengwei commented May 23, 2025

Uh oh!

nshmyrev commented May 23, 2025

Uh oh!

Garry-sh commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

### 1. I draw on the code from icefall. the code is:

2. use time_mask and freq_mask . code is :

3. i don't use dropout.

Uh oh!

Garry-sh commented May 23, 2025

Uh oh!

yaozengwei commented May 25, 2025

Uh oh!

Garry-sh commented May 26, 2025

Uh oh!

yaozengwei commented May 27, 2025

Uh oh!

sangeet2020 commented Jun 7, 2025

Uh oh!

Reviewers

Assignees

Labels

yaozengwei commented Oct 8, 2024 •

edited

Loading

yaozengwei commented Oct 8, 2024 •

edited

Loading

yaozengwei Oct 8, 2024 •

edited

Loading

yaozengwei commented Oct 9, 2024 •

edited

Loading

yaozengwei commented Oct 29, 2024 •

edited

Loading

yaozengwei commented Nov 6, 2024 •

edited

Loading

yaozengwei commented Nov 18, 2024 •

edited

Loading

yaozengwei commented Dec 10, 2024 •

edited

Loading

yaozengwei commented Apr 30, 2025 •

edited

Loading

Garry-sh commented May 23, 2025 •

edited

Loading