Add Consistency-Regularized CTC #1766
Conversation
|
On LibriSpeech dataset, results comparison with Zipformer, without using an external language model:
|
|
Could you update RESULTS.md to include the URLs for the checkpoints and training logs of your PR? |
Sure. Will do it later. |
| @@ -950,7 +943,6 @@ def compute_loss( | |||
| spec_augment=spec_augment, | |||
| supervision_segments=supervision_segments, | |||
| time_warp_factor=params.spec_aug_time_warp_factor, | |||
There was a problem hiding this comment.
can not find the definition of spec_aug_time_warp_factor
There was a problem hiding this comment.
It is defined in zipformer/asr_datamodule.py
|
An example of training script using 4 * 32G-V100: export CUDA_VISIBLE_DEVICES="0,1,2,3"
./zipformer/train.py \
--world-size 4 \
--num-epochs 50 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-cr-loss-scale-0.2-time-mask-ratio-2.5 \
--use-cr-ctc 1 \
--use-ctc 1 \
--use-transducer 0 \
--use-attention-decoder 0 \
--enable-spec-aug 0 \
--cr-loss-scale 0.2 \
--time-mask-ratio 2.5 \
--full-libri 1 \
--max-duration 700 \
--master-port 12345 |
|
I have uploaded the checkpoints and updated |
|
I did some finetuning exps:
Results on GigaSpeech:
Finetuned results on LibriSpeech:
The results show that CR-CTC could be a good choice for pretraining. |
|
First of all, I would like to express my deepest gratitude for sharing your invaluable code and paper. They have been immensely helpful in my research endeavors. While reading through your paper and exploring the code, I have encountered a question concerning the batch_size setting, and I would appreciate your insights. In your paper, you mention that "As CR-CTC requires two forward pass during training, we train CR-CTC models with half the batch size and half the number of epochs compared to CTC models, ensuring a fair comparison in terms of training cost". However, in the model.py file, I noticed that the forward function scale the ctc_loss and transducer_loss by 0.5. I wonder do I need to continue adjusting the setting of batch_size(max_duration) ? Once again, thank you for your hard work and generous sharing! |
For example, if you use max-duration of 1400 for standard CTC, you could use max-duration of 700 for CR-CTC. It will create two copies and then concat them along the batch dim. The reason why we scale the loss values by 0.5 is to keep the logging loss values comparable to other setups (without CR-CTC), as we get the info["frames"] in train.py (before batch duplicating) and normalize the loss values by that before printing. You could refer to the script examples in |
|
Are there any results in streaming ASR? My experiments on streaming ASR using CTC seem to not be working. The CTC loss gets worse while the CR loss gets better, WER gets worse. |
I tested the performance on streaming Zipformer-CTC models, getting the following results with
|
|
Hello, can I know how you perform inference? Do you fuse the two branches using softmax, addition, and then decoding, or something else? In your paper, I noticed you mentioned that you ensemble the two branches, but I’m not sure about the specific ensemble technique you used. Thank you |
The term of "ensemble" is just an explanation of using drop-based training techniques. For CR-CTC, "two branches" just denotes that it accepts different augmented views and gets different outputs (even using same inputs, the outputs are still different since of the dropout in training). Physically it just has one model and you don't need to get the ensemble in inference. |
|
In the revised manuscript (https://arxiv.org/pdf/2410.05101), we have added experiments using Conformer encoder in Appendix 7. |
|
which is the recommended cr-loss-scale if I will use zipformer + pruned-transducer w/ CR-CTC 0.2 or 0.02 ? the paper sets it to 0.2 but the script provided in RESULTS.md sets it to 0.02 |
|
@moadel2002 Yes, we use cr-loss-scale of 0.02 in the pruned-transducer w/ CR-CTC system. For the system trained with pruned-transducer and CTC, we set ctc-loss-scale to 0.1, and we want to maintain the relative scale between the CR loss scale and CTC loss scale, keeping it at 0.2. |
|
@yaozengwei Hi, May I know why the loss is multiplied by 0.5? thanks |
|
@1215thebqtic To ensure the printing loss values remain comparable in magnitude to those of the regular system (without CR). |
|
hi, my task is tiny asr for mcu. my model is tcn and the model param size is above 250k. while use cr_ctc,it don't work. The ctc loss decreased from 80.x to 60.x , but the cr loss does not converge and keeps fluctuating at 15.x 。 May I ask if it is normal for cr_loss not to converge? |
|
@Garry-sh Could you show how you compute the CR loss on the two CTC outputs? In addition, does your model use dropout or different input masking in the two branches? If not, there is no benefit using the CR loss. |
|
For smaller size model (20M parameters) improvement is not so big here as well. Basically no model capacity to model multiview I think. I'll provide more graphs later. |
### 1. I draw on the code from icefall. the code is:2. use time_mask and freq_mask . code is :3. i don't use dropout. |
Could it be a problem with the model structure? only transoformer is useful? |
|
@Garry-sh I don't see clear issues in your pasted codes. I am not sure if the problem might be due to the too small model size, i.e., 250k parameters, as we haven’t tested it on such a small model before. Have you tried running this on a larger model, like a Conformer? |
|
ok, thanks. I want to know the cr_loss is convergent for your expriment ? |
|
@Garry-sh Yes. In my experiment, the cr-loss is decreasing. |
* support consistency-regularized CTC * update arguments of cr-ctc * set default value of cr_loss_masked_scale to 1.0 * minor fix * refactor codes * update RESULTS.md






This PR implements the Consistency-Regularized CTC (CR-CTC) in https://arxiv.org/pdf/2410.05101,
which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. It significantly improves the CTC performance, and could also be an auxiliary loss to boost the performance of transducer or CTC/AED. Please see paper for more details.