German fine-tuning: lessons learned & pitfalls (PLBERT, discriminator saturation, IPA length mark) #359
dida-80b
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
StyleTTS2 German Fine-Tuning — Lessons Learned & Pitfalls
This guide documents practical issues encountered when fine-tuning StyleTTS2 on a German voice dataset (Eva-K, CC0, ~5000 samples, 12.5h). Most of these are not covered in the official documentation and cost significant GPU time to discover.
Hardware used: RunPod H200 SXM 141GB VRAM
Pitfall 1: Wrong BERT Model (37 wasted epochs)
Problem:
StyleTTS2 ships with an English PLBERT by default. If you use this for non-English training, phoneme embeddings will be wrong and audio quality will suffer noticeably — especially prosody and pausing.
Fix:
Use the multilingual PLBERT (
PLBERT_all_languages) which covers all IPA phonemes:This is briefly mentioned in Discussion #81 but easy to miss. Always verify this before starting training.
Pitfall 2: WavLM Discriminator silently dies after a few epochs
Problem:
This is the most critical and hardest to detect issue. The WavLM adversarial discriminator (
wd) saturates early in training and stops updating. In our case it was dead after epoch 7 — and we didn't notice for 50+ epochs.Why it happens:
The discriminator uses hinge loss:
Once the discriminator is confident enough (
D(real) > 1,D(fake) < -1), both relu terms become 0. The training code only updates the discriminator whend_loss != 0, so it freezes permanently.The default learning rate
lr: 0.0001forwdis too high — the discriminator learns to distinguish real from fake too quickly and saturates.Why it's hard to detect:
GenLM Lossremains non-zero even when the discriminator is dead (the generator still trains against the frozen discriminator). So everything looks fine in the logs. You have to specifically checkDiscLM Loss.How to detect:
Fix 1 — Reset discriminator weights from a checkpoint:
Then update
pretrained_modelin your config to point to the reset checkpoint and restart training.Fix 2 — Lower discriminator learning rate (preventive):
Reduce the
wdoptimizer learning rate to1e-5or lower so it saturates more slowly.Fix 3 — Automated monitoring:
Add a check to your watchdog/monitoring script: if
DiscLM Losshas been 0 for the last N steps, auto-reset and restart. This keeps adversarial training alive throughout.Why this matters:
Without an active discriminator, the model only trains via reconstruction losses. The adversarial signal from WavLM is what pushes the generator toward natural-sounding speech. Losing it silently for 50 epochs is a significant quality regression.
Pitfall 3: IPA length mark
ːtreated as a separate phoneme tokenProblem:
espeak-ng outputs long vowels with the IPA length mark:
øː,iː,aːetc. The StyleTTS2 TextCleaner treatsːas a separate token. The duration predictor then sees e.g.ø+ːas two tokens and adds their durations together — resulting in stretched vowels.Symptoms:
Why more training doesn't fix it:
The model isn't making a mistake — it's correctly learning the duration representation in the training data. The data itself is wrong.
Fix — strip
ːfrom phoneme strings before training:Critical: also fix inference to match:
If you fix the training data but not inference, you get a train/inference mismatch.
Note: You can apply this fix mid-training and continue from a checkpoint. The duration predictor will need several epochs to unlearn the old behavior, but the accumulated voice/style knowledge is preserved.
Pitfall 4:
pretrained_modelconfig pointing to a deleted checkpointProblem:
If your watchdog/cleanup script deletes old checkpoints, but
pretrained_modelin your config still points to a deleted file, training will crash silently after loading. The process exits with a PythonFileNotFoundError, the watchdog immediately restarts it, it crashes again — infinite crash loop with continuous failure notifications.Fix:
Always keep
pretrained_modelin your config pointing to an existing checkpoint. Update it every time you delete old checkpoints, or exclude the referenced checkpoint from cleanup.Pre-training Checklist
PLBERT_all_languagesfor non-English trainingːstripped from train_list.txt and val_list.txtːstripped after phonemize()wdoptimizer set to1e-5or lowerpretrained_modelin config: points to an existing checkpointDiscLM Lossafter first 100 steps — must be non-zeroDiscLM Lossevery ~10 epochs — if consistently 0, reset immediatelyMonitoring Commands
What worked well
PLBERT_all_languages— clear improvement for German phonemes once correctly configuredlambda_slm: 1.0— keep WavLM generator signal active, never set to 0batch_size=6, max_len=600— stable on H200, no OOM (batch_size=8 caused OOM)setsid— survives SSH disconnects reliablyload_only_params: false— correctly resumes epoch counter from checkpointHope this saves someone the GPU hours it cost to learn these lessons.
Beta Was this translation helpful? Give feedback.
All reactions