German fine-tuning: lessons learned & pitfalls (PLBERT, discriminator saturation, IPA length mark) #359

dida-80b · 2026-03-24T11:36:25Z

dida-80b
Mar 24, 2026

StyleTTS2 German Fine-Tuning — Lessons Learned & Pitfalls

This guide documents practical issues encountered when fine-tuning StyleTTS2 on a German voice dataset (Eva-K, CC0, ~5000 samples, 12.5h). Most of these are not covered in the official documentation and cost significant GPU time to discover.

Hardware used: RunPod H200 SXM 141GB VRAM

Pitfall 1: Wrong BERT Model (37 wasted epochs)

Problem:
StyleTTS2 ships with an English PLBERT by default. If you use this for non-English training, phoneme embeddings will be wrong and audio quality will suffer noticeably — especially prosody and pausing.

Fix:
Use the multilingual PLBERT (PLBERT_all_languages) which covers all IPA phonemes:

# config.yml
PLBERT_dir: /path/to/StyleTTS2/Utils/PLBERT_all_languages

This is briefly mentioned in Discussion #81 but easy to miss. Always verify this before starting training.

Pitfall 2: WavLM Discriminator silently dies after a few epochs

Problem:
This is the most critical and hardest to detect issue. The WavLM adversarial discriminator (wd) saturates early in training and stops updating. In our case it was dead after epoch 7 — and we didn't notice for 50+ epochs.

Why it happens:
The discriminator uses hinge loss:

d_loss = relu(1 - D(real)) + relu(1 + D(fake))

Once the discriminator is confident enough (D(real) > 1, D(fake) < -1), both relu terms become 0. The training code only updates the discriminator when d_loss != 0, so it freezes permanently.

The default learning rate lr: 0.0001 for wd is too high — the discriminator learns to distinguish real from fake too quickly and saturates.

Why it's hard to detect:
GenLM Loss remains non-zero even when the discriminator is dead (the generator still trains against the frozen discriminator). So everything looks fine in the logs. You have to specifically check DiscLM Loss.

How to detect:

# If this returns almost nothing → discriminator is dead
grep 'DiscLM Loss' train.log | grep -v 'DiscLM Loss: 0.00' | tail -20

Fix 1 — Reset discriminator weights from a checkpoint:

import torch
import torch.nn as nn

state = torch.load('checkpoint.pth', map_location='cpu')
wd = state['net']['wd']
for key, tensor in wd.items():
    if 'weight_v' in key:
        nn.init.xavier_uniform_(tensor)
    elif 'weight_g' in key:
        nn.init.ones_(tensor)
    elif 'bias' in key:
        nn.init.zeros_(tensor)
torch.save(state, 'checkpoint_wd_reset.pth')

Then update pretrained_model in your config to point to the reset checkpoint and restart training.

Fix 2 — Lower discriminator learning rate (preventive):
Reduce the wd optimizer learning rate to 1e-5 or lower so it saturates more slowly.

Fix 3 — Automated monitoring:
Add a check to your watchdog/monitoring script: if DiscLM Loss has been 0 for the last N steps, auto-reset and restart. This keeps adversarial training alive throughout.

Why this matters:
Without an active discriminator, the model only trains via reconstruction losses. The adversarial signal from WavLM is what pushes the generator toward natural-sounding speech. Losing it silently for 50 epochs is a significant quality regression.

Pitfall 3: IPA length mark `ː` treated as a separate phoneme token

Problem:
espeak-ng outputs long vowels with the IPA length mark: øː, iː, aː etc. The StyleTTS2 TextCleaner treats ː as a separate token. The duration predictor then sees e.g. ø + ː as two tokens and adds their durations together — resulting in stretched vowels.

Symptoms:

"plötzlich" → sounds like "plööötzlich"
"Vögel" → stretched ö
Problem does not improve with more epochs — it's baked into the training data representation

Why more training doesn't fix it:
The model isn't making a mistake — it's correctly learning the duration representation in the training data. The data itself is wrong.

Fix — strip ː from phoneme strings before training:

# Apply once to your training and validation lists
sed -i 's/ː//g' /path/to/train_list.txt
sed -i 's/ː//g' /path/to/val_list.txt

Critical: also fix inference to match:

# In your inference script, after phonemize():
ps = global_phonemizer.phonemize([text.strip()])
ps = [p.replace("ː", "") for p in ps]  # add this line
ps = ' '.join(word_tokenize(ps[0]))

If you fix the training data but not inference, you get a train/inference mismatch.

Note: You can apply this fix mid-training and continue from a checkpoint. The duration predictor will need several epochs to unlearn the old behavior, but the accumulated voice/style knowledge is preserved.

Pitfall 4: `pretrained_model` config pointing to a deleted checkpoint

Problem:
If your watchdog/cleanup script deletes old checkpoints, but pretrained_model in your config still points to a deleted file, training will crash silently after loading. The process exits with a Python FileNotFoundError, the watchdog immediately restarts it, it crashes again — infinite crash loop with continuous failure notifications.

Fix:
Always keep pretrained_model in your config pointing to an existing checkpoint. Update it every time you delete old checkpoints, or exclude the referenced checkpoint from cleanup.

# Always verify this file actually exists before starting
pretrained_model: /workspace/data/checkpoints/epoch_2nd_00063.pth

Pre-training Checklist

PLBERT: using PLBERT_all_languages for non-English training
Phoneme data: ː stripped from train_list.txt and val_list.txt
Inference script: ː stripped after phonemize()
Discriminator LR: wd optimizer set to 1e-5 or lower
pretrained_model in config: points to an existing checkpoint
Monitoring: check DiscLM Loss after first 100 steps — must be non-zero
Monitoring: check DiscLM Loss every ~10 epochs — if consistently 0, reset immediately

Monitoring Commands

# Is the discriminator alive?
grep 'DiscLM Loss' train.log | grep -v 'DiscLM Loss: 0.00' | tail -10

# Average Dur Loss per epoch (watch for downward trend)
grep 'Epoch \[' train.log | grep -oP 'Epoch \[\K[0-9]+|Dur Loss: \K[0-9.]+' | \
  paste - - | python3 -c "
import sys
from collections import defaultdict
s, c = defaultdict(float), defaultdict(int)
for l in sys.stdin:
    p = l.strip().split()
    if len(p) == 2: s[int(p[0])] += float(p[1]); c[int(p[0])] += 1
for e in sorted(s): print(f'Epoch {e}: Dur Loss avg = {s[e]/c[e]:.5f}')
"

# Live training monitor
tail -f train.log | grep --line-buffered 'Step \['

What worked well

PLBERT_all_languages — clear improvement for German phonemes once correctly configured
lambda_slm: 1.0 — keep WavLM generator signal active, never set to 0
batch_size=6, max_len=600 — stable on H200, no OOM (batch_size=8 caused OOM)
Watchdog with setsid — survives SSH disconnects reliably
load_only_params: false — correctly resumes epoch counter from checkpoint
Network Volume — checkpoints survive pod restarts

Hope this saves someone the GPU hours it cost to learn these lessons.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German fine-tuning: lessons learned & pitfalls (PLBERT, discriminator saturation, IPA length mark) #359

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

German fine-tuning: lessons learned & pitfalls (PLBERT, discriminator saturation, IPA length mark) #359

Uh oh!

dida-80b Mar 24, 2026

StyleTTS2 German Fine-Tuning — Lessons Learned & Pitfalls

Pitfall 1: Wrong BERT Model (37 wasted epochs)

Pitfall 2: WavLM Discriminator silently dies after a few epochs

Pitfall 3: IPA length mark ː treated as a separate phoneme token

Pitfall 4: pretrained_model config pointing to a deleted checkpoint

Pre-training Checklist

Monitoring Commands

What worked well

Replies: 0 comments

dida-80b
Mar 24, 2026

Pitfall 3: IPA length mark `ː` treated as a separate phoneme token

Pitfall 4: `pretrained_model` config pointing to a deleted checkpoint