CorentinJ · CorentinJ · Sep 23, 2025 · Sep 16, 2025 · Sep 17, 2025 · Sep 17, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,33 @@
+name: ci
+
+on:
+  pull_request:
+  push:
+    branches: [master, main]
+  workflow_dispatch:
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install system dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y libsndfile1
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v6
+
+      - name: Set up Python 3.9
+        run: uv python install 3.9
+
+      - name: Pin Python 3.9
+        run: uv python pin 3.9
+
+      - name: Sync dependencies
+        run: uv sync --extra cpu --dev --locked || uv sync --extra cpu --dev
+
+      - name: Run tests
+        run: uv run pytest -q
diff --git a/.gitignore b/.gitignore
@@ -18,3 +18,6 @@
 encoder/saved_models/*
 synthesizer/saved_models/*
 vocoder/saved_models/*
+saved_models/*
+.venv/
+.pytest_cache/
diff --git a/README.md b/README.md
@@ -1,4 +1,5 @@
 # Real-Time Voice Cloning
+
 This repository is an implementation of [Transfer Learning from Speaker Verification to
 Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) (SV2TTS) with a vocoder that works in real-time. This was my [master's thesis](https://matheo.uliege.be/handle/2268.2/6801).
 
@@ -8,48 +9,56 @@ SV2TTS is a deep learning framework in three stages. In the first stage, one cre
 
 [![Toolbox demo](https://i.imgur.com/8lFUlgz.png)](https://www.youtube.com/watch?v=-O_hYhToKoA)
 
+### Papers implemented
 
-
-### Papers implemented  
-| URL | Designation | Title | Implementation source |
-| --- | ----------- | ----- | --------------------- |
-|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
-|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
-|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
-|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | This repo |
+| URL                                                    | Designation            | Title                                                                                    | Implementation source                                   |
+| ------------------------------------------------------ | ---------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------------------- |
+| [**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS**             | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo                                               |
+| [1802.08435](https://arxiv.org/pdf/1802.08435.pdf)     | WaveRNN (vocoder)      | Efficient Neural Audio Synthesis                                                         | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
+| [1703.10135](https://arxiv.org/pdf/1703.10135.pdf)     | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis                                            | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
+| [1710.10467](https://arxiv.org/pdf/1710.10467.pdf)     | GE2E (encoder)         | Generalized End-To-End Loss for Speaker Verification                                     | This repo                                               |
 
 ## Heads up
+
 Like everything else in Deep Learning, this repo has quickly gotten old. Many SaaS apps (often paying) will give you a better audio quality than this repository will. If you wish for an open-source solution with a high voice quality:
+
 - Check out [paperswithcode](https://paperswithcode.com/task/speech-synthesis/) for other repositories and recent research in the field of speech synthesis.
 - Check out [Chatterbox](https://github.com/resemble-ai/chatterbox) for a similar project up to date with the 2025 SOTA in voice cloning
 
-## Setup
-
-### 1. Install Requirements
-1. Both Windows and Linux are supported. A GPU is recommended for training and for inference speed, but is not mandatory.
-2. Python 3.7 is recommended. Python 3.5 or greater should work, but you'll probably have to tweak the dependencies' versions. I recommend setting up a virtual environment using `venv`, but this is optional.
-3. Install [ffmpeg](https://ffmpeg.org/download.html#get-packages). This is necessary for reading audio files.
-4. Install [PyTorch](https://pytorch.org/get-started/locally/). Pick the latest stable version, your operating system, your package manager (pip by default) and finally pick any of the proposed CUDA versions if you have a GPU, otherwise pick CPU. Run the given command.
-5. Install the remaining requirements with `pip install -r requirements.txt`
+## Running the toolbox
+
+Both Windows and Linux are supported.
+1. Install [ffmpeg](https://ffmpeg.org/download.html#get-packages). This is necessary for reading audio files. Check if it's installed by running in a command line
+```
+ffmpeg
+```
+2. Install uv for python package management
+```
+# On Windows:
+powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
+# On Linux
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Alternatively, on any platform if you have pip installed you can do
+pip install -U uv
+```
+3. Run one of the following commands
+```
+# Run the toolbox if you have an NVIDIA GPU
+uv run --extra cuda demo_toolbox.py
+# Use this if you don't
+uv run --extra cpu demo_toolbox.py
+
+# Run in command line if you don't want the GUI
+uv run --extra cuda demo_cli.py
+uv run --extra cpu demo_cli.py
+```
+Uv will automatically create a .venv directory for you with an appropriate python environment. [Open an issue](https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues) if this fails for you
+
+### (Optional) Download Pretrained Models
 
-### 2. (Optional) Download Pretrained Models
 Pretrained models are now downloaded automatically. If this doesn't work for you, you can manually download them [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models).
 
-### 3. (Optional) Test Configuration
-Before you download any dataset, you can begin by testing your configuration with:
+### (Optional) Download Datasets
 
-`python demo_cli.py`
-
-If all tests pass, you're good to go.
-
-### 4. (Optional) Download Datasets
 For playing with the toolbox alone, I only recommend downloading [`LibriSpeech/train-clean-100`](https://www.openslr.org/resources/12/train-clean-100.tar.gz). Extract the contents as `<datasets_root>/LibriSpeech/train-clean-100` where `<datasets_root>` is a directory of your choosing. Other datasets are supported in the toolbox, see [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training#datasets). You're free not to download any dataset, but then you will need your own data as audio files or you will have to record it with the toolbox.
-
-### 5. Launch the Toolbox
-You can then try the toolbox:
-
-`python demo_toolbox.py -d <datasets_root>`  
-or  
-`python demo_toolbox.py`  
-
-depending on whether you downloaded any datasets. If you are running an X-server or if you have the error `Aborted (core dumped)`, see [this issue](https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/11#issuecomment-504733590).
diff --git a/encoder/audio.py b/encoder/audio.py
@@ -99,7 +99,7 @@ def moving_average(array, width):
         return ret[width - 1:] / width
 
     audio_mask = moving_average(voice_flags, vad_moving_average_width)
-    audio_mask = np.round(audio_mask).astype(np.bool)
+    audio_mask = np.round(audio_mask).astype(bool)
 
     # Dilate the voiced regions
     audio_mask = binary_dilation(audio_mask, np.ones(vad_max_silence_length + 1))

diff --git a/encoder/visualizations.py b/encoder/visualizations.py
@@ -22,7 +22,7 @@
     [33, 0, 127],
     [0, 0, 0],
     [183, 183, 183],
-], dtype=np.float) / 255
+], dtype=np.float64) / 255
 
 
 class Visualizations:

diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,61 @@
+[project]
+name = "real-time-voice-cloning"
+version = "0.0.0"
+requires-python = ">=3.9,<3.10"
+
+# Dependencies will be populated by `uv add -r requirements.txt`.
+dependencies = [
+  "inflect==5.3.0",
+  "librosa==0.9.2",
+  "matplotlib==3.5.1",
+  "numpy>=1.26,<2",
+  "Pillow==8.4.0",
+  "PyQt5==5.15.6",
+  "scikit-learn==1.0.2",
+  "scipy>=1.7",
+  "setuptools<=80.8.0",
+  "sounddevice==0.4.3",
+  "soundfile==0.10.3.post1",
+  "tqdm==4.62.3",
+  "umap-learn==0.5.2",
+  "Unidecode==1.3.2",
+  "urllib3==1.26.7",
+  "visdom==0.1.8.9",
+  "webrtcvad==2.0.10",
+]
+
+[project.optional-dependencies]
+# CPU-only torch (default in CI)
+cpu   = ["torch==1.10.*"]
+cuda = ["torch==1.10.*"]
+
+[dependency-groups]
+# Dev/test dependencies
+dev = ["pytest"]
+
+# Map extras -> PyTorch wheel indexes so uv knows where to fetch torch
+[tool.uv.sources]
+torch = [
+  { index = "pytorch-cpu",   extra = "cpu" },
+  { index = "pytorch-cuda", extra = "cuda" },
+]
+
+[[tool.uv.index]]
+name = "pytorch-cuda"
+url = "https://download.pytorch.org/whl/cu113"
+explicit = true
+
+[[tool.uv.index]]
+name = "pytorch-cpu"
+url = "https://download.pytorch.org/whl/cpu"
+explicit = true
+
+# Constrain lock resolution to platforms we actually support to avoid
+# selecting wheels that don't exist on Windows during local sync.
+[tool.uv]
+# Use PEP 508 markers for environments we support; allows a lock usable on Windows and Linux.
+required-environments = [
+  "sys_platform == 'win32'",
+  "sys_platform == 'linux' and platform_machine == 'x86_64'",
+]
+conflicts = [[{ extra = "cpu" }, { extra = "cuda" }]]
diff --git a/requirements.txt b/requirements.txt
diff --git a/tests/test_ci_smoke.py b/tests/test_ci_smoke.py
@@ -0,0 +1,20 @@
+import os
+import sys
+
+# Ensure the repository root is on sys.path for imports like `import encoder`
+ROOT = os.path.dirname(os.path.dirname(__file__))
+if ROOT not in sys.path:
+    sys.path.insert(0, ROOT)
+
+
+def test_third_party_imports():
+    import librosa  # noqa: F401
+    import numpy  # noqa: F401
+    import soundfile  # noqa: F401
+    import torch  # noqa: F401
+
+
+def test_project_imports():
+    import encoder  # noqa: F401
+    import synthesizer  # noqa: F401
+    import vocoder  # noqa: F401
diff --git a/toolbox/ui.py b/toolbox/ui.py
@@ -34,7 +34,7 @@
     [0, 0, 0],
     [183, 183, 183],
     [76, 255, 0],
-], dtype=np.float) / 255
+], dtype=np.float64) / 255
 
 default_text = \
     "Welcome to the toolbox! To begin, load an utterance from your datasets or record one " \

diff --git a/utils/logmmse.py b/utils/logmmse.py
@@ -176,7 +176,7 @@ def denoise(wav, noise_profile: NoiseProfile, eta=0.15):
 #     wav += np.finfo(np.float64).eps
 #     
 #     nframes = int(math.floor(len(wav) / len2) - math.floor(window_size / len2))
-#     vad = np.zeros(nframes * len2, dtype=np.bool)
+#     vad = np.zeros(nframes * len2, dtype=bool)
 # 
 #     aa = 0.98
 #     mu = 0.98