Skip to content

mgonzs13/whisper_ros

Repository files navigation

whisper_ros

This repository provides a set of ROS 2 packages to integrate whisper.cpp into ROS 2 using audio_common. Besides, silero-vad is used to perform VAD (Voice Activity Detection).

License: MIT GitHub release Code Size Last Commit GitHub issues GitHub pull requests Contributors Python Formatter Check C++ Formatter Check Doxygen Deployment

ROS 2 Distro Branch Build status Docker Image
Humble main Humble Build Docker Image
Iron main Iron Build Docker Image
Jazzy main Jazzy Build Docker Image
Kilted main Kilted Build Docker Image
Rolling main Rolling Build Docker Image

Table of Contents

  1. Related Projects
  2. Installation
  3. Docker
  4. Usage
  5. Params
  6. Demos

Related Projects

  • chatbot_ros → This chatbot, integrated into ROS 2, uses whisper_ros, to listen to people speech; and llama_ros, to generate responses. The chatbot is controlled by a state machine created with YASMIN.

Installation

To run whisper_ros with CUDA, first, you must install the CUDA Toolkit. To run SileroVAD with ONNX and CUDA, you must install the cuDNN.

cd ~/ros2_ws/src
git clone https://github.com/mgonzs13/whisper_ros.git
cd ~/ros2_ws
vcs import src < src/whisper_ros/dependencies.repos
rosdep install --from-paths src --ignore-src -r -y
colcon build --cmake-args -DGGML_CUDA=ON -DONNX_GPU=ON # To use CUDA on Whisper and on Silero, respectively

Docker

Build the whisper_ros docker. Additionally, you can choose to build whisper_ros with CUDA (USE_CUDA) and choose the CUDA version (CUDA_VERSION). Remember that you have to use DOCKER_BUILDKIT=0 to compile whisper_ros with CUDA when building the image.

DOCKER_BUILDKIT=0 docker build -t whisper_ros --build-arg USE_CUDA=1 --build-arg CUDA_VERSION=12-6 .

Run the docker container. If you want to use CUDA, you have to install the NVIDIA Container Toolkit and add --gpus all.

docker run -it --rm --gpus all whisper_ros

Usage

Run Silero for VAD and Whisper for STT:

ros2 launch whisper_bringup whisper.launch.py

Add the parameter silero_vad_use_cuda:=True to use Silero with CUDA.

Params

Model Parameters (model.*)

Param Type Default Description
model.repo string "" HuggingFace repository for model download.
model.filename string "" Filename of the model in the repository.
model.path string "" Local path to the Whisper model file. If empty, the model is downloaded from model.repo.
model.openvino_encode_device string "CPU" OpenVINO device for encoder inference (e.g., "CPU", "GPU").

Sampling Parameters (sampling.*)

Param Type Default Description
sampling.strategy string "beam_search" Decoding strategy: "greedy" or "beam_search".
sampling.greedy_best_of int32 5 Number of best candidates to keep when using greedy sampling.
sampling.beam_search_beam_size int32 5 Beam size for beam search.
sampling.beam_search_patience float -1.0 Beam search patience factor (-1.0 = use whisper.cpp default).

Transcription Parameters (transcription.*)

Param Type Default Description
transcription.n_threads int32 4 Number of threads for processing. Use -1 to auto-detect from hardware concurrency.
transcription.n_max_text_ctx int32 16384 Maximum tokens to use from past text as prompt for the decoder.
transcription.n_processors int32 1 Number of processors for parallel transcription via whisper_full_parallel.
transcription.offset_ms int32 0 Start offset in milliseconds.
transcription.duration_ms int32 0 Audio duration to process in milliseconds (0 = process all).
transcription.audio_ctx int32 0 Overwrite the audio context size (0 = use model default).
transcription.language string "en" Spoken language code (e.g., "en", "es", "de"). Use "auto" for auto-detection.
transcription.detect_language bool false If true, auto-detect the spoken language and exit without transcription.
transcription.translate bool false If true, translate the transcription to English.
transcription.no_context bool true If true, do not use past transcription as initial prompt for the decoder.
transcription.no_timestamps bool false If true, do not generate timestamps.
transcription.single_segment bool false If true, force single segment output (useful for streaming).
transcription.initial_prompt string "" Initial prompt text prepended to the decoder context to guide transcription.
transcription.carry_initial_prompt bool false If true, always prepend the initial prompt to every decode window.
transcription.suppress_regex string "" A regular expression that matches tokens to suppress (empty = disabled).
transcription.suppress_blank bool true Suppress blank outputs at the beginning of the sampling.
transcription.suppress_nst bool false Suppress non-speech tokens.

Token Timestamps Parameters (token_timestamps.*)

Param Type Default Description
token_timestamps.enabled bool false Enable token-level timestamps.
token_timestamps.thold_pt float 0.01 Timestamp token probability threshold.
token_timestamps.thold_ptsum float 0.01 Timestamp token sum probability threshold.
token_timestamps.max_len int32 0 Maximum segment length in characters (0 = no limit).
token_timestamps.split_on_word bool false If true, split on word rather than on token (when used with max_len).
token_timestamps.max_tokens int32 0 Maximum tokens per segment (0 = no limit).

Decoding / Fallback Parameters (decoding.*)

Param Type Default Description
decoding.temperature float 0.0 Initial decoding temperature. Use 0.0 for deterministic output.
decoding.max_initial_ts float 1.0 Maximum initial timestamp allowed.
decoding.length_penalty float -1.0 Length penalty factor (-1.0 = use whisper.cpp default).
decoding.temperature_inc float 0.2 Temperature increment on decoding fallback.
decoding.entropy_thold float 2.4 Entropy threshold for decoding fallback (similar to compression ratio).
decoding.logprob_thold float -1.0 Log probability threshold for decoding fallback.
decoding.no_speech_thold float 0.6 No-speech probability threshold. If the no-speech probability exceeds this value and logprob_thold fails, consider the segment as silence.

Speaker Diarization Parameters (diarization.*)

Param Type Default Description
diarization.tdrz_enable bool false Enable tinydiarize speaker turn detection.

GPU / Backend Parameters (gpu.*)

Param Type Default Description
gpu.enabled bool false Enable GPU inference.
gpu.flash_attn bool true Enable flash attention.
gpu.device int32 0 CUDA device index to use.

DTW Token Timestamps Parameters (dtw.*)

Param Type Default Description
dtw.token_timestamps bool false Enable experimental token-level timestamps with DTW.
dtw.n_top int32 -1 Number of top text layers for DTW alignment heads (-1 = disabled).
dtw.aheads string "none" DTW alignment heads preset. Options: "none", "tiny", "tiny.en", "base", "base.en", "small", "small.en", "medium", "medium.en", "large.v1", "large.v2", "large.v3", "large.v3.turbo".

Demos

Send a goal action to listen:

ros2 action send_goal /whisper/listen whisper_msgs/action/STT "{}"

Or try the example of a whisper client:

ros2 run whisper_demos whisper_demo_node