whisper_ros

This repository provides a set of ROS 2 packages to integrate whisper.cpp into ROS 2 using audio_common. Besides, silero-vad is used to perform VAD (Voice Activity Detection).

ROS 2 Distro	Branch	Build status	Docker Image
Humble	`main`
Iron	`main`
Jazzy	`main`
Kilted	`main`
Rolling	`main`

Related Projects

chatbot_ros → This chatbot, integrated into ROS 2, uses whisper_ros, to listen to people speech; and llama_ros, to generate responses. The chatbot is controlled by a state machine created with YASMIN.

Installation

To run whisper_ros with CUDA, first, you must install the CUDA Toolkit. To run SileroVAD with ONNX and CUDA, you must install the cuDNN.

cd ~/ros2_ws/src
git clone https://github.com/mgonzs13/whisper_ros.git
cd ~/ros2_ws
vcs import src < src/whisper_ros/dependencies.repos
rosdep install --from-paths src --ignore-src -r -y
colcon build --cmake-args -DGGML_CUDA=ON -DONNX_GPU=ON # To use CUDA on Whisper and on Silero, respectively

Docker

Build the whisper_ros docker. Additionally, you can choose to build whisper_ros with CUDA (USE_CUDA) and choose the CUDA version (CUDA_VERSION). Remember that you have to use DOCKER_BUILDKIT=0 to compile whisper_ros with CUDA when building the image.

DOCKER_BUILDKIT=0 docker build -t whisper_ros --build-arg USE_CUDA=1 --build-arg CUDA_VERSION=12-6 .

Run the docker container. If you want to use CUDA, you have to install the NVIDIA Container Toolkit and add --gpus all.

docker run -it --rm --gpus all whisper_ros

Usage

Run Silero for VAD and Whisper for STT:

ros2 launch whisper_bringup whisper.launch.py

Add the parameter silero_vad_use_cuda:=True to use Silero with CUDA.

Params

Model Parameters (`model.*`)

Param	Type	Default	Description
`model.repo`	`string`	`""`	HuggingFace repository for model download.
`model.filename`	`string`	`""`	Filename of the model in the repository.
`model.path`	`string`	`""`	Local path to the Whisper model file. If empty, the model is downloaded from `model.repo`.
`model.openvino_encode_device`	`string`	`"CPU"`	OpenVINO device for encoder inference (e.g., `"CPU"`, `"GPU"`).

Sampling Parameters (`sampling.*`)

Param	Type	Default	Description
`sampling.strategy`	`string`	`"beam_search"`	Decoding strategy: `"greedy"` or `"beam_search"`.
`sampling.greedy_best_of`	`int32`	`5`	Number of best candidates to keep when using greedy sampling.
`sampling.beam_search_beam_size`	`int32`	`5`	Beam size for beam search.
`sampling.beam_search_patience`	`float`	`-1.0`	Beam search patience factor (`-1.0` = use whisper.cpp default).

Transcription Parameters (`transcription.*`)

Param	Type	Default	Description
`transcription.n_threads`	`int32`	`4`	Number of threads for processing. Use `-1` to auto-detect from hardware concurrency.
`transcription.n_max_text_ctx`	`int32`	`16384`	Maximum tokens to use from past text as prompt for the decoder.
`transcription.n_processors`	`int32`	`1`	Number of processors for parallel transcription via `whisper_full_parallel`.
`transcription.offset_ms`	`int32`	`0`	Start offset in milliseconds.
`transcription.duration_ms`	`int32`	`0`	Audio duration to process in milliseconds (`0` = process all).
`transcription.audio_ctx`	`int32`	`0`	Overwrite the audio context size (`0` = use model default).
`transcription.language`	`string`	`"en"`	Spoken language code (e.g., `"en"`, `"es"`, `"de"`). Use `"auto"` for auto-detection.
`transcription.detect_language`	`bool`	`false`	If `true`, auto-detect the spoken language and exit without transcription.
`transcription.translate`	`bool`	`false`	If `true`, translate the transcription to English.
`transcription.no_context`	`bool`	`true`	If `true`, do not use past transcription as initial prompt for the decoder.
`transcription.no_timestamps`	`bool`	`false`	If `true`, do not generate timestamps.
`transcription.single_segment`	`bool`	`false`	If `true`, force single segment output (useful for streaming).
`transcription.initial_prompt`	`string`	`""`	Initial prompt text prepended to the decoder context to guide transcription.
`transcription.carry_initial_prompt`	`bool`	`false`	If `true`, always prepend the initial prompt to every decode window.
`transcription.suppress_regex`	`string`	`""`	A regular expression that matches tokens to suppress (empty = disabled).
`transcription.suppress_blank`	`bool`	`true`	Suppress blank outputs at the beginning of the sampling.
`transcription.suppress_nst`	`bool`	`false`	Suppress non-speech tokens.

Token Timestamps Parameters (`token_timestamps.*`)

Param	Type	Default	Description
`token_timestamps.enabled`	`bool`	`false`	Enable token-level timestamps.
`token_timestamps.thold_pt`	`float`	`0.01`	Timestamp token probability threshold.
`token_timestamps.thold_ptsum`	`float`	`0.01`	Timestamp token sum probability threshold.
`token_timestamps.max_len`	`int32`	`0`	Maximum segment length in characters (`0` = no limit).
`token_timestamps.split_on_word`	`bool`	`false`	If `true`, split on word rather than on token (when used with `max_len`).
`token_timestamps.max_tokens`	`int32`	`0`	Maximum tokens per segment (`0` = no limit).

Decoding / Fallback Parameters (`decoding.*`)

Param	Type	Default	Description
`decoding.temperature`	`float`	`0.0`	Initial decoding temperature. Use `0.0` for deterministic output.
`decoding.max_initial_ts`	`float`	`1.0`	Maximum initial timestamp allowed.
`decoding.length_penalty`	`float`	`-1.0`	Length penalty factor (`-1.0` = use whisper.cpp default).
`decoding.temperature_inc`	`float`	`0.2`	Temperature increment on decoding fallback.
`decoding.entropy_thold`	`float`	`2.4`	Entropy threshold for decoding fallback (similar to compression ratio).
`decoding.logprob_thold`	`float`	`-1.0`	Log probability threshold for decoding fallback.
`decoding.no_speech_thold`	`float`	`0.6`	No-speech probability threshold. If the no-speech probability exceeds this value and `logprob_thold` fails, consider the segment as silence.

Speaker Diarization Parameters (`diarization.*`)

Param	Type	Default	Description
`diarization.tdrz_enable`	`bool`	`false`	Enable tinydiarize speaker turn detection.

GPU / Backend Parameters (`gpu.*`)

Param	Type	Default	Description
`gpu.enabled`	`bool`	`false`	Enable GPU inference.
`gpu.flash_attn`	`bool`	`true`	Enable flash attention.
`gpu.device`	`int32`	`0`	CUDA device index to use.

DTW Token Timestamps Parameters (`dtw.*`)

Param	Type	Default	Description
`dtw.token_timestamps`	`bool`	`false`	Enable experimental token-level timestamps with DTW.
`dtw.n_top`	`int32`	`-1`	Number of top text layers for DTW alignment heads (`-1` = disabled).
`dtw.aheads`	`string`	`"none"`	DTW alignment heads preset. Options: `"none"`, `"tiny"`, `"tiny.en"`, `"base"`, `"base.en"`, `"small"`, `"small.en"`, `"medium"`, `"medium.en"`, `"large.v1"`, `"large.v2"`, `"large.v3"`, `"large.v3.turbo"`.

Demos

Send a goal action to listen:

ros2 action send_goal /whisper/listen whisper_msgs/action/STT "{}"

Or try the example of a whisper client:

ros2 run whisper_demos whisper_demo_node

Name		Name	Last commit message	Last commit date
Latest commit History 277 Commits
.github		.github
whisper_bringup		whisper_bringup
whisper_cpp_vendor		whisper_cpp_vendor
whisper_demos		whisper_demos
whisper_hfhub_vendor		whisper_hfhub_vendor
whisper_msgs		whisper_msgs
whisper_onnxruntime_vendor		whisper_onnxruntime_vendor
whisper_ros		whisper_ros
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dependencies.repos		dependencies.repos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

whisper_ros

Table of Contents

Related Projects

Installation

Docker

Usage

Params

Model Parameters (`model.*`)

Sampling Parameters (`sampling.*`)

Transcription Parameters (`transcription.*`)

Token Timestamps Parameters (`token_timestamps.*`)

Decoding / Fallback Parameters (`decoding.*`)

Speaker Diarization Parameters (`diarization.*`)

GPU / Backend Parameters (`gpu.*`)

DTW Token Timestamps Parameters (`dtw.*`)

Demos

About

Uh oh!

Releases 52

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

whisper_ros

Table of Contents

Related Projects

Installation

Docker

Usage

Params

Model Parameters (model.*)

Sampling Parameters (sampling.*)

Transcription Parameters (transcription.*)

Token Timestamps Parameters (token_timestamps.*)

Decoding / Fallback Parameters (decoding.*)

Speaker Diarization Parameters (diarization.*)

GPU / Backend Parameters (gpu.*)

DTW Token Timestamps Parameters (dtw.*)

Demos

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 52

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Model Parameters (`model.*`)

Sampling Parameters (`sampling.*`)

Transcription Parameters (`transcription.*`)

Token Timestamps Parameters (`token_timestamps.*`)

Decoding / Fallback Parameters (`decoding.*`)

Speaker Diarization Parameters (`diarization.*`)

GPU / Backend Parameters (`gpu.*`)

DTW Token Timestamps Parameters (`dtw.*`)

Packages