RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
BE2R Lab — Biomechatronics and Energy-Efficient Robotics Laboratory, ITMO University
🌐 Project Page: be2rlab.github.io/radio_vipe | 📄 Paper: https://arxiv.org/pdf/2604.26067
We present RADIO-ViPE (Reduce All Domains Into One — Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments.
Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings — spanning vision and language — derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This vision-language-geometric fusion is optimized within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric sessions).
Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics, AR/VR applications, and unconstrained in-the-wild video streams.
# Build the Docker image
make build
# Run the Docker image
make DATA_DIR={YOUR_DATA_DIR} run
# Inside the container, install the package
pip install --no-build-isolation -e .# Run the full pipeline
python run.py pipeline=default streams=raw_mp4_stream streams.base_path=YOUR_VIDEO_OR_DIR_PATH
# Run the pose-only pipeline (without depth estimation)
python run.py pipeline=default streams=raw_mp4_stream streams.base_path=YOUR_VIDEO_OR_DIR_PATH pipeline.post.depth_align_model=nullSemantic segmentation evaluation uses code borrowed from the RayFronts repository.
For Replica, we use the NiceSlam version and we get the GT semantic labels from HOV-SG (Uploaded here for convenience) since NiceSlam does not provide semantic labels without the original dataset.
— cited from RayFronts.
Run evaluation with one of the prepared configs:
python scripts/semseg_eval.py --config-name semseg_configs/replica_kmvipeIf you want to run evaluation for all the scenes:
python scripts/semseg_eval.py \
--config-name semseg_configs/replica_kmvipe \
--multirun \
semseg_configs.dataset.scene_name=office0,office1,office2,office3,office4,room0,room1,room2Expected outputs are saved under eval_out/<experiment>/<DatasetName>/<scene>/.
RMSE evaluation is performed using the shell scripts provided in scripts/, for example:
scripts/slam_evaluation_replica.shThese scripts run the SLAM pipeline on the corresponding dataset and compute RMSE metrics for the generated trajectories.
RADIO-ViPE builds upon many outstanding open-source research projects and codebases, including (non-exhaustive):
| Project | Reference |
|---|---|
| RAD-SEG | arXiv:2511.19704 |
| KM-ViPE | arXiv:2512.01889 |
| RayFronts | arXiv:2504.06994 |
| ViPE | GitHub |
| RADIO | arXiv:2601.17237 |
| DINOv3 | GitHub |
| Talk2DINO | GitHub |
| RVWO | GitHub |
| UniDepth | GitHub |
This project will download and install additional third-party models and software. Note that these are not distributed by NVIDIA — please review their respective license terms before use.
This source code is released under the Apache 2.0 License.

