This repository is the official implementation of Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations.,(IJCNN2025 Accepted)
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. To represent the previously visited environment, most approaches for VLN implement memory using recurrent states, topological maps, or top-down semantic maps. In contrast to these approaches, we build the top-down egocentric and dynamically growing Grid Memory Map (i.e., GridMM) to structure the visited environment. From a global perspective, historical observations are projected into a unified grid map in a top-down view, which can better represent the spatial relations of the environment. From a local perspective, we further propose an instruction relevance aggregation method to capture fine-grained visual clues in each grid region. Extensive experiments are conducted on both the REVERIE, R2R, SOON datasets in the discrete environments, and the R2R-CE dataset in the continuous environments, showing the superiority of our proposed method.
- Install Matterport3D simulator for
R2R,REVERIEandSOON: follow instructions here.
export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH
- Install requirements:
conda create --name MBA python=3.8.5
conda activate MBA
pip install -r requirements.txt
-
Download data from Dropbox, including processed annotations, features and pretrained models of REVERIE, SOON, R2R and R4R datasets. Put the data in `datasets' directory.
-
Download pretrained lxmert
mkdir -p datasets/pretrained
wget https://nlp.cs.unc.edu/data/model_LXRT.pth -P datasets/pretrained
- Download Clip-based rgb feature and Depth feature (glbson and imagenet) form (链接: https://pan.baidu.com/s/1lKend8xnwuy1uxn-aIDBtw?pwd=n8gv 提取码: n8gv) The ground truth depth image (undistorted_depth_images) is obtained from the Matterport Simulator, and depth view features are extracted through here:
python get_depth.py
The code is referenced from HAMT and here
The pretrained ckpts for REVERIE, R2R, SOON is at here. You can also pretrain the model by yourself, just change the pre training RGB of Duet from vit based to clip based. Combine behavior cloning and auxiliary proxy tasks in pretraining:
cd pretrain_src
bash run_r2r.sh # (run_reverie.sh, run_soon.sh)
Use pseudo interative demonstrator to fine-tune the model:
cd map_nav_src
bash scripts/run_r2r.sh # (run_reverie.sh, run_soon.sh)
Our report results on the test set are from the official website of Eval.ai.
R2R: https://eval.ai/web/challenges/challenge-page/97/submission
REVERIE: https://eval.ai/web/challenges/challenge-page/606/overview
SOON: https://eval.ai/web/challenges/challenge-page/1275/overview

- Panoramic trajectory visualization is provided by Speaker-Follower.
- Top-down maps for Matterport3D are available in NRNS.
- Instructions for extracting image features from Matterport3D scenes can be found in VLN-HAMT.
(copy from goat) We extend our gratitude to all the authors for their significant contributions and for sharing their resources.
@article{zhang2024seeing,
title={Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations},
author={Zhang, Xuesong and Li, Jia and Xu, Yunbo and Hu, Zhenzhen and Hong, Richang},
journal={arXiv preprint arXiv:2409.05552},
year={2024}
}Our code is based on VLN-DUET and partially referenced from HAMT for extract view features. Thanks for their great works!
