π Acoustically-Driven Hierarchical Alignment with Differential Attention for Weakly-Supervised Audio-Visual Video Parsing
This is the official code for the Acoustically-Driven Hierarchical Alignment with Differential Attention for Weakly-Supervised Audio-Visual Video Parsing.
- Ubuntu version: 20.04.6 LTS (Focal Fossa)
- CUDA version: 12.2
- PyTorch: 1.12.1
- Python: 3.10.12
- GPU: NVIDIA A100-SXM4-40GB
A conda environment named adda can be created and activated with:
conda env create -f environment.yaml
conda activate adda
Please download LLP dataset annotations (6 CSV files) from AVVP-ECCV20 and place them in data/.
Please download the CLAP-extracted features (CLAP.7z) and CLIP-extracted features (CLIP.7z) from this link, unzip the two files, and place the decompressed CLAP-related files in data/feats_CLAP/ and the CLIP-related files in data/feats_CLIP/.
Please make sure that the file structure is the same as the following.
data/
β βββ AVVP_dataset_full.csv
β βββ AVVP_eval_audio.csv
β βββ AVVP_eval_visual.csv
β βββ AVVP_test_pd.csv
β βββ AVVP_train.csv
β βββ AVVP_val_pd.csv
β βββ feats/
β β βββ CLIP/
β β β βββ -0A9suni5YA.npy
β β β βββ -0BKyt8iZ1I.npy
β β β βββ ...
β β βββ CLAP/
β β β βββ -0A9suni5YA.npy
β β β βββ -0BKyt8iZ1I.npy
β β β βββ ...
β β βββ ...Please download the trained models from this link and put the models in their corresponding model directory.
We provide bash file for a quick start.
bash train.sh
bash test.sh
We build ADDA codebase heavily on the codebase of AVVP-ECCV20, VALOR. We sincerely thank the authors for open-sourcing! We also thank CLIP and CLAP for open-sourcing pre-trained models.
