What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning
Official code of What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning, ICCV 2025.
To create a conda environment with the required dependencies, run the following command:
conda env create -f environment.yml
source activate cfEgo4D/EgoClip
Please refer to EgoVLP codebase for data preparation. We use the downsampled and chunked video outputs as the input to our method (output from utils/video_chunk.py). For summary sentences, we provide the processed summary and narration hierarchy here. The used egosummary_full.csv is available here.
GTEA: Please follow Bridge-Prompt to download the raw video and then extract frames from videos.
EgoPRE: .
EpicKitchen & Charades-Ego: Please refer to EgoVLP codebase for data preparation.
AE2: Please pre-extract frame features for this task, following Align-Ego-Exo for the data split.
Please refer to Llama 3 for model weights and installation instructions. We use the following scripts to generate state change and counterfactual descriptions for the entire Ego4D dataset. Please note that you will need to modify the paths to Ego4D's annotation files in the scripts.
# clip-level state changes and their counterfactuals
cd llama_script
python clip_level_sc_cf.py
# video-level counterfactuals
cd llama_script
python video_level_cf.py
To extract clip-level narration features, please run
language_extraction/feature_extractor.pyTo extract video-level summary features, please run
language_extraction/summary_feature_extractor.pyTo run the pretraining on a distributed SLURM system, copy the content of slurm_scripts to this level directly and run
bash mover_trainer.sh job_name
The parameters of the SLURM job can be changed in the trainer.sh script. We use 2 nodes, each with 4 32 GB GPUs. The submit schedule first copies the required scripts to a different folder and then runs it from there. This copying ensures the code can be safely edited while a job is in the SLURM queue.
Please run
torchrun --nnodes 1 --nproc_per_node 8 --master_port 8081 run/train_egoaggregate.py --config configs/pt/egoaggregation.jsonThe pretraining checkpoint is available here.
Step 1: Generate features with the pre-trained video model.
Please note that you will need to specify the dataset, model name, cofig path in the script and the "save_dir" in ./as_configs/gtea/gtea_exfm.yaml.
python extract_frame_features.py
Please refer to Bridge-Prompt for more details.
Step 2: Train/test ASFormer based on the features.
cd ASFormer
python main.py --feature cf --dataset gtea --split 1/2/3/4
python main.py --action eval --feature cf --dataset gtea --split 1/2/3/4
python eval.py -- result_dir path_to_results --split 1/2/3/4/0
Please refer to ASFormer for more details.
The detailed instruction is in EgoPER - this has modified EgoPER scripts from the original.
To use this modified scripts, submodules should be loaded:
- If you clone this repo without submodule, follow:
git submodule update --init --recursive
- To clone with the submodule, follow:
git clone --recurse-submodules git@github.com:HCIS-Lab/counterfactual-video-pretrain.git
python AE2/AE2_phase_cls.pypython downstream_script/test_epic.pypython downstream_script/test_charades.pypython AE2/AE2_frame_retrieval.pyIf you use our code or method, please cite the following paper:
@InProceedings{counterfactual_ICCV_2025,
author = {Kung, Chi-Hsi and Ramirez, Frangil and Ha, Juhyung and Chen, Yi-Ting and Crandall, David and Tsai, Yi-Hsuan},
title = {What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
}
The pretraining and Chrades-Ego, EPIC-KITCHENS test codebase is based on EgoVLP and HierVL.
The feature extraction code is adapted from Bridge-Prompt.
The temporal action segmentation code is adapted from ASFormer.
The action phase recognition and frame retrieval code is adapted from AE2
