What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Official code of What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning, ICCV 2025.

Installation

To create a conda environment with the required dependencies, run the following command:

conda env create -f environment.yml
source activate cf

Dataset Preparation

Pre-training

Ego4D/EgoClip

Please refer to EgoVLP codebase for data preparation. We use the downsampled and chunked video outputs as the input to our method (output from utils/video_chunk.py). For summary sentences, we provide the processed summary and narration hierarchy here. The used egosummary_full.csv is available here.

Downstream tasks

GTEA: Please follow Bridge-Prompt to download the raw video and then extract frames from videos.

EgoPRE: .

EpicKitchen & Charades-Ego: Please refer to EgoVLP codebase for data preparation.

AE2: Please pre-extract frame features for this task, following Align-Ego-Exo for the data split.

Generate State Changes and Their Counterfactuals with Llama

Please refer to Llama 3 for model weights and installation instructions. We use the following scripts to generate state change and counterfactual descriptions for the entire Ego4D dataset. Please note that you will need to modify the paths to Ego4D's annotation files in the scripts.

# clip-level state changes and their counterfactuals
cd llama_script
python clip_level_sc_cf.py

# video-level counterfactuals
cd llama_script
python video_level_cf.py

Generate Text Features with FLAVA

To extract clip-level narration features, please run

language_extraction/feature_extractor.py

To extract video-level summary features, please run

language_extraction/summary_feature_extractor.py

Pretraining

Running on SLURM cluster

To run the pretraining on a distributed SLURM system, copy the content of slurm_scripts to this level directly and run

bash mover_trainer.sh job_name

The parameters of the SLURM job can be changed in the trainer.sh script. We use 2 nodes, each with 4 32 GB GPUs. The submit schedule first copies the required scripts to a different folder and then runs it from there. This copying ensures the code can be safely edited while a job is in the SLURM queue.

Running on a single machine

Please run

torchrun  --nnodes 1 --nproc_per_node 8 --master_port 8081  run/train_egoaggregate.py --config configs/pt/egoaggregation.json

Pretraining Checkpoint

The pretraining checkpoint is available here.

Downstream Task Training/Testing

Temporal Action Segmentation (GTEA)

Step 1: Generate features with the pre-trained video model.

Please note that you will need to specify the dataset, model name, cofig path in the script and the "save_dir" in ./as_configs/gtea/gtea_exfm.yaml.

python extract_frame_features.py

Please refer to Bridge-Prompt for more details.

Step 2: Train/test ASFormer based on the features.

cd ASFormer
python main.py --feature cf --dataset gtea --split 1/2/3/4
python main.py --action eval --feature cf --dataset gtea --split 1/2/3/4
python eval.py -- result_dir path_to_results --split 1/2/3/4/0

Please refer to ASFormer for more details.

EgoPER (Action Segmentation and Error Detection)

The detailed instruction is in EgoPER - this has modified EgoPER scripts from the original.

To use this modified scripts, submodules should be loaded:

If you clone this repo without submodule, follow:

git submodule update --init --recursive

To clone with the submodule, follow:

git clone --recurse-submodules git@github.com:HCIS-Lab/counterfactual-video-pretrain.git

AE2 Action Phase Recognition

python AE2/AE2_phase_cls.py

Zero-Shot Downstream Task Testing

EpicKitchen-100 Zero-Shot Multi-Instance Retrieval

python downstream_script/test_epic.py

Charades-Ego Zero-Shot Action Classification

python downstream_script/test_charades.py

AE2 Zero-Shot Action Phase Frame Retrieval

python AE2/AE2_frame_retrieval.py

Citation

If you use our code or method, please cite the following paper:

@InProceedings{counterfactual_ICCV_2025,
    author    = {Kung, Chi-Hsi and Ramirez, Frangil and Ha, Juhyung and Chen, Yi-Ting and Crandall, David and Tsai, Yi-Hsuan},
    title     = {What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
}

Acknowledgement

The pretraining and Chrades-Ego, EPIC-KITCHENS test codebase is based on EgoVLP and HierVL.

The feature extraction code is adapted from Bridge-Prompt.

The temporal action segmentation code is adapted from ASFormer.

The action phase recognition and frame retrieval code is adapted from AE2

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
AE2		AE2
ASFormer		ASFormer
EgoPER @ 9af3ac2		EgoPER @ 9af3ac2
as_configs		as_configs
as_utils		as_utils
base		base
clip		clip
configs		configs
data_loader		data_loader
datasets		datasets
downstream_script		downstream_script
language_extraction		language_extraction
llama_script		llama_script
logger		logger
model		model
preprocess		preprocess
run		run
slurm_scripts		slurm_scripts
trainer		trainer
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.md		LICENSE.md
README.md		README.md
counterfactual.gif		counterfactual.gif
distributed_main.py		distributed_main.py
environment.yml		environment.yml
extract_frame_features.py		extract_frame_features.py
move_trainer.sh		move_trainer.sh
parse_config.py		parse_config.py
trainer.sh		trainer.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Installation

Dataset Preparation

Pre-training

Downstream tasks

Generate State Changes and Their Counterfactuals with Llama

Generate Text Features with FLAVA

Pretraining

Running on SLURM cluster

Running on a single machine

Pretraining Checkpoint

Downstream Task Training/Testing

Temporal Action Segmentation (GTEA)

EgoPER (Action Segmentation and Error Detection)

AE2 Action Phase Recognition

Zero-Shot Downstream Task Testing

EpicKitchen-100 Zero-Shot Multi-Instance Retrieval

Charades-Ego Zero-Shot Action Classification

AE2 Zero-Shot Action Phase Frame Retrieval

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Installation

Dataset Preparation

Pre-training

Downstream tasks

Generate State Changes and Their Counterfactuals with Llama

Generate Text Features with FLAVA

Pretraining

Running on SLURM cluster

Running on a single machine

Pretraining Checkpoint

Downstream Task Training/Testing

Temporal Action Segmentation (GTEA)

EgoPER (Action Segmentation and Error Detection)

AE2 Action Phase Recognition

Zero-Shot Downstream Task Testing

EpicKitchen-100 Zero-Shot Multi-Instance Retrieval

Charades-Ego Zero-Shot Action Classification

AE2 Zero-Shot Action Phase Frame Retrieval

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages