Skip to content

cmusatyalab/grw-smoothing

Repository files navigation

Smooth Regularization for Efficient Video Recognition (Grw-smoothing)

Paper (arXiv 2511.20928)Poster (NeurIPS 2025)Project repo

GRW-smoothing is a plug-and-play regularization technique for video recognition models that enforces temporal smoothness in intermediate feature embeddings. Consecutive frame embeddings are modeled as a Gaussian Random Walk (GRW), and the training loss penalizes high “acceleration” in this embedding space while preserving the correct frame ordering. This inductive bias suppresses noisy, abrupt representation changes and better aligns the model with the natural temporal coherence of videos, which is especially beneficial for lightweight architectures under tight FLOP and memory budgets. When applied to MoViNet and MobileNetV3 backbones, GRW-smoothing yields improvements of 3.8–6.4 percentage points on Kinetics‑600 under comparable compute and memory constraints.


Installation

This project includes two packages:

  • grw-smoothing: Contains only the loss term. Use this package if you want to use the GRW-smoothing loss in your own models but don't need the trained video models.
  • grw-smoothing-models: Includes the trained video models and has grw-smoothing as a dependency. Use this package if you want to use the pre-trained models.

Prerequisites

  1. Install uv (Python package installer):

    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Install Python >=3.10 and create a virtual environment:

    uv venv --python 3.11
  3. Activate the virtual environment:

    source .venv/bin/activate
  4. Install PyTorch:

    uv pip install torch torchvision

Installing the packages

To install both grw-smoothing-models (which includes grw-smoothing as a dependency):

uv sync

This will install both packages and all their dependencies.


Illustration of GRW-smoothing

Warm-up airplane example: embeddings with and without GRW-smoothing

Figure: Warm‑up airplanes dataset: three classes (yaw, pitch, roll) that are indistinguishable from a single frame. Without GRW‑smoothing, frame embeddings are scattered; with GRW‑smoothing, they form smooth, low‑acceleration trajectories that align with the underlying rotation type.


How GRW-smoothing is integrated

At a high level, GRW-smoothing can be applied in two ways:

  1. Intermediate-layer smoothing

    • Globally pool intermediate feature maps over space.
    • Normalize them (e.g., using batch norm without learned affine parameters).
    • Slice the temporal sequence into short clips and apply the GRW loss to these sub‑sequences.
  2. Final-layer smoothing

    • Affinely normalize the final per‑frame embeddings.
    • Apply GRW regularization over short temporal windows.
    • Optionally feed the smoothed sequence into a lightweight temporal head (e.g., 1–2 layer Transformer) before the classifier.

In both cases, GRW acts purely on embeddings and introduces only a small overhead relative to the backbone, making it easy to drop into existing architectures.


Results on Kinetics‑600

GRW-smoothing substantially improves the accuracy–efficiency trade‑off of compact video models. On Kinetics‑600, MoViNet‑A0/1/2/3 and their streaming variants, as well as MobileNetV3, achieve 3.8–6.4 pp higher Top‑1 accuracy compared to previous state of the art at similar GFLOPs or memory usage. In particular, MoViNet‑A3‑GRW reaches 85.6% Top‑1 at just 56.4 GFLOPs, while a comparable transformer model (MViTv2‑B‑32×3) needs 18.3× more FLOPs to match this accuracy.

GFLOPs vs. Top‑1 Accuracy

GFLOPs vs Top-1 accuracy on Kinetics-600

Figure: Accuracy vs. FLOPs on Kinetics‑600. Red points trace MoViNet variants with GRW‑smoothing; blue points show prior published models (see Table 1). GRW‑smoothing consistently raises the Pareto frontier for efficient video recognition.

Table 1 — Kinetics‑600 by FLOPs

Top‑1 accuracy vs. total video evaluation cost (GFLOPs) on Kinetics‑600. Models enhanced with GRW‑smoothing are marked with GRW.

# Model Top‑1 (%) GFLOPs Res Frames (clips × frames)
1 MoViNet‑A0‑S‑GRW 78.4 2.7 172 1 × 50
2 MoViNet‑A0 72.3 2.7 172 1 × 50
3 MobileNetV3‑S 61.3 2.8 224 1 × 50
4 MobileNetV3‑S + TSM 65.5 2.8 224 1 × 50
5 X3D‑XS 70.2 3.9 182 1 × 20
6 MoViNet‑A1‑S‑GRW 81.9 6.0 172 1 × 50
7 MoViNet‑A1 76.7 6.0 172 1 × 50
8 X3D‑S 73.4 7.8 182 1 × 40
9 MoViNet‑A2 78.6 10.3 224 1 × 50
10 MobileNetV3‑L 68.1 11.0 224 1 × 50
11 MobileNetV3‑L + TSM 71.4 11.0 224 1 × 50
12 MoViNet‑A2‑S‑GRW 83.3 11.3 224 1 × 50
13 X3D‑M 76.9 19.4 256 1 × 50
14 X3D‑XS 72.3 23.3 182 30 × 4
15 MoViNet‑A3‑GRW 85.6 56.4 256 1 × 120
16 MoViNet‑A3 81.8 56.9 256 1 × 120
17 X3D‑S 76.4 76.1 182 30 × 13
18 X3D‑L 79.1 77.5 356 1 × 50
19 MoViNet‑A4 83.5 105 290 1 × 80
20 UniFormer‑S 82.8 167 224 4 × 16
21 X3D‑M 78.8 186 256 30 × 16
22 X3D‑L 80.7 187 356 1 × 120
23 I3D 71.6 216 224 1 × 250
24 MoViNet‑A5 84.3 281 320 1 × 120
25 MViT‑B‑16×4 82.1 353 224 5 × 16
26 MoViNet‑A6 84.8 386 320 1 × 120
27 UniFormer‑B 84.0 389 224 4 × 16
28 XViT (8×) 82.5 425 224 3 × 8
29 XViT (16×) 84.5 850 224 3 × 16
30 MViT‑B‑32×3 83.4 850 224 5 × 32
31 MViTv2‑B‑32×3 85.5 1030 224 5 × 32

Summary: Across all FLOP regimes, adding GRW‑smoothing on top of existing backbones consistently yields state‑of‑the‑art accuracy for efficient video recognition.


Trained weights

Trained weights for the GRW‑smoothing MoViNet models used in the paper (A0S, A1S, A2S, A3B) are hosted on Hugging Face: https://huggingface.co/DrGil/grw-smoothing-movinet/tree/main


Configure paths

Before running inference, create packages/grw-smoothing-models/config.ini by copying packages/grw-smoothing-models/config.ini.template, then set:

  • models_home to the directory where checkpoints should be stored.
  • data_home to the root of your Kinetics‑600 dataset.

Data preparation (Kinetics‑600 test set)

The paper reports results on the Kinetics‑600 public (labeled) test split. Because Kinetics videos are hosted on YouTube and availability changes over time, we evaluated on the subset of videos from the official test split that were still accessible at the time of our experiments. To enable exact reproducibility, we provide here the frozen set of extracted clips used in the paper: https://huggingface.co/datasets/DrGil/k600_test_ds

  1. Download the dataset tarball to your data_home directory (as configured in packages/grw-smoothing-models/config.ini).
  2. Extract the archive:
    tar -xvf k600_test_ds.tar.gz
    Your data_home directory should now include a single test folder containing the test video clips, organized by class name.
  3. Under data_home, create the directory video_clips_cache:
    mkdir video_clips_cache
  4. Build video clip metadata (this can take a while). From packages/grw-smoothing-models/src/grw_smoothing_models, run:
    python cli.py kinetics-create-clips
  5. Validate the cache: confirm video_clips_cache/test.pkl was created.

Inference examples

Before running the inference commands, download the checkpoints into models_home (as configured in packages/grw-smoothing-models/config.ini).

To run evaluation on Kinetics-600 using the published weights on Hugging Face, run these commands from packages/grw-smoothing-models/src/grw_smoothing_models:

# A0S
torchrun --standalone --nproc_per_node [num_gpus] cli.py kinetics-test \
  --model_name a0s_grw.pt \
  --batch_size 20 \
  --fps 5 \
  --N 50 \
  --backbone movineta0s

# A1S
torchrun --standalone --nproc_per_node [num_gpus] cli.py kinetics-test \
  --model_name a1s_grw.pt \
  --batch_size 20 \
  --fps 5 \
  --N 50 \
  --backbone movineta1s

# A2S
torchrun --standalone --nproc_per_node [num_gpus] cli.py kinetics-test \
  --model_name a2s_grw.pt \
  --batch_size 20 \
  --fps 5 \
  --N 50 \
  --backbone movineta2s

# A3B
torchrun --standalone --nproc_per_node [num_gpus] cli.py kinetics-test \
  --model_name a3b_grw.pt \
  --batch_size 20 \
  --fps 12 \
  --N 120 \
  --backbone movineta3b

The checkpoint is downloaded into the models_home directory from config.ini.


BibTeX

If you use this code or ideas from the paper, please cite:

@inproceedings{goldman2025grwsmoothing,
  title     = {Smooth Regularization for Efficient Video Recognition},
  author    = {Gil Goldman and Raja Giryes and Mahadev Satyanarayanan},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2025},
  url       = {https://arxiv.org/abs/2511.20928}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages