Paper (arXiv 2511.20928) • Poster (NeurIPS 2025) • Project repo
GRW-smoothing is a plug-and-play regularization technique for video recognition models that enforces temporal smoothness in intermediate feature embeddings. Consecutive frame embeddings are modeled as a Gaussian Random Walk (GRW), and the training loss penalizes high “acceleration” in this embedding space while preserving the correct frame ordering. This inductive bias suppresses noisy, abrupt representation changes and better aligns the model with the natural temporal coherence of videos, which is especially beneficial for lightweight architectures under tight FLOP and memory budgets. When applied to MoViNet and MobileNetV3 backbones, GRW-smoothing yields improvements of 3.8–6.4 percentage points on Kinetics‑600 under comparable compute and memory constraints.
This project includes two packages:
grw-smoothing: Contains only the loss term. Use this package if you want to use the GRW-smoothing loss in your own models but don't need the trained video models.grw-smoothing-models: Includes the trained video models and hasgrw-smoothingas a dependency. Use this package if you want to use the pre-trained models.
-
Install
uv(Python package installer):curl -LsSf https://astral.sh/uv/install.sh | sh -
Install Python >=3.10 and create a virtual environment:
uv venv --python 3.11
-
Activate the virtual environment:
source .venv/bin/activate -
Install PyTorch:
uv pip install torch torchvision
To install both grw-smoothing-models (which includes grw-smoothing as a dependency):
uv syncThis will install both packages and all their dependencies.
Figure: Warm‑up airplanes dataset: three classes (yaw, pitch, roll) that are indistinguishable from a single frame. Without GRW‑smoothing, frame embeddings are scattered; with GRW‑smoothing, they form smooth, low‑acceleration trajectories that align with the underlying rotation type.
At a high level, GRW-smoothing can be applied in two ways:
-
Intermediate-layer smoothing
- Globally pool intermediate feature maps over space.
- Normalize them (e.g., using batch norm without learned affine parameters).
- Slice the temporal sequence into short clips and apply the GRW loss to these sub‑sequences.
-
Final-layer smoothing
- Affinely normalize the final per‑frame embeddings.
- Apply GRW regularization over short temporal windows.
- Optionally feed the smoothed sequence into a lightweight temporal head (e.g., 1–2 layer Transformer) before the classifier.
In both cases, GRW acts purely on embeddings and introduces only a small overhead relative to the backbone, making it easy to drop into existing architectures.
GRW-smoothing substantially improves the accuracy–efficiency trade‑off of compact video models. On Kinetics‑600, MoViNet‑A0/1/2/3 and their streaming variants, as well as MobileNetV3, achieve 3.8–6.4 pp higher Top‑1 accuracy compared to previous state of the art at similar GFLOPs or memory usage. In particular, MoViNet‑A3‑GRW reaches 85.6% Top‑1 at just 56.4 GFLOPs, while a comparable transformer model (MViTv2‑B‑32×3) needs 18.3× more FLOPs to match this accuracy.
Figure: Accuracy vs. FLOPs on Kinetics‑600. Red points trace MoViNet variants with GRW‑smoothing; blue points show prior published models (see Table 1). GRW‑smoothing consistently raises the Pareto frontier for efficient video recognition.
Top‑1 accuracy vs. total video evaluation cost (GFLOPs) on Kinetics‑600. Models enhanced with GRW‑smoothing are marked with GRW.
| # | Model | Top‑1 (%) | GFLOPs | Res | Frames (clips × frames) |
|---|---|---|---|---|---|
| 1 | MoViNet‑A0‑S‑GRW | 78.4 | 2.7 | 172 | 1 × 50 |
| 2 | MoViNet‑A0 | 72.3 | 2.7 | 172 | 1 × 50 |
| 3 | MobileNetV3‑S | 61.3 | 2.8 | 224 | 1 × 50 |
| 4 | MobileNetV3‑S + TSM | 65.5 | 2.8 | 224 | 1 × 50 |
| 5 | X3D‑XS | 70.2 | 3.9 | 182 | 1 × 20 |
| 6 | MoViNet‑A1‑S‑GRW | 81.9 | 6.0 | 172 | 1 × 50 |
| 7 | MoViNet‑A1 | 76.7 | 6.0 | 172 | 1 × 50 |
| 8 | X3D‑S | 73.4 | 7.8 | 182 | 1 × 40 |
| 9 | MoViNet‑A2 | 78.6 | 10.3 | 224 | 1 × 50 |
| 10 | MobileNetV3‑L | 68.1 | 11.0 | 224 | 1 × 50 |
| 11 | MobileNetV3‑L + TSM | 71.4 | 11.0 | 224 | 1 × 50 |
| 12 | MoViNet‑A2‑S‑GRW | 83.3 | 11.3 | 224 | 1 × 50 |
| 13 | X3D‑M | 76.9 | 19.4 | 256 | 1 × 50 |
| 14 | X3D‑XS | 72.3 | 23.3 | 182 | 30 × 4 |
| 15 | MoViNet‑A3‑GRW | 85.6 | 56.4 | 256 | 1 × 120 |
| 16 | MoViNet‑A3 | 81.8 | 56.9 | 256 | 1 × 120 |
| 17 | X3D‑S | 76.4 | 76.1 | 182 | 30 × 13 |
| 18 | X3D‑L | 79.1 | 77.5 | 356 | 1 × 50 |
| 19 | MoViNet‑A4 | 83.5 | 105 | 290 | 1 × 80 |
| 20 | UniFormer‑S | 82.8 | 167 | 224 | 4 × 16 |
| 21 | X3D‑M | 78.8 | 186 | 256 | 30 × 16 |
| 22 | X3D‑L | 80.7 | 187 | 356 | 1 × 120 |
| 23 | I3D | 71.6 | 216 | 224 | 1 × 250 |
| 24 | MoViNet‑A5 | 84.3 | 281 | 320 | 1 × 120 |
| 25 | MViT‑B‑16×4 | 82.1 | 353 | 224 | 5 × 16 |
| 26 | MoViNet‑A6 | 84.8 | 386 | 320 | 1 × 120 |
| 27 | UniFormer‑B | 84.0 | 389 | 224 | 4 × 16 |
| 28 | XViT (8×) | 82.5 | 425 | 224 | 3 × 8 |
| 29 | XViT (16×) | 84.5 | 850 | 224 | 3 × 16 |
| 30 | MViT‑B‑32×3 | 83.4 | 850 | 224 | 5 × 32 |
| 31 | MViTv2‑B‑32×3 | 85.5 | 1030 | 224 | 5 × 32 |
Summary: Across all FLOP regimes, adding GRW‑smoothing on top of existing backbones consistently yields state‑of‑the‑art accuracy for efficient video recognition.
Trained weights for the GRW‑smoothing MoViNet models used in the paper (A0S, A1S, A2S, A3B) are hosted on Hugging Face: https://huggingface.co/DrGil/grw-smoothing-movinet/tree/main
Before running inference, create packages/grw-smoothing-models/config.ini by copying packages/grw-smoothing-models/config.ini.template, then set:
models_hometo the directory where checkpoints should be stored.data_hometo the root of your Kinetics‑600 dataset.
The paper reports results on the Kinetics‑600 public (labeled) test split. Because Kinetics videos are hosted on YouTube and availability changes over time, we evaluated on the subset of videos from the official test split that were still accessible at the time of our experiments. To enable exact reproducibility, we provide here the frozen set of extracted clips used in the paper: https://huggingface.co/datasets/DrGil/k600_test_ds
- Download the dataset tarball to your
data_homedirectory (as configured inpackages/grw-smoothing-models/config.ini). - Extract the archive:
Your
tar -xvf k600_test_ds.tar.gz
data_homedirectory should now include a singletestfolder containing the test video clips, organized by class name. - Under
data_home, create the directoryvideo_clips_cache:mkdir video_clips_cache
- Build video clip metadata (this can take a while). From
packages/grw-smoothing-models/src/grw_smoothing_models, run:python cli.py kinetics-create-clips
- Validate the cache: confirm
video_clips_cache/test.pklwas created.
Before running the inference commands, download the checkpoints into models_home (as configured in packages/grw-smoothing-models/config.ini).
To run evaluation on Kinetics-600 using the published weights on Hugging Face, run these commands from packages/grw-smoothing-models/src/grw_smoothing_models:
# A0S
torchrun --standalone --nproc_per_node [num_gpus] cli.py kinetics-test \
--model_name a0s_grw.pt \
--batch_size 20 \
--fps 5 \
--N 50 \
--backbone movineta0s
# A1S
torchrun --standalone --nproc_per_node [num_gpus] cli.py kinetics-test \
--model_name a1s_grw.pt \
--batch_size 20 \
--fps 5 \
--N 50 \
--backbone movineta1s
# A2S
torchrun --standalone --nproc_per_node [num_gpus] cli.py kinetics-test \
--model_name a2s_grw.pt \
--batch_size 20 \
--fps 5 \
--N 50 \
--backbone movineta2s
# A3B
torchrun --standalone --nproc_per_node [num_gpus] cli.py kinetics-test \
--model_name a3b_grw.pt \
--batch_size 20 \
--fps 12 \
--N 120 \
--backbone movineta3bThe checkpoint is downloaded into the models_home directory from config.ini.
If you use this code or ideas from the paper, please cite:
@inproceedings{goldman2025grwsmoothing,
title = {Smooth Regularization for Efficient Video Recognition},
author = {Gil Goldman and Raja Giryes and Mahadev Satyanarayanan},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2025},
url = {https://arxiv.org/abs/2511.20928}
}
