Lightweight Spatio-Temporal Modeling via Temporally Shifted Distillation for Real-Time Accident Anticipation
Abstract
Anticipating traffic accidents in real time is critical for intelligent transportation systems, yet remains challenging under edge-device constraints. We propose a lightweight spatio-temporal framework that introduces a temporally shifted distillation strategy, enabling a student model to acquire predictive temporal dynamics from a frozen image-based teacher without requiring a video pre-trained teacher. The student combines a RepMixer spatial encoding with a RWKV-inspired recurrent module for efficient long-range temporal reasoning. To enhance robustness under partial observability, we design a masking memory strategy that leverages memory retention to reconstruct missing visual tokens, effectively simulating occlusions and future events. In addition, multi-modal vision-language supervision enriches semantic grounding. Our framework achieves state-of-the-art performance on multiple real-world dashcam benchmarks while sustaining real-time inference on resource-limited platforms such as the NVIDIA Jetson Orin Nano. Remarkably, it is 3-7x smaller than leading approaches yet delivers superior accuracy and earlier anticipation, underscoring its practicality for deployment in intelligent vehicles.
Overview
Overview of our teacher–student framework. The teacher is a frozen MobileCLIP model with four RepMixer stages (Stage 4 uses spatial-only MHSA). The student shares the same backbone but replaces Stage 4 with a spatio-temporal RWKV block for efficient temporal reasoning. Spatial distillation is applied at Stages 1–3, while temporal distillation aligns the student’s current output at frame $T$ with the teacher’s future features at frame $T{+}1$. Masked recurrence within the spatio-temporal RWKV block (right) simulates occlusion and strengthens memory retention.
Key Contributions & Methodology
- Temporally Shifted Distillation Framework: Enables spatio-temporal learning from a frozen image-based teacher, which eliminates the need for a temporally aware teacher in video pre-training and makes the approach highly suitable for small datasets and low-resource settings.
- Lightweight Hybrid Student Architecture: Integrates RepMixer spatial encoding with a recurrent temporal module, RWKV, to provide efficient long-range video understanding with linear-time complexity.
- Mask-Aware Spatio-Temporal Module: Adapts the RWKV block into a window-based module that combines localized recurrence with a masked memory strategy, achieving robust temporal modeling under occlusion and partial observability.
- State-of-the-Art Real-Time Anticipation: Delivers top performance on real-world benchmarks while running efficiently on edge devices like the NVIDIA Jetson Orin Nano. The model is 3-7x smaller than recent leading approaches yet achieves high anticipation performance on resource-constrained platforms.
Poster
Coming soon...
BibTeX
@inproceedings{
patera2026lightweight,
title={Lightweight Spatio-Temporal Modeling via Temporally Shifted Distillation for Real-Time Accident Anticipation},
author={Patrik Patera and Yie-Tarng Chen and Wen-Hsien Fang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=8zzfTSVds2}
}