Published in IEEE Transactions on Circuits and Systems for Video Technology, 2025
MASTTA is a parameter-efficient, multi-modal framework that improves traffic accident anticipation by fine-tuning CLIP-based adapters for visual and textual data. By utilizing novel Temporal and Spatial Adapters alongside a Text Adapter, the model captures complex spatio-temporal interactions and aligns them in a joint embedding space. This synergy allows for more accurate, long-range context modeling, outperforming state-of-the-art methods in both earliness and correctness on the DAD and CCD benchmarks.
Recommended citation: P. Patera, Y. -T. Chen and W. -H. Fang, "A Multi-Modal Architecture With Spatio-Temporal-Text Adaptation for Video-Based Traffic Accident Anticipation," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 9, pp. 8989-9002, Sept. 2025, doi: 10.1109/TCSVT.2025.3552895.
Download Paper