Publications

You can also find my articles on my Google Scholar profile.

Conference Papers


Lightweight Spatio-Temporal Modeling via Temporally Shifted Distillation for Real-Time Accident Anticipation

Published in The Fourteenth International Conference on Learning Representations (ICLR), 2026

A lightweight, real-time accident predictor trained via novel temporally shifted distillation, combining efficient spatial encoding and recurrent temporal modeling, running on edge devices.

Recommended citation: P. Patera, Y.-T. Chen, W.-H. Fang, "Lightweight Spatio-Temporal Modeling via Temporally Shifted Distillation for Real-Time Accident Anticipation,"The Fourteenth International Conference on Learning Representations (ICLR), Rio de Janeiro, Brazil, 2026, pp. 1-20, url: https://openreview.net/forum?id=8zzfTSVds2
Download Paper

Spatio-Temporal Adaptation with Dilated Neighbourhood Attention for Accident Anticipation

Published in IEEE International Conference on Image Processing (ICIP), 2024

This study uses Parameter-Efficient Transfer Learning (PEFTL) and Dilated Neighborhood Attention (DNA) to adapt pretrained CLIP-ViT for traffic accident anticipation. By utilizing novel Spatial and Temporal Adapters with cross-attention, the model captures long-range dependencies more effectively, achieving state-of-the-art earliness and accuracy on the DAD and CCD datasets.

Recommended citation: P. Patera, Y. -T. Chen and W. -H. Fang, "Spatio-Temporal Adaptation With Dilated Neighbourhood Attention For Accident Anticipation," 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024, pp. 2452-2458, doi: 10.1109/ICIP51287.2024.10647316.
Download Paper

Journal Articles


A Multi-modal Architecture with Spatio-Temporal-Text Adaptation for Video-based Traffic Accident Anticipation

Published in IEEE Transactions on Circuits and Systems for Video Technology, 2025

MASTTA is a parameter-efficient, multi-modal framework that improves traffic accident anticipation by fine-tuning CLIP-based adapters for visual and textual data. By utilizing novel Temporal and Spatial Adapters alongside a Text Adapter, the model captures complex spatio-temporal interactions and aligns them in a joint embedding space. This synergy allows for more accurate, long-range context modeling, outperforming state-of-the-art methods in both earliness and correctness on the DAD and CCD benchmarks.

Recommended citation: P. Patera, Y. -T. Chen and W. -H. Fang, "A Multi-Modal Architecture With Spatio-Temporal-Text Adaptation for Video-Based Traffic Accident Anticipation," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 9, pp. 8989-9002, Sept. 2025, doi: 10.1109/TCSVT.2025.3552895.
Download Paper

Patents


System and method for identification, authentication, and verification of a person based upon a short audio-visual recording of the person

Published in US patent 2025, 2025

This method computes a unique hash by combining facial and voice representations into distinct fingerprints. It extracts these features from multimedia recordings and uses a similarity measure to compare them for identification or verification.

Recommended citation: K. Ekštein, M. Konopík, F. Pártl, P. Patera, "System and method for identification, authentication, and verification of a person based upon a short audio-visual recording of the person", US patent 2025/0184148.
Download Paper