(Paper under review at a conference location)
Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces InterAct VideoQA, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world-deployable VideoQA models for intelligent transportation systems.
Overview of the InterAct VideoQA data collection framework, integrating traffic video recording and processing with a hybrid approach combining manual labeling and GPT-based automation. The pipeline segments eight hours of footage into 10-second clips, extracts key metadata (e.g., vehicle attributes, movement patterns, pedestrian data), and generates structured question-answer pairs covering attribution, counting, reverse reasoning, event reasoning, and counterfactual inference.
Overview of the InterAct VideoQA dataset, which comprises 28,800 question-answer pairs across various reasoning categories. A higher concentration appears in counting, attribute recognition, and event reasoning, followed by counterfactual inference and reverse reasoning (3a). Figures 3(b)-(d) illustrate the dataset's emphasis on vehicular-related questions, the dominance of attribution and event reasoning categories, and the distribution of question types (“what,” “where,” and “how”). This structured approach supports the analysis of complex, multi-event traffic scenarios, requiring robust spatio-temporal reasoning. A rigorous human and GPT-assisted validation process ensures the consistency, accuracy, and reliability of all annotations.
Performance analysis of VideoLlama2, Llava-NeXT-Video, and Qwen2-VL-7B-hf on the InterAct VideoQA dataset, highlighting metric distributions (a), before vs. after fine-tuning (b), and multi-metric improvements (c). Notably, Qwen2-VL-7B-hf demonstrates the most substantial gains across complex reasoning tasks, emphasizing the effectiveness of fine-tuning for robust traffic video analysis.