InterAct VideoQA: A Benchmark Dataset for Video Question Answering in Traffic Intersection Monitoring

Paper is Archived


Joseph Raj Vishal       Divesh Basina       Katha Naik        Rutuja Patil        Manas Srinivas Gowda        Yezhou Yang        Bharatesh Chakravarthi       
Arizona State University

InterAct VideoQA is a curated, publicly available traffic monitoring dataset gathered using ARGOS Cameras and mobile devices under diverse weather and lighting conditions. It comprises 8 hours of real-world footage from multiple intersections, segmented into 10-second clips, and features over 25,000 question-answer pairs covering spatiotemporal dynamics, vehicle interactions, and incident detection. This dataset enables the benchmarking and enhancement of VideoQA models for intelligent transportation systems. It includes five QA types: (1) attribution, (2) counting, (3) event reasoning, (4) reverse reasoning, and (5) counterfactual inference.

Abstract

Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces InterAct VideoQA, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world-deployable VideoQA models for intelligent transportation systems.


InterAct - Data Collection Setup and Diversity


Overview of the InterAct VideoQA data collection framework, integrating traffic video recording and processing with a hybrid approach combining manual labeling and GPT-based automation. The pipeline segments eight hours of footage into 10-second clips, extracts key metadata (e.g., vehicle attributes, movement patterns, pedestrian data), and generates structured question-answer pairs covering attribution, counting, reverse reasoning, event reasoning, and counterfactual inference.


InterAct - Statistics


Overview of the InterAct VideoQA dataset, which comprises 28,800 question-answer pairs across various reasoning categories. A higher concentration appears in counting, attribute recognition, and event reasoning, followed by counterfactual inference and reverse reasoning (3a). Figures 3(b)-(d) illustrate the dataset's emphasis on vehicular-related questions, the dominance of attribution and event reasoning categories, and the distribution of question types (“what,” “where,” and “how”). This structured approach supports the analysis of complex, multi-event traffic scenarios, requiring robust spatio-temporal reasoning. A rigorous human and GPT-assisted validation process ensures the consistency, accuracy, and reliability of all annotations.


InterAct - Experimentation Statistics


Performance analysis of VideoLlama2, Llava-NeXT-Video, and Qwen2-VL-7B-hf on the InterAct VideoQA dataset, highlighting metric distributions (a), before vs. after fine-tuning (b), and multi-metric improvements (c). Notably, Qwen2-VL-7B-hf demonstrates the most substantial gains across complex reasoning tasks, emphasizing the effectiveness of fine-tuning for robust traffic video analysis.


Sample Data Recordings


BibTeX

		@misc{vishal2025interactvideoreasoningrichvideoqa,
      title={InterAct-Video: Reasoning-Rich Video QA for Urban Traffic}, 
      author={Joseph Raj Vishal and Divesh Basina and Rutuja Patil and Manas Srinivas Gowda and Katha Naik and Yezhou Yang and Bharatesh Chakravarthi},
      year={2025},
      eprint={2507.14743},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.14743}, 
}