InterActVideoQA

InterAct VideoQA is a curated, publicly available traffic monitoring dataset gathered using ARGOS Cameras and mobile devices under diverse weather and lighting conditions. It comprises 8 hours of real-world footage from multiple intersections, segmented into 10-second clips, and features over 25,000 question-answer pairs covering spatiotemporal dynamics, vehicle interactions, and incident detection. This dataset enables the benchmarking and enhancement of VideoQA models for intelligent transportation systems. It includes five QA types: (1) attribution, (2) counting, (3) event reasoning, (4) reverse reasoning, and (5) counterfactual inference.

Abstract

Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces InterAct VideoQA, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world-deployable VideoQA models for intelligent transportation systems.

InterAct - Data Collection Setup and Diversity

Overview of the InterAct VideoQA data collection framework, integrating traffic video recording and processing with a hybrid approach combining manual labeling and GPT-based automation. The pipeline segments eight hours of footage into 10-second clips, extracts key metadata (e.g., vehicle attributes, movement patterns, pedestrian data), and generates structured question-answer pairs covering attribution, counting, reverse reasoning, event reasoning, and counterfactual inference.

InterAct - Statistics

Overview of the InterAct VideoQA dataset, which comprises 28,800 question-answer pairs across various reasoning categories. A higher concentration appears in counting, attribute recognition, and event reasoning, followed by counterfactual inference and reverse reasoning (3a). Figures 3(b)-(d) illustrate the dataset's emphasis on vehicular-related questions, the dominance of attribution and event reasoning categories, and the distribution of question types (“what,” “where,” and “how”). This structured approach supports the analysis of complex, multi-event traffic scenarios, requiring robust spatio-temporal reasoning. A rigorous human and GPT-assisted validation process ensures the consistency, accuracy, and reliability of all annotations.

InterAct - Experimentation Statistics

Performance analysis of VideoLlama2, Llava-NeXT-Video, and Qwen2-VL-7B-hf on the InterAct VideoQA dataset, highlighting metric distributions (a), before vs. after fine-tuning (b), and multi-metric improvements (c). Notably, Qwen2-VL-7B-hf demonstrates the most substantial gains across complex reasoning tasks, emphasizing the effectiveness of fine-tuning for robust traffic video analysis.

Sample Data Recordings

BibTeX

		@misc{vishal2025interactvideoreasoningrichvideoqa,
      title={InterAct-Video: Reasoning-Rich Video QA for Urban Traffic}, 
      author={Joseph Raj Vishal and Divesh Basina and Rutuja Patil and Manas Srinivas Gowda and Katha Naik and Yezhou Yang and Bharatesh Chakravarthi},
      year={2025},
      eprint={2507.14743},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.14743}, 
}

InterAct VideoQA: A Benchmark Dataset for Video Question Answering in Traffic Intersection Monitoring