ICQ: Localizing Events in Videos with Multimodal Queries

1LMU Munich, 2TU Munich, 3Tsinghua University,
4Munich Center for Machine Learning, 5University of Oxford
2024

*Indicates Equal Contribution

Abstract

Video understanding is a pivotal task in the digital era, yet the dynamic and multievent nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images' semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization.

Localizing events in videos with semantics queries: so far, the community has only focused on natural language query-based video event localization. Our benchmark ICQ focuses on a more general scenario: localizing events in video with multimodal queries.

Localizing events in videos with semantics queries: so far, the community has only focused on natural language query-based video event localization. Our benchmark ICQ focuses on a more general scenario: localizing events in video with multimodal queries.




Dataset

Examples of ICQ-Highlight.




Distribution of ICQ-Highlight.




Benchmark

Disclaimer

Please read: This dataset contain synthetic images and despite manual filtering the dataset might still contain sensitive/unpleasant/hallucinating images. If you find any such images, please report them to us. We will remove them from the dataset. The images in this dataset are synthetic and do not represent real images. The dataset is intended for research purposes only and should not be used for any other purposes.

License

The annotation file is licensed under the CC BY-NC-SA License - see the LICENSE file for details.

Acknowledgements

We would like to thank the following projects/people for their contributions to this project.

We thank the authors for their contributions to this project.

BibTeX


        @article{zhang2024localizing,
          title={Localizing Events in Videos with Multimodal Queries},
          author={Zhang, Gengyuan and Fok, Mang Ling Ada and Xia, Yan and Tang, Yansong and Cremers, Daniel and Torr, Philip and Tresp, Volker and Gu, Jindong},
          journal={arXiv preprint arXiv:2406.10079},
          year={2024}
        }