Video data is explosively growing as a result of the ubiquitous acquisition capabilities. The videos captured by UAVs and/or drones, from ground surveillance, and by body-worn cameras can easily reach the scale of gigabytes per day. About 300 hours of videos are uploaded per minute to Youtube. While the “big video data” is a great source for information discovery, the computational challenges are unparalleled. In such context, intelligent algorithms for automatic video summarization (and retrieval, recognition, etc.) (re-)emerge as a pressing need.
In this paper we focus on extractive video summarization, which generates a concise summary of a video by selecting from it key frames or shots. The key frames/shots are expected to be 1) individually important—otherwise they should not be selected, and 2) collectively diverse—otherwise one can remove some of them without losing much information.
In respect to the recent progress, the goal of this paper is to further advance the user-oriented video summarization by modeling user input, or more precisely user intentions, in the summarization process. We name it query-focused (extractive) video summarization, in accordance with the query-focused document summarization in NLP. A query refers to one or more concepts (e.g., car, flowers) that are both user-nameable and machine-detectable. More generic queries are left for the future work.
Data and evaluation
For data and evaluation, please refer to the data in our CVPR2017 project page.