Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach

Authors: Aidean Sharghi, Jacob Scott Laurel, Boqing Gong

For a given query "Disneyland and food" on YouTube, matching is done based on tags and captions of videos, however, it is ideal to perform content search and show relevant segments of the videos to the user.

[Paper-PDF] [Supp\Mat-PDF] [Data] [Code]

VIDEO Summarization

Due to existence of ubiquitous video acquisition devices, there has been an explosion in the video data; not considering any other sources, more than 300 hours of video data is uploaded to YouTube. The amount of video data available is way over what we can watch, and this compels researchers to device automatic approaches to summarize the videos; that is to make a much shorter version of the same video by only selecting the most important segments of it. The summary is supposed to convey the same information in as little time as possible.


Video summarization is subjective; given a video, different users deem different segments of the video important, hence, if they are to summarize the video, most likely the summaries will greatly differ. We argue this is due to different interests and granularity. To investigate our claim, we asked 3 undergraduate students to summarize UT Egocentric videos given different queries. The experiments proved this claim; 1) given a video and a query, the summaries returned by different users while sharing characteristics and common elements, are not identical, 2) given a video, the same annotator returned different summaries depending what the query was. 

The above figure illustrates the summaries returned by one annotator given two different queries, the orange margin indicates query-dependent segments of the video.


In query-focused video summarization, there are two summary components to meet:

  1. Query-dependent summary
  2. Contextual summary

In order to achieve the above criteria, one must find a way to convert static features of the video into dynamic features where the features change with respect to the query. To achieve this, we take advantage of existing Memory Networks that have been used effectively in Question Answering problems. The following figure illustrates one memory network cell used in our framework:


Features of one shot (f_i ; i = {1,...,k}) is fed to memory network cell, the similarity of each individual feature and the query is measured, and a weighted combination of transformed features is computed (O) as the output of the network.

After computing the output of the memory network on all the shots, partitions of 10 shot features are fed to DPP summarizer for subset selection process. The overall pipeline is illustrated below.


Since the problem of query-focused video summarization is relatively new and under-explored, there is no benchmark. Therefore, here we release the first Query-Focused Video Summarization Dataset to facilitate the future research on this topic. We downloaded the videos of UT Egocentric dataset, and defined 46 queries per video. Each query is a pair of concepts. The log-frequency graph for the concepts (total of 48 concepts) are provided below:

Given a video and a query, it is summarized by 3 undergraduate students; hence, a total of 552 (4 videos * 46 queries * 3 annotators) summaries are obtained. 


Given subjectivity of the video summarization, we argue current approaches to evaluate the quality of video summaries are lacking. In terms of defining a better measure that closely tracks what humans can perceive from the video summaries, we share the same opinion as Yeung et al.’s: it is key to evaluate how well a system summary is able to retain the semantic information, as opposed to the visual quantities, of the user supplied video summaries. Arguably, the semantic information is best expressed by concepts (e.g., objects, places, people, etc.) that represent the fundamental characteristics of what we see in the video, while captions can be clearly lacking!

Thanks to the dense concept annotations per video shot, we can conveniently contrast a system generated video summary to user summaries according to the semantic information they entail. We first define a similarity function between any two video shots by intersection-over-union (IOU) of their corresponding concepts. For instance, if one shot is tagged by {CAR, STREET} and another by {STREET, TREE, SIGN}, then the IOU similarity between them is 1/4 = 0.25.

To find the match between two summaries, it is convenient to execute it by the maximum weight matching of a bipartite graph, where the summaries are on opposite sides of the graph.

QUAntitative results

                                           Comparison results for query-focused video summarization

     Comparison results for generic video summarization, i.e., when no video shots are relevant to the query

Qualitative Results

Here we show two sample summaries generated by our model.

Query: Chocolate, Street

Query: Food, Drink

You can find the original videos here.