Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition

Authors: Mingze Xu, Aidean Sharghi, Xin Chen, David J Crandall

Screen Shot 2018-01-22 at 1.45.41 PM.png
Visualization of our fully-coupled two-stream spatiotemporal networks. We feed RGB frames into the spatial stream (green) and corresponding stacked optical flow fields to the temporal stream (yellow). The GRU networks (blue) compute spatiotemporal information for the entire video using the extracted C3D features as inputs. 

[Paper-PDF]

Low resolution Action recognition

 
Screen Shot 2018-01-22 at 2.03.17 PM.png
 

Cameras are seemingly everywhere, from the traffic cameras in cities and highways to the surveillance systems in businesses and public places. Increasingly we allow cameras even into the most private spaces in our lives. While these cameras have the promise of making our lives safer and simpler, including making possible more natural, context-aware interactions with technology, they also record highly sensitive information about people and their private environments. To make matters worse, processing for many of today’s devices is often performed by remote servers “in the cloud.” This means that even if a user trusts that a device is using recorded video solely for legitimate purposes. Perhaps the most effective approach to addressing the privacy challenge is to simply avoid collecting high-fidelity imagery to begin with.


FRAMEWORK

Spatial Feature Extractor. We use the C3D network, which has proven to be well-suited for modeling sequential inputs such as videos. Since C3D uses 3D convolution and pooling operations that operate over both spatial and temporal dimensions, it is able to capture motion information within each input video unit.

Temporal Feature Extractor. While the C3D network is able to encode local temporal features within each video unit, it cannot model across the multiple units of a video sequence. We thus introduce a Recurrent Neural Network (RNN) to capture global sequence dependencies of the input video and cue on motion information.

 
 Visualization of our spatiotemporal features extractor, which uses a C3D network to capture spatial and temporal features for video units and an RNN to encode motion information across the entire video stream

Visualization of our spatiotemporal features extractor, which uses a C3D network to capture spatial and temporal features for video units and an RNN to encode motion information across the entire video stream

 

Fully Coupled Networks. Low resolution recognition approaches in both image and video domains have achieved better performance by learning transferable features from high to low resolutions. This process can be done either using unsupervised pre-training on super resolution sub-networks or with partially-coupled networks, which are more flexible for knowledge transformation. Inspired by these works, we propose a fully-coupled network architecture where all parameters of both the C3D and GRU networks are shared between high and low resolutions in the (single) training stage. The key idea is that by viewing high and low resolution video frames as two different domains, the fully-coupled network architecture is able to extract features across them. Since high resolution video contains much more visual information, training on both resolutions helps improve learning spatial features; using high resolution in training can be thought of as data augmentation, since different techniques for sub-sampling produce different low resolution examplars from the same original high resolution image.


EVALUATION

Our full model featuring pre-trained C3D networks, the bi-directional GRU network, and the fully-coupled two-stream architecture with sum fusion achieves 44.96% accuracy on the low resolution HMDB51 dataset and 73.19% on the low resolution DogCentric dataset. Of course, these results are significantly worse than the best results on the high resolution versions of these datasets (e.g. around 80.7% for HMDB51). We also tested our best model on action recognition on high resolution videos, and easily achieved over 68% accuracy without any explicit tuning network architecture and hyper-parameters.

  Evaluation results of each component of our network architecture on the HMDB51 dataset

Evaluation results of each component of our network architecture on the HMDB51 dataset