Nhận dạng hoạt động của người dựa trên kỹ thuật học sâu và phân tích đa góc nhìn
- 67 trang
- file .pdf
Hanoi University of Science and Technology
Master Thesis
Human Action Recognition using Deep
Learning and Multi-view Discriminant
Analysis
TRAN HOANG NHAT
[email protected]
Control Engineering and Automation
Advisor: Assoc. Prof. Dr. Tran Thi Thanh Hai
Faculty: School of Electrical Engineering
Hanoi, 10/2020
Abstract
Human action recognition (HAR) has many implications in robotic and medical
applications. Invariance under different viewpoints is one of the most critical re-
quirements for practical deployment as it affects many aspects of the information
captured such as occlusion, posture, color, shading, motion and background. In this
thesis, a novel framework that leverages successful deep features for action represen-
tation and multi-view analysis to accomplish robust HAR under viewpoint changes.
Specifically, various deep learning techniques, from 2D CNNs to 3D CNNs are inves-
tigated to capture spatial and temporal characteristics of actions at each individual
view. A common feature space is then constructed to keep view invariant features
among extracted streams. This is carried out by learning a set of linear transforma-
tions that projects separated view features into a common dimension. To this end,
Multi-view Discriminant Analysis (MvDA) is adopted. However, the original MvDA
suffers from odd situations in which the most class-discrepant common space could
not be found because its objective is overly concentrated on scattering classes from
the global mean but unaware of distance between specific pairs of classes. There-
fore, we introduce a pairwise-covariance maximizing extension of MvDA that takes
extra-class discriminance into account, namely pc-MvDA. The novel model also dif-
fers in the way that is more favorable for training of high-dimensional multi-view
data. Experimental results on three datasets (IXMAS, MuHAVI, MICAGes) show
the effectiveness of the proposed method.
Acknowledgements Master Thesis
Acknowledgements
This thesis would not have been possible without the help of many people. First
of all, I would like to express my gratitude to my primary advisor, Prof. Tran Thi
Thanh Hai, who guided me throughout this project. I would like to thank Prof. Le
Thi Lan and Prof. Vu Hai for giving me deep insight, valuable recommendations
and brilliant idea.
I am grateful for my time spent at MICA International Research Institute, where
I learnt a lot about research and enjoyed a very warm and friendly working atmo-
sphere. In particular, I wish to extend my special thanks to PhD candidate. Nguyen
Hong Quan and Dr. Doan Huong Giang who directly supported me.
Finally, I wish to show my appreciation to all my friends and family members who
helped me finalizing the project.
Tran Hoang Nhat - CBC19005 ii
Table of Contents Master Thesis
Table of Contents
List of Figures 1
List of Tables 3
List of Abbreviations 4
1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Technical Background and Related Works 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . 8
2.2.1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 9
2.2.1.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 11
2.2.2 Dimensionality Reduction Algorithms . . . . . . . . . . . . . . . . . . 13
2.2.2.1 Linear discriminant analysis . . . . . . . . . . . . . . . . . . 14
2.2.2.2 Pairwise-covariance linear discriminant analysis . . . . . . . . . 15
2.2.3 Multi-view Analysis Algorithms . . . . . . . . . . . . . . . . . . . . . 16
2.2.3.1 Multi-view discriminant analysis . . . . . . . . . . . . . . . . 16
2.2.3.2 Multi-view discriminant analysis with view-consistency . . . . . 18
2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Human action and gesture recognition . . . . . . . . . . . . . . . . . . 19
2.3.2 Multi-view analysis and learning techniques . . . . . . . . . . . . . . . 20
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Proposed Method 22
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Feature Extraction at Individual View Using Deep Learning Techniques . . . . . 23
3.3.1 2D CNN based clip-level feature extraction . . . . . . . . . . . . . . . . 23
3.3.2 3D CNN based clip-level feature extraction . . . . . . . . . . . . . . . . 26
3.4 Construction of Common Feature Space . . . . . . . . . . . . . . . . . . . . 27
3.4.1 Brief summary of Multi-view Discriminant Analysis . . . . . . . . . . . 27
3.4.2 Pairwise-covariance Multi-view Discriminant Analysis . . . . . . . . . . . 28
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Experiments 34
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Tran Hoang Nhat - CBC19005 iii
Table of Contents Master Thesis
4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 IXMAS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 MuHAVi dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.3 MICAGes dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.1 Programming Environment and Libraries . . . . . . . . . . . . . . . . 39
4.4.2 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . . . 40
4.5.1 Experimental results on IXMAS dataset . . . . . . . . . . . . . . . . . 40
4.5.2 Experimental results on MuHAVi dataset . . . . . . . . . . . . . . . . 42
4.5.3 Experimental results on MICAGes dataset . . . . . . . . . . . . . . . . 44
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Conclusion 48
5.1 Accomplishments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A Appendix 50
A.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.1.1 Derivation of S yW and S yB scatter matrices in MvDA . . . . . . . . . . . 50
A.1.2 Derivation of O view−consistency in MvDA-vc . . . . . . . . . . . . . . . 54
A.1.3 Derivation of S xW ab and S xB ab scatter matrices in pc-MvDA . . . . . . . . 54
Bibliography 58
Tran Hoang Nhat - CBC19005 iv
List of Figures Master Thesis
List of Figures
2.1 A single LSTM cell. From [1]. . . . . . . . . . . . . . . . . . . . . . . 12
2.2 A single GRU variation cell. From [1]. . . . . . . . . . . . . . . . . . 13
2.3 Analytical solution of LDA. . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Analytical solution of MvDA. . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Proposed framework for building common feature space with pairwise-
covariance multi-view discriminant analysis (pc-MvDA). . . . . . . . 24
3.2 Architecture of ResNet-50 utilized in this work for feature extraction
at each separated view. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Three pooling techniques: Average Pooling (AP), Recurrent Neural
Network (RNN) and Temporal Attention Pooling (TA). . . . . . . . . 25
3.4 Architecture of ResNet-50 3D utilized in this work for feature extraction. 27
3.5 Architecture of C3D utilized in this work for feature extraction. . . . 27
3.6 a) MvDA does not optimize the distance between paired classes in
common space. b) pc-MvDA takes pairwise distances into account to
better distinguish the classes. . . . . . . . . . . . . . . . . . . . . . . 30
3.7 A synthetic dataset of 180 data points, evenly distributed to 3 classes
among 3 different views; a) 2-D original distribution; b) 1-D projec-
tion of MvDA; c) 1-D projection of pc-MvDA. . . . . . . . . . . . . . 31
3.8 A synthetic dataset of 300 data points, evenly distributed to 5 classes
among 3 different views; a) 3-D original distribution; b) 2-D projec-
tion of MvDA; c) 2-D projection of pc-MvDA. . . . . . . . . . . . . . 31
4.1 Illustration of frames extracted from action check watch observed
from five camera viewpoints. . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Environment setup to collect action sequences from 8 views [2]. . . . 35
4.3 Illustration of frames extracted from an action punch observed from
Camera 1 to Camera 7. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Environment setup to capture MICAGes dataset. . . . . . . . . . . . 36
4.5 Illustration of a gesture belonging to the 6th class observed from 5
different views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Two evaluation protocols used in experiments. . . . . . . . . . . . . . 37
Tran Hoang Nhat - CBC19005 1
List of Figures Master Thesis
4.7 Comparison of accuracy on each action class using different deep fea-
tures combined with pc-MvDA on IXMAS dataset. . . . . . . . . . . 42
4.8 Comparison of accuracy on each action class using different deep fea-
tures combined with pc-MvDA. . . . . . . . . . . . . . . . . . . . . . 43
4.9 Comparison of accuracy on each action class using different deep fea-
tures combined with pc-MvDA on MICAGes dataset. . . . . . . . . . 46
4.10 First column: private feature spaces stacked and embedded together
in a same coordinate system; Second column: MvDA common space;
Third column: pc-MvDA common space. . . . . . . . . . . . . . . . . 47
Tran Hoang Nhat - CBC19005 2
List of Tables Master Thesis
List of Tables
3.1 Comparison of computational complexity of different notations of
Fisher criteria described in [3]. . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Cross-view recognition comparison on IXMAS dataset. . . . . . . . . 40
4.2 Cross-view recognition results of different features on IXMAS dataset
with pc-MvDA method. The result in the bracket are accuracies of
using features C3D, ResNet-50 3D, ResNet-50 RNN, ResNet-50 TA,
Restnet-50 AP respectively. Each row corresponds to training view
(from view C0 to view C3). Each column corresponds to testing view
(from view C0 to view C3). . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Multi-view recognition comparison on IXMAS dataset. . . . . . . . . 41
4.4 Comparison of proposed methods with SOTA methods on IXMAS
dataset according to the second evaluation protocol. . . . . . . . . . . 42
4.5 Cross-view recognition comparison on MuHAVi dataset. . . . . . . . . 42
4.6 Cross-view recognition results of different features on MuHAVi dataset
with pc-MvDA method. The result in the bracket are accuracies of
using features C3D, ResNet-50 3D, ResNet-50 RNN, ResNet-50 TA,
ResNet-50 AP respectively. Each row corresponds to training view
(from view C1 to view C7). Each column corresponds to testing view
(from view C1 to view C7). . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Multi-view recognition comparison on MuHAVi dataset. . . . . . . . . 43
4.8 Comparison of the proposed methods with SOTA methods on MuHAVi
dataset according to the second evaluation protocol. . . . . . . . . . . 44
4.9 Cross-view recognition comparison on MICAGes dataset. . . . . . . . 44
4.10 Cross-view recognition results of different features on MICAGes dataset
with pc-MvDA method. The result in the bracket are accuracies of
using features C3D, ResNet-50 3D, ResNet-50 RNN, ResNet-50 TA,
RestNet-50 AP respectively. Each row corresponds to training view
(from view K1 to view K5). Each column corresponds to testing view
(from view K1 to view K5). . . . . . . . . . . . . . . . . . . . . . . . 45
4.11 Multi-view recognition comparison on MICAGes dataset. . . . . . . . 45
Tran Hoang Nhat - CBC19005 3
List of Abbreviations Master Thesis
List of Abbreviations
ANN Artificial Neural Network
AP Average Pooling
CCA Canonical Correlation Analysis
CNN Convolutional Neural Network
DNN Deep Neural Network
GRU Gated Recurrent Unit
HAR Human Action Recognition
HoG Histogram of oriented Gradient
iDT improved Dense Trajectories
KCCA Kernel Canonical Correlation Analysis
kNN k-Neareast Neighbor
LDA Linear Discriminant Analysis
LSTM Long Short-Term Memory
MICA Multimedia, Information, Communication & Applications Interna-
tional Research Institute
MLP Multilayer Perceptron
MST-AOG Multi-view Spatio-Temporal AND-OR Graph
MvA Multi-view Analysis
MvCCA Multi-view Canonical Correlation Analysis
MvCCDA Multi-view Common Component Discriminant Analysis
MvDA Multi-view Discriminant Analysis
MvDA-vc Multi-view Discriminant Analysis with View-Consistency
MvFDA Multi-view Fisher Discriminant Analysis
MvMDA Multi-view Modular Discriminant Analysis
MvML-LA Multi-view Manifold Learning with Locality Alignment
MvPLS Multi-view Partial Least Square
pc-LDA Pairwise-Covariance Linear Discriminant Analysis
pc-MvDA Pairwise-Covariance Multi-view Discriminant Analysis
RNN Recurrent Neural Network
SOTA State Of The Art
SSM Self Similarity Matrix
TA Temporal Attention
Tran Hoang Nhat - CBC19005 4
1 Introduction Master Thesis
1 Introduction
1.1 Motivation
Human action and gesture recognition aims at recognizing an action from a given
video clip. This is an attractive research topic, which has been extensively stud-
ied over the years due to its broad range of applications from video surveillance
to human machine interaction [4, 5]. Within this scope, a very important demand
is independence to viewpoint. However, different viewpoints result in various hu-
man pose, background, camera motions, lighting conditions and occlusions. Conse-
quently, recognition performance could be dramatically degraded under viewpoint
changes.
To overcome this problem, a number of methods have been proposed. View inde-
pendence recognition such as [6, 7, 8, 9] generally require a careful multi-view camera
setting for robust joint estimation. View invariance approach [10, 11] is usually lim-
ited by inherent structure of view-invariant features. Recently, knowledge transfer
technique is widely deployed for cross-view action recognition, for instance bipar-
tite graph that bridge the semantic gap across view dependent vocabularies [12],
AND-OR graph (MST-AOG) for cross-view action recognition [13]. To increase dis-
criminant and informative features, view private features and shared features are
both incorporated in such frameworks to learn the common latent space [14, 15].
While existing works for human action and gesture recognition from common view-
points explored different deep learning techniques and achieved impressive accuracy.
In most of aforementioned multi-view action recognition techniques, the features
extracted from each view are usually hand-crafted features (i.e improved dense tra-
jectories) [16, 15, 14]. Deep learning techniques, if used, handle knowledge transfer
among viewpoints. Deployment of deep features in such frameworks for cross-view
scenario is under active investigation.
In parallel with knowledge transfer techniques, building a common space from
different views has been addressed in many other works using multi-view discrimi-
nant analysis techniques. The first work of this approach was initiated by Canonical
Component Analysis (CCA) that tried to find two linear transformations for each
view [17]. Various improvements of CCA have been made to take non-linear trans-
formation into account (kernel CCA) [18]. Henceforth, different improvements have
been introduced such as MULDA [19], MvDA [20], MvCCA, MvPLS and MvMDA
Tran Hoang Nhat - CBC19005 5
1 Introduction Master Thesis
[21], MvCCDA [22]. All of these techniques try to build a common space from
different views by maximizing the cross-covariance between views. However, most
of these works are still experimented with static images, none of them have been
explored with videos. Particularly, their investigation for the case of human action
recognition is still actively under taken.
1.2 Objective
Motivated by the two aforementioned under investigation problems, in the research
project at MICA institute, an unified framework is proposed for cross-view action
recognition which consists of two main components: individual view feature extrac-
tion and latent common space construction. The work in this thesis is part of the
work carried out by the research team.
For feature extraction from individual view, a range of deep neural networks are
investigated, from 2D CNNs with different pooling strategies (average pooling, tem-
poral attention pooling or using LSTM) to 3D CNNs with two most recent variations
(C3D [23] and ResNet-50 3D [24]). These networks have been successfully deployed
for human action and gesture recognition in general, but not investigated yet for
cross-view recognition scenarios.
The objective of this thesis focuses on the second stage of the proposed general
framework. For building a latent common space, we are inspired by idea of multi-
view discriminant analysis (MvDA). This technique has been shown to be efficient
for images based tasks, but not deployed for video based tasks and mostly with deep
features extracted from videos as input. In addition, the MvDA’s objective has no
explicit constraint to push class centers away from each other. To this end, with
idea based on the proposal of a dimensionality reduction algorithm namely pc-LDA,
we extend the original MvDA by introducing the pairwise-covariance constraint that
helps to make classes to be more separated, while modifying the optimization model
that could theoretically directed to train the whole framework end-to-end. The new
optimization objective is also more efficient than the original conception of pc-LDA
in terms of computational complexity.
The main contributions of this thesis are summarized as follows:
• Firstly, investigating various recent deep neural networks for feature extrac-
tion.
• Secondly, proposing an extension of MvDA (so-called pc-MvDA) which aims
to improve the recognition results.
• Finally, incorporating DNN and MvA in an unified framework and evaluating
it on three datasets (IXMAS, MuHAVi, MICAGes).
Tran Hoang Nhat - CBC19005 6
1 Introduction Master Thesis
Specifically, where the thesis is based on work done by myself jointly with others,
my own contribution focuses on second and third objectives as primary contributor
while training process of DNN for feature extraction is largely done with help of
co-researcher.
1.3 Thesis Outline
The thesis is structured into 5 chapters:
1 Introduction This chapter. Motivates the work and describes the research goals.
2 Background and Related Works Describes the deep learning based approach
for feature extraction, dimensionality reduction and multi-view learning al-
gorithms. Also briefly reviews the existing approaches on human action recog-
nition in single and cross-view scenarios and multi-view analysis techniques.
3 Proposed Method Introduces the general architecture and proposes the technical
contribution for solving the mentioned research objective.
4 Experiments Reports information regarding experiments: datasets, evaluation
protocol, technical setup, results and discussions.
5 Conclusion Summarizes the work, points out the contributions, drawbacks and
suggests future research directions.
Tran Hoang Nhat - CBC19005 7
2 Technical Background and Related Works Master Thesis
2 Technical Background and Related
Works
2.1 Introduction
This chapter provides the basic knowledge as well as related works regarding to the
research topic of this thesis. Section 2.2 introduces briefly the general architecture
of deep neural networks for deep feature extraction; then describes in detail some
dimensionality reduction algorithms and multi-view analysis algorithms. Section 2.3
summarizes approaches introduced in existing works to tackle problems in human
action and gesture recognition and multi-view strategy.
2.2 Technical Background
2.2.1 Deep Neural Networks
2.2.1.1 Artificial Neural Networks
Artificial neural networks (ANNs), sometimes a.k.a. multi-layer perceptron (MLP)
or feed-forward neural network, are inspired by “real” neural networks - i.e. the
human brain and nervous system. They consist of neurons grouped in multiple
connected layers, each of which subsequently transformed by an activation function.
Linear Layer The linear layer, or generally known as fully-connected layer, ba-
sically composes of several perceptron units. Mathematically, it is simply a linear
function of input features with 2 learnable parameters weight W = {ω1 , ω2 , ..., ωd } as
coefficient of multiplication and bias b as additional term, simulating the biological
group of d perceptrons:
y =W ·x+b (2.1)
Activation Layer The non-linear activation layers are responsible to create com-
plex smooth mappings between the input and the output. They are element-wise
operators responsible to squash the value of each element within the boundaries of
specified function. Some common activation functions are:
Tran Hoang Nhat - CBC19005 8
2 Technical Background and Related Works Master Thesis
• Sigmoid: sigmoid (x) = 1+e1−x
x −x
• Hyperbolic tangent: tanh (x) = eex +e
−e
−x
• Rectifier: relu (x) = max (0, x)
• Swish: swish (x) = x · sigmoid (x)
For classification layer (the last linear layer), the number of neurons is equal to
the number of classes to be recognized and a softmax operator (Equation (2.2)) is
usually applied as activation function to get the probabilities of each classes.
eyi
σi = PN (2.2)
yn
n=1 e
Training Neural network training is usually performed via backpropagation algo-
rithm. This algorithm is based on the calculation of a loss function L which repre-
sents the difference between the network output and the expected output. Partial
∂L
derivatives of the cost ∂p i
are calculated with regards to each trainable parameters
pi using the chain rule. Then, each parameter is adjusted accordingly:
∂L
∆pi = −η (2.3)
∂pi
where η is called learning rate, which must be choosen carefully to ensure conver-
gence.
Loss functions are usually applied on the last layer. The most common criterion
for classification tasks is Cross Entropy Loss:
N
1 X eWclass(xk ) xk +bclass(xk )
LCE = − log Pc Wi xk +bi
(2.4)
N k=1 i=1 e
where N is number of samples, c is number of classes; W and b are parameters
of classification layer, Wi and bi denote the ith column of the weight W and bias b
respectively; xk denotes the deep feature of k th sample belonging to the class(xk )
class.
2.2.1.2 Convolutional Neural Networks
Convolutional neural networks (CNNs), inspired from the biological process in the
visual cortex of animals, have emerged as the most efficient approach for image
recognition and classification tasks. They are able to extract and aggregate highly
abstract information from images and videos. As a result of huge research and
engineering efforts, the effectiveness and performance of such algorithms have con-
siderably improved, outperforming handcrafted methods for visual information em-
bedding and becoming the state-of-the-art in image and video recognition.
Tran Hoang Nhat - CBC19005 9
2 Technical Background and Related Works Master Thesis
There are 5 main building blocks in architecture of a modern CNN:
Convolution Layer The convolutional layer implements sliding kernel on the input
tensor and for every position perform the summation of element-wise multiplication
between sliced input and learnable weight matrices to compute the output. It can
have multiple numbers of kernels such that more features from the input tensor can
be extracted.
The mathematical operation performed by each 2D convolutional kernel is:
X KH X
Ch X KW
oli,j = l
· xc,i+h,j+w + bl
ωc,h,w (2.5)
c=0 h=0 w=0
2D convolution is limited to spatial data and requires extra steps to manipulate
temporarily continuous sequence of images. On the other hand, 3D convolution
could intrinsically comprehend and establish abstract spatio-temporal relationship
in 3D input tensor. The mathematical operation performed by each 3D convolutional
kernel is:
X KH X
KD X
Ch X KW
oli,j,k = l
· xc,i+d,j+h,k+w + bl
ωc,d,h,w (2.6)
c=0 d=0 h=0 w=0
Batch Normalization Layer The batch normalization layer introduced in [25] is a
pervasive component in modern CNN architectures. It generally escorts after every
convolution layer and before an activation layer, responsible for bringing all the
pre-activated features to the same scale. The mathematical equation is as follow:
x − E[x]
y=p ·γ+β (2.7)
Var[x] +
where E[x] and Var[x] stands for mean and standard deviation calculated per-
dimension over the input mini-batches x; γ and β are learnable parameters and is
a small number added to the denominator to ensure numerical stability.
Activation Layer Similar to activiation in ANNs, activation layer in CNNs are
element-wise operators that apply for each pixel of input tensor.
Pooling Layer The pooling layer is usually inserted after one or a group of convo-
lutional layers. The purpose of pooling is to progressively decrease the size of the
elaborated data and make sure that only the most relevant features will be forwarded
to the next layers. It follows the sliding kernel principle of convolution, but uses a
much simpler operator without learnable parameter, such as:
• Max Pooling: Select the pixel with maximum value.
Tran Hoang Nhat - CBC19005 10
2 Technical Background and Related Works Master Thesis
• Min Pooling: Select the pixel with minimum value.
• Average Pooling: Compute the mean of the sliced input pixels.
A special class of pooling layer is called global pooling, which has flexible filter
sizes and shifts exact to the shape of input tensor, squeezing each channel to a
single scalar value. This type of pooling is generally used in the very end of a large-
scale CNN, transforming high-level features of possibly unascertained shapes to a
single vector of fixed length. After that, the output feature vector can be forwarded
to further linear layers without being flattened or perform classification directly.
Linear Layer Similarly, linear layer is the essential building block of classical
ANNs, but might be optional in CNNs in case of fully convolutional neural net-
works.
2.2.1.3 Recurrent Neural Networks
A recurrent neural network (RNN) is a feed-forward neural network that takes previ-
ous time steps into account. The input of RNNs is a sequentially ordered collection
of samples. Therefore, they excel in tasks in which order is important, e.g. time
series forecasting, natural language processing. Relating to the research topic of this
thesis, they can be used to handle chronical relationship of high-level representation
of frames extracted from videos.
In practice, either Long-Short-Term Memory (LSTM) or Gated Recurrent Unit
(GRU) are used instead of the basic idea of RNN. The main difference is that infor-
mation that is deemed important is allowed to pass on to later time-steps without
too much interference from hidden dot products and activation functions.
Long Short-Term Memory The LSTM architecture, contrary to regular RNNs,
has an additional hidden state that is never directly outputted (see Figure 2.1). This
additional hidden state can then be used by the network solely for remembering
previous relevant information. Instead of having to share its “memory” with its
output, these values are now separate. During the training process, an LSTM learns
what should be remembered for the future and what should be forgotten, which is
achieved by using its internal weights.
As can be seen in the Figure 2.1, there are quite a few more parameters in this
cell than in a normal RNN cell. The calculation of the output vector and the hidden
vector involves several operations. First of all the network determines how much
of the hidden state to forget, also called the forget gate. This is done by pushing
both the previous iteration’s output (ct−1 ) and the forget gate vector (ft ) through a
matrix multiplication, allowing the network to forget values at specific indices in the
Tran Hoang Nhat - CBC19005 11
2 Technical Background and Related Works Master Thesis
Figure 2.1: A single LSTM cell. From [1].
previous iteration’s output vector. ft can be obtained by using formula in Equation
(2.8), where W contains the weights for the input and U contains the weights for
the previous iteration’s output vector, xt refers to the input, ht−1 to the previous
iteration’s output vector and b is bias:
ft = σ(Wf xt + Uf ht−1 + bf ) (2.8)
The network then determines what to remember from the input vector. This,
commonly referred to as the input gate, is done by pushing the previous forget
gate’s output as well as the input gate through a matrix addition. The output of
the input gate (it ) can be found by using the following formula:
it = σ(Wi xt + Ui ht−1 + bi ) (2.9)
The final hidden state vector (ct ) can then be found by using the previous two
results as follows:
ct = ft ◦ ct−1 + it ◦ σ(Wc xt + Uc ht−1 + bc ) (2.10)
where ◦ denotes the Hadamard product (where each value at index ij is the
product of the values at the indices ij in the two input matrices). This vector is
then passed on to the next iteration. Now the output gate vector ot and the output
state ht can be optained:
ot = σ(Wo xt + Uo ht−1 + bo ) (2.11)
ht = ot ◦ σ(ct ) (2.12)
This results in a version of an RNN that is able to remember more and is more
liberal in choosing what information it wants to keep in the hidden state and what
Tran Hoang Nhat - CBC19005 12
2 Technical Background and Related Works Master Thesis
it wants to discard. This makes LSTM networks better suited for tasks involving
series of data and become the predominant RNN architecture.
Gated Recurrent Units Another RNN architecture is the GRU, introduced in
[26]. This architecture combines the input and forget gates into a single so-called
“update gate” and also merges the cell state and hidden state (see Figure 2.2). The
calculation of the merged output vector once again consists of several operations.
The network first computes the “reset gate” rt using the following function, where
Wr are the weights for the reset gate and [ht−1 , xt ] signifies the concatenation of ht−1
and xt :
Figure 2.2: A single GRU variation cell. From [1].
rt = σ (Wr [ht−1 , xt ]) (2.13)
After this, the “update gate” zt is computed as follows, where Wz holds the weights
of the update gate:
zt = σ (Wz [ht−1 , xt ]) (2.14)
The output vector ht (representing both the cell’s output and its state) can then
be computed by the following formula:
ht = (1 − zt ) ∗ ht−1 + zt ∗ h̃t (2.15)
where h̃t = tanh(W ∗ [rt ∗ ht−1 , xt ]).
2.2.2 Dimensionality Reduction Algorithms
Dimensionality reduction techniques are important in many applications related to
machine learning. They aim to find low-dimensional embedding that should pre-
serve sufficient information from the original dimension. Let’s define X = {xik |i =
(1, .., c); k = (1, .., ni )} as training samples where xik ∈ Rd is the k th sample of the
ith class, dx is the original dimension of data, dy is the desired dimension of data
after transformation such that dy < dx .
Tran Hoang Nhat - CBC19005 13
2 Technical Background and Related Works Master Thesis
2.2.2.1 Linear discriminant analysis
The linear discriminant analysis (LDA) technique is developed to linearly transform
the features into a lower dimensional space where the ratio of the between-class vari-
ance to the within-class variance is maximized, thereby guaranteering the optimal
class separability. The projection results of X on the lower dimensional space is de-
noted by Y = {yik = wT xik |i = (1, .., c); k = (1, ..., ni )}. S yB and S yW are computed
as follows:
X ni
c X
S yW = (yik − µi )(yik − µi )T (2.16)
i=1 k=1
Xc
S yB = ni (µi − µ)(µi − µ)T (2.17)
i=1
where µi = n1i nk=1 yik is the mean of all samples of the ith class in the lower
P i
dimensional space; µ = n1 ci=1 nk=1
P P i
yik is the mean of all samples of all classes; n =
Pc y
i=1 ni is the total number of data samples. The between-class S B and within-class
S yW covariance matrices in new dimension are not known yet but can be formulated
as the linearly transformed versions of their counterparts S xB and S xW in original
dimension.
S yW = ω T S xW ω (2.18)
S yB = ω T S xB ω (2.19)
S xB and S xW are simply calculable as follows:
X ni
c X
(x) (x)
S xW = (xik − µi )(xik − µi )T (2.20)
i=1 k=1
Xc
(x) (x)
S xB = ni (µi − µ(x) )(µi − µ(x) )T (2.21)
i=1
Then the objective function is formulated by a Rayleigh quotient:
trace(S yB ) trace(ω T S xB ω)
ω ∗ = argmax = argmax (2.22)
ω trace(S yW ) ω trace(ω T S xW ω)
The Fisher’s criterion in Equation (2.22) can be reformulated as:
S xW ω = λS xB ω (2.23)
where λ represents the eigenvalues of the transformation matrix ω. The ana-
Tran Hoang Nhat - CBC19005 14
2 Technical Background and Related Works Master Thesis
lytical solution of ω ∗ is a (dx × dy ) matrix optained by calculating the eigenvec-
tors V = {v1 , v2 , ..., vdx } sorted by the scalar values of corresponding eigenvalues
λ = {λ1 , λ2 , ..., λdx } of the matrix S = S xW −1 S xB , then slice it till the largest dy
element as illustrated in Figure 2.3.
Figure 2.3: Analytical solution of LDA.
2.2.2.2 Pairwise-covariance linear discriminant analysis
Pairwise-covariance linear discriminant analysis (pc-LDA) is an extension of LDA
introduced in [27] that overcomes it’s drawbacks by formulating pairwise distances
between pairs of classes. The pairs of a and b classes are regarded as two Gaus-
sian distributions Na (µa , S yW a ), Nb (µb , S yW b ) and the objective distance between two
classes is defined as their Kullback-Leibler divergence [28]:
1 −1
DKL (Na k Nb ) = (µa − µb )T S yW ab (µa − µb ) , (2.24)
2
where S yW ab is pairwise covariance matrix (Equation (2.26)), calculated as the
β parameterized convex sum of global within-class scatter matrix S yW used in LDA
with the within-class scatter matrix of each class S yW i (Equation (2.25)). The author
theorizes it would better represent the data distribution within two classes.
nij
v X
X
S yW i = (yik − µi ) (yik − µi )T (2.25)
j=1 k=1
na S yW a + nb S yW b
S yW ab = β + (1 − β) S yW (2.26)
na + nb
Tran Hoang Nhat - CBC19005 15
Master Thesis
Human Action Recognition using Deep
Learning and Multi-view Discriminant
Analysis
TRAN HOANG NHAT
[email protected]
Control Engineering and Automation
Advisor: Assoc. Prof. Dr. Tran Thi Thanh Hai
Faculty: School of Electrical Engineering
Hanoi, 10/2020
Abstract
Human action recognition (HAR) has many implications in robotic and medical
applications. Invariance under different viewpoints is one of the most critical re-
quirements for practical deployment as it affects many aspects of the information
captured such as occlusion, posture, color, shading, motion and background. In this
thesis, a novel framework that leverages successful deep features for action represen-
tation and multi-view analysis to accomplish robust HAR under viewpoint changes.
Specifically, various deep learning techniques, from 2D CNNs to 3D CNNs are inves-
tigated to capture spatial and temporal characteristics of actions at each individual
view. A common feature space is then constructed to keep view invariant features
among extracted streams. This is carried out by learning a set of linear transforma-
tions that projects separated view features into a common dimension. To this end,
Multi-view Discriminant Analysis (MvDA) is adopted. However, the original MvDA
suffers from odd situations in which the most class-discrepant common space could
not be found because its objective is overly concentrated on scattering classes from
the global mean but unaware of distance between specific pairs of classes. There-
fore, we introduce a pairwise-covariance maximizing extension of MvDA that takes
extra-class discriminance into account, namely pc-MvDA. The novel model also dif-
fers in the way that is more favorable for training of high-dimensional multi-view
data. Experimental results on three datasets (IXMAS, MuHAVI, MICAGes) show
the effectiveness of the proposed method.
Acknowledgements Master Thesis
Acknowledgements
This thesis would not have been possible without the help of many people. First
of all, I would like to express my gratitude to my primary advisor, Prof. Tran Thi
Thanh Hai, who guided me throughout this project. I would like to thank Prof. Le
Thi Lan and Prof. Vu Hai for giving me deep insight, valuable recommendations
and brilliant idea.
I am grateful for my time spent at MICA International Research Institute, where
I learnt a lot about research and enjoyed a very warm and friendly working atmo-
sphere. In particular, I wish to extend my special thanks to PhD candidate. Nguyen
Hong Quan and Dr. Doan Huong Giang who directly supported me.
Finally, I wish to show my appreciation to all my friends and family members who
helped me finalizing the project.
Tran Hoang Nhat - CBC19005 ii
Table of Contents Master Thesis
Table of Contents
List of Figures 1
List of Tables 3
List of Abbreviations 4
1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Technical Background and Related Works 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . 8
2.2.1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 9
2.2.1.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 11
2.2.2 Dimensionality Reduction Algorithms . . . . . . . . . . . . . . . . . . 13
2.2.2.1 Linear discriminant analysis . . . . . . . . . . . . . . . . . . 14
2.2.2.2 Pairwise-covariance linear discriminant analysis . . . . . . . . . 15
2.2.3 Multi-view Analysis Algorithms . . . . . . . . . . . . . . . . . . . . . 16
2.2.3.1 Multi-view discriminant analysis . . . . . . . . . . . . . . . . 16
2.2.3.2 Multi-view discriminant analysis with view-consistency . . . . . 18
2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Human action and gesture recognition . . . . . . . . . . . . . . . . . . 19
2.3.2 Multi-view analysis and learning techniques . . . . . . . . . . . . . . . 20
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Proposed Method 22
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Feature Extraction at Individual View Using Deep Learning Techniques . . . . . 23
3.3.1 2D CNN based clip-level feature extraction . . . . . . . . . . . . . . . . 23
3.3.2 3D CNN based clip-level feature extraction . . . . . . . . . . . . . . . . 26
3.4 Construction of Common Feature Space . . . . . . . . . . . . . . . . . . . . 27
3.4.1 Brief summary of Multi-view Discriminant Analysis . . . . . . . . . . . 27
3.4.2 Pairwise-covariance Multi-view Discriminant Analysis . . . . . . . . . . . 28
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Experiments 34
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Tran Hoang Nhat - CBC19005 iii
Table of Contents Master Thesis
4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 IXMAS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 MuHAVi dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.3 MICAGes dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.1 Programming Environment and Libraries . . . . . . . . . . . . . . . . 39
4.4.2 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . . . 40
4.5.1 Experimental results on IXMAS dataset . . . . . . . . . . . . . . . . . 40
4.5.2 Experimental results on MuHAVi dataset . . . . . . . . . . . . . . . . 42
4.5.3 Experimental results on MICAGes dataset . . . . . . . . . . . . . . . . 44
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Conclusion 48
5.1 Accomplishments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A Appendix 50
A.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.1.1 Derivation of S yW and S yB scatter matrices in MvDA . . . . . . . . . . . 50
A.1.2 Derivation of O view−consistency in MvDA-vc . . . . . . . . . . . . . . . 54
A.1.3 Derivation of S xW ab and S xB ab scatter matrices in pc-MvDA . . . . . . . . 54
Bibliography 58
Tran Hoang Nhat - CBC19005 iv
List of Figures Master Thesis
List of Figures
2.1 A single LSTM cell. From [1]. . . . . . . . . . . . . . . . . . . . . . . 12
2.2 A single GRU variation cell. From [1]. . . . . . . . . . . . . . . . . . 13
2.3 Analytical solution of LDA. . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Analytical solution of MvDA. . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Proposed framework for building common feature space with pairwise-
covariance multi-view discriminant analysis (pc-MvDA). . . . . . . . 24
3.2 Architecture of ResNet-50 utilized in this work for feature extraction
at each separated view. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Three pooling techniques: Average Pooling (AP), Recurrent Neural
Network (RNN) and Temporal Attention Pooling (TA). . . . . . . . . 25
3.4 Architecture of ResNet-50 3D utilized in this work for feature extraction. 27
3.5 Architecture of C3D utilized in this work for feature extraction. . . . 27
3.6 a) MvDA does not optimize the distance between paired classes in
common space. b) pc-MvDA takes pairwise distances into account to
better distinguish the classes. . . . . . . . . . . . . . . . . . . . . . . 30
3.7 A synthetic dataset of 180 data points, evenly distributed to 3 classes
among 3 different views; a) 2-D original distribution; b) 1-D projec-
tion of MvDA; c) 1-D projection of pc-MvDA. . . . . . . . . . . . . . 31
3.8 A synthetic dataset of 300 data points, evenly distributed to 5 classes
among 3 different views; a) 3-D original distribution; b) 2-D projec-
tion of MvDA; c) 2-D projection of pc-MvDA. . . . . . . . . . . . . . 31
4.1 Illustration of frames extracted from action check watch observed
from five camera viewpoints. . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Environment setup to collect action sequences from 8 views [2]. . . . 35
4.3 Illustration of frames extracted from an action punch observed from
Camera 1 to Camera 7. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Environment setup to capture MICAGes dataset. . . . . . . . . . . . 36
4.5 Illustration of a gesture belonging to the 6th class observed from 5
different views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Two evaluation protocols used in experiments. . . . . . . . . . . . . . 37
Tran Hoang Nhat - CBC19005 1
List of Figures Master Thesis
4.7 Comparison of accuracy on each action class using different deep fea-
tures combined with pc-MvDA on IXMAS dataset. . . . . . . . . . . 42
4.8 Comparison of accuracy on each action class using different deep fea-
tures combined with pc-MvDA. . . . . . . . . . . . . . . . . . . . . . 43
4.9 Comparison of accuracy on each action class using different deep fea-
tures combined with pc-MvDA on MICAGes dataset. . . . . . . . . . 46
4.10 First column: private feature spaces stacked and embedded together
in a same coordinate system; Second column: MvDA common space;
Third column: pc-MvDA common space. . . . . . . . . . . . . . . . . 47
Tran Hoang Nhat - CBC19005 2
List of Tables Master Thesis
List of Tables
3.1 Comparison of computational complexity of different notations of
Fisher criteria described in [3]. . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Cross-view recognition comparison on IXMAS dataset. . . . . . . . . 40
4.2 Cross-view recognition results of different features on IXMAS dataset
with pc-MvDA method. The result in the bracket are accuracies of
using features C3D, ResNet-50 3D, ResNet-50 RNN, ResNet-50 TA,
Restnet-50 AP respectively. Each row corresponds to training view
(from view C0 to view C3). Each column corresponds to testing view
(from view C0 to view C3). . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Multi-view recognition comparison on IXMAS dataset. . . . . . . . . 41
4.4 Comparison of proposed methods with SOTA methods on IXMAS
dataset according to the second evaluation protocol. . . . . . . . . . . 42
4.5 Cross-view recognition comparison on MuHAVi dataset. . . . . . . . . 42
4.6 Cross-view recognition results of different features on MuHAVi dataset
with pc-MvDA method. The result in the bracket are accuracies of
using features C3D, ResNet-50 3D, ResNet-50 RNN, ResNet-50 TA,
ResNet-50 AP respectively. Each row corresponds to training view
(from view C1 to view C7). Each column corresponds to testing view
(from view C1 to view C7). . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Multi-view recognition comparison on MuHAVi dataset. . . . . . . . . 43
4.8 Comparison of the proposed methods with SOTA methods on MuHAVi
dataset according to the second evaluation protocol. . . . . . . . . . . 44
4.9 Cross-view recognition comparison on MICAGes dataset. . . . . . . . 44
4.10 Cross-view recognition results of different features on MICAGes dataset
with pc-MvDA method. The result in the bracket are accuracies of
using features C3D, ResNet-50 3D, ResNet-50 RNN, ResNet-50 TA,
RestNet-50 AP respectively. Each row corresponds to training view
(from view K1 to view K5). Each column corresponds to testing view
(from view K1 to view K5). . . . . . . . . . . . . . . . . . . . . . . . 45
4.11 Multi-view recognition comparison on MICAGes dataset. . . . . . . . 45
Tran Hoang Nhat - CBC19005 3
List of Abbreviations Master Thesis
List of Abbreviations
ANN Artificial Neural Network
AP Average Pooling
CCA Canonical Correlation Analysis
CNN Convolutional Neural Network
DNN Deep Neural Network
GRU Gated Recurrent Unit
HAR Human Action Recognition
HoG Histogram of oriented Gradient
iDT improved Dense Trajectories
KCCA Kernel Canonical Correlation Analysis
kNN k-Neareast Neighbor
LDA Linear Discriminant Analysis
LSTM Long Short-Term Memory
MICA Multimedia, Information, Communication & Applications Interna-
tional Research Institute
MLP Multilayer Perceptron
MST-AOG Multi-view Spatio-Temporal AND-OR Graph
MvA Multi-view Analysis
MvCCA Multi-view Canonical Correlation Analysis
MvCCDA Multi-view Common Component Discriminant Analysis
MvDA Multi-view Discriminant Analysis
MvDA-vc Multi-view Discriminant Analysis with View-Consistency
MvFDA Multi-view Fisher Discriminant Analysis
MvMDA Multi-view Modular Discriminant Analysis
MvML-LA Multi-view Manifold Learning with Locality Alignment
MvPLS Multi-view Partial Least Square
pc-LDA Pairwise-Covariance Linear Discriminant Analysis
pc-MvDA Pairwise-Covariance Multi-view Discriminant Analysis
RNN Recurrent Neural Network
SOTA State Of The Art
SSM Self Similarity Matrix
TA Temporal Attention
Tran Hoang Nhat - CBC19005 4
1 Introduction Master Thesis
1 Introduction
1.1 Motivation
Human action and gesture recognition aims at recognizing an action from a given
video clip. This is an attractive research topic, which has been extensively stud-
ied over the years due to its broad range of applications from video surveillance
to human machine interaction [4, 5]. Within this scope, a very important demand
is independence to viewpoint. However, different viewpoints result in various hu-
man pose, background, camera motions, lighting conditions and occlusions. Conse-
quently, recognition performance could be dramatically degraded under viewpoint
changes.
To overcome this problem, a number of methods have been proposed. View inde-
pendence recognition such as [6, 7, 8, 9] generally require a careful multi-view camera
setting for robust joint estimation. View invariance approach [10, 11] is usually lim-
ited by inherent structure of view-invariant features. Recently, knowledge transfer
technique is widely deployed for cross-view action recognition, for instance bipar-
tite graph that bridge the semantic gap across view dependent vocabularies [12],
AND-OR graph (MST-AOG) for cross-view action recognition [13]. To increase dis-
criminant and informative features, view private features and shared features are
both incorporated in such frameworks to learn the common latent space [14, 15].
While existing works for human action and gesture recognition from common view-
points explored different deep learning techniques and achieved impressive accuracy.
In most of aforementioned multi-view action recognition techniques, the features
extracted from each view are usually hand-crafted features (i.e improved dense tra-
jectories) [16, 15, 14]. Deep learning techniques, if used, handle knowledge transfer
among viewpoints. Deployment of deep features in such frameworks for cross-view
scenario is under active investigation.
In parallel with knowledge transfer techniques, building a common space from
different views has been addressed in many other works using multi-view discrimi-
nant analysis techniques. The first work of this approach was initiated by Canonical
Component Analysis (CCA) that tried to find two linear transformations for each
view [17]. Various improvements of CCA have been made to take non-linear trans-
formation into account (kernel CCA) [18]. Henceforth, different improvements have
been introduced such as MULDA [19], MvDA [20], MvCCA, MvPLS and MvMDA
Tran Hoang Nhat - CBC19005 5
1 Introduction Master Thesis
[21], MvCCDA [22]. All of these techniques try to build a common space from
different views by maximizing the cross-covariance between views. However, most
of these works are still experimented with static images, none of them have been
explored with videos. Particularly, their investigation for the case of human action
recognition is still actively under taken.
1.2 Objective
Motivated by the two aforementioned under investigation problems, in the research
project at MICA institute, an unified framework is proposed for cross-view action
recognition which consists of two main components: individual view feature extrac-
tion and latent common space construction. The work in this thesis is part of the
work carried out by the research team.
For feature extraction from individual view, a range of deep neural networks are
investigated, from 2D CNNs with different pooling strategies (average pooling, tem-
poral attention pooling or using LSTM) to 3D CNNs with two most recent variations
(C3D [23] and ResNet-50 3D [24]). These networks have been successfully deployed
for human action and gesture recognition in general, but not investigated yet for
cross-view recognition scenarios.
The objective of this thesis focuses on the second stage of the proposed general
framework. For building a latent common space, we are inspired by idea of multi-
view discriminant analysis (MvDA). This technique has been shown to be efficient
for images based tasks, but not deployed for video based tasks and mostly with deep
features extracted from videos as input. In addition, the MvDA’s objective has no
explicit constraint to push class centers away from each other. To this end, with
idea based on the proposal of a dimensionality reduction algorithm namely pc-LDA,
we extend the original MvDA by introducing the pairwise-covariance constraint that
helps to make classes to be more separated, while modifying the optimization model
that could theoretically directed to train the whole framework end-to-end. The new
optimization objective is also more efficient than the original conception of pc-LDA
in terms of computational complexity.
The main contributions of this thesis are summarized as follows:
• Firstly, investigating various recent deep neural networks for feature extrac-
tion.
• Secondly, proposing an extension of MvDA (so-called pc-MvDA) which aims
to improve the recognition results.
• Finally, incorporating DNN and MvA in an unified framework and evaluating
it on three datasets (IXMAS, MuHAVi, MICAGes).
Tran Hoang Nhat - CBC19005 6
1 Introduction Master Thesis
Specifically, where the thesis is based on work done by myself jointly with others,
my own contribution focuses on second and third objectives as primary contributor
while training process of DNN for feature extraction is largely done with help of
co-researcher.
1.3 Thesis Outline
The thesis is structured into 5 chapters:
1 Introduction This chapter. Motivates the work and describes the research goals.
2 Background and Related Works Describes the deep learning based approach
for feature extraction, dimensionality reduction and multi-view learning al-
gorithms. Also briefly reviews the existing approaches on human action recog-
nition in single and cross-view scenarios and multi-view analysis techniques.
3 Proposed Method Introduces the general architecture and proposes the technical
contribution for solving the mentioned research objective.
4 Experiments Reports information regarding experiments: datasets, evaluation
protocol, technical setup, results and discussions.
5 Conclusion Summarizes the work, points out the contributions, drawbacks and
suggests future research directions.
Tran Hoang Nhat - CBC19005 7
2 Technical Background and Related Works Master Thesis
2 Technical Background and Related
Works
2.1 Introduction
This chapter provides the basic knowledge as well as related works regarding to the
research topic of this thesis. Section 2.2 introduces briefly the general architecture
of deep neural networks for deep feature extraction; then describes in detail some
dimensionality reduction algorithms and multi-view analysis algorithms. Section 2.3
summarizes approaches introduced in existing works to tackle problems in human
action and gesture recognition and multi-view strategy.
2.2 Technical Background
2.2.1 Deep Neural Networks
2.2.1.1 Artificial Neural Networks
Artificial neural networks (ANNs), sometimes a.k.a. multi-layer perceptron (MLP)
or feed-forward neural network, are inspired by “real” neural networks - i.e. the
human brain and nervous system. They consist of neurons grouped in multiple
connected layers, each of which subsequently transformed by an activation function.
Linear Layer The linear layer, or generally known as fully-connected layer, ba-
sically composes of several perceptron units. Mathematically, it is simply a linear
function of input features with 2 learnable parameters weight W = {ω1 , ω2 , ..., ωd } as
coefficient of multiplication and bias b as additional term, simulating the biological
group of d perceptrons:
y =W ·x+b (2.1)
Activation Layer The non-linear activation layers are responsible to create com-
plex smooth mappings between the input and the output. They are element-wise
operators responsible to squash the value of each element within the boundaries of
specified function. Some common activation functions are:
Tran Hoang Nhat - CBC19005 8
2 Technical Background and Related Works Master Thesis
• Sigmoid: sigmoid (x) = 1+e1−x
x −x
• Hyperbolic tangent: tanh (x) = eex +e
−e
−x
• Rectifier: relu (x) = max (0, x)
• Swish: swish (x) = x · sigmoid (x)
For classification layer (the last linear layer), the number of neurons is equal to
the number of classes to be recognized and a softmax operator (Equation (2.2)) is
usually applied as activation function to get the probabilities of each classes.
eyi
σi = PN (2.2)
yn
n=1 e
Training Neural network training is usually performed via backpropagation algo-
rithm. This algorithm is based on the calculation of a loss function L which repre-
sents the difference between the network output and the expected output. Partial
∂L
derivatives of the cost ∂p i
are calculated with regards to each trainable parameters
pi using the chain rule. Then, each parameter is adjusted accordingly:
∂L
∆pi = −η (2.3)
∂pi
where η is called learning rate, which must be choosen carefully to ensure conver-
gence.
Loss functions are usually applied on the last layer. The most common criterion
for classification tasks is Cross Entropy Loss:
N
1 X eWclass(xk ) xk +bclass(xk )
LCE = − log Pc Wi xk +bi
(2.4)
N k=1 i=1 e
where N is number of samples, c is number of classes; W and b are parameters
of classification layer, Wi and bi denote the ith column of the weight W and bias b
respectively; xk denotes the deep feature of k th sample belonging to the class(xk )
class.
2.2.1.2 Convolutional Neural Networks
Convolutional neural networks (CNNs), inspired from the biological process in the
visual cortex of animals, have emerged as the most efficient approach for image
recognition and classification tasks. They are able to extract and aggregate highly
abstract information from images and videos. As a result of huge research and
engineering efforts, the effectiveness and performance of such algorithms have con-
siderably improved, outperforming handcrafted methods for visual information em-
bedding and becoming the state-of-the-art in image and video recognition.
Tran Hoang Nhat - CBC19005 9
2 Technical Background and Related Works Master Thesis
There are 5 main building blocks in architecture of a modern CNN:
Convolution Layer The convolutional layer implements sliding kernel on the input
tensor and for every position perform the summation of element-wise multiplication
between sliced input and learnable weight matrices to compute the output. It can
have multiple numbers of kernels such that more features from the input tensor can
be extracted.
The mathematical operation performed by each 2D convolutional kernel is:
X KH X
Ch X KW
oli,j = l
· xc,i+h,j+w + bl
ωc,h,w (2.5)
c=0 h=0 w=0
2D convolution is limited to spatial data and requires extra steps to manipulate
temporarily continuous sequence of images. On the other hand, 3D convolution
could intrinsically comprehend and establish abstract spatio-temporal relationship
in 3D input tensor. The mathematical operation performed by each 3D convolutional
kernel is:
X KH X
KD X
Ch X KW
oli,j,k = l
· xc,i+d,j+h,k+w + bl
ωc,d,h,w (2.6)
c=0 d=0 h=0 w=0
Batch Normalization Layer The batch normalization layer introduced in [25] is a
pervasive component in modern CNN architectures. It generally escorts after every
convolution layer and before an activation layer, responsible for bringing all the
pre-activated features to the same scale. The mathematical equation is as follow:
x − E[x]
y=p ·γ+β (2.7)
Var[x] +
where E[x] and Var[x] stands for mean and standard deviation calculated per-
dimension over the input mini-batches x; γ and β are learnable parameters and is
a small number added to the denominator to ensure numerical stability.
Activation Layer Similar to activiation in ANNs, activation layer in CNNs are
element-wise operators that apply for each pixel of input tensor.
Pooling Layer The pooling layer is usually inserted after one or a group of convo-
lutional layers. The purpose of pooling is to progressively decrease the size of the
elaborated data and make sure that only the most relevant features will be forwarded
to the next layers. It follows the sliding kernel principle of convolution, but uses a
much simpler operator without learnable parameter, such as:
• Max Pooling: Select the pixel with maximum value.
Tran Hoang Nhat - CBC19005 10
2 Technical Background and Related Works Master Thesis
• Min Pooling: Select the pixel with minimum value.
• Average Pooling: Compute the mean of the sliced input pixels.
A special class of pooling layer is called global pooling, which has flexible filter
sizes and shifts exact to the shape of input tensor, squeezing each channel to a
single scalar value. This type of pooling is generally used in the very end of a large-
scale CNN, transforming high-level features of possibly unascertained shapes to a
single vector of fixed length. After that, the output feature vector can be forwarded
to further linear layers without being flattened or perform classification directly.
Linear Layer Similarly, linear layer is the essential building block of classical
ANNs, but might be optional in CNNs in case of fully convolutional neural net-
works.
2.2.1.3 Recurrent Neural Networks
A recurrent neural network (RNN) is a feed-forward neural network that takes previ-
ous time steps into account. The input of RNNs is a sequentially ordered collection
of samples. Therefore, they excel in tasks in which order is important, e.g. time
series forecasting, natural language processing. Relating to the research topic of this
thesis, they can be used to handle chronical relationship of high-level representation
of frames extracted from videos.
In practice, either Long-Short-Term Memory (LSTM) or Gated Recurrent Unit
(GRU) are used instead of the basic idea of RNN. The main difference is that infor-
mation that is deemed important is allowed to pass on to later time-steps without
too much interference from hidden dot products and activation functions.
Long Short-Term Memory The LSTM architecture, contrary to regular RNNs,
has an additional hidden state that is never directly outputted (see Figure 2.1). This
additional hidden state can then be used by the network solely for remembering
previous relevant information. Instead of having to share its “memory” with its
output, these values are now separate. During the training process, an LSTM learns
what should be remembered for the future and what should be forgotten, which is
achieved by using its internal weights.
As can be seen in the Figure 2.1, there are quite a few more parameters in this
cell than in a normal RNN cell. The calculation of the output vector and the hidden
vector involves several operations. First of all the network determines how much
of the hidden state to forget, also called the forget gate. This is done by pushing
both the previous iteration’s output (ct−1 ) and the forget gate vector (ft ) through a
matrix multiplication, allowing the network to forget values at specific indices in the
Tran Hoang Nhat - CBC19005 11
2 Technical Background and Related Works Master Thesis
Figure 2.1: A single LSTM cell. From [1].
previous iteration’s output vector. ft can be obtained by using formula in Equation
(2.8), where W contains the weights for the input and U contains the weights for
the previous iteration’s output vector, xt refers to the input, ht−1 to the previous
iteration’s output vector and b is bias:
ft = σ(Wf xt + Uf ht−1 + bf ) (2.8)
The network then determines what to remember from the input vector. This,
commonly referred to as the input gate, is done by pushing the previous forget
gate’s output as well as the input gate through a matrix addition. The output of
the input gate (it ) can be found by using the following formula:
it = σ(Wi xt + Ui ht−1 + bi ) (2.9)
The final hidden state vector (ct ) can then be found by using the previous two
results as follows:
ct = ft ◦ ct−1 + it ◦ σ(Wc xt + Uc ht−1 + bc ) (2.10)
where ◦ denotes the Hadamard product (where each value at index ij is the
product of the values at the indices ij in the two input matrices). This vector is
then passed on to the next iteration. Now the output gate vector ot and the output
state ht can be optained:
ot = σ(Wo xt + Uo ht−1 + bo ) (2.11)
ht = ot ◦ σ(ct ) (2.12)
This results in a version of an RNN that is able to remember more and is more
liberal in choosing what information it wants to keep in the hidden state and what
Tran Hoang Nhat - CBC19005 12
2 Technical Background and Related Works Master Thesis
it wants to discard. This makes LSTM networks better suited for tasks involving
series of data and become the predominant RNN architecture.
Gated Recurrent Units Another RNN architecture is the GRU, introduced in
[26]. This architecture combines the input and forget gates into a single so-called
“update gate” and also merges the cell state and hidden state (see Figure 2.2). The
calculation of the merged output vector once again consists of several operations.
The network first computes the “reset gate” rt using the following function, where
Wr are the weights for the reset gate and [ht−1 , xt ] signifies the concatenation of ht−1
and xt :
Figure 2.2: A single GRU variation cell. From [1].
rt = σ (Wr [ht−1 , xt ]) (2.13)
After this, the “update gate” zt is computed as follows, where Wz holds the weights
of the update gate:
zt = σ (Wz [ht−1 , xt ]) (2.14)
The output vector ht (representing both the cell’s output and its state) can then
be computed by the following formula:
ht = (1 − zt ) ∗ ht−1 + zt ∗ h̃t (2.15)
where h̃t = tanh(W ∗ [rt ∗ ht−1 , xt ]).
2.2.2 Dimensionality Reduction Algorithms
Dimensionality reduction techniques are important in many applications related to
machine learning. They aim to find low-dimensional embedding that should pre-
serve sufficient information from the original dimension. Let’s define X = {xik |i =
(1, .., c); k = (1, .., ni )} as training samples where xik ∈ Rd is the k th sample of the
ith class, dx is the original dimension of data, dy is the desired dimension of data
after transformation such that dy < dx .
Tran Hoang Nhat - CBC19005 13
2 Technical Background and Related Works Master Thesis
2.2.2.1 Linear discriminant analysis
The linear discriminant analysis (LDA) technique is developed to linearly transform
the features into a lower dimensional space where the ratio of the between-class vari-
ance to the within-class variance is maximized, thereby guaranteering the optimal
class separability. The projection results of X on the lower dimensional space is de-
noted by Y = {yik = wT xik |i = (1, .., c); k = (1, ..., ni )}. S yB and S yW are computed
as follows:
X ni
c X
S yW = (yik − µi )(yik − µi )T (2.16)
i=1 k=1
Xc
S yB = ni (µi − µ)(µi − µ)T (2.17)
i=1
where µi = n1i nk=1 yik is the mean of all samples of the ith class in the lower
P i
dimensional space; µ = n1 ci=1 nk=1
P P i
yik is the mean of all samples of all classes; n =
Pc y
i=1 ni is the total number of data samples. The between-class S B and within-class
S yW covariance matrices in new dimension are not known yet but can be formulated
as the linearly transformed versions of their counterparts S xB and S xW in original
dimension.
S yW = ω T S xW ω (2.18)
S yB = ω T S xB ω (2.19)
S xB and S xW are simply calculable as follows:
X ni
c X
(x) (x)
S xW = (xik − µi )(xik − µi )T (2.20)
i=1 k=1
Xc
(x) (x)
S xB = ni (µi − µ(x) )(µi − µ(x) )T (2.21)
i=1
Then the objective function is formulated by a Rayleigh quotient:
trace(S yB ) trace(ω T S xB ω)
ω ∗ = argmax = argmax (2.22)
ω trace(S yW ) ω trace(ω T S xW ω)
The Fisher’s criterion in Equation (2.22) can be reformulated as:
S xW ω = λS xB ω (2.23)
where λ represents the eigenvalues of the transformation matrix ω. The ana-
Tran Hoang Nhat - CBC19005 14
2 Technical Background and Related Works Master Thesis
lytical solution of ω ∗ is a (dx × dy ) matrix optained by calculating the eigenvec-
tors V = {v1 , v2 , ..., vdx } sorted by the scalar values of corresponding eigenvalues
λ = {λ1 , λ2 , ..., λdx } of the matrix S = S xW −1 S xB , then slice it till the largest dy
element as illustrated in Figure 2.3.
Figure 2.3: Analytical solution of LDA.
2.2.2.2 Pairwise-covariance linear discriminant analysis
Pairwise-covariance linear discriminant analysis (pc-LDA) is an extension of LDA
introduced in [27] that overcomes it’s drawbacks by formulating pairwise distances
between pairs of classes. The pairs of a and b classes are regarded as two Gaus-
sian distributions Na (µa , S yW a ), Nb (µb , S yW b ) and the objective distance between two
classes is defined as their Kullback-Leibler divergence [28]:
1 −1
DKL (Na k Nb ) = (µa − µb )T S yW ab (µa − µb ) , (2.24)
2
where S yW ab is pairwise covariance matrix (Equation (2.26)), calculated as the
β parameterized convex sum of global within-class scatter matrix S yW used in LDA
with the within-class scatter matrix of each class S yW i (Equation (2.25)). The author
theorizes it would better represent the data distribution within two classes.
nij
v X
X
S yW i = (yik − µi ) (yik − µi )T (2.25)
j=1 k=1
na S yW a + nb S yW b
S yW ab = β + (1 − β) S yW (2.26)
na + nb
Tran Hoang Nhat - CBC19005 15