Nguồn âm thanh sử dụng mô hình phổ nguồn tổng quát trên cơ sở thừa số hoá ma trận không âm
- 131 trang
- file .doc
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
DUONG THI HIEN THANH
AUDIO SOURCE SEPARATION EXPLOITING NMF-
BASED GENERIC SOURCE SPECTRAL MODEL
DOCTORAL DISSERTATION OF COMPUTER SCIENCE
Hanoi - 2019
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
DUONG THI HIEN THANH
AUDIO SOURCE SEPARATION EXPLOITING NMF-
BASED GENERIC SOURCE SPECTRAL MODEL
Major: Computer Science
Code: 9480101
DOCTORAL DISSERTATION OF COMPUTER SCIENCE
SUPERVISORS:
1. ASSOC. PROF. DR. NGUYEN QUOC CUONG
2. DR. NGUYEN CONG PHUONG
Hanoi - 2019
DECLARATION OF AUTHORSHIP
I, Duong Thi Hien Thanh, hereby declare that this thesis is my original
work and it has been written by me in its entirety. I confirm that:
• This work was done wholly during candidature for a Ph.D. research
degree at Hanoi University of Science and Technology.
• Where any part of this thesis has previously been submitted for a
degree or any other qualification at Hanoi University of Science and
Technology or any other institution, this has been clearly stated.
• Where I have consulted the published work of others, this is always
clearly at-tributed.
• Where I have quoted from the work of others, the source is always given.
With the exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
• Where the thesis is based on work done by myself jointly with others, I have
made exactly what was done by others and what I have contributed myself.
Hanoi, February 2019
Ph.D. Student
Duong Thi Hien Thanh
SUPERVISORS
Assoc.Prof. Dr. Nguyen Quoc Cuong Dr. Nguyen Cong Phuong
i
ACKNOWLEDGEMENT
This thesis has been written during my doctoral study at International Research
Institute Multimedia, Information, Communication, and Applications (MICA), Hanoi
University of Science and Technology (HUST). It is my great pleasure to thank
numer-ous people who have contributed towards shaping this thesis.
First and foremost I would like to express my most sincere gratitude to my
supervi-sors, Assoc. Prof. Nguyen Quoc Cuong and Dr. Nguyen Cong Phuong,
for their great guidance and support throughout my Ph.D. study. I am grateful to
them for devoting their precious time to discussing research ideas, proofreading,
and explaining how to write good research papers. I would like to thank them for
encouraging my research and empowering me to grow as a research scientist. I
could not have imagined having a better advisor and mentor for my Ph.D. study.
I would like to express my appreciation to my supervisor in Master cource, Prof.
Nguyen Thanh Thuy, School of Information and Communication Technology - HUST,
and Dr. Nguyen Vu Quoc Hung, my supervisor in Bachelors course at Hanoi National
University of Education. They had shaped my knowledge for excelling in studies.
In the process of implementation and completion of my research, I have
received many supports from the board of MICA directors and my colleagues at
Speech Com-munication department. Particularly, I am very much thankful to
Prof. Pham Thi Ngoc Yen, Prof. Eric Castelli, Dr. Nguyen Viet Son and Dr. Dao
Trung Kien, who pro-vided me with an opportunity to join researching works in
MICA institute and have access to the laboratory and research facilities. Without
their precious support would it have been being impossible to conduct this
research. My warmly thanks go to my colleagues at Speech Communication
department of MICA institute for their useful comments on my study and
unconditional support over four years both at work and outside of work.
I am very grateful to my internship supervisor Prof. Nobutaka Ono and the mem-bers
of Ono’s Lab at the National Institute of Informatics, Japan for warmly welcoming me
into their lab and the helpful research collaboration they offered. I much appreciate his
help in funding my conference trip and introducing me to the signal processing research
communities. I would also like to thank Dr. Toshiya Ohshima, MSc. Yasu-taka Nakajima,
MSc. Chiho Haruta and other researchers at Rion Co., Ltd., Japan for
ii
welcoming me to their company and providing me data for experimental.
I would also like to sincerely thank Dr. Nguyen Quang Khanh, dean of
Information Technology Faculty, and Assoc. Prof. Le Thanh Hue, dean of
Economic Informatics Department, at Hanoi University of Mining and Geology
(HUMG) where I am work-ing. I have received the financial and time support
from my office and leaders for completing my doctoral thesis. Grateful thanks
also go to my wonderful colleagues and friends Nguyen Thu Hang, Pham Thi
Nguyet, Vu Thi Kim Lien, Vo Thi Thu Trang, Pham Quang Hien, Nguyen The
Binh, Nguyen Thuy Duong, Nong Thi Oanh and Nguyen Thi Hai Yen, who have
the unconditional support and help during a long time. A special thank goes to
Dr. Le Hong Anh for the encouragement and his precious advice.
Last but not the least, I would like to express my deepest gratitude to my
family. I am very grateful to my mother-in-law and father-in-law for their support
in the time of need, and always allow me to focus on my work. I dedicate this
thesis to my mother and father with special love, they have been being a great
mentor in my life and had constantly encouraged me to be a better person. The
struggle and sacrifice of my parents always motivate me to work hard in my
studies. I would also like to express my love to my younger sisters and younger
brother for their encouraging and helping. This work has become more
wonderful because of the love and affection that they have provided.
A special love goes to my beloved husband Tran Thanh Huan for his patience and
understanding, for always being there for me to share the good and bad times. I also
appreciate my sons Tran Tuan Quang and Tran Tuan Linh for always cheering me up
with their smiles. Without love from them, this thesis would not have been completed.
Thank you all!
Hanoi, February 2019
Ph.D. Student
Duong Thi Hien Thanh
iii
CONTENTS
DECLARATION OF AUTHORSHIP . . . . . . . . . . . . . . . . . . . . . i
DECLARATION OF AUTHORSHIP i
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
NOTATIONS AND GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1. AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF
THE ART 10
1.1 Audio source separation: a solution for cock-tail party problem . . . . 10
1.1.1 General framework for source separation . . . . . . . . . . . 10
1.1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . 11
1.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Spectral models . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1.1 Gaussian Mixture Model . . . . . . . . . . . . . . 14
1.2.1.2 Nonnegative Matrix Factorization . . . . . . . . . . 15
1.2.1.3 Deep Neural Networks . . . . . . . . . . . . . . . 16
1.2.2 Spatial models . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD) . 18
1.2.2.2 Rank-1 covariance matrix . . . . . . . . . . . . . . 19
1.2.2.3 Full-rank spatial covariance model . . . . . . . . . 20
1.3 Source separation performance evaluation . . . . . . . . . . . . . . . 21
1.3.1 Energy-based criteria . . . . . . . . . . . . . . . . . . . . . . 22
1.3.2 Perceptually-based criteria . . . . . . . . . . . . . . . . . . . 23
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 2. NONNEGATIVE MATRIX FACTORIZATION 24
2.1 NMF introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iv
2.1.1 NMF in a nutshell . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.2 Cost function for parameter estimation . . . . . . . . . . . . . 26
2.1.3 Multiplicative update rules . . . . . . . . . . . . . . . . . . . 27
2.2 Application of NMF to audio source separation . . . . . . . . . . . . 29
2.2.1 Audio spectra decomposition . . . . . . . . . . . . . . . . . . 29
2.2.2 NMF-based audio source separation . . . . . . . . . . . . . . 30
2.3 Proposed application of NMF to unusual sound detection . . . . . . . 32
2.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Proposed methods for non-stationary frame detection . . . . . 34
2.3.2.1 Signal energy based method . . . . . . . . . . . . . 34
2.3.2.2 Global NMF-based method . . . . . . . . . . . . . 35
2.3.2.3 Local NMF-based method . . . . . . . . . . . . . . 35
2.3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.3.2 Algorithm settings and evaluation metrics . . . . . 37
2.3.3.3 Results and discussion . . . . . . . . . . . . . . . . 38
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 3. SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING
NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP
SPARSITY CONSTRAINT 44
3.1 General workflow of the proposed approach . . . . . . . . . . . . . . 44
3.2 GSSM formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Model fitting with sparsity-inducing penalties . . . . . . . . . . . . . 46
3.3.1 Block sparsity-inducing penalty . . . . . . . . . . . . . . . . 47
3.3.2 Component sparsity-inducing penalty . . . . . . . . . . . . . 48
3.3.3 Proposed mixed sparsity-inducing penalty . . . . . . . . . . . 49
3.4 Derived algorithm in unsupervised case . . . . . . . . . . . . . . . . 49
3.5 Derived algorithm in semi-supervised case . . . . . . . . . . . . . . . 52
3.5.1 Semi-GSSM formulation . . . . . . . . . . . . . . . . . . . . 52
3.5.2 Model fitting with mixed sparsity and algorithm . . . . . . . . 54
3.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.1 Experiment data . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.1.1 Synthetic dataset . . . . . . . . . . . . . . . . . . . 55
v
3.6.1.2 SiSEC-MUS dataset . . . . . . . . . . . . . . . . . 55
3.6.1.3 SiSEC-BNG dataset . . . . . . . . . . . . . . . . . 56
3.6.2 Single-channel source separation performance with unsuper-
vised setting . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.2.1 Experiment settings . . . . . . . . . . . . . . . . . 57
3.6.2.2 Evaluation method . . . . . . . . . . . . . . . . . . 57
3.6.2.3 Results and discussion . . . . . . . . . . . . . . . . 61
3.6.3 Single-channel source separation performance with semi-supervised
setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6.3.1 Experiment settings . . . . . . . . . . . . . . . . . 65
3.6.3.2 Evaluation method . . . . . . . . . . . . . . . . . . 65
3.6.3.3 Results and discussion . . . . . . . . . . . . . . . . 65
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Chapter 4. MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING
NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 68
4.1 Formulation and modeling . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.1 Local Gaussian model . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 NMF-based source variance model . . . . . . . . . . . . . . . 70
4.1.3 Estimation of the model parameters . . . . . . . . . . . . . . 71
4.2 Proposed GSSM-based multichannel approach . . . . . . . . . . . . . 72
4.2.1 GSSM construction . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2 Proposed source variance fitting criteria . . . . . . . . . . . . 73
4.2.2.1 Source variance denoising . . . . . . . . . . . . . . 73
4.2.2.2 Source variance separation . . . . . . . . . . . . . 74
4.2.3 Derivation of MU rule for updating the activation matrix . . . 75
4.2.4 Derived algorithm . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 Dataset and parameter settings . . . . . . . . . . . . . . . . . 79
4.3.2 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.2.1 Algorithm convergence: separation results as func-
tions of EM and MU iterations . . . . . . . . . . . 80
4.3.2.2 Separation results with different choices of and 81
4.3.3 Comparison with the state of the art . . . . . . . . . . . . . . 82
vi
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
CONCLUSIONS AND PERSPECTIVES . . . . . . . . . . . . . . . . . . . 93
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
LIST OF PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 113
vii
NOTATIONS AND GLOSSARY
Standard mathematical symbols
C Set of complex numbers
R Set of real numbers
Z Set of integers
E Expectation of a random variable
Nc Complex Gaussian distribution
Vectors and matrices
a Scalar
a Vector
A Matrix
T
A Matrix transpose
H
A Matrix conjugate transposition (Hermitian conjugation)
diag(a) Diagonal matrix with a as its diagonal
det(A) Determinant of matrix A
tr(A) Matrix trace
A B The element-wise Hadamard product of two matrices (of the same dimension)
with elements [A B]ij = AijBij
:(n)
A:(n) The matrix with entries [A]ij
kak1
kAk1 `1-norm of matrix
Indices
f Frequency index
i Channel index
j Source index
n Time frame index
t Time sample index
viii
Sizes
I Number of channels
J Number of sources
L STFT filter length
F Number of frequency bin
N Number of time frames
K Number of spectral basis
Mixing filters
Matrix of filters
A2R I J L
I th
a j( ) 2 R Mixing filter of j source to all microphones, is the time delay
th
aij(t) 2 R Filter coefficient at t time index
L Time domain filter vector
aij 2 R
L Frequency domain filter vector
aij 2 C
a
b
ij (
f )2C Filter coefficient at f frequency index
th
b
General parameters
x(t) 2 R
I Time-domain mixture signal
s(t) 2 R
J Time-domain source signals
th
cj(t) 2 R
I Time-domain j source image
th
sj(t) 2 R Time-domain j original source signal
x(n; f) 2 C
I Time-frequency domain mixture signal
s(n; f) 2 C
J Time-frequency domain source signals
th
cj(n; f) 2 C
I Time-frequency domain j source image
th
Time-dependent variances of the j source
vj(n; f) 2 R
th
Rj(f) 2 C Time-independent covariance matrix of the j source
th
j(n; f) 2 C
I I Covariance matrix of the j source image
bx(n; f) 2 C
II Empirical mixture covariance
bx(n; f) 2 C
II Empirical mixture covariance
F N Power spectrogram matrix
V2R +
F K Spectral basis matrix
W 2 R +
H 2 R +
K N Time activation matrix
F K Generic source spectral model
U2R +
ix
Abbreviations
APS Artifacts-related Perceptual Score
BSS Blind Source Separation
DoA Direction of Arrival
DNN Deep Neural Network
EM Expectation Maximization
ICA Independent Component Analysis
IPS Interference-related Perceptual Score
IS Itakura-Saito
ISR source Image to Spatial distortion Ratio
ISTFT Inverse Short-Time Fourier Transform
IID (i.i.d) Interchannel Intensity Difference
ITD (i.t.d) Interchannel Time Difference GCC-PHAT
Generalized Cross Correlation Phase Transform
GMM Gaussian Mixture Model
GSSM Generic Source Spectral Model
KL Kullback-Leibler
LGM Local Gaussian Model
MAP Maximum A Posteriori
ML Maximum Likelihood
MU Multiplicative Update
NMF Non-negative Matrix Factorization
OPS Overall Perceptual Score
PLCA Probabilistic Latent Component Analysis
SAR Signal to Artifacts Ratio
SDR Signal to Distortion Ratio
SIR Signal to Interference Ratio
SiSEC Signal Separation Evaluation Campaign
SNMF Spectral Non-negative Matrix Factorization
SNR Signal to Noise Ratio
STFT Short-Time Fourier Transform
TDOA Time Difference of Arrival
T-F Time-Frequency
TPS Target-related Perceptual Score
x
LIST OF TABLES
2.1 Total number of different events detected from three recordings in spring 40
2.2 Total number of different events detected from three recordings in sum-
mer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 Total number of different events detected from three recordings in winter 42
3.1 List of snip songs in the SiSEC-MUS dataset. . . . . . . . . . . . . . 56
3.2 Source separation performance obtained on the Synthetic and SiSEC-
MUS dataset with unsupervised setting. . . . . . . . . . . . . . . . . 59
3.3 Speech separation performance obtained on the SiSEC-BGN. indi-
cates submissions by the authors and “-” indicates missing information
[81, 98, 100]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Speech separation performance obtained on the Synthetic dataset with
semi-supervised setting. . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Speech separation performance obtained on the SiSEC-BGN-devset -
Comparison with closed baseline methods. . . . . . . . . . . . . . . . 85
4.2 Speech separation performance obtained on the SiSEC-BGN-devset -
Comparison with s-o-t-a methods in SiSEC. indicates submissions
by the authors and “-” indicates missing information. . . . . . . . . . 86
4.3 Speech separation performance obtained on the test set of the SiSEC-
BGN. indicates submissions by the authors [81]. . . . . . . . . . . . 91
xi
LIST OF FIGURES
1 A cocktail party effect. . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Audio source separation. . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Live recording environments. . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Source separation general framework. . . . . . . . . . . . . . . . . . 11
1.2 Audio source separation: a solution for cock-tail party problem. . . . 13
1.3 IID coresponding to two sources in an anechoic environment. . . . . . 19
2.1 Decomposition model of NMF [36]. . . . . . . . . . . . . . . . . . . 25
2.2 Spectral decomposition model based on NMF (K = 2) [66]. . . . . . 29
2.3 General workflow of supervised NMF-based audio source separation. 30
2.4 Image of overlapping blocks. . . . . . . . . . . . . . . . . . . . . . . 34
2.5 General workflow of the NMF-based nonstationary segment extraction. 35
2.6 Number of different events were detected by the methods from (a) the
recordings in Spring, (b) the recordings in Summer, and (c) the record-
ings in Winter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Proposed weakly-informed single-channel source separation approach. 45
3.2 Generic source spectral model (GSSM) construction. . . . . . . . . . 47
3.3 Estimated activation matrix H: (a) without a sparsity constraint, (b)
with a block sparsity-inducing penalty (3.5), (c) with a component
sparsity-inducing penalty (3.6), and (d) with the proposed mixed sparsity-
inducing penalty (3.7). . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Average separation performance obtained by the proposed method with
unsupervised setting over the Synthetic dataset as a function of MU it-
erations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Average separation performance obtained by the proposed method with
unsupervised setting over the Synthetic dataset as a function of and . 62
3.6 Average speech separation performance obtained by the proposed meth-
ods and the state-of-the-art methods over the dev set in SiSEC-BGN. . 63
3.7 Average speech separation performance obtained by the proposed meth-
ods and the state-of-the-art methods over the test set in SiSEC-BGN. . 63
xii
4.1 General workflow of the proposed source separation approach. The top
green dashed box describes the training phase for the GSSM construc-
tion. Bottom blue boxes indicate processing steps for source separa-
tion. Green dashed boxes indicate the novelty compared to the existing
works [6, 38, 107]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Average separation performance obtained by the proposed method over
stereo mixtures of speech and noise as functions of EM and MU itera-
tions. (a): speech SDR, (b): speech SIR, (c): speech SAR, (d): speech
ISR, (e): noise SDR, (f): noise SIR, (g): noise SAR, (h): noise ISR . . 81
4.3 Average separation performance obtained by the proposed method over
stereo mixtures of speech and noise as functions of and . (a): speech
SDR, (b): speech SIR, (c): speech SAR, (d): speech ISR, (e): noise
SDR, (f): noise SIR, (g): noise SAR, (h): noise ISR . . . . . . . . . . 82
4.4 Average speech separation performance obtained by the proposed meth-
ods and the closest existing algorithms in terms of the energy-based
criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5 Average speech separation performance obtained by the proposed meth-
ods and the closest existing algorithms in terms of the perceptually-
based criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Average speech separation performance obtained by the proposed meth-
ods and the state-of-the-art methods in terms of the energy-based criteria. 89
4.7 Average speech separation performance obtained by the proposed meth-
ods and the state-of-the-art methods in terms of the perceptually-based
criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.8 Boxplot for the speech separation performance obtained by the pro-
posed “GSSM + SV denoising” (P1) and “GSSM + SV separation”
(P2) methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xiii
INTRODUCTION
In this part, we will introduce the motivation and the problem that we focus
on throughout this thesis. Then, we emphasize on the objectives as well as
scopes of our work. In addition, our contributions in this thesis will be
summarized in order to give a clear view of the achievement. Finally, the
structure of the thesis is presented chapter by chapter.
1. Background and Motivation
1.1. Cocktail party problem
Real-world sound scenarios are usually very complicated as they are mixtures of
many different sound sources. Fig. 1 depicts the scenario of a typical cocktail party,
where there are many people attending, many conversations going on simultaneously
and various disturbances like loud music, people screaming sounds, and a lot of hustle-
bustle. Some other similar situations also happen in daily life, for example, in outdoor
recordings, where there is interference from a variety of environmental sounds, or in a
music concert scenario, where a number of musical instruments are played and the au-
dience gets to listen to the collective sound, etc. In such settings, what is actually heard
by the ears is a mixture of various sounds that are generated by various audio sources.
The mixing process can contain many sound reflections from walls and ceiling, which is
known as the reverberation. Humans with normal hearing ability are generally able to
locate, identify, and differentiate sound sources which are heard simultaneously so as to
understand the conveyed information. However, this task has remained extremely
challenging for machines, especially in highly noisy and reverberated environments. The
cocktail party effect described above prevents both human and machine perceiv-ing the
target sound sources [2, 12, 145], the creation of machine listening algorithms that can
automatically separate sound sources in difficult mixing conditions remains an open
problem.
Audio source separation aims at providing machine listeners with a similar func-
tion to the human ears by separating and extracting the signals of individual sources
from a given mixture. This technique is formally termed as blind source separation
1
(BSS) when no prior information about either the sources or the mixing condition is
available, and is described in Fig. 2. Audio source separation is also known as an
effective solution for cocktail party problem in audio signal processing community [85,
90, 138, 143, 152]. Depending on specific application, some source separation
approaches focus on speech separation, in which the speech signal is extracted from
the mixture containing multiple background noise and other unwanted sounds. Other
methods deal with music separation, in which the singing voice and certain instruments
are recovered from the mixture or song containing multiple musical instruments. The
separated source signals may be either listened to or further processed, giving rise to
many potential applications. Speech separation is mainly used for speech enhance-
ment in hearing aids, hands-free phones, or automatic speech recognition (ASR) in
adverse conditions [11, 47, 64, 116, 129]. While music separation has many interest-ing
applications, including editing/remixing music post-production, up-mixing, music
information retrieval, rendering of stereo recordings, and karaoke [37, 51, 106, 110].
1
Figure 1: A cocktail party effect .
Over the last couple of decades, efforts have been undertaken by the scientific com-
munity, from various backgrounds such as Signal Processing, Mathematics, Statistics,
Neural Networks, Machine Learning, etc., to build audio source separation systems as
described in [14, 15, 22, 43, 85, 105, 125]. The audio source separation problem
1
Some icons of Fig. 1 are from: http://clipartix.com/.
2
Figure 2: Audio source separation.
has been studied at various levels of complexity, and different approaches and
systems have come up. Despite numerous effort, the problem is not completely
solved yet as the obtained separation results are still far from perfect, especially
in challenging conditions such as moving sound sources and high reverberation.
1.2. Basic notations and target challenges
• Overdetermined, determined, and underdetermined mixture
There are three different settings in audio source separation under the relation-
ship between the number of sources J and the number of microphones I: In case
the number of the microphones is larger than that of the sources, J < I, the
number of observable variables are more than the unknown variables and hence
it is referred to as overdetermined case. If J = I, we have as many observable
variables as unknowns, and this is a determined case. The more dificult soure
separation case is that the number of unknowns are more than the number of
observable variables, J > I, which is called the underdetermined case.
Furthermore, if I = 1 then it is a single-channel case. If I > 1 then it is a
multi-channel case.
• Instantaneous, anechoic, and reverberant mixing environment
Apart from the mixture settings based on the relationship between the number of
sources and the number of microphones, audio source separation algorithms can
also be distinguished based on the target mixing condition they deal with.
3
The simplest case deals with instantaneous mixtures, such as certain music
mix-tures generated by amplitude panning. In this case, there is no time delay,
a mixture at a given time is essentially a weighted sum of the source signals at
the same time instant. There are two other typical types of the live recording
environments, anechoic and reverberant, as shown in Fig. 3. In the anechoic
environments such as studio or outdoor, the microphones capture only the
direct sound propagation from a source. With reverberant environments such
as real meeting rooms or chambers, the microphones capture not only the
direct sound but also many sound reflections from walls, ceilings, and floors.
The modeling of the reverberant environment is much more difficult than the
instantaneous and anechoic cases.
2
Figure 3: Live recording environments .
State-of-the-art audio source separation algorithms perform quite well in instan-
taneous or noiseless anechoic conditions, but still far from perfect by the amount of
reverberation. These numerical performance results are clearly shown in the recent
community-based Signal Separation Evaluation Campaigns (SiSEC) [5, 99, 101, 133,
134] and others [65, 135]. That shows that addressing the separation of reverberant
mixtures, a common case in the real-world recording applications, remains one of the
key scientific challenges in the source separation community. Moreover, when the de-
sired sound is corrupted by high-level background noise, i.e., the Signal-to-Noise Ratio
(SNR) is up to 0 dB or lesser, the separation performance is even lower.
2
Some icons of Fig. 3 are from: http://clipartix.com/.
4
To improve the separation performance, informed approaches have been proposed
and emerged over the last decade in the literature [78, 136]. Such approaches exploit
side information about one or all of the sources themselves, or the mixing condition in
order to guide the separation process. Examples of the investigated side information
include deformed or hummed references of one (or more) source(s) in a given mixture
[123, 126], text associated with spoken speeches [83], score associated with musical
sources [37, 51], and motion associated with audio-visual objects in a video [110].
Following this trend, our research focuses on using weakly-informed strategy to
target the determined/underdetermined and high reverberation audio source
separation challenge. We use a very abstract semantic information just about the
types of audio sources existing in the mixture to guide the separation process.
2. Objective and scope
2.1. Objective
The main objective of the thesis is to investigate and develop efficient audio
source separation algorithm, which can deal with the determined/underdetermined
and high reverberation in the real-world recording conditions.
In order to do that, we start by studying state-of-the-art approaches for
selecting one of the most well-known frameworks that can deal with the targeted
challenges. We then develop novel algorithms grounded on such considered
modeling framework, i.e., the Local Gaussian Model (LGM), with Nonnegative
Matrix Factorization (NMF) as the spectral model, for both single-channel and
multi-channel cases. In our proposed approach, we exploit information just about
the types of audio sources in the mixture to guide the separation process. For
instance, in speech enhancement application, we know that one source in a
noisy recording should be speech, and another is background noise. We further
want to investigate the algorithms’ convergence as well as their sensitivity to the
parameter settings in order to guide for parameter settings when it is applicable.
For evaluation, both speech and music separations are considered. We consider a
speech separation for speech enhancement task, and consider both singing voice and
musical instrument separation for music task. In order to compare fairly the obtained
separation results with other existing methods, we use the benchmark dataset in addi-
5
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
DUONG THI HIEN THANH
AUDIO SOURCE SEPARATION EXPLOITING NMF-
BASED GENERIC SOURCE SPECTRAL MODEL
DOCTORAL DISSERTATION OF COMPUTER SCIENCE
Hanoi - 2019
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
DUONG THI HIEN THANH
AUDIO SOURCE SEPARATION EXPLOITING NMF-
BASED GENERIC SOURCE SPECTRAL MODEL
Major: Computer Science
Code: 9480101
DOCTORAL DISSERTATION OF COMPUTER SCIENCE
SUPERVISORS:
1. ASSOC. PROF. DR. NGUYEN QUOC CUONG
2. DR. NGUYEN CONG PHUONG
Hanoi - 2019
DECLARATION OF AUTHORSHIP
I, Duong Thi Hien Thanh, hereby declare that this thesis is my original
work and it has been written by me in its entirety. I confirm that:
• This work was done wholly during candidature for a Ph.D. research
degree at Hanoi University of Science and Technology.
• Where any part of this thesis has previously been submitted for a
degree or any other qualification at Hanoi University of Science and
Technology or any other institution, this has been clearly stated.
• Where I have consulted the published work of others, this is always
clearly at-tributed.
• Where I have quoted from the work of others, the source is always given.
With the exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
• Where the thesis is based on work done by myself jointly with others, I have
made exactly what was done by others and what I have contributed myself.
Hanoi, February 2019
Ph.D. Student
Duong Thi Hien Thanh
SUPERVISORS
Assoc.Prof. Dr. Nguyen Quoc Cuong Dr. Nguyen Cong Phuong
i
ACKNOWLEDGEMENT
This thesis has been written during my doctoral study at International Research
Institute Multimedia, Information, Communication, and Applications (MICA), Hanoi
University of Science and Technology (HUST). It is my great pleasure to thank
numer-ous people who have contributed towards shaping this thesis.
First and foremost I would like to express my most sincere gratitude to my
supervi-sors, Assoc. Prof. Nguyen Quoc Cuong and Dr. Nguyen Cong Phuong,
for their great guidance and support throughout my Ph.D. study. I am grateful to
them for devoting their precious time to discussing research ideas, proofreading,
and explaining how to write good research papers. I would like to thank them for
encouraging my research and empowering me to grow as a research scientist. I
could not have imagined having a better advisor and mentor for my Ph.D. study.
I would like to express my appreciation to my supervisor in Master cource, Prof.
Nguyen Thanh Thuy, School of Information and Communication Technology - HUST,
and Dr. Nguyen Vu Quoc Hung, my supervisor in Bachelors course at Hanoi National
University of Education. They had shaped my knowledge for excelling in studies.
In the process of implementation and completion of my research, I have
received many supports from the board of MICA directors and my colleagues at
Speech Com-munication department. Particularly, I am very much thankful to
Prof. Pham Thi Ngoc Yen, Prof. Eric Castelli, Dr. Nguyen Viet Son and Dr. Dao
Trung Kien, who pro-vided me with an opportunity to join researching works in
MICA institute and have access to the laboratory and research facilities. Without
their precious support would it have been being impossible to conduct this
research. My warmly thanks go to my colleagues at Speech Communication
department of MICA institute for their useful comments on my study and
unconditional support over four years both at work and outside of work.
I am very grateful to my internship supervisor Prof. Nobutaka Ono and the mem-bers
of Ono’s Lab at the National Institute of Informatics, Japan for warmly welcoming me
into their lab and the helpful research collaboration they offered. I much appreciate his
help in funding my conference trip and introducing me to the signal processing research
communities. I would also like to thank Dr. Toshiya Ohshima, MSc. Yasu-taka Nakajima,
MSc. Chiho Haruta and other researchers at Rion Co., Ltd., Japan for
ii
welcoming me to their company and providing me data for experimental.
I would also like to sincerely thank Dr. Nguyen Quang Khanh, dean of
Information Technology Faculty, and Assoc. Prof. Le Thanh Hue, dean of
Economic Informatics Department, at Hanoi University of Mining and Geology
(HUMG) where I am work-ing. I have received the financial and time support
from my office and leaders for completing my doctoral thesis. Grateful thanks
also go to my wonderful colleagues and friends Nguyen Thu Hang, Pham Thi
Nguyet, Vu Thi Kim Lien, Vo Thi Thu Trang, Pham Quang Hien, Nguyen The
Binh, Nguyen Thuy Duong, Nong Thi Oanh and Nguyen Thi Hai Yen, who have
the unconditional support and help during a long time. A special thank goes to
Dr. Le Hong Anh for the encouragement and his precious advice.
Last but not the least, I would like to express my deepest gratitude to my
family. I am very grateful to my mother-in-law and father-in-law for their support
in the time of need, and always allow me to focus on my work. I dedicate this
thesis to my mother and father with special love, they have been being a great
mentor in my life and had constantly encouraged me to be a better person. The
struggle and sacrifice of my parents always motivate me to work hard in my
studies. I would also like to express my love to my younger sisters and younger
brother for their encouraging and helping. This work has become more
wonderful because of the love and affection that they have provided.
A special love goes to my beloved husband Tran Thanh Huan for his patience and
understanding, for always being there for me to share the good and bad times. I also
appreciate my sons Tran Tuan Quang and Tran Tuan Linh for always cheering me up
with their smiles. Without love from them, this thesis would not have been completed.
Thank you all!
Hanoi, February 2019
Ph.D. Student
Duong Thi Hien Thanh
iii
CONTENTS
DECLARATION OF AUTHORSHIP . . . . . . . . . . . . . . . . . . . . . i
DECLARATION OF AUTHORSHIP i
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
NOTATIONS AND GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1. AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF
THE ART 10
1.1 Audio source separation: a solution for cock-tail party problem . . . . 10
1.1.1 General framework for source separation . . . . . . . . . . . 10
1.1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . 11
1.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Spectral models . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1.1 Gaussian Mixture Model . . . . . . . . . . . . . . 14
1.2.1.2 Nonnegative Matrix Factorization . . . . . . . . . . 15
1.2.1.3 Deep Neural Networks . . . . . . . . . . . . . . . 16
1.2.2 Spatial models . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD) . 18
1.2.2.2 Rank-1 covariance matrix . . . . . . . . . . . . . . 19
1.2.2.3 Full-rank spatial covariance model . . . . . . . . . 20
1.3 Source separation performance evaluation . . . . . . . . . . . . . . . 21
1.3.1 Energy-based criteria . . . . . . . . . . . . . . . . . . . . . . 22
1.3.2 Perceptually-based criteria . . . . . . . . . . . . . . . . . . . 23
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 2. NONNEGATIVE MATRIX FACTORIZATION 24
2.1 NMF introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iv
2.1.1 NMF in a nutshell . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.2 Cost function for parameter estimation . . . . . . . . . . . . . 26
2.1.3 Multiplicative update rules . . . . . . . . . . . . . . . . . . . 27
2.2 Application of NMF to audio source separation . . . . . . . . . . . . 29
2.2.1 Audio spectra decomposition . . . . . . . . . . . . . . . . . . 29
2.2.2 NMF-based audio source separation . . . . . . . . . . . . . . 30
2.3 Proposed application of NMF to unusual sound detection . . . . . . . 32
2.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Proposed methods for non-stationary frame detection . . . . . 34
2.3.2.1 Signal energy based method . . . . . . . . . . . . . 34
2.3.2.2 Global NMF-based method . . . . . . . . . . . . . 35
2.3.2.3 Local NMF-based method . . . . . . . . . . . . . . 35
2.3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.3.2 Algorithm settings and evaluation metrics . . . . . 37
2.3.3.3 Results and discussion . . . . . . . . . . . . . . . . 38
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 3. SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING
NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP
SPARSITY CONSTRAINT 44
3.1 General workflow of the proposed approach . . . . . . . . . . . . . . 44
3.2 GSSM formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Model fitting with sparsity-inducing penalties . . . . . . . . . . . . . 46
3.3.1 Block sparsity-inducing penalty . . . . . . . . . . . . . . . . 47
3.3.2 Component sparsity-inducing penalty . . . . . . . . . . . . . 48
3.3.3 Proposed mixed sparsity-inducing penalty . . . . . . . . . . . 49
3.4 Derived algorithm in unsupervised case . . . . . . . . . . . . . . . . 49
3.5 Derived algorithm in semi-supervised case . . . . . . . . . . . . . . . 52
3.5.1 Semi-GSSM formulation . . . . . . . . . . . . . . . . . . . . 52
3.5.2 Model fitting with mixed sparsity and algorithm . . . . . . . . 54
3.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.1 Experiment data . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.1.1 Synthetic dataset . . . . . . . . . . . . . . . . . . . 55
v
3.6.1.2 SiSEC-MUS dataset . . . . . . . . . . . . . . . . . 55
3.6.1.3 SiSEC-BNG dataset . . . . . . . . . . . . . . . . . 56
3.6.2 Single-channel source separation performance with unsuper-
vised setting . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.2.1 Experiment settings . . . . . . . . . . . . . . . . . 57
3.6.2.2 Evaluation method . . . . . . . . . . . . . . . . . . 57
3.6.2.3 Results and discussion . . . . . . . . . . . . . . . . 61
3.6.3 Single-channel source separation performance with semi-supervised
setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6.3.1 Experiment settings . . . . . . . . . . . . . . . . . 65
3.6.3.2 Evaluation method . . . . . . . . . . . . . . . . . . 65
3.6.3.3 Results and discussion . . . . . . . . . . . . . . . . 65
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Chapter 4. MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING
NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 68
4.1 Formulation and modeling . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.1 Local Gaussian model . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 NMF-based source variance model . . . . . . . . . . . . . . . 70
4.1.3 Estimation of the model parameters . . . . . . . . . . . . . . 71
4.2 Proposed GSSM-based multichannel approach . . . . . . . . . . . . . 72
4.2.1 GSSM construction . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2 Proposed source variance fitting criteria . . . . . . . . . . . . 73
4.2.2.1 Source variance denoising . . . . . . . . . . . . . . 73
4.2.2.2 Source variance separation . . . . . . . . . . . . . 74
4.2.3 Derivation of MU rule for updating the activation matrix . . . 75
4.2.4 Derived algorithm . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 Dataset and parameter settings . . . . . . . . . . . . . . . . . 79
4.3.2 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.2.1 Algorithm convergence: separation results as func-
tions of EM and MU iterations . . . . . . . . . . . 80
4.3.2.2 Separation results with different choices of and 81
4.3.3 Comparison with the state of the art . . . . . . . . . . . . . . 82
vi
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
CONCLUSIONS AND PERSPECTIVES . . . . . . . . . . . . . . . . . . . 93
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
LIST OF PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 113
vii
NOTATIONS AND GLOSSARY
Standard mathematical symbols
C Set of complex numbers
R Set of real numbers
Z Set of integers
E Expectation of a random variable
Nc Complex Gaussian distribution
Vectors and matrices
a Scalar
a Vector
A Matrix
T
A Matrix transpose
H
A Matrix conjugate transposition (Hermitian conjugation)
diag(a) Diagonal matrix with a as its diagonal
det(A) Determinant of matrix A
tr(A) Matrix trace
A B The element-wise Hadamard product of two matrices (of the same dimension)
with elements [A B]ij = AijBij
:(n)
A:(n) The matrix with entries [A]ij
kak1
kAk1 `1-norm of matrix
Indices
f Frequency index
i Channel index
j Source index
n Time frame index
t Time sample index
viii
Sizes
I Number of channels
J Number of sources
L STFT filter length
F Number of frequency bin
N Number of time frames
K Number of spectral basis
Mixing filters
Matrix of filters
A2R I J L
I th
a j( ) 2 R Mixing filter of j source to all microphones, is the time delay
th
aij(t) 2 R Filter coefficient at t time index
L Time domain filter vector
aij 2 R
L Frequency domain filter vector
aij 2 C
a
b
ij (
f )2C Filter coefficient at f frequency index
th
b
General parameters
x(t) 2 R
I Time-domain mixture signal
s(t) 2 R
J Time-domain source signals
th
cj(t) 2 R
I Time-domain j source image
th
sj(t) 2 R Time-domain j original source signal
x(n; f) 2 C
I Time-frequency domain mixture signal
s(n; f) 2 C
J Time-frequency domain source signals
th
cj(n; f) 2 C
I Time-frequency domain j source image
th
Time-dependent variances of the j source
vj(n; f) 2 R
th
Rj(f) 2 C Time-independent covariance matrix of the j source
th
j(n; f) 2 C
I I Covariance matrix of the j source image
bx(n; f) 2 C
II Empirical mixture covariance
bx(n; f) 2 C
II Empirical mixture covariance
F N Power spectrogram matrix
V2R +
F K Spectral basis matrix
W 2 R +
H 2 R +
K N Time activation matrix
F K Generic source spectral model
U2R +
ix
Abbreviations
APS Artifacts-related Perceptual Score
BSS Blind Source Separation
DoA Direction of Arrival
DNN Deep Neural Network
EM Expectation Maximization
ICA Independent Component Analysis
IPS Interference-related Perceptual Score
IS Itakura-Saito
ISR source Image to Spatial distortion Ratio
ISTFT Inverse Short-Time Fourier Transform
IID (i.i.d) Interchannel Intensity Difference
ITD (i.t.d) Interchannel Time Difference GCC-PHAT
Generalized Cross Correlation Phase Transform
GMM Gaussian Mixture Model
GSSM Generic Source Spectral Model
KL Kullback-Leibler
LGM Local Gaussian Model
MAP Maximum A Posteriori
ML Maximum Likelihood
MU Multiplicative Update
NMF Non-negative Matrix Factorization
OPS Overall Perceptual Score
PLCA Probabilistic Latent Component Analysis
SAR Signal to Artifacts Ratio
SDR Signal to Distortion Ratio
SIR Signal to Interference Ratio
SiSEC Signal Separation Evaluation Campaign
SNMF Spectral Non-negative Matrix Factorization
SNR Signal to Noise Ratio
STFT Short-Time Fourier Transform
TDOA Time Difference of Arrival
T-F Time-Frequency
TPS Target-related Perceptual Score
x
LIST OF TABLES
2.1 Total number of different events detected from three recordings in spring 40
2.2 Total number of different events detected from three recordings in sum-
mer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 Total number of different events detected from three recordings in winter 42
3.1 List of snip songs in the SiSEC-MUS dataset. . . . . . . . . . . . . . 56
3.2 Source separation performance obtained on the Synthetic and SiSEC-
MUS dataset with unsupervised setting. . . . . . . . . . . . . . . . . 59
3.3 Speech separation performance obtained on the SiSEC-BGN. indi-
cates submissions by the authors and “-” indicates missing information
[81, 98, 100]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Speech separation performance obtained on the Synthetic dataset with
semi-supervised setting. . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Speech separation performance obtained on the SiSEC-BGN-devset -
Comparison with closed baseline methods. . . . . . . . . . . . . . . . 85
4.2 Speech separation performance obtained on the SiSEC-BGN-devset -
Comparison with s-o-t-a methods in SiSEC. indicates submissions
by the authors and “-” indicates missing information. . . . . . . . . . 86
4.3 Speech separation performance obtained on the test set of the SiSEC-
BGN. indicates submissions by the authors [81]. . . . . . . . . . . . 91
xi
LIST OF FIGURES
1 A cocktail party effect. . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Audio source separation. . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Live recording environments. . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Source separation general framework. . . . . . . . . . . . . . . . . . 11
1.2 Audio source separation: a solution for cock-tail party problem. . . . 13
1.3 IID coresponding to two sources in an anechoic environment. . . . . . 19
2.1 Decomposition model of NMF [36]. . . . . . . . . . . . . . . . . . . 25
2.2 Spectral decomposition model based on NMF (K = 2) [66]. . . . . . 29
2.3 General workflow of supervised NMF-based audio source separation. 30
2.4 Image of overlapping blocks. . . . . . . . . . . . . . . . . . . . . . . 34
2.5 General workflow of the NMF-based nonstationary segment extraction. 35
2.6 Number of different events were detected by the methods from (a) the
recordings in Spring, (b) the recordings in Summer, and (c) the record-
ings in Winter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Proposed weakly-informed single-channel source separation approach. 45
3.2 Generic source spectral model (GSSM) construction. . . . . . . . . . 47
3.3 Estimated activation matrix H: (a) without a sparsity constraint, (b)
with a block sparsity-inducing penalty (3.5), (c) with a component
sparsity-inducing penalty (3.6), and (d) with the proposed mixed sparsity-
inducing penalty (3.7). . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Average separation performance obtained by the proposed method with
unsupervised setting over the Synthetic dataset as a function of MU it-
erations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Average separation performance obtained by the proposed method with
unsupervised setting over the Synthetic dataset as a function of and . 62
3.6 Average speech separation performance obtained by the proposed meth-
ods and the state-of-the-art methods over the dev set in SiSEC-BGN. . 63
3.7 Average speech separation performance obtained by the proposed meth-
ods and the state-of-the-art methods over the test set in SiSEC-BGN. . 63
xii
4.1 General workflow of the proposed source separation approach. The top
green dashed box describes the training phase for the GSSM construc-
tion. Bottom blue boxes indicate processing steps for source separa-
tion. Green dashed boxes indicate the novelty compared to the existing
works [6, 38, 107]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Average separation performance obtained by the proposed method over
stereo mixtures of speech and noise as functions of EM and MU itera-
tions. (a): speech SDR, (b): speech SIR, (c): speech SAR, (d): speech
ISR, (e): noise SDR, (f): noise SIR, (g): noise SAR, (h): noise ISR . . 81
4.3 Average separation performance obtained by the proposed method over
stereo mixtures of speech and noise as functions of and . (a): speech
SDR, (b): speech SIR, (c): speech SAR, (d): speech ISR, (e): noise
SDR, (f): noise SIR, (g): noise SAR, (h): noise ISR . . . . . . . . . . 82
4.4 Average speech separation performance obtained by the proposed meth-
ods and the closest existing algorithms in terms of the energy-based
criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5 Average speech separation performance obtained by the proposed meth-
ods and the closest existing algorithms in terms of the perceptually-
based criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Average speech separation performance obtained by the proposed meth-
ods and the state-of-the-art methods in terms of the energy-based criteria. 89
4.7 Average speech separation performance obtained by the proposed meth-
ods and the state-of-the-art methods in terms of the perceptually-based
criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.8 Boxplot for the speech separation performance obtained by the pro-
posed “GSSM + SV denoising” (P1) and “GSSM + SV separation”
(P2) methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xiii
INTRODUCTION
In this part, we will introduce the motivation and the problem that we focus
on throughout this thesis. Then, we emphasize on the objectives as well as
scopes of our work. In addition, our contributions in this thesis will be
summarized in order to give a clear view of the achievement. Finally, the
structure of the thesis is presented chapter by chapter.
1. Background and Motivation
1.1. Cocktail party problem
Real-world sound scenarios are usually very complicated as they are mixtures of
many different sound sources. Fig. 1 depicts the scenario of a typical cocktail party,
where there are many people attending, many conversations going on simultaneously
and various disturbances like loud music, people screaming sounds, and a lot of hustle-
bustle. Some other similar situations also happen in daily life, for example, in outdoor
recordings, where there is interference from a variety of environmental sounds, or in a
music concert scenario, where a number of musical instruments are played and the au-
dience gets to listen to the collective sound, etc. In such settings, what is actually heard
by the ears is a mixture of various sounds that are generated by various audio sources.
The mixing process can contain many sound reflections from walls and ceiling, which is
known as the reverberation. Humans with normal hearing ability are generally able to
locate, identify, and differentiate sound sources which are heard simultaneously so as to
understand the conveyed information. However, this task has remained extremely
challenging for machines, especially in highly noisy and reverberated environments. The
cocktail party effect described above prevents both human and machine perceiv-ing the
target sound sources [2, 12, 145], the creation of machine listening algorithms that can
automatically separate sound sources in difficult mixing conditions remains an open
problem.
Audio source separation aims at providing machine listeners with a similar func-
tion to the human ears by separating and extracting the signals of individual sources
from a given mixture. This technique is formally termed as blind source separation
1
(BSS) when no prior information about either the sources or the mixing condition is
available, and is described in Fig. 2. Audio source separation is also known as an
effective solution for cocktail party problem in audio signal processing community [85,
90, 138, 143, 152]. Depending on specific application, some source separation
approaches focus on speech separation, in which the speech signal is extracted from
the mixture containing multiple background noise and other unwanted sounds. Other
methods deal with music separation, in which the singing voice and certain instruments
are recovered from the mixture or song containing multiple musical instruments. The
separated source signals may be either listened to or further processed, giving rise to
many potential applications. Speech separation is mainly used for speech enhance-
ment in hearing aids, hands-free phones, or automatic speech recognition (ASR) in
adverse conditions [11, 47, 64, 116, 129]. While music separation has many interest-ing
applications, including editing/remixing music post-production, up-mixing, music
information retrieval, rendering of stereo recordings, and karaoke [37, 51, 106, 110].
1
Figure 1: A cocktail party effect .
Over the last couple of decades, efforts have been undertaken by the scientific com-
munity, from various backgrounds such as Signal Processing, Mathematics, Statistics,
Neural Networks, Machine Learning, etc., to build audio source separation systems as
described in [14, 15, 22, 43, 85, 105, 125]. The audio source separation problem
1
Some icons of Fig. 1 are from: http://clipartix.com/.
2
Figure 2: Audio source separation.
has been studied at various levels of complexity, and different approaches and
systems have come up. Despite numerous effort, the problem is not completely
solved yet as the obtained separation results are still far from perfect, especially
in challenging conditions such as moving sound sources and high reverberation.
1.2. Basic notations and target challenges
• Overdetermined, determined, and underdetermined mixture
There are three different settings in audio source separation under the relation-
ship between the number of sources J and the number of microphones I: In case
the number of the microphones is larger than that of the sources, J < I, the
number of observable variables are more than the unknown variables and hence
it is referred to as overdetermined case. If J = I, we have as many observable
variables as unknowns, and this is a determined case. The more dificult soure
separation case is that the number of unknowns are more than the number of
observable variables, J > I, which is called the underdetermined case.
Furthermore, if I = 1 then it is a single-channel case. If I > 1 then it is a
multi-channel case.
• Instantaneous, anechoic, and reverberant mixing environment
Apart from the mixture settings based on the relationship between the number of
sources and the number of microphones, audio source separation algorithms can
also be distinguished based on the target mixing condition they deal with.
3
The simplest case deals with instantaneous mixtures, such as certain music
mix-tures generated by amplitude panning. In this case, there is no time delay,
a mixture at a given time is essentially a weighted sum of the source signals at
the same time instant. There are two other typical types of the live recording
environments, anechoic and reverberant, as shown in Fig. 3. In the anechoic
environments such as studio or outdoor, the microphones capture only the
direct sound propagation from a source. With reverberant environments such
as real meeting rooms or chambers, the microphones capture not only the
direct sound but also many sound reflections from walls, ceilings, and floors.
The modeling of the reverberant environment is much more difficult than the
instantaneous and anechoic cases.
2
Figure 3: Live recording environments .
State-of-the-art audio source separation algorithms perform quite well in instan-
taneous or noiseless anechoic conditions, but still far from perfect by the amount of
reverberation. These numerical performance results are clearly shown in the recent
community-based Signal Separation Evaluation Campaigns (SiSEC) [5, 99, 101, 133,
134] and others [65, 135]. That shows that addressing the separation of reverberant
mixtures, a common case in the real-world recording applications, remains one of the
key scientific challenges in the source separation community. Moreover, when the de-
sired sound is corrupted by high-level background noise, i.e., the Signal-to-Noise Ratio
(SNR) is up to 0 dB or lesser, the separation performance is even lower.
2
Some icons of Fig. 3 are from: http://clipartix.com/.
4
To improve the separation performance, informed approaches have been proposed
and emerged over the last decade in the literature [78, 136]. Such approaches exploit
side information about one or all of the sources themselves, or the mixing condition in
order to guide the separation process. Examples of the investigated side information
include deformed or hummed references of one (or more) source(s) in a given mixture
[123, 126], text associated with spoken speeches [83], score associated with musical
sources [37, 51], and motion associated with audio-visual objects in a video [110].
Following this trend, our research focuses on using weakly-informed strategy to
target the determined/underdetermined and high reverberation audio source
separation challenge. We use a very abstract semantic information just about the
types of audio sources existing in the mixture to guide the separation process.
2. Objective and scope
2.1. Objective
The main objective of the thesis is to investigate and develop efficient audio
source separation algorithm, which can deal with the determined/underdetermined
and high reverberation in the real-world recording conditions.
In order to do that, we start by studying state-of-the-art approaches for
selecting one of the most well-known frameworks that can deal with the targeted
challenges. We then develop novel algorithms grounded on such considered
modeling framework, i.e., the Local Gaussian Model (LGM), with Nonnegative
Matrix Factorization (NMF) as the spectral model, for both single-channel and
multi-channel cases. In our proposed approach, we exploit information just about
the types of audio sources in the mixture to guide the separation process. For
instance, in speech enhancement application, we know that one source in a
noisy recording should be speech, and another is background noise. We further
want to investigate the algorithms’ convergence as well as their sensitivity to the
parameter settings in order to guide for parameter settings when it is applicable.
For evaluation, both speech and music separations are considered. We consider a
speech separation for speech enhancement task, and consider both singing voice and
musical instrument separation for music task. In order to compare fairly the obtained
separation results with other existing methods, we use the benchmark dataset in addi-
5