Literature review notes on automatic speech recognition


Deep Neural Network

Supervised Sequence Labelling with RNN (Graves)

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014)

$$ \max_{\boldsymbol{\theta}} \frac{1}{N} \displaystyle\sum_{n=1}^{N} \log {p_{\boldsymbol{\theta}} (\boldsymbol{y}_n|\boldsymbol{x}_n)}, $$ where $\boldsymbol{\theta}$ is the set of the model parameters and each $(\boldsymbol{x}_n, \boldsymbol{y}_n)$ is an (input sequence, output sequence) pair from the training set.

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling (2014)

A time delay neural network architecture for efficient modeling of long temporal contexts (2015)

Token Passing Model

For me, I find it necessary to know how the Viterbi algorithm works in order to understand this token passing model. A good explanation in Chinese is available here.

Token Passing: a Simple Conceptual Model for Connected Speech Recognition Systems (1989)

Discriminative Training

Discriminative Training for Large Vocabulary Speech Recognition (2003)

Objective functions for training HMMs

$$ \mathcal{F}_{ \text{MLE} }(\lambda) = \displaystyle\sum_{r=1}^{R} \log p_\lambda \left( \mathcal{O}_r | s_r \right), $$ where $s_r$ is the correct transcription of the r-th speech file $\mathcal{O}_r$. $\lambda$ denotes all the parameters of a set of HMMs.

$$ \mathcal{F}_{ \text{MMI} }(\lambda) = \displaystyle\sum_{r=1}^{R} \log \frac{p_\lambda \left( \mathcal{O}_r | s_r \right)^\kappa P\left( s_r \right)^\kappa}{\sum_s p_\lambda \left( \mathcal{O}_r | s \right)^\kappa P\left( s \right)^\kappa}, $$ where $P(s)$ is the language probability for sentence $s$.

Why we need to use discriminative objective functions?

End-to-end speech recognition using lattice-free MMI (2018)

Weighted Finite State Transducer

Speech Recognition Algorithms Using Weighted Finite-State Transducers

Keyword Spotting

Small-footprint keyword spotting using deep neural networks (2014)

Query-by-example keyword spotting using long short-term memory networks (2015)

Streaming small-footprint keyword spotting using sequence-to-sequence models (2017)

Region Proposal Network Based Small-Footprint Keyword Spotting (2019)

Multi-Task Learning for Voice Trigger Detection (2020)

Far-field Speech Recognition

JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS (2015)

Reverberation robust acoustic modeling using i-vectors with time delay neural networks (2015)