www.pudn.com > HMM_HTK-3.0.rar > fundaments.tex


%/* ----------------------------------------------------------- */
%/*                                                             */
%/*                          ___                                */
%/*                       |_| | |_/   SPEECH                    */
%/*                       | | | | \   RECOGNITION               */
%/*                       =========   SOFTWARE                  */ 
%/*                                                             */
%/*                                                             */
%/* ----------------------------------------------------------- */
%/*         Copyright: Microsoft Corporation                    */
%/*          1995-2000 Redmond, Washington USA                  */
%/*                    http://www.microsoft.com                */
%/*                                                             */
%/*   Use of this software is governed by a License Agreement   */
%/*    ** See the file License for the Conditions of Use  **    */
%/*    **     This banner notice must not be removed      **    */
%/*                                                             */
%/* ----------------------------------------------------------- */
%
% HTKBook - Steve Young 1/12/97
%

\mychap{The Fundamentals of \HTK}{fundaments}

\sidepic{overview}{40}{}
\HTK\ is a toolkit for building Hidden Markov Models (HMMs). 
HMMs can be used to model any time series and the core of 
\HTK\ is similarly general-purpose. However, \HTK\ is primarily
designed for building HMM-based speech processing tools,  in
particular recognisers.  Thus, much of the infrastructure
support in \HTK\ is dedicated to this task.  As shown in the picture
alongside, there are two major processing stages involved.  Firstly,
the \HTK\ training tools are used to estimate the parameters of a set
of HMMs using 
training utterances and their associated transcriptions. Secondly,
unknown utterances are transcribed using the \HTK\ recognition tools.


The main body of this book is mostly concerned with the mechanics of these
two processes. However, before launching into detail it is necessary to
understand some of the basic principles of HMMs.  It is also helpful to have
an overview of the toolkit and to have some appreciation of 
how training and recognition in \HTK\ is organised. 

This first part of the book attempts to provide this information. In this
chapter, the basic ideas of HMMs and their use in speech recognition are
introduced.  The following chapter then presents a brief overview
of \HTK\ and, for users of older versions, it highlights the main
differences in version 2.0 and later.  Finally in this tutorial part of the book, chapter 3
describes how a HMM-based speech
recogniser can be built using
\HTK.  It does this by describing the construction of a simple small vocabulary
continuous speech recogniser.

The second part of the book then revisits the topics skimmed over here
and discusses each in detail. This can be 
read in conjunction with the third and final part
of the book which provides a reference manual for \HTK.  This includes
a description of each tool,  summaries of the various
parameters used to configure \HTK\ and a list of the error messages that
it generates when things go wrong.

Finally, note that this book is concerned only with \HTK\ as a tool-kit.
It does not provide  information for using the \HTK\ libraries as a programming
environment.

\mysect{General Principles of HMMs}{genpHMM}

\sidefig{messencode}{50}{Message Encoding/Decoding}{-4}{}
Speech recognition systems generally assume that the speech signal is
a  realisation of some message encoded as a sequence of one or more
symbols (see Fig.~\href{f:messencode}). To effect the reverse
operation of recognising the underlying symbol sequence given a spoken
utterance, the continuous speech waveform is first converted to a
sequence of equally spaced discrete parameter vectors. This sequence
of parameter vectors is assumed to form an exact representation of
the speech waveform on the basis that for the duration covered by a
single vector (typically 10ms or so), the speech waveform can be
regarded as being stationary.  Although this is not strictly true, it
is  a reasonable approximation.  Typical parametric representations in
common use are smoothed spectra or linear prediction coefficients plus
various other representations derived from these.

The r\^{o}le of the recogniser is to effect a mapping between  sequences
of speech vectors and the wanted underlying symbol sequences.  Two
problems make this very difficult.  Firstly, the mapping from symbols
to speech is not one-to-one since different underlying symbols can
give rise to similar speech sounds.  Furthermore, there are large
variations in the realised speech waveform due to speaker variability,
mood, environment, etc.  Secondly, the boundaries between symbols
cannot be identified explicitly from the speech waveform.  Hence, it
is not possible to treat the speech waveform as a sequence of
concatenated static patterns. 

The second problem of not knowing the word boundary locations
can be avoided by restricting the task to
isolated word recognition.  As shown in Fig.~\href{f:isoprob},
this implies that the speech waveform corresponds to a single
underlying symbol (e.g. word) chosen from a fixed vocabulary.
Despite the fact that this
simpler problem is somewhat artificial, it nevertheless has 
a wide range of practical applications.
Furthermore, it serves as a good basis for 
introducing the basic ideas of HMM-based
recognition before dealing with the more complex continuous speech
case.  Hence, isolated word recognition using HMMs will be dealt
with first.

\mysect{Isolated Word Recognition}{isowrdrec}

Let each spoken word be represented by a sequence of speech vectors or {\it
observations} $\bm{O}$,  defined as
\begin{equation}
\bm{O} = \bm{o}_1, \bm{o}_2, \ldots, \bm{o}_T
\end{equation}
where $\bm{o}_t$ is the speech vector observed at time $t$.  The
isolated word recognition problem can then be regarded as that of
computing
\begin{equation}  \label{e:2}
 \arg\max_i \left\{ P(w_i | \bm{O}) \right\}
\end{equation}
where $w_i$ is the $i$'th vocabulary word.  This probability
is not computable directly but using Bayes' Rule\index{Bayes' Rule} gives
\begin{equation} \label{e:3}
  P(w_i | \bm{O}) = \frac{P(\bm{O}|w_i) P(w_i)}{P(\bm{O})}
\end{equation}
Thus, for a given set of prior probabilities $P(w_i)$, the most
probable spoken word depends only on the likelihood $P(\bm{O}|w_i) $.
Given the dimensionality of the observation sequence $\bm{O}$, the
direct estimation of the joint conditional probability 
$P(\bm{o}_1,\bm{o}_2,\ldots | w_i)$ from examples of spoken words
is not practicable. However, if a parametric model of word production
such as a Markov model
is assumed, then estimation from data is possible since the problem
of estimating the class conditional observation densities $P(\bm{O}|w_i)$
is replaced by the much simpler problem of estimating the Markov
model parameters.

\sidefig{isoprob}{50}{Isolated Word Problem}{-4}
In HMM based speech recognition, it is assumed that the sequence of
observed speech vectors corresponding to each word is generated
by a Markov model\index{HMM!definitions} as shown in Fig.~\href{f:markovgen}.
A Markov model is a finite state machine which changes state
once every time unit and each time $t$ that a state $j$ is entered, a
speech vector $\bm{o}_t$ is generated from the probability density
$b_j(\bm{o}_t)$.  Furthermore, the transition from state $i$ to state $j$
is also probabilistic and is governed by the discrete probability $a_{ij}$.
Fig.~\href{f:markovgen} shows an example of this process where the six state
model moves through the state sequence $X=1,2,2,3,4,4,5,6$ in
order to generate the sequence $\bm{o}_1$ to $\bm{o}_6$. Notice that
in \HTK, the entry and exit states of a HMM are non-emitting.  This
is to facilitate the construction of composite models as explained in
more detail later.


The joint probability that $\bm{O}$ is generated by the model $M$ moving
through the state sequence
$X$ is calculated simply as the product of the transition
probabilities and the output probabilities.  So for the state sequence $X$ in
Fig.~\href{f:markovgen}
\begin{equation} \label{e:4}
P(\bm{O},X|M) = a_{12} b_2(\bm{o}_1) a_{22} b_2(\bm{o}_2) a_{23} 
           b_3(\bm{o}_3) \ldots
\end{equation}
However, in practice, only the observation sequence 
$\bm{O}$ is known and the
underlying state sequence $X$ is hidden.  This is why it is
called a {\it Hidden Markov Model}.  

\centrefig{markovgen}{85}{The Markov Generation Model}

Given that $X$ is unknown, the
required likelihood\index{likelihood computation} is computed 
by summing over all possible state
sequences $X = x(1), x(2), x(3), \ldots, x(T)$, that is
\begin{equation}  \label{e:5}
P(\bm{O}|M) = \sum_X a_{x(0)x(1)} \prod_{t=1}^T b_{x(t)}(\bm{o}_t)
a_{x(t)x(t+1)} \end{equation}
where $x(0)$ is constrained to be the model entry state and $x(T+1)$
is constrained to be the model exit state.

As an alternative to equation~\ref{e:5}, the likelihood can be
approximated by only considering the most likely state
sequence, that is
\begin{equation} \label{e:6}
\hat{P}(\bm{O}|M) = \max_X \left\{ 
        a_{x(0)x(1)} \prod_{t=1}^T b_{x(t)}(\bm{o}_t) a_{x(t)x(t+1)}
        \right\}
\end{equation}

Although the direct computation of equations \ref{e:5} and \ref{e:6}
is not tractable, simple recursive procedures exist which allow
both quantities to be calculated very efficiently.
Before going any further, however, notice that if equation~\ref{e:2} is
computable then the recognition problem is solved.  Given a set of models
$M_i$ corresponding to words $w_i$, equation~\ref{e:2} is
solved by using \ref{e:3} and assuming that
\begin{equation} \label{e:7}
P(\bm{O}|w_i) = P(\bm{O}|M_i).
\end{equation}

All this, of course, assumes that the parameters $\{a_{ij}\}$ and
$\{b_{j}(\bm{o}_t)\}$ are known for each model $M_i$.  Herein lies the
elegance and power of the HMM framework.  Given a set of training examples
corresponding to a particular model, the parameters of that model can be
determined automatically by a robust and efficient re-estimation 
procedure.  Thus, provided that a sufficient number of representative
examples of each word can be collected then a HMM can be constructed
which implicitly models all of the many sources of variability inherent
in real speech.  Fig.~\href{f:useforiso} summarises the use of HMMs
for isolated word recognition.  Firstly, a
HMM is trained for each vocabulary word using a number of examples
of that word.  In this case, the vocabulary consists of
just three words: ``one'', ``two'' and ``three''.
Secondly, to recognise some unknown word, the likelihood of 
each model generating that word is calculated and the most likely
model identifies the word.

\centrefig{useforiso}{84}{Using HMMs for Isolated Word Recognition}

\mysect{Output Probability Specification}{outprobspec}

Before\index{output probability!continuous case} the problem of parameter estimation can be discussed in more
detail, the form of the output distributions $\{b_{j}(\bm{o}_t)\}$
needs to be made explicit. \HTK\ is designed primarily for modelling
continuous parameters using continuous density multivariate output
distributions.  It can also handle observation sequences
consisting of discrete symbols in which case, the output distributions
are discrete probabilities.  For simplicity, however, the presentation 
in this chapter will assume that continuous density 
distributions are being used. The minor differences that the use
of discrete probabilities entail are noted in chapter~\ref{c:HMMDefs}
and discussed in more detail in chapter~\ref{c:discmods}.

In common with most other\index{Gaussian mixture}\index{streams}\index{codebooks}
continuous density HMM systems, \HTK\ represents output distributions
by Gaussian Mixture Densities.  
In \HTK, however, a further
generalisation is made.  \HTK\ allows each observation vector at time $t$
to be split into a number of $S$ independent data streams $o_{st}$.  The
formula for computing $b_{j}(\bm{o}_t)$ is then
\begin{equation} \label{e:8}
  b_{j}(\bm{o}_t) = \prod_{s=1}^S \left[
     \sum_{m=1}^{M_s} c_{jsm} {\cal N}(\bm{o}_{st};
                    \bm{\mu}_{jsm}, \bm{\Sigma}_{jsm})
  \right]^{\gamma_s}
\end{equation}
where $M_s$ is the number of mixture components in stream $s$, $c_{jsm}$
is the weight of the $m$'th component and 
${\cal N}(\cdot; \bm{\mu}, \bm{\Sigma})$ is a multivariate Gaussian with
mean vector $\bm{\mu}$ and covariance matrix $\bm{\Sigma}$,
that is\index{stream weight}\index{codebook exponent}
\begin{equation}
{\cal N}(\bm{o}; \bm{\mu}, \bm{\Sigma}) =
       \frac{1}{\sqrt{(2 \pi)^n | \bm{\Sigma} |}} 
       e^{- \frac{1}{2}(\bm{o}-\bm{\mu})' \bm{\Sigma}^{-1}(\bm{o}-\bm{\mu})}
\end{equation}
where $n$ is the dimensionality of $\bm{o}$.

The exponent $\gamma_s$ is a stream weight\footnote{often 
referred to as a codebook exponent.}.  It 
can be used to give a particular stream more emphasis, however,
it can only be set manually.  No current \HTK\ training tools 
can estimate values for it. 

Multiple data streams are used to
enable separate modelling of multiple information sources.  In
\HTK, the processing of streams is completely general.  However,
the speech input modules assume that the 
source data is split into at most 4 streams.  Chapter~\ref{c:speechio}
discusses this in more detail but for now it is sufficient to
remark that the default streams are the
basic parameter vector, first (delta) and second (acceleration)
difference coefficients and log energy.

\mysect{Baum-Welch Re-Estimation}{bwrest}

To determine the parameters of a HMM it is first necessary to make
a rough guess at what they might be.  Once this is done, more
accurate (in the maximum likelihood sense) parameters
can be found by applying the so-called 
Baum-Welch re-estimation\index{Baum-Welch re-estimation}
formulae.  

\sidefig{subsmixrep}{60}{Representing a Mixture}{-4}{
Chapter~\ref{c:Training} gives the formulae used 
in \HTK\ in full detail.  
Here the basis
of the formulae will be presented in a very informal way.
Firstly, it should be noted that the inclusion of multiple data
streams does not alter matters significantly since each stream
is considered to be statistically independent.  Furthermore,
mixture components can be considered to be a special form of 
sub-state in which the transition probabilities are the mixture
weights (see Fig.~\href{f:subsmixrep}).
}

Thus, the essential problem is to estimate the means and 
variances of a HMM in which each state output distribution is a single
component Gaussian, that is
\begin{equation} \label{e:10}
b_j(\bm{o}_t) = \frac{1}{\sqrt{(2 \pi)^n | \bm{\Sigma_j} |}} 
      e^{- \frac{1}{2}(\bm{o}_t - \bm{\mu}_j)'\bm{\Sigma}_j^{-1}(\bm{o}_t - \bm{\mu}_j)}
\end{equation}
If there was just one state $j$ in the HMM, this parameter
estimation would be easy.  The maximum likelihood estimates of 
$\bm{\mu}_j$ and $\bm{\Sigma}_j$ would be just the simple averages, 
that is
\begin{equation} \label{e:11}
   \hat{\bm{\mu}}_j = \frac{1}{T} \sum_{t=1}^{T} \bm{o}_t
\end{equation}
and
\begin{equation} \label{e:12}
   \hat{\bm{\Sigma}}_j = \frac{1}{T} \sum_{t=1}^{T} 
        (\bm{o}_t - \bm{\mu}_j) (\bm{o}_t - \bm{\mu}_j)'
\end{equation}
In practice, of course, there are multiple states and there is no
direct assignment of observation vectors to individual states 
because the underlying state sequence is unknown.  Note, however,
that if some approximate assignment of vectors to states could be made then
equations \ref{e:11} and \ref{e:12} could be used to give the
required initial values for the parameters.  Indeed, this is exactly
what is done in the \HTK\ tool called \htool{HInit}\index{hinit@\htool{HInit}}.  
\htool{HInit} first divides the
training observation vectors equally amongst the model states and then
uses equations \ref{e:11} and \ref{e:12} to give initial values for
the mean and variance of each state.  It then finds the maximum
likelihood state sequence using the Viterbi\index{Viterbi training} 
algorithm described  below,
reassigns the observation vectors to states and then uses
equations \ref{e:11} and \ref{e:12} again to get better initial 
values.  This process is repeated until the estimates
do not change.

Since the full likelihood of each observation sequence
is based on the summation of all possible state sequences,
each observation vector $\bm{o}_t$ contributes to the computation
of the maximum likelihood parameter values for each state $j$.
In other words, instead of assigning each observation vector
to a specific state as in the above approximation, each
observation is assigned to every state in proportion to
the probability of the model being in that state when the
vector was observed.  Thus, if $L_j(t)$ denotes the probability
of being in state $j$ at time $t$ then the 
equations \ref{e:11} and \ref{e:12} given above become the
following weighted averages
\begin{equation} \label{e:13}
   \hat{\bm{\mu}}_j = \frac{ \sum_{t=1}^{T} L_j(t) \bm{o}_t}
                          {\sum_{t=1}^{T} L_j(t)}
\end{equation}
and
\begin{equation} \label{e:14}
   \hat{\bm{\Sigma}}_j = \frac{ \sum_{t=1}^{T} L_j(t)
        (\bm{o}_t - \bm{\mu}_j) (\bm{o}_t - \bm{\mu}_j)' }
                      {\sum_{t=1}^{T} L_j(t)}
\end{equation}
where the summations in the denominators are included to give
the required normalisation.

Equations \ref{e:13} and \ref{e:14} are the 
Baum-Welch re-estimation\index{Baum-Welch re-estimation}
formulae for the means and covariances of a HMM.  A similar but
slightly more complex formula can be derived for the transition
probabilities (see chapter~\ref{c:Training}).

Of course, to apply equations \ref{e:13} and \ref{e:14}, the
probability of state occupation $L_j(t)$ must be calculated.
This is done efficiently using the so-called {\it Forward-Backward}
\index{Forward-Backward algorithm}
algorithm.  Let the forward probability\footnote{
Since the output distributions are densities, these are not
really probabilities but it is a convenient fiction.
}
\index{forward probability} $\alpha_j(t)$ for some model
$M$ with $N$ states be defined as 
\begin{equation} \label{e:15}
   \alpha_j(t) = P(\bm{o}_1,\ldots,\bm{o}_t, x(t)=j | M).
\end{equation}
That is, $\alpha_j(t)$ is the joint probability of observing the
first $t$ speech vectors and being in state $j$ at time $t$.  This
forward probability can be efficiently calculated by the following
recursion
\begin{equation} \label{e:16}
    \alpha_j(t) = \left[ \sum_{i=2}^{N-1} \alpha_i(t-1) a_{ij} \right]
                     b_j(\bm{o}_t).
\end{equation}
This recursion depends on the fact that the probability
of being in state $j$ at time $t$ and seeing observation $\bm{o}_t$
can be deduced by summing the forward probabilities for all
possible predecessor states $i$ weighted by the transition
probability $a_{ij}$.  The slightly odd limits are caused by
the fact that states $1$ and $N$ are non-emitting\footnote{
To understand equations involving a non-emitting state at time $t$, the time
should be thought of as being $t-\delta t$ if it is an entry state, and $t+\delta t$
if it is an exit state.  This becomes important when HMMs are connected together
in sequence so that transitions across non-emitting states take place
\textit{between frames}.
}.   The
initial conditions for the above recursion are
\begin{equation}
    \alpha_1(1) = 1
\end{equation}
\begin{equation}
    \alpha_j(1) = a_{1j} b_j(\bm{o}_1)
\end{equation}
for $1