www.pudn.com > HMM_HTK-3.0.rar > exampsys.tex


%/* ----------------------------------------------------------- */
%/*                                                             */
%/*                          ___                                */
%/*                       |_| | |_/   SPEECH                    */
%/*                       | | | | \   RECOGNITION               */
%/*                       =========   SOFTWARE                  */ 
%/*                                                             */
%/*                                                             */
%/* ----------------------------------------------------------- */
%/*         Copyright: Microsoft Corporation                    */
%/*          1995-2000 Redmond, Washington USA                  */
%/*                    http://www.microsoft.com                */
%/*                                                             */
%/*   Use of this software is governed by a License Agreement   */
%/*    ** See the file License for the Conditions of Use  **    */
%/*    **     This banner notice must not be removed      **    */
%/*                                                             */
%/* ----------------------------------------------------------- */
%
% HTKBook - Steve Young 24/11/97
%
% revised by JBA and VV

\mychap{A Tutorial Example of Using HTK}{exampsys}

\sidepic{recipe}{80}{
This final chapter of the tutorial part of the book will describe the
construction of a recogniser for simple voice dialling applications.  This
recogniser will be designed to recognise continuously spoken digit strings and
a limited set of names.  It is sub-word based so that adding a new name to the
vocabulary involves only modification to the pronouncing dictionary and task
grammar.  The HMMs will be continuous density mixture Gaussian tied-state
triphones with clustering performed using phonetic decision trees.  Although
the voice dialling task itself is quite simple, the system design is
general-purpose and would be useful for a range of applications.  
}

The system will be built from scratch even to the extent of recording training
and test data using the \HTK\ tool \htool{HSLab}.  To make this tractable, the
system will be speaker dependent\footnote{The final stage of the tutorial deals 
with adapting the speaker dependent models for new speakers}, but the same design 
would be followed to build a speaker independent system.  The only difference being 
that data would be required from a large number of speakers and there would 
be a consequential increase in model complexity. 

Building a speech recogniser from scratch involves a number of inter-related
subtasks and pedagogically it is not obvious what the best order is to present
them. In the presentation here, the ordering is chronological so that in effect
the text provides a recipe that could be followed to construct a similar
system.  The entire process is described in considerable detail in order give a
clear view of the range of functions that \HTK\ addresses and thereby to
motivate the rest of the book.

The \HTK\ software distribution also contains an example of constructing a
recognition system for the 1000 word ARPA Naval Resource Management Task. This
is contained in the directory \texttt{RMHTK} of the \HTK\ distribution.
Further demonstration of \HTK's capabilities can be found in the directory 
\texttt{HTKDemo}. Some example scripts that may be of assistance during the 
tutorial are available in the \texttt{HTKTutorial} directory.

At each step of the tutorial presented in this chapter, the user is advised to
thoroughly read the entire section before executing the commands, and also to
consult the reference section for each \HTK\ tool being introduced
(chapter~\ref{c:toolref}), so that all command line options and arguments are
clearly understood.

\mysect{Data Preparation}{egdataprep}

The first stage of any recogniser development project is data preparation.
\index{data preparation}  Speech data is needed both for training and for
testing.  In the system to be built here, all of this speech will be recorded
from scratch and to do this scripts are needed to prompt for each sentence.  In
the case of the test data, these prompt scripts will also provide the reference
transcriptions against which the recogniser's performance can be measured and a
convenient way to create them is to use the task grammar as a random generator.
In the case of the training data, the prompt scripts will be used in
conjunction with a pronunciation dictionary to provide the initial phone level
transcriptions needed to start the HMM training process.  Since the application
requires that arbitrary names can be added to the recogniser, training data
with good phonetic balance and coverage is needed.  Here for convenience the
prompt scripts needed for training are taken from the TIMIT acoustic-phonetic
database.

It follows from the above that before the data can be recorded, a phone set
must be defined, a dictionary must be constructed to cover both training and
testing and a task grammar must be defined.

\subsection{Step 1 - the Task Grammar}

The goal of the system to be built here is to provide a voice-operated
interface for phone dialling. Thus, the recogniser must handle digit strings
and also personal name lists. Examples of typical inputs might be
\begin{quote}
Dial three three two six five four

Dial nine zero four one oh nine

Phone Woodland

Call Steve Young
\end{quote}

\HTK\ provides a grammar definition language for
specifying simple task grammars\index{task grammar} such as this.  It consists
of a set of variable definitions followed by a regular 
expression describing the words to recognise.  For the
voice dialling application, a suitable grammar might be
\begin{verbatim}
    $digit = ONE | TWO | THREE | FOUR | FIVE |
             SIX | SEVEN | EIGHT | NINE | OH | ZERO;
    $name  = [ JOOP ] JANSEN |
             [ JULIAN ] ODELL |
             [ DAVE ] OLLASON |
             [ PHIL ] WOODLAND | 
             [ STEVE ] YOUNG;
    ( SENT-START ( DIAL <$digit> | (PHONE|CALL) $name) SENT-END )
\end{verbatim}
where the vertical bars denote alternatives, the square brackets denote
optional items and the angle braces denote one or more repetitions.  The
complete grammar can be depicted as a network as shown in
Fig.~\href{f:dialnet}.

\centrefig{dialnet}{110}{Grammar for Voice Dialling}

\sidefig{step1}{25}{Step 1}{-4}{
The above high level representation of a task grammar
is provided for user convenience.  The \HTK\ recogniser actually 
requires a
word network to be defined  using a low level notation
called \HTK\ Standard Lattice Format\index{standard lattice format} (SLF)
\index{SLF}
in which each word instance and each word-to-word transition
is listed explicitly.  This word network can be created 
automatically from the grammar above using 
the \htool{HParse}
tool, thus assuming that the file \texttt{gram} contains the
above grammar, executing
}\index{hparse@\htool{HParse}}
\begin{verbatim}
    HParse gram wdnet
\end{verbatim}
will create an equivalent word network in 
the file \texttt{wdnet} (see Fig~\href{f:step1}).


\subsection{Step 2 - the Dictionary}

The first step in building a dictionary is to create a sorted list of the
required words. 
In the telephone dialling task pursued here, it is quite easy to create a list
of required words by hand. However, if the task were more complex, it would be
necessary to build a word list from the sample sentences present in the training
data. Furthermore, to build robust acoustic models, it is necessary to train
them on a large set of sentences containing many words and preferably
phonetically balanced. For these reasons, the training data will consist of
English sentences unrelated to the phone recognition task. Below, a short
example of creating a word list from sentence prompts will be given. As noted
above the training sentences given here are extracted from some prompts used
with the TIMIT database\index{TIMIT database} and for convenience reasons they 
have been renumbered. For example, the first few items might be as follows
\vspace{1cm}
\begin{verbatim}
    S0001 ONE VALIDATED ACTS OF SCHOOL DISTRICTS
    S0002 TWO OTHER CASES ALSO WERE UNDER ADVISEMENT
    S0003 BOTH FIGURES WOULD GO HIGHER IN LATER YEARS
    S0004 THIS IS NOT A PROGRAM OF SOCIALIZED MEDICINE
    etc
\end{verbatim}
The desired training word list\index{word list} (\texttt{wlist}) could then be
extracted automatically from these.  Before using HTK, one would need to edit
the text into a suitable format.  For example, it would be necessary to change
all white space to newlines and then to use the UNIX utilities \texttt{sort}
and \texttt{uniq} to sort the words into a unique alphabetically ordered set,
with one word per line.  The script \texttt{prompts2wlist} from the
\texttt{HTKTutorial} directory can be used for this purpose.

The dictionary\index{dictionary!construction}\index{dictionary!format}  
itself can be built from a standard source 
using \htool{HDMan}\index{hdman@\htool{HDMan}}.
For this example, the British English BEEP pronouncing dictionary will be
used\footnote{Available by anonymous ftp from 
\texttt{svr-ftp.eng.cam.ac.uk/comp.speech/dictionaries/beep.tar.gz}.
Note that items beginning with unmatched quotes, found at the start
of the dictionary, should be removed.}.  
Its phone set will be adopted without modification except that 
the stress marks will be removed and a short-pause (\texttt{sp}) will
be added to the end of every pronunciation. If the dictionary contains any
silence markers then the \texttt{MP} command will merge the \texttt{sil} and 
\texttt{sp} phones into a single \texttt{sil}. These changes can be applied 
using \htool{HDMan} and an edit script (stored in \texttt{global.ded})
containing the three commands
\begin{verbatim}
   AS sp
   RS cmu
   MP sil sp sil
\end{verbatim}
where \texttt{cmu} refers to a style of stress marking\index{stress marking} in which 
the lexical stress level is
marked by a single digit appended to the phone name (e.g.\ \texttt{eh2} means
the phone \texttt{eh} with level 2 stress). 

\centrefig{step2}{100}{Step 2}

\noindent
The command
\begin{verbatim}
    HDMan -m -w wlist -n monophones1 -l dlog dict beep names
\end{verbatim}
will create a new dictionary called \texttt{dict} by searching the source
dictionaries \texttt{beep} and \texttt{names} to find pronunciations for each
word in \texttt{wlist} (see Fig~\href{f:step2}). Here, the \texttt{wlist} in
question needs only to be a sorted list of the words appearing in the task
grammar given above.

Note that \texttt{names} is a manually constructed file containing
pronunciations for the proper names used in the task grammar. The option
\texttt{-l} instructs \htool{HDMan} to output a log file \texttt{dlog} which 
contains various statistics about the constructed dictionary. In particular,
it indicates if there are words missing. \htool{HDMan} can also output a list
of the phones used, here called \texttt{monophones1}. Once training and test
data has been recorded, an HMM will be estimated for each of these phones.

The general format of each dictionary entry\index{dictionary!entry} is
\begin{verbatim}
    WORD [outsym] p1 p2 p3 ....
\end{verbatim}
which means that the word \texttt{WORD} is pronounced as the sequence of phones
\texttt{p1 p2 p3 ...}.  The string in square brackets specifies the string to
output when that word is recognised.  If it is omitted then the word itself is
output.  If it is included but empty, then nothing is output.

To see what the dictionary is like, here are a few entries.
\begin{verbatim}
    A               ah sp
    A               ax sp
    A               ey sp
    CALL            k ao l sp
    DIAL            d ay ax l sp
    EIGHT           ey t sp
    PHONE           f ow n sp
    SENT-END    []  sil
    SENT-START  []  sil
    SEVEN           s eh v n sp
    TO              t ax sp
    TO              t uw sp
    ZERO            z ia r ow sp
\end{verbatim}
Notice that function words such as \texttt{A} and \texttt{TO}
have multiple pronunciations.
The entries for \texttt{SENT-START} and \texttt{SENT-END} have a silence
model \texttt{sil} as their pronunciations and null output symbols.  

\subsection{Step 3 - Recording the Data}

The\index{recording speech} training and test data will be recorded using the
\HTK\ tool \htool{HSLab}\index{hslab@\htool{HSLab}}. This is a combined 
waveform recording and labelling tool. In this example \htool{HSLab} will be
used just for recording, as labels already exist. However, if you do not have
pre-existing training sentences (such as those from the TIMIT database) you can
create them either from pre-existing text (as described above) or by labelling
your training utterances using \htool{HSLab}. \htool{HSLab} is invoked by typing
\begin{verbatim}
    HSLab noname
\end{verbatim}
This will cause a window to appear with a waveform display area in the upper
half and a row of buttons, including a record button in the lower half.  When
the name of a normal file is given as argument, \htool{HSLab} displays its
contents.  Here, the special file name \texttt{noname} indicates that new data
is to be recorded. \htool{HSLab} makes no special provision for prompting the
user.  However, each time the record button is pressed, it writes the
subsequent recording alternately to a file called \texttt{noname0.wav} and to a
file called \texttt{noname1.wav}.  Thus, it is simple to write a shell script
which for each successive line of a prompt file, outputs the prompt, waits for
either \texttt{noname0.wav} or \texttt{noname1.wav} to appear, and then renames
the file to the name prepending the prompt (see Fig.~\href{f:step3}).
\index{extensions!wav@\texttt{wav}}

While the prompts for training sentences already were provided for above, the
prompts for test sentences need to be generated before recording them. 
The tool\index{prompt script!generationof}\index{hsgen@\htool{HSGen}}
\htool{HSGen} can be used to do this by randomly traversing a word network and 
outputting each word encountered. For example, typing
\begin{verbatim}
    HSGen -l -n 200 wdnet dict
\end{verbatim}
would generate 200 numbered test utterances, the first few of which would
look something like
\begin{verbatim}
    1.  PHONE YOUNG  
    2.  DIAL OH SIX SEVEN SEVEN OH ZERO
    3.  DIAL SEVEN NINE OH OH EIGHT SEVEN NINE NINE
    4.  DIAL SIX NINE SIX TWO NINE FOUR ZERO NINE EIGHT  
    5.  CALL JULIAN ODELL
    ... etc
\end{verbatim}
These can be used to construct the a prompt script
for the required test data.

\subsection{Step 4 - Creating the Transcription Files}

\sidefig{step3}{50}{Step 3}{-4}{}
To train a set of HMMs, every file of training data must have an associated
phone level transcription.  Since there is no hand labelled data to bootstrap a
set of models, a flat-start scheme will be used instead.  To do this, two sets
of phone transcriptions will be needed.  The set used initially will have no
short-pause (\texttt{sp}) models between words.  Then once reasonable phone
models have been generated, an \texttt{sp} model will be inserted between words
to take care of any pauses introduced by the speaker.\index{flat start}

The starting point for both sets of phone transcription is an
orthographic\index{transcription!orthographic} transcription in \HTK\ label
format.  This can be created fairly easily using a text editor or a scripting
language.
An example of this is found in the RM Demo at point 0.4. Alternatively, the
script \texttt{prompts2mlf} has been provided in the \texttt{HTKTutorial}
directory.
The effect should be to convert the prompt utterances exampled above into the
following form:
\begin{verbatim}
    #!MLF!#
    "*/S0001.lab"
    ONE 
    VALIDATED 
    ACTS 
    OF 
    SCHOOL 
    DISTRICTS
    .
    "*/S0002.lab"
    TWO 
    OTHER 
    CASES 
    ALSO 
    WERE 
    UNDER 
    ADVISEMENT
    .
    "*/S0003.lab" 
    BOTH 
    FIGURES 
    (etc.)
\end{verbatim}
As can be seen, the prompt labels need to be converted into path names, each
word should be written on a single line and each utterance should be terminated
by a single period on its own.  The first line of the file just identifies the
file as a \textit{Master Label File} (MLF).  This is a single file containing a
complete set of transcriptions.  \HTK\ allows each individual transcription to
be stored in its own file but it is more efficient to use an MLF.
\index{master label files}\index{MLF}

The form of the path name used in the MLF deserves some explanation since it is
really a \textit{pattern} and not a name.\index{master label files!patterns}
When \HTK\ processes speech files, it expects to find a transcription (or 
{\it label file}) with the same name but a different extension.  Thus, if the file
\texttt{/root/sjy/data/S0001.wav} was being processed, \HTK\ would look for a
label file called \texttt{/root/sjy/data/S0001.lab}.  When MLF files are used,
\HTK\ scans the file for a pattern which matches the required label file name.
However, an asterix will match any character string and hence the pattern used
in the example is in effect path independent.  It therefore allows the same
transcriptions to be used with different versions of the speech data to be
stored in different locations.

Once the word level MLF has been created, phone level MLFs can be generated
using the label editor \htool{HLEd}\index{hled@\htool{HLEd}}. For example,
assuming that the above word level MLF is stored in the file
\texttt{words.mlf}, the command
\begin{verbatim}
    HLEd -l '*' -d dict -i phones0.mlf mkphones0.led words.mlf
\end{verbatim}
will generate a phone level transcription of the following form
where the \texttt{-l} option is needed to generate the path '\verb+*+' in the 
output patterns.
\begin{verbatim}
    #!MLF!#
    "*/S0001.lab"
    sil
    w
    ah
    n
    v
    ae
    l
    ih
    d
    .. etc
\end{verbatim}
This process is illustrated in Fig.~\href{f:step4}.

The \htool{HLEd} edit script \texttt{mkphones0.led} 
contains the following commands
\begin{verbatim}
   EX
   IS sil sil
   DE sp
\end{verbatim}
The expand \texttt{EX} command replaces each word in \texttt{words.mlf} 
by the corresponding pronunciation in the dictionary file \texttt{dict}.  
The \texttt{IS}
command inserts a silence model \texttt{sil} at the start and end of
every utterance.  Finally, the delete \texttt{DE} command deletes all
short-pause \texttt{sp} labels, which are not wanted in the transcription
labels at this point.  

\centrefig{step4}{60}{Step 4}

\subsection{Step 5 - Coding the Data}

The final stage of data preparation is to parameterise the raw speech
waveforms into sequences of feature vectors.  \HTK\ support both 
FFT-based\index{analysis!FFT-based}
and LPC-based\index{analysis!LPC-based} analysis.  
Here Mel Frequency Cepstral Coefficients (MFCCs)\index{MFCC coefficients},
which are derived from FFT-based log spectra, will be used.

Coding can be performed using the tool \htool{HCopy}\index{hcopy@\htool{HCopy}} 
configured to\index{coding}
automatically convert its input into MFCC vectors.  To do this, a configuration
file (\texttt{config}) is needed which specifies all of the conversion 
parameters\index{parameterisation}. 
Reasonable settings for these are as follows
\begin{verbatim}
    # Coding parameters
    TARGETKIND = MFCC_0
    TARGETRATE = 100000.0
    SAVECOMPRESSED = T
    SAVEWITHCRC = T
    WINDOWSIZE = 250000.0
    USEHAMMING = T
    PREEMCOEF = 0.97
    NUMCHANS = 26
    CEPLIFTER = 22
    NUMCEPS = 12
    ENORMALISE = F
\end{verbatim}
Some of these settings are in fact the default setting, but they
are given explicitly here for completeness.  In brief, they specify
that the target parameters are to be MFCC using $C_0$ as the energy
component, the frame period is 10msec (\HTK\ uses units of 100ns),
the output should be saved in compressed format, and a crc checksum should
be added.  The FFT should use a Hamming window and the signal should
have first order preemphasis applied using a coefficient of 0.97.
The filterbank should have 26 channels and 12 MFCC coefficients should
be output. 
The variable \texttt{ENORMALISE} is by default true and performs energy
normalisation on recorded audio files. It cannot be used with live audio and
since the target system is for live audio, this variable should be set to
false.

Note that explicitly creating coded data files is not necessary, as coding can
be done "on-the-fly" from the original waveform files by specifying the
appropriate configuration file (as above) with the relevant HTK tools. However,
creating these files reduces the amount of preprocessing required during
training, which itself can be a time-consuming process.

To run \htool{HCopy},  a list of
each source file and its corresponding output file is needed.  For example,
the first few lines might look like\index{extensions!mfc@\texttt{mfc}}
\begin{verbatim}
    /root/sjy/waves/S0001.wav /root/sjy/train/S0001.mfc
    /root/sjy/waves/S0002.wav /root/sjy/train/S0002.mfc
    /root/sjy/waves/S0003.wav /root/sjy/train/S0003.mfc
    /root/sjy/waves/S0004.wav /root/sjy/train/S0004.mfc
    (etc.)
\end{verbatim}
Files containing lists of files are referred to as script files\footnote{
Not to be confused with files containing \textit{edit} scripts
}
and\index{extensions!scp@\texttt{scp}}
by convention are given the extension \texttt{scp} (although 
\HTK\ does not demand this).  Script files are specified using the standard
\texttt{-S} option and their contents are read simply as extensions
to the command line.  Thus, they avoid the need for command lines with
several thousand arguments\footnote{
Most UNIX shells, especially the C shell, only allow a limited and
quite small number of arguments.}.
\index{command line!arguments}\index{command line!script files}

\centrefig{step5}{100}{Step 5}

\noindent
Assuming that the above script is stored in the file \texttt{codetr.scp},
the training data would be coded by executing
\begin{verbatim}
    HCopy -T 1 -C config -S codetr.scp
\end{verbatim}
This is illustrated in Fig.~\href{f:step5}.
A similar procedure is used to code the test data after which
all of the pieces are in place to start training 
the HMMs.
 

\mysect{Creating Monophone HMMs}{egcreatmono}

In this section, the creation of a well-trained set of single-Gaussian
monophone HMMs will be described.  The starting point will be
a set of identical monophone HMMs in which every mean and variance is
identical.  These are then retrained, short-pause models are
added and the silence model is extended slightly.  The monophones
are then retrained.

Some of the dictionary entries have multiple pronunciations.  However,
when \htool{HLEd} was used to expand the word level MLF to create the
phone level MLFs, it arbitrarily selected the first pronunciation it found.
Once reasonable monophone HMMs have been created, the recogniser tool
\htool{HVite} can be used to perform a \textit{forced alignment} 
of\index{forced alignment}
the training data.  By this means, a new phone level MLF is created in which
the choice of pronunciations depends on the acoustic evidence.  This new
MLF can be used to perform a final re-estimation of the monophone HMMs.
\index{monophone HMM!construction of}

\subsection{Step 6 - Creating Flat Start Monophones}

The first step in HMM training is to define a prototype model.  The
parameters of this model are not important, its purpose is to
define the model topology.  For phone-based systems,  a good
topology to use is 3-state left-right with no skips such as the following
\begin{verbatim}
    ~o  39 
    ~h "proto"
    
      5
      2
         39
          0.0 0.0 0.0 ...
         39
          1.0 1.0 1.0 ...
      3
         39
          0.0 0.0 0.0 ...
         39
          1.0 1.0 1.0 ...
      4
         39
          0.0 0.0 0.0 ...
         39
          1.0 1.0 1.0 ...
      5
      0.0 1.0 0.0 0.0 0.0
      0.0 0.6 0.4 0.0 0.0
      0.0 0.0 0.6 0.4 0.0
      0.0 0.0 0.0 0.7 0.3
      0.0 0.0 0.0 0.0 0.0
    
\end{verbatim}
where each ellipsed vector is of length 39.  This number, 39, is computed from
the length of the parameterised static vector (\texttt{MFCC\_0} = 13) plus
the delta coefficients (+13) plus the acceleration coefficients (+13).

The \HTK\ tool \htool{HCompV}\index{hcompv@\htool{HCompV}} will scan a set of data files, compute
the global mean and variance and set all of the Gaussians in a given HMM
to have the same mean and variance.\index{flat start}
Hence, assuming that a list of all the training files is stored in
\texttt{train.scp}, the command
\begin{verbatim}
    HCompV -C config -f 0.01 -m -S train.scp -M hmm0 proto
\end{verbatim}
will create a new version of \texttt{proto} in the directory \texttt{hmm0}
in which the zero means and unit variances above have been replaced
by the global speech means and variances.
Note that the prototype HMM defines the parameter kind as \texttt{MFCC\_0\_D\_A}.
This means that delta and acceleration coefficients are to be computed and
appended to the static MFCC coefficients computed and stored during the
coding process described above.  To ensure that these are computed during loading,
the configuration file \texttt{config} should be modified
to change the target kind, i.e.\ the configuration file entry for
\texttt{TARGETKIND} should be changed to
\begin{verbatim}
   TARGETKIND = MFCC_0_D_A
\end{verbatim}
\htool{HCompV} has a number of options specified for it.  The 
\texttt{-f} option causes a variance floor 
macro\index{variance floor macros} (called \texttt{vFloors}) to be generated which
is equal to 0.01 times the global variance.  This is a vector
of values which will be used to set a floor on the variances estimated
in the subsequent steps.  The \texttt{-m} option asks for means to be computed
as well as variances.  Given this
new prototype model stored in the directory
\texttt{hmm0}, a \textit{Master Macro File}\index{master macro files} 
(MMF) called \texttt{hmmdefs} \index{MMF}
containing a copy for each of the required monophone HMMs is constructed 
by manually copying the prototype and relabeling it for each required 
monophone.  
The format of an MMF is similar to that
of an MLF and it serves a similar purpose in that it avoids having
a large number of individual HMM definition files\index{HMM!definition files} 
(see Fig.~\href{f:MMFeg}).

\centrefig{MMFeg}{85}{Form of Master Macro Files}

The flat start monophones stored in the directory \texttt{hmm0} are
re-estimated using the embedded re-estimation\index{embedded re-estimation} 
tool \htool{HERest}\index{herest@\htool{HERest}}
invoked as follows
\begin{verbatim}
   HERest -C config -I phones0.mlf -t 250.0 150.0 1000.0 \
    -S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophones0
\end{verbatim}
The effect of this is to load all the models in \texttt{hmm0} which are
listed in
the model list \texttt{monophones0} (\texttt{monophones1} less the short 
pause (\texttt{sp}) model). These are then re-estimated them using the data
listed in \texttt{train.scp} and the new model set is stored in the
directory \texttt{hmm1}.
Most of the files used in this invocation of \htool{HERest} have 
already been described.  The exception is the file \texttt{macros}.
This should contain a so-called \textit{global options} macro and
the variance floor macro \texttt{vFloors} generated earlier.  The global options macro
simply defines the HMM parameter kind and the vector size i.e.
\begin{verbatim}
   ~o   39
\end{verbatim}
See Fig.~\href{f:MMFeg}. This can be combined with \texttt{vFloors} into a text file
called \texttt{macros}.

\centrefig{step6}{85}{Step 6}

The \texttt{-t} option sets the pruning\index{pruning} thresholds to be used during
training.  Pruning limits the range of state alignments that the
forward-backward algorithm includes in its summation and it
can reduce the amount of computation required by an
order of magnitude.  For most training files, a very tight pruning threshold
can be set, however, some training files will provide poorer acoustic
matching and in consequence a wider pruning beam is needed.  \htool{HERest}
deals with this by having an auto-incrementing pruning threshold.  In the
above example, pruning is normally 250.0.  If re-estimation fails on any
particular file, the threshold is increased by 150.0 and the file is
reprocessed.  This is repeated until either the file is successfully
processed or the pruning limit of 1000.0 is exceeded.  At this point it 
is safe to assume that there
is a serious problem with the training file and hence the fault should be fixed
(typically it will be an incorrect transcription) or the training file should be discarded.
The process leading to the initial set of monophones in the directory
\texttt{hmm0} is illustrated in Fig.~\href{f:step6}.

Each time \htool{HERest} is run it performs a single re-estimation.  Each new
HMM set is stored in a new directory.  Execution of \htool{HERest} should be
repeated twice more, changing the name of the input and output directories (set
with the options \texttt{-H} and \texttt{-M}) each time, until the directory
\texttt{hmm3} contains the final set of initialised monophone HMMs.

\subsection{Step 7 - Fixing the Silence Models}

\sidefig{egsils}{55}{Silence Models}{-4}{
The previous step has generated a 3 state left-to-right HMM for each
phone and also a HMM for the silence model\index{silence model} \texttt{sil}.  The 
next step is to add extra transitions from states 2 to 4 and from
states 4 to 2\index{transitions!adding them}
in the silence model.  The idea here is to make the model more robust
by allowing individual states to absorb the various
impulsive noises in the training data.  The backward skip allows this to happen
without committing the model to transit to the following word.

Also, at this point, a 1 state
short pause\index{short pause} \texttt{sp} model should be created.  
This should be a so-called \textit{tee-model}\index{tee-models}
which has a direct transition from entry to exit node.
This \texttt{sp} has its emitting state tied to the centre state of the silence model.
The required topology of the two silence models is shown in Fig.~\href{f:egsils}.
}

These silence models can be created in two stages
\begin{itemize}
\item 
Use a text editor on the file \texttt{hmm3/hmmdefs} to copy the centre state of
the \texttt{sil} model to
make a new \texttt{sp} model and store the resulting MMF \texttt{hmmdefs}, which 
includes the new \texttt{sp} model, in the new directory \texttt{hmm4}. 

\item Run the HMM editor \htool{HHEd}\index{hhed@\htool{HHEd}} to add 
the extra transitions required
and tie the \texttt{sp} state to the centre \texttt{sil} state
\end{itemize}

\htool{HHEd} works in a similar way to \htool{HLEd}.  It applies a set of commands in
a script to modify a set of HMMs.  In this case, it is executed as follows
\begin{verbatim}
    HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed monophones1
\end{verbatim}
where \texttt{sil.hed} contains the following commands
\begin{verbatim}
    AT 2 4 0.2 {sil.transP}
    AT 4 2 0.2 {sil.transP}
    AT 1 3 0.3 {sp.transP}
    TI silst {sil.state[3],sp.state[2]}
\end{verbatim}
The \texttt{AT}\index{at@\texttt{AT} command} commands add transitions to the
given transition matrices and the final \texttt{TI}\index{ti@\texttt{TI}
command} command creates a tied-state called \texttt{silst}.  The parameters of
this tied-state are stored in the \texttt{hmmdefs} file and within each silence
model, the original state parameters are replaced by the name of this
macro\index{macros}.  Macros are described in more detail below. For now it is
sufficient to regard them simply as the mechanism by which
\HTK\ implements parameter sharing. 
Note that the phone list used here has been changed, because the original list
\texttt{monophones0} has been extended by the new \texttt{sp} model. The new 
file is called \texttt{monophones1} and has been used in the above \htool{HHEd}
command.

\centrefig{step7}{110}{Step 7}

Finally, another two passes of \htool{HERest} are applied using the phone
transcriptions with \texttt{sp} models between words.  This leaves the
set of monophone HMMs created so far in the directory \texttt{hmm7}.
This step is illustrated in Fig.~\href{f:step7}

\subsection{Step 8 - Realigning the Training Data}

As noted earlier, the dictionary contains multiple pronunciations for 
some words, particularly function words.  The phone models created so
far can be used to \textit{realign} the training data and create new
transcriptions.  This can be done with a single invocation of the\index{realignment}
\HTK\ recognition tool \htool{HVite}\index{hvite@\htool{HVite}}, viz
\begin{verbatim}
    HVite -l '*' -o SWT -b silence -C config -a -H hmm7/macros \
          -H hmm7/hmmdefs -i aligned.mlf -m -t 250.0 -y lab \
          -I words.mlf -S train.scp  dict monophones1 
\end{verbatim}
This command uses the HMMs stored in \texttt{hmm7} to transform the input
word level transcription \texttt{words.mlf} to the new phone level transcription
\texttt{aligned.mlf} using the pronunciations stored in the dictionary
\texttt{dict} (see Fig~\href{f:step8}).   The key difference between this
operation and the original word-to-phone mapping performed by \htool{HLEd}
in step 4 is that the recogniser considers all pronunciations for each
word and outputs the pronunciation that best matches the acoustic data.
\index{phone alignment}\index{phone mapping}

In the above, the \texttt{-b} option is used to insert a silence model\index{silence model}
at the start and end of each utterance.  The name \texttt{silence} is used
on the assumption that the dictionary contains an entry
\begin{verbatim}
    silence sil
\end{verbatim}
The \texttt{-t} option sets a pruning level of 250.0 and the \texttt{-o} 
option is used to suppress the printing of scores, word names and time
boundaries in the output MLF.

\centrefig{step8}{85}{Step 8}

Once the new phone alignments have been created, another  2 passes
of \htool{HERest} can be applied to reestimate the HMM set parameters
again.  Assuming that this is done, the final monophone HMM set will
be stored in directory \texttt{hmm9}.

\mysect{Creating Tied-State Triphones}{egcreattri}

Given a set of monophone HMMs, the final stage of model building is to create
context-dependent triphone\index{HMM!triphones} HMMs.  This is done in 
two steps.  Firstly, the
monophone transcriptions are converted to triphone transcriptions and a set
of triphone models are created by copying the monophones and re-estimating.
Secondly, similar acoustic states of these triphones are tied to ensure that
all state distributions can be robustly estimated.

\subsection{Step 9 - Making Triphones from Monophones}

Context-dependent triphones can be made by simply 
cloning\index{HMM!cloning}\index{cloning} monophones and then
re-estimating using triphone transcriptions.  The latter should be created
first using \htool{HLEd}\index{hled@\htool{HLEd}} because 
a side-effect is to generate a list of all
the triphones for which there is at least one example in the training data.
That is, executing
\begin{verbatim}
    HLEd -n triphones1 -l '*' -i wintri.mlf mktri.led aligned.mlf
\end{verbatim}
will convert the monophone transcriptions in \texttt{aligned.mlf} to
an equivalent set of triphone transcriptions in \texttt{wintri.mlf}.
At the same time, a list of triphones is written to the file \texttt{triphones1}.
The edit script \texttt{mktri.led}  contains the commands
\begin{verbatim}
    WB sp
    WB sil
    TC 
\end{verbatim}
The two \texttt{WB}\index{wb@\texttt{WB} command} commands define \texttt{sp} and \texttt{sil}
as \textit{word boundary symbols}.  These then block the addition of
context in the \texttt{TI} command, seen in the following script, which converts all phones
(except word boundary symbols) to triphones
\index{triphones!word internal}\index{triphones!from monophones}\index{triphones!by cloning}.  
For example,
\begin{verbatim}
    sil th ih s sp m ae n sp ...
\end{verbatim}
becomes
\begin{verbatim}
    sil th+ih th-ih+s ih-s sp m+ae m-ae+n ae-n sp ...
\end{verbatim}
This style of triphone transcription is referred to as \textit{word internal}.
\index{word internal}
Note that some biphones will also be generated as contexts at word boundaries
will sometimes only include two phones.

The cloning of models can be done efficiently using the HMM editor \htool{HHEd}:
\begin{verbatim}
    HHEd -B -H hmm9/macros -H hmm9/hmmdefs -M hmm10 
         mktri.hed monophones1
\end{verbatim}
where the edit script \texttt{mktri.hed}
contains a clone command \texttt{CL} followed by \texttt{TI} commands to tie all of
the transition matrices in each triphone\index{triphones!notation} set, that is:
\begin{verbatim}
    CL triphones1
    TI T_ah {(*-ah+*,ah+*,*-ah).transP}
    TI T_ax {(*-ax+*,ax+*,*-ax).transP}
    TI T_ey {(*-ey+*,ey+*,*-ey).transP}
    TI T_b {(*-b+*,b+*,*-b).transP}
    TI T_ay {(*-ay+*,ay+*,*-ay).transP}
    ...
\end{verbatim}  
The file \texttt{mktri.hed} can be generated using the {\em Perl} script
\texttt{maketrihed} included in the \texttt{HTKTutorial} directory.

The clone command \texttt{CL}\index{cl@\texttt{CL} command} takes as its
argument the name of the file containing the list of triphones (and
biphones)\index{cloning}\index{parameter tying}\index{item lists} generated
above.  For each model of the form \texttt{a-b+c} in this list, it looks for
the monophone \texttt{b} and makes a copy of it.\index{tying!transition
matrices} Each \texttt{TI} command takes as its argument the name of a macro
and a list of HMM components.  The latter uses a notation which attempts to
mimic the hierarchical structure of the HMM parameter set in which the
transition matrix \texttt{transP} can be regarded as a sub-component of each
HMM.  The list of items within brackets are patterns designed to match the set
of triphones, right biphones and left biphones for each phone.

\centrefig{egtranstie}{80}{Tying Transition Matrices}

Up to now macros and tying have only been mentioned in passing.  Although a
full explanation must wait until chapter~\ref{c:HMMDefs}, a brief explanation
is warranted here.  Tying means that one or more HMMs share the same set of
parameters.  On the left side of Fig.~\href{f:egtranstie}, two HMM definitions
are shown.  Each HMM has its own individual transition matrix.  On the right
side, the effect of the first \texttt{TI} command in the edit script
\texttt{mktri.hed} is shown.  The individual transition matrices have been
replaced by a reference to a \textit{macro} called \texttt{T\_ah} which
contains a matrix shared by both models.  When reestimating tied parameters,
the data which would have been used for each of the original untied parameters
is pooled so that a much more reliable estimate can be obtained.

Of course, tying could affect performance if performed indiscriminately.
Hence, it is important to only tie parameters which have little effect on
discrimination.  This is the case here where the transition parameters do not
vary significantly with acoustic context but nevertheless need to be estimated
accurately.  Some triphones will occur only once or twice and so very poor
estimates would be obtained if tying was not done.  These problems of data
insufficiency will affect the output distributions too, but this will be dealt
with in the next step.

Hitherto, all HMMs have been stored in text format and could be inspected like
any text file.  Now however, the model files will be getting larger and space
and load/store times become an issue.  For increased efficiency,
\HTK\ can store and load MMFs in binary\index{HMM!binary storage}
format.  Setting the standard \texttt{-B} option causes this to happen.

\sidefig{step9}{55}{Step 9}{-4}{
Once the context-dependent models have been cloned, the new triphone set can be
re-estimated using \htool{HERest}.  This is done as previously except that the
monophone model list is replaced by a triphone list and the triphone
transcriptions are used in place of the monophone transcriptions.  

For the final pass of \htool{HERest}, the \texttt{-s} option should be used to
generate a file of state occupation statistics called \texttt{stats}.  In
combination with the means and variances, these enable likelihoods to be
calculated for clusters of states and are needed during the state-clustering
process \index{statistics!state occupation} described below.
Fig.~\href{f:step9} illustrates this step of the HMM construction
procedure. Re-estimation should be again done twice, so that the resultant
model sets will ultimately be saved in \texttt{hmm12}.  
}
\begin{verbatim}
   HERest -C config -I wintri.mlf -t 250.0 150.0 1000.0 -s stats \
    -S train.scp -H hmm10/macros -H hmm10/hmmdefs -M hmm11 triphones1
\end{verbatim}


\subsection{Step 10 - Making Tied-State Triphones}

The outcome of the previous stage is a set of triphone HMMs with all triphones
in a phone set sharing the same transition matrix.  When estimating these
models, many of the variances in the output distributions
will have been floored since there will be\index{variance!flooring problems}\index{state tying}
\index{tying!states}\index{data insufficiency}
insufficient data associated with many of the states.  The last step in
the model building process is to tie states within triphone sets
in order to share data and thus be able to make robust parameter estimates.

In the previous step, the \texttt{TI} command was used to
explicitly tie all members of a set of transition matrices together. 
However,
the choice of which states to tie requires a bit more  subtlety since
the performance of the recogniser depends crucially on how accurate
the state output distributions capture the statistics of the speech data.

\htool{HHEd} provides two mechanisms which allow states to be clustered 
and\index{state clustering}
then each cluster tied.  The first is data-driven and uses a similarity
measure between states.  The second uses decision trees\index{decision trees}
and is based on asking questions about the left and right contexts of each
triphone.  The decision tree attempts to find those contexts which make the largest
difference to the acoustics and which should therefore distinguish clusters.

Decision tree state tying is performed by running \htool{HHEd} 
in the normal way, i.e.
\begin{verbatim}
   HHEd -B -H hmm12/macros -H hmm12/hmmdefs -M hmm13 \
        tree.hed triphones1 > log
\end{verbatim}
Notice that the output is saved in a log file.  This is important since
some tuning of thresholds is usually needed.

The edit script \texttt{tree.hed}, which contains the instructions regarding
which contexts to examine for possible clustering, can be rather long and
complex. A script for automatically generating this file, \texttt{mkclscript},
is found in the RM Demo. A version of the \texttt{tree.hed} script, which can
be used with this tutorial, is included in the \texttt{HTKTutorial} directory.
Note that this script is only capable of creating the TB commands (decision 
tree clustering of states).  The questions (QS) still need defining by
the user.  There is, however, an example list of questions which may be 
suitable to some tasks (or at least useful as an example) supplied with the 
RM demo (lib/quests.hed).  The entire script appropriate for clustering 
English phone models is too long to show here in the text, however, its main 
components are given by the following fragments:

\begin{verbatim}

    RO 100.0 stats
    TR 0
    QS "L_Class-Stop" {p-*,b-*,t-*,d-*,k-*,g-*} 
    QS "R_Class-Stop" {*+p,*+b,*+t,*+d,*+k,*+g} 
    QS "L_Nasal" {m-*,n-*,ng-*} 
    QS "R_Nasal" {*+m,*+n,*+ng}
    QS "L_Glide" {y-*,w-*} 
    QS "R_Glide" {*+y,*+w}
    ....
    QS "L_w" {w-*} 
    QS "R_w" {*+w} 
    QS "L_y" {y-*} 
    QS "R_y" {*+y} 
    QS "L_z" {z-*} 
    QS "R_z" {*+z} 
 
    TR 2

    TB 350.0 "aa_s2" {(aa, *-aa, *-aa+*, aa+*).state[2]}
    TB 350.0 "ae_s2" {(ae, *-ae, *-ae+*, ae+*).state[2]}
    TB 350.0 "ah_s2" {(ah, *-ah, *-ah+*, ah+*).state[2]}
    TB 350.0 "uh_s2" {(uh, *-uh, *-uh+*, uh+*).state[2]}
    ....
    TB 350.0 "y_s4" {(y, *-y, *-y+*, y+*).state[4]}
    TB 350.0 "z_s4" {(z, *-z, *-z+*, z+*).state[4]}
    TB 350.0 "zh_s4" {(zh, *-zh, *-zh+*, zh+*).state[4]}

    TR 1
    
    AU "fulllist"
    CO "tiedlist"

    ST "trees"
\end{verbatim}
Firstly, the \texttt{RO}\index{ro@\texttt{RO} command} command is used to set
the outlier threshold\index{outlier threshold} to 100.0 and load the statistics
file\index{statistics file} generated at the end of the previous step.  The
outlier threshold determines the minimum occupancy\index{minimum occupancy} of
any cluster and prevents a single outlier state forming a singleton cluster
just because it is acoustically very different to all the other states.  The
\texttt{TR}\index{tr@\texttt{TR} command} command sets the trace level to zero
in preparation for loading in the questions.  Each
\texttt{QS}\index{qs@\texttt{QS} command} command loads a single question and
each question is defined by a set of contexts.  For example, the first
\texttt{QS} command defines a question called \texttt{L\_Class-Stop} which is
true if the left context is either of the stops \texttt{p},
\texttt{b}, \texttt{t}, \texttt{d}, \texttt{k} or \texttt{g}.

\sidefig{step10}{50}{Step 10}{-4}{}
Notice that for a triphone system, it is necessary to include questions
referring to both the right and left contexts of a phone. The questions should
progress from wide, general classifications (such as consonant, vowel, nasal,
diphthong, etc.) to specific instances of each phone.
Ideally, the full set of questions loaded using the \texttt{QS} command would
include every possible context which can influence the acoustic realisation of
a phone, and can include any linguistic or phonetic classification which may be
relevant. There is no harm in creating extra unnecessary questions, because
those which are determined to be irrelevant to the data will be ignored.

The second \texttt{TR} command enables intermediate level progress reporting so
that each of the following \texttt{TB} commands\index{tb@\texttt{TB} command}
can\index{tree building} be monitored.  Each of these \texttt{TB} commands
clusters one specific set of states.  For example, the first \texttt{TB}
command applies to the first emitting state of all context-dependent models for
the phone \texttt{aa}.

Each \texttt{TB} command works as follows.  Firstly, each set of states defined
by the final argument is pooled to form a single cluster.  Each question in the
question set loaded by the \texttt{QS} commands is used to split the pool into
two sets.  The use of two sets rather than one, allows the log likelihood of
the training data to be increased and the question which maximises this
increase is selected for the first branch of the tree. The process is then
repeated until the increase in log likelihood achievable by any question at any
node is less than the threshold specified by the first argument (350.0 in this
case).

Note that the values given in the \texttt{RO} and \texttt{TB} commands affect
the degree of tying and therefore the number of states output in the clustered
system.  The values should be varied according to the amount of training data
available.
As a final step to the clustering, any pair of clusters which can be merged
\index{cluster merging} such that the decrease in log likelihood is below
the threshold is merged.  On completion, the states in each cluster $i$ are
tied to form a single shared state with macro name \texttt{xxx\_i} where
\texttt{xxx} is the name given by the second argument of the \texttt{TB}
command.

The set of triphones used so far only includes those needed to cover the
training data. The \texttt{AU} command takes as its argument a new list of
triphones expanded to include all those needed for recognition.  This list can
be generated, for example, by using \htool{HDMan} on the entire dictionary (not
just the training dictionary), converting it to triphones using the command
\texttt{TC} and outputting a list of the distinct triphones to a file using the
option \texttt{-n} 

\begin{verbatim}
    HDMan -n fulllist -g global.ded -l flog beep
\end{verbatim}
\noindent
The effect of the \texttt{AU} command is to use the decision trees to
synthesise all of the new previously unseen triphones in the new list.
\index{au@\texttt{AU} command}

Once all state-tying has been completed and new models synthesised, 
some models may  share exactly
the same 3 states and transition matrices and are thus identical.
The \texttt{CO} command\index{co@\texttt{CO} command}\index{model compaction} is used
to compact the model set by finding all identical models and tying them
together\footnote{
Note that if the transition matrices had not been tied, the \texttt{CO}
command would be ineffective since all models would be different by
virtue of their unique transition matrices.}, producing a new list of models
called \texttt{tiedlist}.

One of the advantages of using decision tree clustering is that it allows
previously\index{unseen triphones}
unseen triphones to be synthesised.  To do this, the trees must
be saved and this is done by the \texttt{ST} command\index{st@\texttt{ST} command}.
Later if new previously unseen triphones are required, for example in the
pronunciation of a new vocabulary item, the existing model set can be
reloaded into \htool{HHEd}, the trees reloaded using 
the \texttt{LT} command\index{lt@\texttt{LT} command}
and then a new extended list of triphones created using 
the \texttt{AU} command.\index{au@\texttt{AU} command}

After \htool{HHEd} has completed,  the effect of tying can be studied and
the thresholds adjusted if necessary.  The log file will
include summary statistics which give the total number of physical
states remaining and the number of models after compacting.

Finally, and for the last time, the models are re-estimated twice using
\htool{HERest}.  Fig.~\href{f:step10} illustrates this last step in the HMM
build process.  The trained models are then contained in the file
\texttt{hmm15/hmmdefs}.

\mysect{Recogniser Evaluation}{egrectest}

The recogniser is now complete and its performance can be evaluated.  
The recognition network and dictionary have already been constructed, 
and test data has been recorded.  
Thus, all that is necessary is to run the recogniser and 
then evaluate the results using the \HTK\ analysis tool \htool{HResults}\index{recogniser evaluation}

\subsection{Step 11 - Recognising the Test Data}

Assuming that \texttt{test.scp} holds a list of the coded test files,
then each test file will be recognised and its transcription output to
an MLF called \texttt{recout.mlf} by executing the following
\begin{verbatim}
    HVite -H hmm15/macros -H hmm15/hmmdefs -S test.scp \
          -l '*' -i recout.mlf -w wdnet \
          -p 0.0 -s 5.0 dict tiedlist
\end{verbatim}
The options \texttt{-p} and \texttt{-s} set the \textit{word insertion penalty}
\index{word insertion penalty}
and the \textit{grammar scale factor}, \index{grammar scale factor}
respectively.  The word insertion penalty
is a fixed value added to each token when it transits from the end of one word
to the start of the next.  The grammar scale factor is the amount by which
the language model probability is scaled before being 
added to each token  as it transits from the end of one word
to the start of the next.  These parameters can have a significant effect
on recognition performance and hence, some tuning on development test data
is well worthwhile.

The dictionary contains monophone transcriptions whereas the supplied HMM list
contains word internal triphones.  \htool{HVite}\index{hvite@\htool{HVite}} 
will make the necessary 
conversions when loading the word network \texttt{wdnet}.  However, 
if the HMM list contained both monophones and context-dependent phones
then \htool{HVite} would become confused.  The required form of 
word-internal network\index{networks!word-internal} 
expansion can be forced by setting the configuration variable
\texttt{FORCECXTEXP}\index{forcecxtexp@\texttt{FORCECXTEXP}} to true and 
\texttt{ALLOWXWRDEXP}\index{allowxwrdexp@\texttt{ALLOWXWRDEXP}} to 
false (see chapter~\ref{c:netdict} for details).\index{accuracy figure}

Assuming that the MLF \texttt{testref.mlf} contains word level transcriptions
for each test file\footnote{The \htool{HLEd} tool may have to be used to insert silences 
at the start and end of each transcription or alternatively
\htool{HResults} can be used to ignore silences (or any other symbols) using
the \texttt{-e} option}, the actual
performance can be determined by running 
\htool{HResults} as follows
\begin{verbatim}
    HResults -I testref.mlf tiedlist recout.mlf
\end{verbatim}
the result would be a print-out of the form
\begin{verbatim}
    ====================== HTK Results Analysis ==============
      Date: Sun Oct 22 16:14:45 1995
      Ref : testrefs.mlf
      Rec : recout.mlf
    ------------------------ Overall Results -----------------
    SENT: %Correct=98.50 [H=197, S=3, N=200]
    WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1, N=855]
    ==========================================================
\end{verbatim}
The line starting with \texttt{SENT:} indicates that of the 200 test utterances,
197  (98.50\%) were correctly recognised.  The following line starting with \texttt{WORD:} 
gives the word level statistics and indicates that of the 855 words in total,
853 (99.77\%) were recognised correctly.  There was 1 deletion error (\texttt{D}), 
1 substitution\index{recognition!results analysis}
error (\texttt{S}) and 1 insertion error (\texttt{I}).  The accuracy figure (\texttt{Acc})
of 99.65\% is lower than the percentage correct (\texttt{Cor}) because it takes
account of the insertion errors which the latter ignores.

\centrefig{step11}{120}{Step 11}

\mysect{Running the Recogniser Live}{egreclive}

The recogniser can also be run with live input\index{live input}.  
\index{recognition!direct audio input}
To do this it is only
necessary to set the configuration variables needed to convert the input
audio to the correct form of  parameterisation.  Specifically, the following
need to be set
\begin{verbatim}
    # Waveform capture
    SOURCERATE=625.0
    SOURCEKIND=HAUDIO
    SOURCEFORMAT=HTK
    ENORMALISE=F
    USESILDET=T
    MEASURESIL=F
    OUTSILWARN=T
\end{verbatim}
These indicate that the source is direct audio with sample period 62.5
$\mu$secs.  The silence detector is enabled and a measurement of the background
speech/silence levels should be made at start-up.  The final line makes sure
that a warning is printed when this silence measurement is being made.

Once the configuration file has been set-up for direct audio input,
\htool{HVite} can be run as in the previous step except that no files need be
given as arguments. On start-up, \htool{HVite} will prompt the user to speak an
arbitrary sentence (approx. 4 secs) in order to measure the speech and
background silence levels. It will then repeatedly recognise and, if trace
level bit 1 is set, it will output each utterance to the terminal. A typical
session is as follows\index{recognition!output}

\begin{verbatim}
   Read 1648 physical / 4131 logical HMMs
   Read lattice with 26 nodes / 52 arcs
   Created network with 123 nodes / 151 links

   READY[1]>
   Please speak sentence - measuring levels
   Level measurement completed
   DIAL FOUR SIX FOUR TWO FOUR OH  
        == [303 frames] -95.5773 [Ac=-28630.2 LM=-329.8] (Act=21.8)
   
   READY[2]>
    DIAL ZERO EIGHT SIX TWO 
        == [228 frames] -99.3758 [Ac=-22402.2 LM=-255.5] (Act=21.8)
   
   READY[3]>
    etc
\end{verbatim}
During loading, information will be printed out regarding the different
recogniser components. The physical models are the distinct HMMs used by 
the system, while the logical models include all model names. The number 
of logical models is higher than the number of physical models because many 
logically distinct models have been determined to be physically identical 
and have been merged during the previous model building steps. The lattice
information refers to the number of links and nodes in the recognition syntax.
The network information refers to actual recognition network built by
expanding the lattice using the current HMM set, dictionary and any context
expansion rules specified.
After each utterance, the numerical information gives the total number
of frames, the average log likelihood per frame, the total acoustic score,
the total language model score and the average number of models active.

Note that if it was required to recognise a new name, then the
following two changes would be needed
\begin{enumerate}
\item the grammar would be altered to include the new name
\item a pronunciation for the new name would be added to the dictionary
\end{enumerate}
If the new name required triphones which did not exist, then they could be
created by loading the existing triphone set into
\htool{HHEd}\index{hhed@\htool{HHEd}}, loading the decision trees using the
\texttt{LT} command\index{lt@\texttt{LT} command} and then using the
\texttt{AU} command\index{au@\texttt{AU} command} to generate a new complete
triphone set.\index{triphones!synthesising unseen}

\mysect{Adapting the HMMs}{exsysadapt}

The previous sections have described the stages required to build a simple 
voice dialling system. To simplify this process, speaker dependent models were 
developed using training data from a single user. Consequently, recognition 
accuracy for any other users would be poor.
To overcome this limitation, a set of speaker independent models could be 
constructed, but this would require large amounts of training data from a 
variety of speakers. An alternative is to adapt the current speaker dependent 
models to the characteristics of a new speaker using a small amount of 
training or adaptation data\index{adaptation}. In general, adaptation 
techniques are applied to well trained speaker independent model sets to 
enable them to better model the characteristics of particular speakers.

\HTK\ supports both supervised adaptation\index{adaptation!supervised adaptation}, 
where the true transcription of the data is known and unsupervised 
adaptation\index{adaptation!unsupervised adaptation} where the
transcription is hypothesised.
In \HTK\ supervised adaptation is performed offline by
\htool{HEAdapt} using maximum likelihood linear regression
(MLLR)\index{adaptation!MLLR} 
and/or maximum a-posteriori (MAP)\index{adaptation!MAP} techniques to 
estimate
a series of transforms or a transformed model set, that reduces the mismatch 
between the current model set and the adaptation data. Unsupervised 
adaptation is provided by \htool{HVite} (see section
~\ref{s:unsup_adapt}), using just MLLR.

The following sections describe offline supervised adaptation (using
MLLR) with the use of \htool{HEAdapt}.

\subsection{Step 12 - Preparation of the Adaptation Data}

As in normal recogniser development, the first stage in adaptation involves 
data preparation. Speech data from the new user is required for both adapting 
the models and testing the adapted system. The data can be obtained in a 
similar fashion to that taken to prepare the original test data.
Initially, prompt lists for the adaptation and test data will be generated using 
\htool{HSGen}. For example, typing

\begin{verbatim}
    HSGen -l -n 20 wdnet dict > promptsAdapt
    HSGen -l -n 20 wdnet dict > promptsTest
\end{verbatim}

\noindent
would produce two prompt files for the adaptation and test data. The amount of 
adaptation data required will normally be found empirically, but a performance 
improvement should be observable after just 30 seconds of speech.
In this case, around 20 utterances should be sufficient.
\htool{HSLab} can be used to record the associated speech.

Assuming that the script files \texttt{codeAdapt.scp} and \texttt{codeTest.scp} 
list the source and output files for the adaptation and test data respectively 
then both sets of speech can then be coded using the \htool{HCopy} commands given 
below.

\begin{verbatim}
    HCopy -C config -S codeAdapt.scp
    HCopy -C config -S codeTest.scp
\end{verbatim}

\noindent
The final stage of preparation involves generating context dependent phone 
transcriptions of the adaptation data and word level transcriptions of the test 
data for use in adapting the models and evaluating their performance.
The transcriptions of the test data can be obtained using \texttt{prompts2mlf}.
To minimize the problem of multiple pronunciations the phone level 
transcriptions of the adaptation data can be obtained by using \htool{HVite}
to perform a \textit{forced alignment} of the adaptation data. Assuming 
that word level transcriptions are listed in \texttt{adaptWords.mlf}, then the
following command will place the phone transcriptions in 
\texttt{adaptPhones.mlf}.

\begin{verbatim}
    HVite -l '*' -o SWT -b silence -C config -a -H hmm15/macros \ 
          -H hmm15/hmmdefs -i adaptPhones.mlf -m -t 250.0 \ 
          -I adaptWords.mlf -y lab -S adapt.scp dict tiedlist
\end{verbatim}

\subsection{Step 13 - Generating the Transforms}
\index{adaptation!generating transforms}
\htool{HEAdapt} provides two forms of MLLR adaptation depending on the
amount of  adaptation data available. If only small amounts are
available a global transform\index{adaptation!global transforms} can
be generated for every output distribution of every model. As 
more adaptation data becomes available more specific transforms can be 
generated for specific groups of Gaussians.
To identify the number of transforms that can be estimated using the current 
adaptation data, \htool{HEAdapt}\index{headapt@\htool{HEAdapt}} 
uses a regression class tree\index{adaptation!regression tree} to 
cluster together groups of output distributions that are to undergo the same 
transformation. The \HTK\ tool \htool{HHEd} can be used to build a 
regression class tree and store it as part of the HMM set. For example,

\begin{verbatim}
    HHEd -H hmm15/macros -H hmm15/hmmdefs -M hmm16 regtree.hed tiedlist
\end{verbatim}

\noindent
creates a regression class tree using the models stored in \texttt{hmm15}. 
The models are written out to the \texttt{hmm16} directory together with the 
regression class tree information. The \htool{HHEd} edit script 
\texttt{regtree.scp} contains the following commands

\begin{verbatim}
    LS "stats"
    RC 32 "rtree"
\end{verbatim}

\noindent
The \texttt{LS}\index{ls@\texttt{LS} command} command loads the state 
occupation statistics file 
\texttt{stats} generated by the last application of \htool{HERest} which 
created the models in \texttt{hmm15}. 
The \texttt{RC}\index{rc@\texttt{RC} command} command then attempts to build 
a regression class tree with 32 terminal or leaf nodes using these statistics.

\htool{HEAdapt} can be used to perform either static adaptation, where all the
adaptation data is processed in a single block or incremental
adaptation, where adaptation is performed after a specified number of 
utterances and this is controlled by the \texttt{-i} option. In this tutorial 
the default setting of static adaptation will be used.

A typical use of \htool{HEAdapt} involves two passes. On the first pass a 
global adaptation is performed. The second pass then uses the global 
transformation to transform the model set, producing better frame/state 
alignments which are then used to estimate a set of more specific transforms, 
using a regression class tree.
After estimating the transforms, \htool{HEAdapt} can output either the newly 
adapted model set or the transformations themselves in a transform model 
file (TMF)\index{adaptation!transform model file}.
The latter can be advantageous if storage is an issue since the TMFs are 
significantly smaller than MMFs and the computational overhead incurred 
when transforming a model set using a TMF is negligible.
 
The two applications of \htool{HEAdapt} below demonstrate a static two-pass 
adaptation approach where the global and regression class transformations are 
stored in the \texttt{global.tmf} and \texttt{rc.tmf} files respectively.
The standard \texttt{-J} and \texttt{-K} options are used to load and save the
TMFs respectively.
\begin{verbatim}
    HEAdapt -C config -g -S adapt.scp -I adaptPhones.mlf -H hmm16/macros \
            -H hmm16/hmmdefs -K global.tmf tiedlist

    HEAdapt -C config -S adapt.scp -I adaptPhones.mlf -H hmm16/macros \ 
            -H hmm16/hmmdefs -J global.tmf -K rc.tmf tiedlist
\end{verbatim}

\subsection{Step 14 - Evaluation of the Adapted System}

To evaluate the performance of the adaptation, the test data previously recorded 
is recognised using \htool{HVite}. Assuming that \texttt{testAdapt.scp} contains a list 
of all of the coded test files, then \htool{HVite} can be invoked in much the same way 
as before but with the additional \texttt{-J} argument used to load the model 
transformation file \texttt{rc.tmf}.

\begin{verbatim}

    HVite -H hmm16/macros -H hmm16/hmmdefs -S testAdapt.scp -l '*' \ 
          -J rc.tmf -i recoutAdapt.mlf -w wdnet \ 
          -p 0.0 -s 5.0 dict tiedlist

\end{verbatim}

\noindent
The results of the adapted model set can then be observed using \htool{HResults} 
in the usual manner.
 
The RM Demo contains a section on speaker adaptation (point 5.6) and the
recognition results obtained using an adapted model set are 
given below.

\begin{verbatim}
====================== HTK Results Analysis =======================
  Date: Wed Jan 06 21:09:23 1999
  Ref : usr/local/htk/RMHTK_V2.1/RMLib/wlabs/dms0_tst.mlf
  Rec : adapt/dms0_tst.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=66.33 [H=65, S=33, N=98]
WORD: %Corr=94.25, Acc=93.10 [H=738, D=11, S=34, I=9, N=783]
===================================================================
\end{verbatim}

\noindent     
The performance improvement gained by the adapted models can
be evaluated by recognising the test data using the unadapted model
set and comparing the two results. For the RM Demo task the following
results were obtained with an unadapted model set.

\begin{verbatim}
====================== HTK Results Analysis =======================
  Date: Mon Dec 14 10:59:28 1998
  Ref : usr/local/htk/RMHTK_V2.1/RMLib/wlabs/dms0_tst.mlf
  Rec : unadapt/dms0_tst.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=46.00 [H=46, S=54, N=100]
WORD: %Corr=89.04, Acc=86.43 [H=715, D=26, S=62, I=21, N=803]
===================================================================
\end{verbatim}

\mysect{Summary}{exsyssum}
This chapter has described the construction of a tied-state phone-based
continuous speech recogniser and in so doing, it has touched on most of the
main areas addressed by \HTK: recording, data preparation, HMM definitions,
training tools, adaptation tools, networks, decoding and evaluating.  The rest of this book
discusses each of these topics in detail.



%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "htkbook"
%%% End: