www.pudn.com > sphinx_recipe.zip > CreateTrans.pl
#!/usr/bin/perl
# Given a list of CMU format fileids, a list of transcriptions in WSJ DOT format,
# a dictionary, produce a word level transcript and script file that
# only contain things were all the words are in the dictionary.
#
# Also can be set to ignore [] elements in the transcripts and
# output a file of all the unknown vocab. But it turns out to be
# better to include these in the training (0.67 abs on Nov92)
#
# Removes any \ escaped symbols.
#
# Optionally can be set to output even those with unknown vocab,
# this would be used when construction a test set MLF with OOVs.
#
# Can also optionally try and fix up things like:
# *WORLD*
# STOCKDEALS
#
# A find-and-replace file can be sent in to manually enter corrections
# for some transcriptions.
#
# Copyright 2006 by Keith Vertanen
#
use strict;
if ( @ARGV < 5 )
{
print "$0