In the microblog below I recorded my efforts to implement the first tutorial from the HTK book.
A word of caution:
In no way is this a complete account of the whole process,
or should this entry be regarded as a tutorial (as you will see soon enough)
My general impression is that it is very hard to go through the HTK tutorial without some external help.
A good reference tutorial can be found here
but maybe someone out there will find this useful too.
Also note that my timestamps below shouldn't be taken seriously.
This was done over the weekend, along with other things.
-- Tomer
General configuration:
I installed HTK 3.4.1 on windows+cygwin.
cygwin should be installed with perl as well as have the sox application
(used to concat wav files together)
Speech recognition settings:
I wanted to avoid recording.
Therefore, I used only a pre-recorded set of digits
available here
I used only the non-raw version.
For training and test data I therefore have 2x5x11 samples:
That is, five recordings per digit and 11 different utterances since zero is encoded by both "zero" and "oh".
I set up a very simple grammar:
Initially, a user can either say "0" or "1".
Subsequently,
"0" can be followed by a seq' of even numbers (e.g., 02824)
"1" can be followed by a seq' of odd numbers (e.g., 1775)
-- Start of microblog.
4:04pm:
Step 1 - the task grammar.
$even = TWO | FOUR | SIX | EIGHT | OH | ZERO;
$odd = ONE | THREE | FIVE | SEVEN | NINE; ( SENT-START ( (ZERO|OH) <$even> | ONE <$odd>) SENT-END ) |
4:16pm: HParse passes on first run,
4:24pm: Step 2 - the dictionary
created the file 'wlist' using 'prompt2wlist' perl script.
4:56pm:
Downloaded 'beep' dictionary here.
Unfortunetly, It isn't well sorted.
Moreover, sorting it according to unix/perl/python sort does not seem to help (maybe it does on non-cygwin system?)
5:14pm:
To overcome this problem, I wrote a script 'reduce.pl' that reduces a dictionary only to the given wlist.
The script can be found here, among other scripts I wrote in this session.
perl -w reduce.pl wlist beep-1.0 > beep.reduced.
I then added the recommended 'global.ded' in the same folder:
AS sp
RS cmu MP sil sil sp |
We obtain the file 'dict' (where "SENT-XXXX [] .." were *manually* added.")
EIGHT ey t sp
FIVE f ay v sp FOUR f ao sp FOUR f ao r sp NINE n ay n sp OH ow sp ONE w ah n sp SENT-END [] sil SENT-START [] sil SEVEN s eh v n sp SIX s ih k s sp THREE th r iy sp TWO t uw sp ZERO z ia r ow sp |
5:27pm Step 3:
Previous experience shows that HSLab is not a very stable tool.
For example, it kept getting stuck after manual saving of marked and labeled files.
Fortunately, no labeling with HSLab is required here.
Two sets of sentences were created using HSGen.exe:
HSGen.exe -l -n 100 wdnet dict > trainprompts
HSGen.exe -l -n 100 wdnet dict > testprompts
Note that the digits prefixing every line should be changed to something like
"/*S00\d+" such that the output would look like:
*/S00099 ONE ONE NINE THREE THREE FIVE ONE FIVE ONE FIVE
*/S00100 OH EIGHT SIX FOUR EIGHT SIX SIX
5:37pm Step 4:
Generating the HTK label format file (mlf) for the training and test data:
perl -w ../../samples/HTKTutorial/prompts2mlf trainmlf trainprompts
perl -w ../../samples/HTKTutorial/prompts2mlf testmlf testprompts
5:45pm
The files 'mkphones0.led' was created (note the newline is required at the last line)
EX
IS sil sil DE sp |
as well as 'mkphones1.led':
EX
IS sil sil |
Notice that there's no delete "sp" in the second .led file - this results in words being separated by 'sp':
for example: ('086')
sil ow sp ey t sp s ih k s sp
The following commands were run:
HLEd -l '*' -d dict -i phones0.mlf mkphones0.led trainmlf
HLEd -l '*' -d dict -i phones1.mlf mkphones1.led trainmlf
5:46: Validating that 'sil' was added at the start and end of every utterance.
5:47: step 5
...
I don't really plan on recording 100 sentences (corresponding the the HGen generated prompts)
Instead, I'm going to concat the single digit files according to the generated prompts
For each digit, I have 5 training recordings and 5 testing recordings.
the train prompts will be composed by concating the training recording,
and separately for the test prompts.
6:51 Done writing the digit-wav-concating script! available here,
The commmand
perl -w concatprompt.pl ../../def/trainprompts > codetrain.scp
generates two outputs:
1. the 'codetrain.scp' file with content
S00001.wav S00001.mfc
S00002.wav S00002.mfc ... |
2. The 'SXXXX.wav' files - by concating the single digit wav files according to the prompt.
concating was done using the command 'sox' (was already installed on my cygwin).
Now I'm ready to run:
HCopy -T 1 -C configtrain -S codetrain.scp
which obviously fails, spewing:
ERROR [+6310] OpenParmChannel: cannot open Parm File reading
ERROR [+6313] OpenAsChannel: OpenParmChannel failed
ERROR [+6316] OpenBuffer: OpenAsChannel failed
ERROR [+1050] OpenParmFile: Config parameters invalid
FATAL ERROR - Terminating program HCopy
The problem seemed to be a missing 'SOURCEFORMAT' command in 'configtrain' suggest by the HTKBook (thank you internet forums).
So, my 'configtrain' now looks like this:
# Coding parameters
TARGETKIND = MFCC_0 TARGETRATE = 100000.0 SOURCEFORMAT = WAV SAVECOMPRESSED = T SAVEWITHCRC = T WINDOWSIZE = 250000.0 USEHAMMING = T PREEMCOEF = 0.97 NUMCHANS = 26 CEPLIFTER = 22 NUMCEPS = 12 ENORMALISE = F |
and my 'configtest' now looks like this (the only difference is in TARGETKIND)
# Coding parameters
TARGETKIND = MFCC_0_D_A TARGETRATE = 100000.0 SOURCEFORMAT = WAV SAVECOMPRESSED = T SAVEWITHCRC = T WINDOWSIZE = 250000.0 USEHAMMING = T PREEMCOEF = 0.97 NUMCHANS = 26 CEPLIFTER = 22 NUMCEPS = 12 ENORMALISE = F |
7:22pm stop.
12:46am - step 6:
Running:
HCompV -C configtrain2 -f 0.01 -m -S trainmfc.scp -M hmm0 proto
Here the files:
1. 'trainmfc.scp' contains the .mfc files (one per line)
2. 'proto' is
~o <VecSize> 39 <MFCC_0_D_A>
~h "proto" <BeginHMM> <NumStates> 5 <State> 2 <Mean> 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 <Variance> 39 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0 <State> 3 <Mean> 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 <Variance> 39 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0 <State> 4 <Mean> 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 <Variance> 39 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0 <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 <EndHMM> |
3. 'configtrain2' is a modified version of 'configtrain' with
(a) the 'SOURCEFORMAT = WAV' removed (since we're using mfc files)
and (b) TARGETKIND = MFCC_0_D_A - matching the 'proto' first line, otherwise you get an error like
[+2050] CheckData: Parameterisation in S00001.mfc is incompatible with
1:29am - successful run.
instead of manually copying and pasting for each monophone, I use a script:
'monophone2hmmdef.pl' which takes a list of monophones and a proto file and outputs an hmmdefs file.
as instructed here
(for the script, see here).
The command line I use is:
perl -w monophone2hmmdef.pl hmm0/proto ../../def/monophones0 > hmmdefs
where 'monophones0' is 'monophones1' without the phone 'sp'.
1:48am: creating the macros file: (as specified by the same voxforge link above):
the resulting file is:
~o
<STREAMINFO> 1 39 <VECSIZE> 39<NULLD><MFCC_D_A_0><DIAGC> ~v varFloor1 <Variance> 39 9.718211e-01 9.532348e-01 8.965245e-01 1.231103e+00 1.152173e+00 7.287015e-01 6.180173e-01 3.474550e-01 5.032213e-01 5.217451e-01 3.275607e-01 4.168463e-01 9.035664e-01 2.967923e-02 2.530965e-02 2.388922e-02 4.786262e-02 3.260848e-02 3.308474e-02 3.244526e-02 2.310554e-02 2.778173e-02 2.565148e-02 2.570237e-02 2.567602e-02 3.337039e-02 3.829589e-03 3.403592e-03 3.239712e-03 6.049931e-03 4.571099e-03 5.151636e-03 5.374793e-03 3.918502e-03 4.568941e-03 4.490597e-03 4.577133e-03 4.227266e-03 4.098070e-03 |
To re-estimate, create the folder 'hmm1' and run
HERest -C configtrain2 -I ../../def/phones0.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 ../../def/monophones0
which outputs:
ERROR [+7321] CreateInsts: Unknown label sil
So, I manually added the 'sil' monophone to 'hmmdefs' and to 'monophones0'
(I could have just added it to monophones0 and regenerate hmmdefs..).
2:28am: that's it for today. messy!
=======
4:50pm
finishing step 6 - need to repeat 2 runs of HERest.
HERest -C configtrain2 -I ../../def/phones0.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm1/macros -H hmm1/hmmdefs -M hmm2 ../../def/monophones0
HERest -C configtrain2 -I ../../def/phones0.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm2/macros -H hmm2/hmmdefs -M hmm3 ../../def/monophones0
4:57pm done.
Step 7: Fixing the Silence Models
The description is indeed a bit cryptic here.
Here, 'fixing the silence model' means that we will copy the content of hmm3 to hmm4 and modify 'hmm4/hmmdefs' a bit.
the modifications will tie 'sp' to the center state of the 'sil' model.
that is, they will share the same HMM parameters.
The following describes what exactly is needed to be done.
http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial/monophones/step-7
Note: At this point I started following the above tutorial as much as I could.
Below I record mainly the commands that I ran and some tricks,
Many problems were solved by looking over the user comments in each step, or just searching the error number HTK outputs.
...
After using a text editor as required, I run:
HHEd -A -D -T 1 -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed ../../def/monophones1
Note that 'monophones1' contains 'sp'.
I got this error message:
WARNING [-2631] EditTransMat: No trans mats to edit! in HHEd
Afterwhich I noticed that my 'monophones1' doesn't have a 'sil' monophone. so I manually added it.
This seemed to resolve the problem.
5:35pm
Now two more re-estimations are done, using 'monophones1'
Running:
HERest -C configtrain2 -I ../../def/phones0.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm5/macros -H hmm5/hmmdefs -M hmm6 ../../def/monophones1
fails with:
WARNING [-2331] UpdateModels: sp[19] copied: only 0 egs
The reason is that we should use 'phones1.mlf'.
After correcting, the following two commands were executed:
HERest -C configtrain2 -I ../../def/phones1.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm5/macros -H hmm5/hmmdefs -M hmm6 ../../def/monophones1
HERest -C configtrain2 -I ../../def/phones1.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm6/macros -H hmm6/hmmdefs -M hmm7 ../../def/monophones1
5:49pm: Step 8 - Realigning the Training Data.
Note: How anyone can successfully follow the HTK tutorial without external help is beyond me at this stage.
Sorry.
Running:
HVite -A -D -T 1 -l '*' -o SWT -b SENT-END -C configtrain2 -H hmm7/macros -H hmm7/hmmdefs -i aligned.mlf -m -t 250.0 150.0 1000.0 -y lab -a -I ../../def/trainmlf -S trainmfc.scp ../../def/dict ../../def/monophones1 > HVite_log
Which outputs the file 'aligned.mlf'.
This command is somewhat different than the one suggested by the HTK book and was taken from here
This step is actually redundant considering that my dictionary has a single pronounciation per word.
Nevertheless, for the sake of completeness (and who knows what else will follow)...
HERest -C configtrain2 -I ../../def/phones1.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm7/macros -H hmm7/hmmdefs -M hmm8 ../../def/monophones1
HERest -C configtrain2 -I ../../def/phones1.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm8/macros -H hmm8/hmmdefs -M hmm9 ../../def/monophones1
6:31pm: Step 9 - Making Triphones from Monophones
the file 'mktri.led' was created
WB sp
WB sil TC |
and the following command was executed:
HLEd -A -D -T 1 -n triphones1 -l '*' -i wintri.mlf ../../def/mktri.led aligned.mlf
'triphones1' contains a list of triphones
(each word's monophone's were grouped to triphones. sil/sp were skipped by the 'WB' directive above)
'wintri.mlf' contains the new transcription of each file, in triphone format.
Next the following command was run:
perl -w ../../../samples/HTKTutorial/maketrihed monophones1 triphones1
Which creates 'mktri.hed' with content
CL triphones1
TI T_ey {(*-ey+*,ey+*,*-ey).transP} TI T_t {(*-t+*,t+*,*-t).transP} TI T_sp {(*-sp+*,sp+*,*-sp).transP} TI T_f {(*-f+*,f+*,*-f).transP} TI T_ay {(*-ay+*,ay+*,*-ay).transP} TI T_v {(*-v+*,v+*,*-v).transP} TI T_ao {(*-ao+*,ao+*,*-ao).transP} TI T_r {(*-r+*,r+*,*-r).transP} TI T_n {(*-n+*,n+*,*-n).transP} TI T_ow {(*-ow+*,ow+*,*-ow).transP} TI T_w {(*-w+*,w+*,*-w).transP} TI T_ah {(*-ah+*,ah+*,*-ah).transP} TI T_s {(*-s+*,s+*,*-s).transP} TI T_eh {(*-eh+*,eh+*,*-eh).transP} TI T_ih {(*-ih+*,ih+*,*-ih).transP} TI T_k {(*-k+*,k+*,*-k).transP} TI T_th {(*-th+*,th+*,*-th).transP} TI T_iy {(*-iy+*,iy+*,*-iy).transP} TI T_uw {(*-uw+*,uw+*,*-uw).transP} TI T_z {(*-z+*,z+*,*-z).transP} TI T_ia {(*-ia+*,ia+*,*-ia).transP} TI T_sil {(*-sil+*,sil+*,*-sil).transP} |
and thereafter,
HHEd -A -D -T 1 -H hmm9/macros -H hmm9/hmmdefs -M hmm10 mktri.hed monophones1
The HTK book says, you can disregard the 'T_sil' related warning:
WARNING [-2631] ApplyTie: Macro T_sil has nothing to tie of type t in HHEd
Running 2 times more:
HERest -A -D -T 1 -C configtrain2 -I wintri.mlf -t 250.0 150.0 3000.0 -S trainmfc.scp -H hmm10/macros -H hmm10/hmmdefs -M hmm11 triphones1
HERest -A -D -T 1 -C configtrain2 -I wintri.mlf -t 250.0 150.0 3000.0 -s stats -S trainmfc.scp -H hmm11/macros -H hmm11/hmmdefs -M hmm12 triphones1
7:28pm: Step 10 - Making Tied-State Triphones
HDMan -A -D -T 1 -b sp -n fulllist -g ../../def/global.ded -l flog dict-tri ../../def/bigdict.txt
Where 'bigdict.txt' is taken from
http://www.voxforge.org/uploads/-A/h1/-Ah18p_AY2DzEs-9h-K-4g/voxforge_lexicon
next
cat triphones1 fulllist > fulllist1
and
perl -w fixfulllist.pl fulllist1 fulllist
where fixfulllist.pl is taken from here
I continue to follow the steps in the voxforge tutorial,
but saved a copy of 'tree.hed'
cp tree.hed tree.hed.old
and also created 'tree.hed.suffix'
TR 1 AU "fulllist" CO "tiedlist" ST "trees" |
After creating the required folders, I ran:
perl -w ~/samples/RMHTK/perl_scripts/mkclscript.prl TB 350 ../../def/monophones0 >> tree.hed
and
HHEd -A -D -T 1 -H hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones1
which fails with:
AU fulllist
Creating HMMset using trees to add unseen triphones
ERROR [+2662] FindProtoModel: no proto for b in hSet
FATAL ERROR - Terminating program HHEd
Creating HMMset using trees to add unseen triphones
ERROR [+2662] FindProtoModel: no proto for b in hSet
FATAL ERROR - Terminating program HHEd
Now, true, there's no 'b' sound in any of the digits, as well as many other sounds like m or d.
My plan is to remove any monophones I don't need from 'bigdict.txt'
8:55pm:
Finally the 'HHEd' passed after filtering out all of these phones:
"[bmljdkpqgy]\|aa\|zh\|ch\|ae\|aw\|ax\|en\|er\|hh\|sh\|uh".
The following shell script helped in the process, where I just changed the grep -v pattern until only required monophones were present)
grep -v "[bmljdkpqgy]\|aa\|zh\|ch\|ae\|aw\|ax\|en\|er\|hh\|sh\|uh" ../../def/bigdict.txt > ../../def/bigdict2.txt;
HDMan -A -D -T 1 -b sp -n fulllist -g ../../def/global.ded -l flog dict-tri ../../def/bigdict2.txt
cat triphones1 fulllist > fulllist1
perl -w fixfulllist.pl fulllist1 fulllist
rm tree.hed; cp tree.hed.old tree.hed; cp tree.hed.old tree.hed
perl -w ~/samples/RMHTK/perl_scripts/mkclscript.prl TB 350 ../../def/monophones0 >> tree.hed
cat tree.hed.suffix >> tree.hed
HHEd -A -D -T 1 -H hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones1
There's probably a simpler way to work around this.
and my flog looks like this:
1. w : 65
2. ah : 71 3. n : 305 4. sp : 561 5. ow : 172 6. t : 375 7. uw : 69 8. s : 384 9. eh : 156 10. v : 46 11. ih : 280 12. f : 117 13. r : 204 14. ao : 49 15. z : 140 16. th : 43 |
which seems okay (all phones > 10).
now running
HERest -A -D -T 1 -C configtrain2 -I wintri.mlf -s stats -t 250.0 150.0 3000.0 -S trainmfc.scp -H hmm13/macros -H hmm13/hmmdefs -M hmm14 tiedlist
HERest -A -D -T 1 -C configtrain2 -I wintri.mlf -s stats -t 250.0 150.0 3000.0 -S trainmfc.scp -H hmm14/macros -H hmm14/hmmdefs -M hmm15 tiedlist
9:03pm. Done for now.
12:14am. Step 11 - Recognising the Test Data (finally!)
HVite -H ../train/hmm15/macros -H ../train/hmm15/hmmdefs -S testmfc.scp -l '*' -i recout.mlf -w ../../def/wdnet -p 0.0 -s 5.0 ../../def/dict ../train/tiedlist
HResults -I ../../def/testmlf ../train/tiedlist recout.mlf
Gives perfect results:
====================== HTK Results Analysis =======================
Date: Tue May 29 00:44:39 2012
Ref : ../../def/testmlf
Rec : recout.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=100.00 [H=100, S=0, N=100]
WORD: %Corr=100.00, Acc=100.00 [H=641, D=0, S=0, I=0, N=641]
===================================================================
i.e. a %100 percent recognition success!
(this is expected due to the very low variance in the .wav files).
12:46am.
References
0. HTKBook here
1. voxforge tutorial: here
2. htk problems and how to fix'em: here
3. scripts and files I wrote/used: here