Tomer Levinboim's Blog: 2012

In the microblog below I recorded my efforts to implement the first tutorial from the HTK book.
A word of caution:
In no way is this a complete account of the whole process,
or should this entry be regarded as a tutorial (as you will see soon enough)
My general impression is that it is very hard to go through the HTK tutorial without some external help.
A good reference tutorial can be found here
but maybe someone out there will find this useful too.

Also note that my timestamps below shouldn't be taken seriously.
This was done over the weekend, along with other things.
-- Tomer

General configuration:
I installed HTK 3.4.1 on windows+cygwin.
cygwin should be installed with perl as well as have the sox application
(used to concat wav files together)

Speech recognition settings:
I wanted to avoid recording.
Therefore, I used only a pre-recorded set of digits
available here
I used only the non-raw version.

For training and test data I therefore have 2x5x11 samples:
That is, five recordings per digit and 11 different utterances since zero is encoded by both "zero" and "oh".

I set up a very simple grammar:
Initially, a user can either say "0" or "1".
Subsequently,
"0" can be followed by a seq' of even numbers (e.g., 02824)
"1" can be followed by a seq' of odd numbers (e.g., 1775)

-- Start of microblog.

4:04pm:
Step 1 - the task grammar.

4:16pm: HParse passes on first run,

4:24pm: Step 2 - the dictionary
created the file 'wlist' using 'prompt2wlist' perl script.

4:56pm:
Downloaded 'beep' dictionary here.
Unfortunetly, It isn't well sorted.
Moreover, sorting it according to unix/perl/python sort does not seem to help (maybe it does on non-cygwin system?)

5:14pm:
To overcome this problem, I wrote a script 'reduce.pl' that reduces a dictionary only to the given wlist.
The script can be found here, among other scripts I wrote in this session.

perl -w reduce.pl wlist beep-1.0 > beep.reduced.

I then added the recommended 'global.ded' in the same folder:

AS sp
RS cmu
MP sil sil sp

We obtain the file 'dict' (where "SENT-XXXX [] .." were *manually* added.")

EIGHT ey t sp
FIVE f ay v sp
FOUR f ao sp
FOUR f ao r sp
NINE n ay n sp
OH ow sp
ONE w ah n sp
SENT-END [] sil
SENT-START [] sil
SEVEN s eh v n sp
SIX s ih k s sp
THREE th r iy sp
TWO t uw sp
ZERO z ia r ow sp

5:27pm Step 3:
Previous experience shows that HSLab is not a very stable tool.
For example, it kept getting stuck after manual saving of marked and labeled files.
Fortunately, no labeling with HSLab is required here.

Two sets of sentences were created using HSGen.exe:

HSGen.exe -l -n 100 wdnet dict > trainprompts

HSGen.exe -l -n 100 wdnet dict > testprompts

Note that the digits prefixing every line should be changed to something like
"/*S00\d+" such that the output would look like:

*/S00099 ONE ONE NINE THREE THREE FIVE ONE FIVE ONE FIVE
*/S00100 OH EIGHT SIX FOUR EIGHT SIX SIX

5:37pm Step 4:
Generating the HTK label format file (mlf) for the training and test data:

perl -w ../../samples/HTKTutorial/prompts2mlf trainmlf trainprompts

perl -w ../../samples/HTKTutorial/prompts2mlf testmlf testprompts

5:45pm
The files 'mkphones0.led' was created (note the newline is required at the last line)

EX
IS sil sil
DE sp

as well as 'mkphones1.led':

EX
IS sil sil

Notice that there's no delete "sp" in the second .led file - this results in words being separated by 'sp':
for example: ('086')
sil ow sp ey t sp s ih k s sp

The following commands were run:

HLEd -l '*' -d dict -i phones0.mlf mkphones0.led trainmlf

HLEd -l '*' -d dict -i phones1.mlf mkphones1.led trainmlf

5:46: Validating that 'sil' was added at the start and end of every utterance.

5:47: step 5
...
I don't really plan on recording 100 sentences (corresponding the the HGen generated prompts)
Instead, I'm going to concat the single digit files according to the generated prompts
For each digit, I have 5 training recordings and 5 testing recordings.
the train prompts will be composed by concating the training recording,
and separately for the test prompts.

6:51 Done writing the digit-wav-concating script! available here,
The commmand

perl -w concatprompt.pl ../../def/trainprompts > codetrain.scp

generates two outputs:
1. the 'codetrain.scp' file with content

S00001.wav S00001.mfc
S00002.wav S00002.mfc
...

2. The 'SXXXX.wav' files - by concating the single digit wav files according to the prompt.
concating was done using the command 'sox' (was already installed on my cygwin).

Now I'm ready to run:

HCopy -T 1 -C configtrain -S codetrain.scp

which obviously fails, spewing:

ERROR [+6310] OpenParmChannel: cannot open Parm File reading
ERROR [+6313] OpenAsChannel: OpenParmChannel failed
ERROR [+6316] OpenBuffer: OpenAsChannel failed
ERROR [+1050] OpenParmFile: Config parameters invalid
FATAL ERROR - Terminating program HCopy

The problem seemed to be a missing 'SOURCEFORMAT' command in 'configtrain' suggest by the HTKBook (thank you internet forums).
So, my 'configtrain' now looks like this:

# Coding parameters
TARGETKIND = MFCC_0
TARGETRATE = 100000.0
SOURCEFORMAT = WAV
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F

and my 'configtest' now looks like this (the only difference is in TARGETKIND)

# Coding parameters
TARGETKIND = MFCC_0_D_A
TARGETRATE = 100000.0
SOURCEFORMAT = WAV
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F

7:22pm stop.
12:46am - step 6:

Running:

HCompV -C configtrain2 -f 0.01 -m -S trainmfc.scp -M hmm0 proto

Here the files:
1. 'trainmfc.scp' contains the .mfc files (one per line)
2. 'proto' is

~o <VecSize> 39 <MFCC_0_D_A>
~h "proto"
<BeginHMM>
<NumStates> 5
<State> 2
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0
<State> 3
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0
<State> 4
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0. 0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1. 0
<TransP> 5
0.0 1.0 0.0 0.0 0.0
0.0 0.6 0.4 0.0 0.0
0.0 0.0 0.6 0.4 0.0
0.0 0.0 0.0 0.7 0.3
0.0 0.0 0.0 0.0 0.0
<EndHMM>

3. 'configtrain2' is a modified version of 'configtrain' with
(a) the 'SOURCEFORMAT = WAV' removed (since we're using mfc files)
and (b) TARGETKIND = MFCC_0_D_A - matching the 'proto' first line, otherwise you get an error like

[+2050] CheckData: Parameterisation in S00001.mfc is incompatible with

1:29am - successful run.

instead of manually copying and pasting for each monophone, I use a script:
'monophone2hmmdef.pl' which takes a list of monophones and a proto file and outputs an hmmdefs file.
as instructed here
(for the script, see here).

The command line I use is:

perl -w monophone2hmmdef.pl hmm0/proto ../../def/monophones0 > hmmdefs

where 'monophones0' is 'monophones1' without the phone 'sp'.

1:48am: creating the macros file: (as specified by the same voxforge link above):
the resulting file is:

~o
<STREAMINFO> 1 39
<VECSIZE> 39<NULLD><MFCC_D_A_0><DIAGC>
~v varFloor1
<Variance> 39
9.718211e-01 9.532348e-01 8.965245e-01 1.231103e+00 1.152173e+00 7.287015e-01 6.180173e-01 3.474550e-01 5.032213e-01 5.217451e-01 3.275607e-01 4.168463e-01 9.035664e-01 2.967923e-02 2.530965e-02 2.388922e-02 4.786262e-02 3.260848e-02 3.308474e-02 3.244526e-02 2.310554e-02 2.778173e-02 2.565148e-02 2.570237e-02 2.567602e-02 3.337039e-02 3.829589e-03 3.403592e-03 3.239712e-03 6.049931e-03 4.571099e-03 5.151636e-03 5.374793e-03 3.918502e-03 4.568941e-03 4.490597e-03 4.577133e-03 4.227266e-03 4.098070e-03

To re-estimate, create the folder 'hmm1' and run

HERest -C configtrain2 -I ../../def/phones0.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 ../../def/monophones0

which outputs:

ERROR [+7321] CreateInsts: Unknown label sil

So, I manually added the 'sil' monophone to 'hmmdefs' and to 'monophones0'
(I could have just added it to monophones0 and regenerate hmmdefs..).

2:28am: that's it for today. messy!
=======
4:50pm
finishing step 6 - need to repeat 2 runs of HERest.

HERest -C configtrain2 -I ../../def/phones0.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm1/macros -H hmm1/hmmdefs -M hmm2 ../../def/monophones0

HERest -C configtrain2 -I ../../def/phones0.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm2/macros -H hmm2/hmmdefs -M hmm3 ../../def/monophones0

4:57pm done.

Step 7: Fixing the Silence Models
The description is indeed a bit cryptic here.
Here, 'fixing the silence model' means that we will copy the content of hmm3 to hmm4 and modify 'hmm4/hmmdefs' a bit.
the modifications will tie 'sp' to the center state of the 'sil' model.
that is, they will share the same HMM parameters.

The following describes what exactly is needed to be done.
http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial/monophones/step-7

Note: At this point I started following the above tutorial as much as I could.
Below I record mainly the commands that I ran and some tricks,
Many problems were solved by looking over the user comments in each step, or just searching the error number HTK outputs.
...

After using a text editor as required, I run:

HHEd -A -D -T 1 -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed ../../def/monophones1

Note that 'monophones1' contains 'sp'.

I got this error message:

WARNING [-2631] EditTransMat: No trans mats to edit! in HHEd

Afterwhich I noticed that my 'monophones1' doesn't have a 'sil' monophone. so I manually added it.
This seemed to resolve the problem.

5:35pm
Now two more re-estimations are done, using 'monophones1'

Running:

HERest -C configtrain2 -I ../../def/phones0.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm5/macros -H hmm5/hmmdefs -M hmm6 ../../def/monophones1

fails with:

WARNING [-2331] UpdateModels: sp[19] copied: only 0 egs

The reason is that we should use 'phones1.mlf'.
After correcting, the following two commands were executed:

HERest -C configtrain2 -I ../../def/phones1.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm5/macros -H hmm5/hmmdefs -M hmm6 ../../def/monophones1

HERest -C configtrain2 -I ../../def/phones1.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm6/macros -H hmm6/hmmdefs -M hmm7 ../../def/monophones1

5:49pm: Step 8 - Realigning the Training Data.
Note: How anyone can successfully follow the HTK tutorial without external help is beyond me at this stage.
Sorry.

Running:

HVite -A -D -T 1 -l '*' -o SWT -b SENT-END -C configtrain2 -H hmm7/macros -H hmm7/hmmdefs -i aligned.mlf -m -t 250.0 150.0 1000.0 -y lab -a -I ../../def/trainmlf -S trainmfc.scp ../../def/dict ../../def/monophones1 > HVite_log

Which outputs the file 'aligned.mlf'.
This command is somewhat different than the one suggested by the HTK book and was taken from here
This step is actually redundant considering that my dictionary has a single pronounciation per word.
Nevertheless, for the sake of completeness (and who knows what else will follow)...

HERest -C configtrain2 -I ../../def/phones1.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm7/macros -H hmm7/hmmdefs -M hmm8 ../../def/monophones1

HERest -C configtrain2 -I ../../def/phones1.mlf -t 250.0 150.0 1000.0 -S trainmfc.scp -H hmm8/macros -H hmm8/hmmdefs -M hmm9 ../../def/monophones1

6:31pm: Step 9 - Making Triphones from Monophones

the file 'mktri.led' was created

WB sp
WB sil
TC

and the following command was executed:

HLEd -A -D -T 1 -n triphones1 -l '*' -i wintri.mlf ../../def/mktri.led aligned.mlf

'triphones1' contains a list of triphones
(each word's monophone's were grouped to triphones. sil/sp were skipped by the 'WB' directive above)
'wintri.mlf' contains the new transcription of each file, in triphone format.

Next the following command was run:

perl -w ../../../samples/HTKTutorial/maketrihed monophones1 triphones1

Which creates 'mktri.hed' with content

CL triphones1
TI T_ey {(*-ey+*,ey+*,*-ey).transP}
TI T_t {(*-t+*,t+*,*-t).transP}
TI T_sp {(*-sp+*,sp+*,*-sp).transP}
TI T_f {(*-f+*,f+*,*-f).transP}
TI T_ay {(*-ay+*,ay+*,*-ay).transP}
TI T_v {(*-v+*,v+*,*-v).transP}
TI T_ao {(*-ao+*,ao+*,*-ao).transP}
TI T_r {(*-r+*,r+*,*-r).transP}
TI T_n {(*-n+*,n+*,*-n).transP}
TI T_ow {(*-ow+*,ow+*,*-ow).transP}
TI T_w {(*-w+*,w+*,*-w).transP}
TI T_ah {(*-ah+*,ah+*,*-ah).transP}
TI T_s {(*-s+*,s+*,*-s).transP}
TI T_eh {(*-eh+*,eh+*,*-eh).transP}
TI T_ih {(*-ih+*,ih+*,*-ih).transP}
TI T_k {(*-k+*,k+*,*-k).transP}
TI T_th {(*-th+*,th+*,*-th).transP}
TI T_iy {(*-iy+*,iy+*,*-iy).transP}
TI T_uw {(*-uw+*,uw+*,*-uw).transP}
TI T_z {(*-z+*,z+*,*-z).transP}
TI T_ia {(*-ia+*,ia+*,*-ia).transP}
TI T_sil {(*-sil+*,sil+*,*-sil).transP}

and thereafter,

HHEd -A -D -T 1 -H hmm9/macros -H hmm9/hmmdefs -M hmm10 mktri.hed monophones1

The HTK book says, you can disregard the 'T_sil' related warning:

WARNING [-2631] ApplyTie: Macro T_sil has nothing to tie of type t in HHEd

Running 2 times more:

HERest -A -D -T 1 -C configtrain2 -I wintri.mlf -t 250.0 150.0 3000.0 -S trainmfc.scp -H hmm10/macros -H hmm10/hmmdefs -M hmm11 triphones1

HERest -A -D -T 1 -C configtrain2 -I wintri.mlf -t 250.0 150.0 3000.0 -s stats -S trainmfc.scp -H hmm11/macros -H hmm11/hmmdefs -M hmm12 triphones1

7:28pm: Step 10 - Making Tied-State Triphones

HDMan -A -D -T 1 -b sp -n fulllist -g ../../def/global.ded -l flog dict-tri ../../def/bigdict.txt

Where 'bigdict.txt' is taken from
http://www.voxforge.org/uploads/-A/h1/-Ah18p_AY2DzEs-9h-K-4g/voxforge_lexicon

next

cat triphones1 fulllist > fulllist1

and

perl -w fixfulllist.pl fulllist1 fulllist

where fixfulllist.pl is taken from here

I continue to follow the steps in the voxforge tutorial,

but saved a copy of 'tree.hed'

cp tree.hed tree.hed.old

and also created 'tree.hed.suffix'

TR 1

AU "fulllist"
CO "tiedlist"

ST "trees"

After creating the required folders, I ran:

perl -w ~/samples/RMHTK/perl_scripts/mkclscript.prl TB 350 ../../def/monophones0 >> tree.hed

and

HHEd -A -D -T 1 -H hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones1

which fails with:

AU fulllist
Creating HMMset using trees to add unseen triphones
ERROR [+2662] FindProtoModel: no proto for b in hSet
FATAL ERROR - Terminating program HHEd

Now, true, there's no 'b' sound in any of the digits, as well as many other sounds like m or d.
My plan is to remove any monophones I don't need from 'bigdict.txt'

8:55pm:
Finally the 'HHEd' passed after filtering out all of these phones:
"[bmljdkpqgy]\|aa\|zh\|ch\|ae\|aw\|ax\|en\|er\|hh\|sh\|uh".

The following shell script helped in the process, where I just changed the grep -v pattern until only required monophones were present)

grep -v "[bmljdkpqgy]\|aa\|zh\|ch\|ae\|aw\|ax\|en\|er\|hh\|sh\|uh" ../../def/bigdict.txt > ../../def/bigdict2.txt;

HDMan -A -D -T 1 -b sp -n fulllist -g ../../def/global.ded -l flog dict-tri ../../def/bigdict2.txt

cat triphones1 fulllist > fulllist1

perl -w fixfulllist.pl fulllist1 fulllist

rm tree.hed; cp tree.hed.old tree.hed; cp tree.hed.old tree.hed

perl -w ~/samples/RMHTK/perl_scripts/mkclscript.prl TB 350 ../../def/monophones0 >> tree.hed

cat tree.hed.suffix >> tree.hed

HHEd -A -D -T 1 -H hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones1

There's probably a simpler way to work around this.

and my flog looks like this:

1. w : 65
2. ah : 71
3. n : 305
4. sp : 561
5. ow : 172
6. t : 375
7. uw : 69
8. s : 384
9. eh : 156
10. v : 46
11. ih : 280
12. f : 117
13. r : 204
14. ao : 49
15. z : 140
16. th : 43

which seems okay (all phones > 10).

now running

HERest -A -D -T 1 -C configtrain2 -I wintri.mlf -s stats -t 250.0 150.0 3000.0 -S trainmfc.scp -H hmm13/macros -H hmm13/hmmdefs -M hmm14 tiedlist

HERest -A -D -T 1 -C configtrain2 -I wintri.mlf -s stats -t 250.0 150.0 3000.0 -S trainmfc.scp -H hmm14/macros -H hmm14/hmmdefs -M hmm15 tiedlist

9:03pm. Done for now.

12:14am. Step 11 - Recognising the Test Data (finally!)

HVite -H ../train/hmm15/macros -H ../train/hmm15/hmmdefs -S testmfc.scp -l '*' -i recout.mlf -w ../../def/wdnet -p 0.0 -s 5.0 ../../def/dict ../train/tiedlist

HResults -I ../../def/testmlf ../train/tiedlist recout.mlf

Gives perfect results:
====================== HTK Results Analysis =======================
Date: Tue May 29 00:44:39 2012
Ref : ../../def/testmlf
Rec : recout.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=100.00 [H=100, S=0, N=100]
WORD: %Corr=100.00, Acc=100.00 [H=641, D=0, S=0, I=0, N=641]
===================================================================

i.e. a %100 percent recognition success!
(this is expected due to the very low variance in the .wav files).
12:46am.

References
0. HTKBook here
1. voxforge tutorial: here
2. htk problems and how to fix'em: here
3. scripts and files I wrote/used: here

Tomer Levinboim's Blog

December 06, 2012

Python and Multi-threadining

May 30, 2012

HTK Tutorial microblogging May 27.