The Acoustic Model

The acoustic model relates the words to the underlying signal: P(signal|words).
This refines the process of speech recognition into the following steps:

  word --> sequence of phones --> acoustic signal --> vector quantization

The phonetic variations are due to:

pronunciation different by dialect
coarticulation (slurring of phones)

The pronunciations are modeled as Markov models:



The states represent phones, and the transitions represent succession with 
some probability. If there is only one successor, then the probability = 1.

There is one Markov model per word. This model is used to predict the
probability that each particular combination of phones occurs.


P(phones|word) = product of corresponding transition probabilites

To complete the recognition process, we still need P(signal|phone). For
this, we use a Hidden Markov Model (HMM). Here is an example for the
[m] phone.



Each state has multiple outputs with associated probabilities.
The outputs represent vector quantization values. The transitions can be 
loops, which permits iteration of vector quantization values for slow speakers.
The model is called hidden because we don't know which state (Onset, Mid, End)
produced which output (C1-C7). As a result, the pronunciation Markov model 
is actually a Hidden Markov model. Other phones have similar HMMs.

Given vector quantization values, we compute P(VQ values|phone):

P(VQ values|phone) = P(state transition path) * P(VQ value|state)

For example, consider VQ values [C3,C5,C6].

P([C3,C5,C6]|[m]) = P(Onset->Mid) * P(Mid->End) * P(End->Final)

            * P([C3]|Onset) * P(C5|Mid) * P([C6]|End)

            = (0.7)(0.1)(0.6) * (0.3)(0.1)(0.5) = 0.00063


Most phones last 5-10 frames, where a frame = 10 msecs.
We acquire HMM probabilities from language data.