The Acoustic Model

The acoustic model relates the words to the underlying signal: P(signal|words).
This refines the process of speech recognition into the following steps:

word --> sequence of phones --> acoustic signal --> vector quantization

The phonetic variations are due to:

The pronunciations are modeled as Markov models:

The states represent phones, and the transitions represent succession with some probability. If there is only one successor, then the probability = 1.

There is one Markov model per word. This model is used to predict the probability that each particular combination of phones occurs.

P(phones|word) = product of corresponding transition probabilites
To complete the recognition process, we still need P(signal|phone). For this, we use a Hidden Markov Model (HMM). Here is an example for the [m] phone.

Each state has multiple outputs with associated probabilities. The outputs represent vector quantization values. The transitions can be loops, which permits iteration of vector quantization values for slow speakers. The model is called hidden because we don't know which state (Onset, Mid, End) produced which output (C1-C7). As a result, the pronunciation Markov model is actually a Hidden Markov model. Other phones have similar HMMs.

Given vector quantization values, we compute P(VQ values|phone):

P(VQ values|phone) = P(state transition path) * P(VQ value|state)
For example, consider VQ values [C3,C5,C6].
P([C3,C5,C6]|[m]) = P(Onset->Mid) * P(Mid->End) * P(End->Final)
* P([C3]|Onset) * P(C5|Mid) * P([C6]|End)
= (0.7)(0.1)(0.6) * (0.3)(0.1)(0.5) = 0.00063
Most phones last 5-10 frames, where a frame = 10 msecs. We acquire HMM probabilities from language data.