Large vocabulary continuous speech recognition systems (LVCSR) are mostly based on the Hidden Markov Model (HMM).
| Q = q1, q2, q3, ... qN | a set of N states |
| A = a11, a12,... a1n, ... ann | a transition probability matrix A, each aij representing the probability of moving from state i to state j, such that |
| O = o1, o2, o3,...oT | a sequence of individual T observations, each one drawn from a vocabulary V = v1, v2, ... vV. |
| B = bi(ot) | a sequence of observation likelihoods (emission probabilities), each expressing the probability of an observation ot being generated from a state i |
| q0, qF | a start state and end state that are not associated with observations |
| πi | the probability that the HMM will state in state i. Initial distribution π is an initial probability distribution over states. |
| λ = (A, B, π) | a common notation for an HMM. Hidden Markov Model is described by the three probability densities A, B and π |
According to Jurafsky and Martin (2000) the goal of the probabilistic noisy channel architecture for speech recognition is to find "what is the most likely sentence out of all sentences in the
language L given some acoustic input O".
Figure 1. The noisy channel model applied to entire sentences. Adopted from Jurafsky & Martin, 2000.
Most commonly, a speech recognition system consists of an acoustic processor and a linguistic decoder.
Figure 2. Architecure of a HMM-based Recogniser. Adopted from Gales & Young, 2007.
The acoustic processor receives as input a speech waveform and produces as output a sequence of fixed size acoustic vectors described as observations O1:T in a process called feature extraction.
The goal of the decoder is to find the sequence of words W which is most likely to have generated O:
Applying Bayesian rule:
,
where W ∈ L; P(O|W) is Acoustic Model such as HMMs and P(W) is the Language Model such as n-grams.
The feature extraction stage is based on the Fourier transformation designed to represent a complex waveform as a sum of many simple waves of different frequencies. A spectrum is a representation of these different frequency components of a wave. A spectrogram is a way of envisioning how the different frequencies which make up a waveform change over time.
During feature extraction process, the original waveform is sampled with certain frequency and digitized. A spectral analysis method is then applied to the windowed signal to produce a parsimonious representation of the spectral properties of the waveform within the window. Standard methods include measurement of the discrete (fast) Fourier transform (FFT), all-pole minimum phase linear prediction (LPC) and autoregressive moving average methods. As a result, the original waveform is converted into a set of spectral feature vectors. Each vector represents one time-slice of the input signal. Feature vectors give information about how much energy in the signal is at different frequencies and are typically computed every 10 ms using an overlapping analysis window of around 25 ms.
Figure 3. A segment of speech waveform corresponding to the word "judge" and the resulting short-time spectral analysis of the first four frames. Adopted from Juang and Rabiner, 1991.
Each spoken word w is decomposed into a sequence of basic sounds called base phones. The acoustic model describes the probability of a specific observation given a base phone. Each base phone is represented by a continuous density HMM of the form illustrated below with transition probability parameters {aij} and output observation distributions {bj}.

In such HMM, the observation sequences are the vectors of spectral features. Each state is allowed to generate multiple copies of the same observations by having a loop on the state. These loops allow HMMs to model the variable duration of phones. Longer phones require more loops through the HMM. There are also two special states, so called non-emitting states, as the start and end states.
In this form the model obeys the standard first-order HMM assumptions of conditional independence:
Given a set if observed phones O = (o1o2o3...ot) we can use Viterbi algorithm to find the best state sequence Q = (q1q2q3...qt).
As a result, the computed observation likelihood probabilities are a fine-grained probability estimates, computed via mixtures of Gaussian probability estimators or neural nets.
In continuous speech, the input consists of a sequence of words with undefined location of the word boundaries. The task of finding word boundaries in connected speech is called segmentation and it can also be solved using Viterbi algorithm applying a more sophisticated N-gram language models.
There are two main limitations of the Viterbi decoder. First, the Viterbi decoder does not actually compute the sequence of words which is most probable given the input acoustics. Instead, it computes an approximation of this: the sequence of states (i.e. phones) which is most probable given the input. Usually, the most probable sequence of phones corresponds t the most probable sequence of words, but sometimes it does not. This happens when the most probable sequence of words has a word with many pronunciations. In that case, the algorithm gives preference to a path through a word with only single pronanciation, which may be incorrect.
A second limitation of the Viterbi decoder is that it cannot take advantage of models that are more complicated than a bigram grammar. This is because of the dinamic programming assumption that if the best path for the entire observation sequence goes through qi state, it necessarily must include the best path up to and including state qi. However, in case of best trigram-probability path for a complete sentence may go through a word but not include the best path to that word.
To solve these issues, more complicated algorithms are used. Some of these algorithms are: N-Best-Viterbi algorithm (returns a list of N best sentences) and A* decoder (uses A* search algorithm).