ch5 (recognition principles).ppt

7-Speech Recognition (Cont’d)
HMM Calculating Approaches
Neural Components
Three Basic HMM Problems
Viterbi Algorithm
State Duration Modeling
Training In HMM
1
Speech Recognition Concepts
Speech recognition is inverse of Speech Synthesis
Speech Synthesis
Text
Speech Speech
Phone
Processing Sequence
NLP
NLP
Speech
Processing
Text
Speech
Understanding
Speech Recognition
2
Speech Recognition
Approaches
Bottom-Up Approach
Top-Down Approach
Blackboard Approach
3
Bottom-Up Approach
Signal Processing
Knowledge Sources
Feature Extraction
Voiced/Unvoiced/Silence
Segmentation
Signal Processing
Sound Classification Rules
Feature Extraction
Phonotactic Rules
Segmentation
Lexical Access
Language Model
Segmentation
Recognized Utterance
4
Top-Down Approach
Inventory
Word
of speech Dictionary Grammar
recognition
units
Feature
Analysis
Syntactic
Hypo
thesis
Unit
Matching
System
Lexical
Hypo
thesis
Utterance
Verifier/
Matcher
Recognized Utterance
Task
Model
Semantic
Hypo
thesis
5
Blackboard Approach
Acoustic
Processes
Environmental
Processes
Lexical
Processes
Black
board
Semantic
Processes
Syntactic
Processes
6
top down
An overall
view of a
speech
recognition
system
bottom up
7
From Ladefoged 2001
Recognition Theories
Articulatory Based Recognition
– Use from Articulatory system for recognition
– This theory is the most successful until now
Auditory Based Recognition
– Use from Auditory system for recognition
Hybrid Based Recognition
– Is a hybrid from the above theories
Motor Theory
– Model the intended gesture of speaker
8
Recognition Problem
We have the sequence of acoustic
symbols and we want to find the words
that expressed by speaker
Solution : Finding the most probable
word sequence having Acoustic symbols
9
Recognition Problem
A : Acoustic Symbols
W : Word Sequence
we should find
ŵ
so that
P(wˆ | A)  max P(w | A)
w
10
Bayse Rule
P( x | y) P( y)  P( x, y)
P( y | x) P( x)
P( x | y ) 
P( y )
P( A | w) P( w)
 P( w | A) 
P( A)
11
Bayse Rule (Cont’d)
P(wˆ | A)  max P(w | A)
w
P( A | w) P( w)
 max
w
P( A)
ˆ  Arg max P( w | A)
w
w
 Arg max P( A | w) P( w)
w
12
Simple Language Model
w  w1w2 w3  wn
n
P( w)   P( wi | wi 1wi 2  w1 )
i 1
Computing this probability is very difficult and we
need a very big database. So we use from Trigram
and Bigram models.
13
Simple Language Model
(Cont’d)
n
Trigram :
P( w)   P( wi | wi 1wi 2 )
i 1
n
Bigram :
P( w)   P(wi | wi 1 )
i 1
n
Monogram :
P( w)   P( wi )
i 1
14
Simple Language Model
(Cont’d)
Computing Method :
P( w3 | w2 w1 ) 
Number of happening W3 after W1W2
Total number of happening W1W2
AdHoc Method :
P(w3 | w2 w1 )  1 f (w3 | w2 w1 )  2 f (w3 | w2 )  3 f (w3 )
15
7-Speech Recognition
Speech Recognition Concepts
Speech Recognition Approaches
Recognition Theories
Bayse Rule
Simple Language Model
P(A|W) Network Types
16
From Ladefoged 2001
17
P(A|W) Computing
Approaches
Dynamic Time Warping (DTW)
Hidden Markov Model (HMM)
Artificial Neural Network (ANN)
Hybrid Systems
18
Dynamic Time Warping
Method (DTW)
To obtain a global distance between two speech
patterns a time alignment must be performed
Ex :
A time alignment
path between a
template pattern
“SPEECH” and a
noisy input
“SsPEEhH”
19
Recognition Tasks
Isolated Word Recognition (IWR) And
Continuous Speech Recognition (CSR)
Speaker Dependent And Speaker
Independent
Vocabulary Size
– Small
<20
– Medium
>100 , <1000
– Large
>1000, <10000
– Very Large >10000
20
Error Production Factor
Prosody (Recognition should be
Prosody Independent)
Noise (Noise should be prevented)
Spontaneous Speech
21
Artificial Neural Network
x0
x1 .w1
.
.
xN 1
w0

y
N 1
y  ( wi xi   )
i 1
wN 1
Simple Computation Element
of a Neural Network
22
Artificial Neural Network
(Cont’d)
Neural Network Types
– Perceptron
– Time Delay
– Time Delay Neural Network Computational
Element (TDNN)
23
Artificial Neural Network
(Cont’d)
Single Layer Perceptron
y0
x0
...
...
yM 1
xN 1
24
Artificial Neural Network
(Cont’d)
Three Layer Perceptron
...
...
...
...
25
Hybrid Methods
Hybrid Neural Network and Matched Filter For
Recognition
Acoustic
Output Units
Speech
Features
Delays
PATTERN
CLASSIFIER
26
Neural Network Properties
The system is simple, But too much
iterative
Doesn’t determine a specific structure
Regardless of simplicity, the results are
good
Training size is large, so training should be
offline
Accuracy is relatively good
27
Hidden Markov Model
Si
aij
a ji
Sj
Observation : O1,O2, . . .
O1 , O2 , O3 ,, Ot
States in time : q1, q2, . .
.
q1 , q2 , q3 , , qt
All states : s1, s2, . . .
28