Slides - Yisong Yue

Machine Learning & Data Mining
CS/CNS/EE 155
Lecture 5:
Sequence Prediction & HMMs
1
Announcements
• Homework 1 Due Today @5pm
– Via Moodle
• Homework 2 to be Released Soon.
– Due Feb 3rd via Moodle (2 weeks)
– Mostly Coding
• Don’t start late!
• Use Moodle Forums for Q&A
2
Sequence Prediction
(POS Tagging)
• x = “Fish Sleep”
• y = (N, V)
• x = “The Dog Ate My Homework”
• y = (D, N, V, D, N)
• x = “The Fox Jumped Over The Fence”
• y = (D, N, V, P, D, N)
3
Challenges
• Multivariable Output
– Makes multiple predictions simultaneously
• Variable Length Input/Output
– Sentence lengths not fixed
4
Multivariate Outputs
• x = “Fish Sleep”
• y = (N, V)
POS Tags:
Det, Noun, Verb, Adj, Adv, Prep
• Multiclass prediction:
Replicate Weights:
Score All Classes:
Predict via Largest Score:
• How many classes?
5
Multiclass Prediction
• x = “Fish Sleep”
• y = (N, V)
• Multiclass prediction:
POS Tags:
Det, Noun, Verb, Adj, Adv, Prep
L=6
– All possible length-M sequences as different class
Exponential Explosion in #Classes!
– (D,
D),Tractable
(D, N), (D,for
V), Sequence
(D, Adj), (D,Prediction)
Adv), (D, Pr)
(Not
(N, D), (N, N), (N, V), (N, Adj), (N, Adv), …
• LM classes!
– Length 2: 62 = 36!
6
Why is Naïve Multiclass Intractable?
x=“I fish often”
POS Tags:
Det, Noun, Verb, Adj, Adv, Prep
– (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr)
– (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr)
–Treats
(D, V, D),Every
(D, V, N),Combination
(D, V, V), (D, V, Adj), As
(D, V,
Adv), (D, V, Pr)
Different
Class
– …
– (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr)
– (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr)
– … Exponentially Large Representation!
(Learn (w,b) for each combination)
(Exponential Time to Consider Every Class)
(Exponential Storage)
Assume pronouns are nouns for simplicity.
7
Independent Classification
x=“I fish often”
POS Tags:
Det, Noun, Verb, Adj, Adv, Prep
• Treat each word independently (assumption)
– Independent multiclass prediction per word
– Predict for x=“I” independently
#Classes = #POS Tags
– Predict for x=“fish” independently
(6 in our example)
– Predict for x=“often” independently
– Concatenate
Solvable
usingpredictions.
standard multiclass prediction.
Assume pronouns are nouns for simplicity.
8
Independent Classification
x=“I fish often”
POS Tags:
Det, Noun, Verb, Adj, Adv, Prep
• Treat each word independently
– Independent multiclass prediction per word
P(y|x)
x=“I”
x=“fish”
x=“often”
y=“Det”
0.0
0.0
0.0
y=“Noun” 1.0
0.75
0.0
y=“Verb”
0.0
0.25
0.0
y=“Adj”
0.0
0.0
0.4
y=“Adv”
0.0
0.0
0.6
y=“Prep”
0.0
0.0
0.0
Assume pronouns are nouns for simplicity.
Prediction: (N, N, Adv)
Correct: (N, V, Adv)
Why the mistake?
9
Context Between Words
x=“I fish often”
POS Tags:
Det, Noun, Verb, Adj, Adv, Prep
• Independent Predictions Ignore Word Pairs
– In Isolation:
• “Fish” is more likely to be a Noun
– But Conditioned on Following a (pro)Noun…
• “Fish” is more likely to be a Verb!
– “1st Order” Dependence (Model All Pairs)
• 2nd Order Considers All Triplets
• Arbitrary Order = Exponential Size (Naïve Multiclass)
Assume pronouns are nouns for simplicity.
10
1st Order Hidden Markov Model
• x = (x1,x2,x4,x4,…,xM)
• y = (y1,y2,y3,y4,…,yM)
•
•
•
•
P(xi|yi)
P(yi+1|yi)
P(y1|y0)
P(End|yM)
(sequence of words)
(sequence of POS tags)
Probability of state yi generating xi
Probability of state yi transitioning to yi+1
y0 is defined to be the Start state
Prior probability of yM being the final state
– Not always used
11
1st Order Hidden Markov Model
Additional Complexity of (#POS Tags)2
Models All State-State Pairs
(all POS Tag-Tag pairs)
Models All State-Observation Pairs (all Tag-Word pairs)
Same Complexity as Independent Multiclass
•
•
•
•
P(xi|yi)
P(yi+1|yi)
P(y1|y0)
P(End|yM)
Probability of state yi generating xi
Probability of state yi transitioning to yi+1
y0 is defined to be the Start state
Prior probability of yM being the final state
– Not always used
12
1st Order Hidden Markov Model
M
M
i=1
i=1
P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
“Joint Distribution”
•
•
•
•
P(xi|yi)
P(yi+1|yi)
P(y1|y0)
P(End|yM)
Optional
Probability of state yi generating xi
Probability of state yi transitioning to yi+1
y0 is defined to be the Start state
Prior probability of yM being the final state
– Not always used
13
1st Order Hidden Markov Model
M
P ( x | y) = Õ P(x | y )
i
i
i=1
Given a POS Tag Sequence y:
Can compute each P(xi|y) independently!
(xi conditionally independent given yi)
“Conditional Distribution on x given y”
•
•
•
•
P(xi|yi)
P(yi+1|yi)
P(y1|y0)
P(End|yM)
Probability of state yi generating xi
Probability of state yi transitioning to yi+1
y0 is defined to be the Start state
Prior probability of yM being the final state
– Not always used
14
P ( word | state/tag )
• Two-word language: “fish” and “sleep”
• Two-tag language: “Noun” and “Verb”
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
Slides borrowed from Ralph Grishman
Given Tag Sequence y:
P(“fish sleep” | (N,V) )
P(“fish fish”
| (N,V) )
P(“sleep fish” | (V,V) )
P(“sleep sleep” | (N,N) )
= 0.8*0.5
= 0.8*0.5
= 0.8*0.5
= 0.2*0.5
15
Sampling
• HMMs are “generative” models
– Models joint distribution P(x,y)
– Can generate samples from this distribution
– First consider conditional distribution P(x|y)
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
Given Tag Sequence y = (N,V):
Sample each word independently:
Sample P(x1| N) (0.8 Fish, 0.2 Sleep)
Sample P(x2| V) (0.5 Fish, 0.5 Sleep)
– What about sampling from P(x,y)?
16
Forward Sampling of P(y,x)
M
M
i=1
i=1
P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
P(x|y)
y=“Noun” y=“Verb”
x=“fish”A Simple
0.8
x=“sleep”
0.5 HMM
POS
0.2
0.5
0.2
start
0.8
Initialize y0 = Start
Initialize i = 0
1.
2.
3.
4.
5.
0.1
0.8
noun
verb
0.7
end
0.2
0.1
Exploits Conditional Ind.
Requires P(End|yi)
0.1
Slides'borrowed'from'Ralph Grishman''
Slides
borrowed from Ralph Grishman
i=i+1
Sample yi from P(yi|yi-1)
If yi == End: Quit
Sample xi from P(xi|yi)
Goto Step 1
12'
17
Forward Sampling of P(y,x|L)
M
M
i=1
i=1
P ( x, y | M ) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
P(x|y)
Ay=“Noun”
Simpley=“Verb”
POS HMM
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.2
start
0.8
0.333
0.91
noun
verb
0.667
0.09
Initialize y0 = Start
Initialize i = 0
1.
2.
3.
4.
5.
i=i+1
If(i == M): Quit
Sample yi from P(yi|yi-1)
Sample xi from P(xi|yi)
Goto Step 1
Exploits Conditional Ind.
Assumes no P(End|yi)
Slides borrowed from Ralph Grishman
Slides'borrowed'from'Ralph Grishman''
18
19'
1st Order Hidden Markov Model
P ( x k+1:M , yk+1:M | x1:k , y1:k ) = P ( x k+1:M , yk+1:M | y k )
“Memory-less Model” – only needs yk to model rest of sequence
•
•
•
•
P(xi|yi)
P(yi+1|yi)
P(y1|y0)
P(End|yM)
Probability of state yi generating xi
Probability of state yi transitioning to yi+1
y0 is defined to be the Start state
Prior probability of yM being the final state
– Not always used
19
Viterbi Algorithm
20
Most Common Prediction Problem
• Given input sentence, predict POS Tag seq.
argmax P ( y | x )
y
• Naïve approach:
– Try all possible y’s
– Choose one with highest probability
– Exponential time: LM possible y’s
21
Bayes’s Rule
P(y, x)
argmax P ( y | x ) = argmax
y
y
P(x)
= argmax P(y, x)
y
= argmax P(x | y)P(y)
y
L
P ( x | y) = Õ P(x i | yi )
i=1
L
P ( y) = P(End | y L )Õ P(yi | yi-1 )
i=1
22
M
M
argmax P(y, x) = argmax Õ P(y | y )Õ P(x i | yi )
i
y
y
i-1
i=1
i=1
M
M
= argmax argmax Õ P(y | y )Õ P(x i | yi )
i
y1:M -1
yM
i-1
i=1
i=1
= argmax argmax P(y M | y M -1 )P(x M | y M )P(y1:M -1 | x1:M -1 )
y1:M -1
yM
P(y | x
1:k
1:k
) = P(x
1:k
| y )P(y )
1:k
1:k
P(x | y
1:k
k
1:k
) = Õ P(x | y )
i
i
i=1
P(y
1:k
k
) = Õ P(y
i+1
| yi )
i=1
Exploit Memory-less Property:
The choice of yL only depends on y1:M-1 via P(yM|yM-1)!
23
Dynamic Programming
• Input: x = (x1,x2,x3,…,xM)
• Computed: best length-k prefix ending in each Tag:
– Examples:
æ
ö
Yˆ k (V ) = ç argmax P(y1:k-1 Å V, x1:k )÷ Å V
è y1:k-1
ø
æ
ö
Yˆ k (N) = ç argmax P(y1:k-1 Å N, x1:k )÷ Å N
è y1:k-1
ø
Sequence Concatenation
• Claim:
æ
ö
k+1
1:k
1:k+1
ˆ
Y (V ) = çç argmax P(y Å V, x )÷÷ Å V
1:k
k
è y Î{Yˆ (T )}T
ø
æ
ö
1:k
1:k
k+1
k
k+1
k+1
= çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V
1:k
k
è y Î{Yˆ (T )}T
ø
Pre-computed
Recursive Definition!
24
æ
ö
1
1
2
1
2
2
Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V
1
1
è y Î{Yˆ (T )}T
ø
Solve:
2
Store each
Ŷ1(Z) & P(Ŷ1(Z),x1)
y1=V
Ŷ1(V)
Ŷ2(V)
y1=D
Ŷ1(D)
Ŷ2(D)
y1=N
Ŷ1(N)
Ŷ1(Z) is just Z
Ŷ2(N)
25
Solve:
æ
ö
1
1
2
1
2
2
Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V
1
1
è y Î{Yˆ (T )}T
ø
2
Store each
Ŷ1(Z) & P(Ŷ1(Z),x1)
Ŷ1(V)
Ŷ2(V)
Ŷ1(D)
Ŷ2(D)
y1=N
Ŷ1(N)
Ŷ1(Z) is just Z
Ŷ2(N)
Ex: Ŷ2(V) = (N, V)
26
Solve:
æ
ö
1:2
1:2
3
2
3
3
Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V
1:2
2
è y Î{Yˆ (T )}T
ø
3
Store each
Ŷ1(Z) & P(Ŷ1(Z),x1)
Ŷ1(V)
Store each
Ŷ2(Z) & P(Ŷ2(Z),x1:2)
Ŷ2(V)
y2=V
Ŷ3(V)
y2=D
Ŷ1(D)
Ŷ2(D)
Ŷ3(D)
y2=N
Ŷ1(N)
Ŷ2(N)
Ŷ3(N)
Ex: Ŷ2(V) = (N, V)
27
Solve:
æ
ö
1:2
1:2
3
2
3
3
Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V
1:2
2
è y Î{Yˆ (T )}T
ø
3
Store each
Ŷ1(Z) & P(Ŷ1(Z),x1)
Ŷ1(V)
Claim: Only need to check
solutions of Ŷ2(Z), Z=V,D,N
Store each
Ŷ2(Z) & P(Ŷ2(Z),x1:2)
Ŷ2(V)
y2=V
Ŷ3(V)
y2=D
Ŷ1(D)
Ŷ2(D)
Ŷ3(D)
y2=N
Ŷ1(N)
Ŷ2(N)
Ŷ3(N)
Ex: Ŷ2(V) = (N, V)
28
Solve:
æ
ö
1:2
1:2
3
2
3
3
Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V
1:2
2
è y Î{Yˆ (T )}T
ø
3
Store each
Ŷ1(Z) & P(Ŷ1(Z),x1)
Ŷ1(V)
Store each
Ŷ2(Z) & P(Ŷ2(Z),x1:2)
Ŷ2(V)
Claim: Only need to check
solutions of Ŷ2(Z), Z=V,D,N
y2=V
Ŷ3(V)
y2=D
Ŷ1(D)
Ŷ2(D) Ŷ3(V) = (V,V,V)…
Ŷ3(D)
Suppose
…prove that Ŷ3(V) = (N,V,V) has higher prob.
y2=N st
depends on 1
Ŷ1(N)
Proof
order property
• Prob.
Ŷ2(N) of (V,V,V) & (N,V,V)
Ŷ3(N) differ in 3 terms
• P(y1|y0), P(x1|y1), P(y2|y1)
3!
• None
of
these
depend
on
y
2
Ex: Ŷ (V) = (N, V)
29
æ
ö
1:M
-1
1:M
-1
M
M-1
M
M
M
Yˆ (V ) = çç argmax P(y , x
)P(y = V | y )P(x | y = V )P(End | y = V )÷÷ Å V
1:M -1
M
-1
è y Î{Yˆ (T )}T
ø
M
Store each
Ŷ1(Z) & P(Ŷ1(Z),x1)
Store each
Ŷ2(Z) & P(Ŷ2(Z),x1:2)
Store each
Ŷ3(Z) & P(Ŷ3(Z),x1:3)
Ŷ1(V)
Ŷ2(V)
Ŷ3(V)
Ŷ1(D)
Ŷ2(D)
Ŷ3(D)
Ŷ1(N)
Ŷ2(N)
Ŷ3(N)
Ex: Ŷ2(V) = (N, V)
Ex: Ŷ3(V) = (D,N,V)
Optional
ŶM(V)
…
ŶM(D
)
ŶM(N)
30
Viterbi Algorithm
• Solve:
P(y, x)
argmax P ( y | x ) = argmax
y
y
P(x)
= argmax P(y, x)
y
= argmax P(x | y)P(y)
• For k=1..M
y
– Iteratively solve for each Ŷk(Z)
• Z looping over every POS tag.
• Predict best ŶM(Z)
• Also known as Mean A Posteriori (MAP) inference
31
Numerical Example
x= (Fish Sleep)
0.2
start
0.8
0.1
0.8
noun
verb
0.7
end
0.2
0.1
Slides borrowed from Ralph Grishman
0.1
32
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Slides'borrowed'from'Ralph Grishman''
12'
0
start
1
verb
0
noun
0
end
0
Slides borrowed from Ralph Grishman
1
2
3
33
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 1: fish
Slides'borrowed'from'Ralph Grishman''
12'
0
1
start
1
0
verb
0
.2 * .5
noun
0
.8 * .8
end
0
0
Slides borrowed from Ralph Grishman
2
3
34
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 1: fish
Slides'borrowed'from'Ralph Grishman''
12'
0
1
start
1
0
verb
0
.1
noun
0
.64
end
0
0
Slides borrowed from Ralph Grishman
2
3
35
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 2: sleep
Slides'borrowed'from'Ralph Grishman''
12'
0
1
2
start
1
0
0
verb
0
.1
.1*.1*.5
noun
0
.64
.1*.2*.2
end
0
0
-
(if ‘fish’ is verb)
Slides borrowed from Ralph Grishman
3
36
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 2: sleep
Slides'borrowed'from'Ralph Grishman''
12'
0
1
2
start
1
0
0
verb
0
.1
.005
noun
0
.64
.004
end
0
0
-
(if ‘fish’ is verb)
Slides borrowed from Ralph Grishman
3
37
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 2: sleep
Slides'borrowed'from'Ralph Grishman''
12'
0
1
2
start
1
0
0
verb
0
.1
noun
0
.64
.005
.64*.8*.5
.004
.64*.1*.2
end
0
0
(if ‘fish’ is a noun)
Slides borrowed from Ralph Grishman
3
38
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 2: sleep
Slides'borrowed'from'Ralph Grishman''
12'
0
1
2
start
1
0
0
verb
0
.1
noun
0
.64
.005
.256
.004
.0128
end
0
0
(if ‘fish’ is a noun)
Slides borrowed from Ralph Grishman
3
39
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 2: sleep
take maximum,
set back pointers
Slides'borrowed'from'Ralph Grishman''
12'
0
1
2
start
1
0
0
verb
0
.1
noun
0
.64
.005
.256
.004
.0128
end
0
0
Slides borrowed from Ralph Grishman
3
40
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 2: sleep
take maximum,
set back pointers
Slides'borrowed'from'Ralph Grishman''
12'
0
1
2
start
1
0
0
verb
0
.1
.256
noun
0
.64
.0128
end
0
0
-
Slides borrowed from Ralph Grishman
3
41
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 3: end
Slides'borrowed'from'Ralph Grishman''
12'
0
1
2
start
1
0
0
verb
0
.1
.256 -
noun
0
.64
.0128 -
end
0
0
Slides borrowed from Ralph Grishman
-
3
0
.256*.7
.0128*.1
42
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Token 3: end
take maximum,
set back pointers
Slides'borrowed'from'Ralph Grishman''
12'
0
1
2
start
1
0
0
verb
0
.1
.256 -
noun
0
.64
.0128 -
end
0
0
Slides borrowed from Ralph Grishman
-
3
0
.256*.7
.0128*.1
43
0.1
0.2
start
0.8
0.8
verb
noun
0.7
end
0.2
0.1
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
0.1
Decode:
fish = noun
sleep = verb
Slides'borrowed'from'Ralph Grishman''
12'
0
1
2
start
1
0
0
verb
0
.1
.256 -
noun
0
.64
.0128 -
end
0
0
Slides borrowed from Ralph Grishman
-
3
0
.256*.7
44
Recap: Viterbi
• Predict most likely y given x:
– E.g., predict POS Tags given sentences
P(y, x)
argmax P ( y | x ) = argmax
y
y
P(x)
= argmax P(y, x)
y
= argmax P(x | y)P(y)
y
• Solve using Dynamic Programming
45
Recap: Independent Classification
x=“I fish often”
POS Tags:
Det, Noun, Verb, Adj, Adv, Prep
• Treat each word independently
– Independent multiclass prediction per word
P(y|x)
x=“I”
x=“fish”
x=“often”
y=“Det”
0.0
0.0
0.0
y=“Noun” 1.0
0.75
0.0
y=“Verb”
0.0
0.25
0.0
y=“Adj”
0.0
0.0
0.4
y=“Adv”
0.0
0.0
0.6
y=“Prep”
0.0
0.0
0.0
Assume pronouns are nouns for simplicity.
Prediction: (N, N, Adv)
Correct: (N, V, Adv)
Mistake due to not
modeling multiple words.
46
Recap: Viterbi
• Models pairwise transitions between states
– Pairwise transitions between POS Tags
– “1st order” model
M
M
P ( x, y) = P(End | y )Õ P(y | y )Õ P(x | y )
M
i
i=1
x=“I fish often”
i-1
i
i
i=1
Independent: (N, N, Adv)
HMM Viterbi: (N, V, Adv)
*Assuming we defined P(x,y) properly
47
Training HMMs
48
Supervised Training
S = {(xi , yi )}i=1
N
• Given:
POS Tag Sequence
Word Sequence
(Sentence)
• Goal: Estimate P(x,y) using S
M
M
i=1
i=1
P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
• Maximum Likelihood!
49
Aside: Matrix Formulation
• Define Transition Matrix: A
–
Aab = P(yi+1=a|yi=b) or –Log( P(yi+1=a|yi=b) )
P(ynext|y)
y=“Noun”
y=“Verb”
ynext=“Noun” 0.09
0.667
ynext=“Verb”
0.333
0.91
• Observation Matrix: O
–
Owz = P(xi=w|yi=z) or –Log(P(xi=w|yi=z) )
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
50
Aside: Matrix Formulation
M
M
i=1
i=1
P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
M
M
i=1
i=1
P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
M
M
i=1
i=1
= AEnd,yM Õ Ayi ,yi-1 Õ Oxi ,yi
Log prob. formulation
Each entry of à is
define as –log(A)
51
Maximum Likelihood
argmax
A,O
Õ P ( x, y) = argmax Õ
A,O
( x,y)ÎS
M
M
i=1
i=1
P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
( x,y)ÎS
• Estimate each component separately:
N Mj
Aab =
N Mj
åå1 (
j=1 i=0
)(
)
é yi+1 =a Ù yi =b ù
j
ë j
û
N Mj
åå1
j=1 i=0
é yi =bù
ë j û
Owz =
åå1 (
j=1 i=1
)(
)
é x i =w Ù y i =z ù
j
ë j
û
N Mj
åå1
j=1 i=1
é yi =zù
ë j û
• Can also minimize neg. log likelihood
52
Recap: Supervised Training
argmax
A,O
Õ P ( x, y) = argmax Õ
( x,y)ÎS
A,O
M
M
i=1
i=1
P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
( x,y)ÎS
• Maximum Likelihood Training
– Counting statistics
– Super easy!
– Why?
• What about unsupervised case?
53
Recap: Supervised Training
argmax
A,O
Õ P ( x, y) = argmax Õ
( x,y)ÎS
A,O
M
M
i=1
i=1
P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
( x,y)ÎS
• Maximum Likelihood Training
– Counting statistics
– Super easy!
– Why?
• What about unsupervised case?
54
Conditional Independence Assumptions
argmax
A,O
Õ P ( x, y) = argmax Õ
( x,y)ÎS
A,O
( x,y)ÎS
M
M
i=1
i=1
P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi )
• Everything decomposes
to products of pairs
# Parameters:
– I.e., P(yi+1=a|yi=b) doesn’t depend on anything
else
2
Transitions A: #Tags
Observations O: #Words x #Tags
• Can just estimate frequencies:
directly
model
word/word
–Avoids
How often
yi+1=a when
yi=b over
training setpairings
i=b) is a common model across all
– Note that P(yi+1=a|y#Tags
= 10s
locations of all sequences.
#Words = 10000s
55
Unsupervised Training
• What about if no y’s?
– Just a training set of sentences
S = { xi }i=1
N
Word Sequence
(Sentence)
• Still want to estimate P(x,y)
– How?
– Why?
56
Unsupervised Training
• What about if no y’s?
– Just a training set of sentences
S = { xi }i=1
N
Word Sequence
(Sentence)
• Still want to estimate P(x,y)
– How?
– Why?
57
Why Unsupervised Training?
• Supervised Data hard to acquire
– Require annotating POS tags
• Unsupervised Data plentiful
– Just grab some text!
• Might just work for POS Tagging!
– Learn y’s that correspond to POS Tags
• Can be used for other tasks
– Detect outlier sentences (sentences with low prob.)
– Sampling new sentences.
58
EM Algorithm (Baum-Welch)
• If we had y’s  max likelihood.
• If we had (A,O)  predict y’s
1. Initialize A and O arbitrarily
Chicken vs Egg!
Expectation Step
2. Predict prob. of y’s for each training x
3. Use y’s to estimate new (A,O)
Maximization Step
4. Repeat back to Step 1 until convergence
http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm
59
Expectation Step
• Given (A,O)
• For training x=(x1,…,xM)
– Predict P(yi) for each y=(y1,…yM)
x1
x2
…
xL
P(yi=Noun) 0.5
0.4
…
0.05
P(yi=Det)
0.4
0.6
…
0.25
P(yi=Verb)
0.1
0.0
…
0.7
– Encodes current model’s beliefs about y
– “Marginal Distribution” of each yi
60
Recall: Matrix Formulation
• Define Transition Matrix: A
– Aab = P(yi+1=a|yi=b) or –Log( P(yi+1=a|yi=b) )
P(ynext|y)
y=“Noun”
y=“Verb”
ynext=“Noun” 0.09
0.667
ynext=“Verb”
0.333
0.91
• Observation Matrix: O
– Owz = P(xi=w|yi=z) or –Log(P(xi=w|yi=z) )
P(x|y)
y=“Noun” y=“Verb”
x=“fish”
0.8
0.5
x=“sleep”
0.2
0.5
61
Maximization Step
• Max. Likelihood over Marginal Distribution
N Mj
Supervised:
Aab =
N Mj
åå1 (
j=1 i=0
)(
Owz =
N Mj
åå1
j=1 i=0
åå P(y
åå1 (
åå1
j=1 i=1
N Mj
åå P(y
j=1 i=0
i
j
= b)
Marginals
åå1
Owz =
j=1 i=0
é yi =zù
ë j û
N Mj
= b)P(yi+1
j = a)
)
N Mj
Marginals
i
j
)(
é x i =w Ù y i =z ù
j
ë j
û
j=1 i=1
é yi =bù
ë j û
N Mj
Unsupervised: Aab =
)
é yi+1 =a Ù yi =b ù
j
ë j
û
j=1 i=1
é x i =wù
ë j û
P(yij = z)
N Mj
åå P(y
i
j
= z)
j=1 i=1
Marginals
62
Computing Marginals
(Forward-Backward Algorithm)
• Solving E-Step, requires compute marginals
x1
x2
…
xL
P(yi=Noun) 0.5
0.4
…
0.05
P(yi=Det)
0.4
0.6
…
0.25
P(yi=Verb)
0.1
0.0
…
0.7
• Can solve using Dynamic Programming!
– Similar to Viterbi
63
Notation
Probability of observing prefix x1:i and having the i-th state be yi=Z
az (i) = P(x1:i , yi = Z | A,O)
Probability of observing suffix xi+1:M given the i-th state being yi=Z
bz (i) = P(x i+1:M | yi = Z, A,O)
Computing Marginals = Combining the Two Terms
az (i)b z (i)
P(y = z | x) =
å az' (i)bz' (i)
i
z'
http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm
64
Forward (sub-)Algorithm
• Solve for every:
az (i) = P(x1:i , yi = Z | A,O)
• Naively:
Exponential Time!
a z (i) = P(x1:i , yi = Z | A,O) = å P(x1:i , yi = Z, y1:i-1 | A,O)
y1:i-1
• Can be computed recursively (like Viterbi)
a z (1) = P(y1 = z | y0 )P(x1 | y1 = z) = Ox ,z Az,start
1
a z (i +1) = Ox
L
i+1
,z
åa (i)A
j
z, j
j=1
Viterbi effectively replaces sum with max
65
Backward (sub-)Algorithm
• Solve for every: bz (i) = P(xi+1:M | yi = Z, A,O)
• Naively:
Exponential Time!
bz (i) = P(x i+1:M | yi = Z, A,O) = å P(x i+1:M , yi+1:M | yi = Z, A,O)
yi+1:L
• Can be computed recursively (like Viterbi)
bz (L) =1
L
bz (i) = å b j (i +1)A j,zOx
i+1
,j
j=1
66
Forward-Backward Algorithm
• Runs Forward
az (i) = P(x1:i , yi = Z | A,O)
• Runs Backward
bz (i) = P(x i+1:M | yi = Z, A,O)
• For each training x=(x1,…,xM)
– Computes each P(yi) for y=(y1,…,yM)
az (i)b z (i)
i
P(y = z | x) =
å az' (i)bz' (i)
z'
67
Recap: Unsupervised Training
• Train using only word sequences: S = { xi }N
i=1
Word Sequence
(Sentence)
• y’s are “hidden states”
– All pairwise transitions are through y’s
– Hence hidden Markov Model
• Train using EM algorithm
– Converge to local optimum
68
Initialization
• How to choose #hidden states?
– By hand
– Cross Validation
• P(x) on validation data
• Can compute P(x) via forward algorithm:
P(x) = å P(x, y) = åa z (M )P(End | y M = z)
y
z
69
Recap: Sequence Prediction & HMMs
• Models pairwise dependences in sequences
x=“I fish often”
POS Tags:
Det, Noun, Verb, Adj, Adv, Prep
Independent: (N, N, Adv)
HMM Viterbi: (N, V, Adv)
• Compact: only model pairwise between y’s
• Main Limitation: Lots of independence assumptions
– Poor predictive accuracy
70
Next Lecture
• Conditional Random Fields
– Removes many independence assumptions
– More accurate in practice
– Can only be trained in supervised setting
• Recitation tomorrow:
– Recap of Viterbi and Forward/Backward
71