Machine Learning & Data Mining CS/CNS/EE 155 Lecture 5: Sequence Prediction & HMMs 1 Announcements • Homework 1 Due Today @5pm – Via Moodle • Homework 2 to be Released Soon. – Due Feb 3rd via Moodle (2 weeks) – Mostly Coding • Don’t start late! • Use Moodle Forums for Q&A 2 Sequence Prediction (POS Tagging) • x = “Fish Sleep” • y = (N, V) • x = “The Dog Ate My Homework” • y = (D, N, V, D, N) • x = “The Fox Jumped Over The Fence” • y = (D, N, V, P, D, N) 3 Challenges • Multivariable Output – Makes multiple predictions simultaneously • Variable Length Input/Output – Sentence lengths not fixed 4 Multivariate Outputs • x = “Fish Sleep” • y = (N, V) POS Tags: Det, Noun, Verb, Adj, Adv, Prep • Multiclass prediction: Replicate Weights: Score All Classes: Predict via Largest Score: • How many classes? 5 Multiclass Prediction • x = “Fish Sleep” • y = (N, V) • Multiclass prediction: POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 – All possible length-M sequences as different class Exponential Explosion in #Classes! – (D, D),Tractable (D, N), (D,for V), Sequence (D, Adj), (D,Prediction) Adv), (D, Pr) (Not (N, D), (N, N), (N, V), (N, Adj), (N, Adv), … • LM classes! – Length 2: 62 = 36! 6 Why is Naïve Multiclass Intractable? x=“I fish often” POS Tags: Det, Noun, Verb, Adj, Adv, Prep – (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) – (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) –Treats (D, V, D),Every (D, V, N),Combination (D, V, V), (D, V, Adj), As (D, V, Adv), (D, V, Pr) Different Class – … – (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) – (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) – … Exponentially Large Representation! (Learn (w,b) for each combination) (Exponential Time to Consider Every Class) (Exponential Storage) Assume pronouns are nouns for simplicity. 7 Independent Classification x=“I fish often” POS Tags: Det, Noun, Verb, Adj, Adv, Prep • Treat each word independently (assumption) – Independent multiclass prediction per word – Predict for x=“I” independently #Classes = #POS Tags – Predict for x=“fish” independently (6 in our example) – Predict for x=“often” independently – Concatenate Solvable usingpredictions. standard multiclass prediction. Assume pronouns are nouns for simplicity. 8 Independent Classification x=“I fish often” POS Tags: Det, Noun, Verb, Adj, Adv, Prep • Treat each word independently – Independent multiclass prediction per word P(y|x) x=“I” x=“fish” x=“often” y=“Det” 0.0 0.0 0.0 y=“Noun” 1.0 0.75 0.0 y=“Verb” 0.0 0.25 0.0 y=“Adj” 0.0 0.0 0.4 y=“Adv” 0.0 0.0 0.6 y=“Prep” 0.0 0.0 0.0 Assume pronouns are nouns for simplicity. Prediction: (N, N, Adv) Correct: (N, V, Adv) Why the mistake? 9 Context Between Words x=“I fish often” POS Tags: Det, Noun, Verb, Adj, Adv, Prep • Independent Predictions Ignore Word Pairs – In Isolation: • “Fish” is more likely to be a Noun – But Conditioned on Following a (pro)Noun… • “Fish” is more likely to be a Verb! – “1st Order” Dependence (Model All Pairs) • 2nd Order Considers All Triplets • Arbitrary Order = Exponential Size (Naïve Multiclass) Assume pronouns are nouns for simplicity. 10 1st Order Hidden Markov Model • x = (x1,x2,x4,x4,…,xM) • y = (y1,y2,y3,y4,…,yM) • • • • P(xi|yi) P(yi+1|yi) P(y1|y0) P(End|yM) (sequence of words) (sequence of POS tags) Probability of state yi generating xi Probability of state yi transitioning to yi+1 y0 is defined to be the Start state Prior probability of yM being the final state – Not always used 11 1st Order Hidden Markov Model Additional Complexity of (#POS Tags)2 Models All State-State Pairs (all POS Tag-Tag pairs) Models All State-Observation Pairs (all Tag-Word pairs) Same Complexity as Independent Multiclass • • • • P(xi|yi) P(yi+1|yi) P(y1|y0) P(End|yM) Probability of state yi generating xi Probability of state yi transitioning to yi+1 y0 is defined to be the Start state Prior probability of yM being the final state – Not always used 12 1st Order Hidden Markov Model M M i=1 i=1 P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) “Joint Distribution” • • • • P(xi|yi) P(yi+1|yi) P(y1|y0) P(End|yM) Optional Probability of state yi generating xi Probability of state yi transitioning to yi+1 y0 is defined to be the Start state Prior probability of yM being the final state – Not always used 13 1st Order Hidden Markov Model M P ( x | y) = Õ P(x | y ) i i i=1 Given a POS Tag Sequence y: Can compute each P(xi|y) independently! (xi conditionally independent given yi) “Conditional Distribution on x given y” • • • • P(xi|yi) P(yi+1|yi) P(y1|y0) P(End|yM) Probability of state yi generating xi Probability of state yi transitioning to yi+1 y0 is defined to be the Start state Prior probability of yM being the final state – Not always used 14 P ( word | state/tag ) • Two-word language: “fish” and “sleep” • Two-tag language: “Noun” and “Verb” P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 Slides borrowed from Ralph Grishman Given Tag Sequence y: P(“fish sleep” | (N,V) ) P(“fish fish” | (N,V) ) P(“sleep fish” | (V,V) ) P(“sleep sleep” | (N,N) ) = 0.8*0.5 = 0.8*0.5 = 0.8*0.5 = 0.2*0.5 15 Sampling • HMMs are “generative” models – Models joint distribution P(x,y) – Can generate samples from this distribution – First consider conditional distribution P(x|y) P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 Given Tag Sequence y = (N,V): Sample each word independently: Sample P(x1| N) (0.8 Fish, 0.2 Sleep) Sample P(x2| V) (0.5 Fish, 0.5 Sleep) – What about sampling from P(x,y)? 16 Forward Sampling of P(y,x) M M i=1 i=1 P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) P(x|y) y=“Noun” y=“Verb” x=“fish”A Simple 0.8 x=“sleep” 0.5 HMM POS 0.2 0.5 0.2 start 0.8 Initialize y0 = Start Initialize i = 0 1. 2. 3. 4. 5. 0.1 0.8 noun verb 0.7 end 0.2 0.1 Exploits Conditional Ind. Requires P(End|yi) 0.1 Slides'borrowed'from'Ralph Grishman'' Slides borrowed from Ralph Grishman i=i+1 Sample yi from P(yi|yi-1) If yi == End: Quit Sample xi from P(xi|yi) Goto Step 1 12' 17 Forward Sampling of P(y,x|L) M M i=1 i=1 P ( x, y | M ) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) P(x|y) Ay=“Noun” Simpley=“Verb” POS HMM x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.2 start 0.8 0.333 0.91 noun verb 0.667 0.09 Initialize y0 = Start Initialize i = 0 1. 2. 3. 4. 5. i=i+1 If(i == M): Quit Sample yi from P(yi|yi-1) Sample xi from P(xi|yi) Goto Step 1 Exploits Conditional Ind. Assumes no P(End|yi) Slides borrowed from Ralph Grishman Slides'borrowed'from'Ralph Grishman'' 18 19' 1st Order Hidden Markov Model P ( x k+1:M , yk+1:M | x1:k , y1:k ) = P ( x k+1:M , yk+1:M | y k ) “Memory-less Model” – only needs yk to model rest of sequence • • • • P(xi|yi) P(yi+1|yi) P(y1|y0) P(End|yM) Probability of state yi generating xi Probability of state yi transitioning to yi+1 y0 is defined to be the Start state Prior probability of yM being the final state – Not always used 19 Viterbi Algorithm 20 Most Common Prediction Problem • Given input sentence, predict POS Tag seq. argmax P ( y | x ) y • Naïve approach: – Try all possible y’s – Choose one with highest probability – Exponential time: LM possible y’s 21 Bayes’s Rule P(y, x) argmax P ( y | x ) = argmax y y P(x) = argmax P(y, x) y = argmax P(x | y)P(y) y L P ( x | y) = Õ P(x i | yi ) i=1 L P ( y) = P(End | y L )Õ P(yi | yi-1 ) i=1 22 M M argmax P(y, x) = argmax Õ P(y | y )Õ P(x i | yi ) i y y i-1 i=1 i=1 M M = argmax argmax Õ P(y | y )Õ P(x i | yi ) i y1:M -1 yM i-1 i=1 i=1 = argmax argmax P(y M | y M -1 )P(x M | y M )P(y1:M -1 | x1:M -1 ) y1:M -1 yM P(y | x 1:k 1:k ) = P(x 1:k | y )P(y ) 1:k 1:k P(x | y 1:k k 1:k ) = Õ P(x | y ) i i i=1 P(y 1:k k ) = Õ P(y i+1 | yi ) i=1 Exploit Memory-less Property: The choice of yL only depends on y1:M-1 via P(yM|yM-1)! 23 Dynamic Programming • Input: x = (x1,x2,x3,…,xM) • Computed: best length-k prefix ending in each Tag: – Examples: æ ö Yˆ k (V ) = ç argmax P(y1:k-1 Å V, x1:k )÷ Å V è y1:k-1 ø æ ö Yˆ k (N) = ç argmax P(y1:k-1 Å N, x1:k )÷ Å N è y1:k-1 ø Sequence Concatenation • Claim: æ ö k+1 1:k 1:k+1 ˆ Y (V ) = çç argmax P(y Å V, x )÷÷ Å V 1:k k è y Î{Yˆ (T )}T ø æ ö 1:k 1:k k+1 k k+1 k+1 = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V 1:k k è y Î{Yˆ (T )}T ø Pre-computed Recursive Definition! 24 æ ö 1 1 2 1 2 2 Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V 1 1 è y Î{Yˆ (T )}T ø Solve: 2 Store each Ŷ1(Z) & P(Ŷ1(Z),x1) y1=V Ŷ1(V) Ŷ2(V) y1=D Ŷ1(D) Ŷ2(D) y1=N Ŷ1(N) Ŷ1(Z) is just Z Ŷ2(N) 25 Solve: æ ö 1 1 2 1 2 2 Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V 1 1 è y Î{Yˆ (T )}T ø 2 Store each Ŷ1(Z) & P(Ŷ1(Z),x1) Ŷ1(V) Ŷ2(V) Ŷ1(D) Ŷ2(D) y1=N Ŷ1(N) Ŷ1(Z) is just Z Ŷ2(N) Ex: Ŷ2(V) = (N, V) 26 Solve: æ ö 1:2 1:2 3 2 3 3 Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V 1:2 2 è y Î{Yˆ (T )}T ø 3 Store each Ŷ1(Z) & P(Ŷ1(Z),x1) Ŷ1(V) Store each Ŷ2(Z) & P(Ŷ2(Z),x1:2) Ŷ2(V) y2=V Ŷ3(V) y2=D Ŷ1(D) Ŷ2(D) Ŷ3(D) y2=N Ŷ1(N) Ŷ2(N) Ŷ3(N) Ex: Ŷ2(V) = (N, V) 27 Solve: æ ö 1:2 1:2 3 2 3 3 Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V 1:2 2 è y Î{Yˆ (T )}T ø 3 Store each Ŷ1(Z) & P(Ŷ1(Z),x1) Ŷ1(V) Claim: Only need to check solutions of Ŷ2(Z), Z=V,D,N Store each Ŷ2(Z) & P(Ŷ2(Z),x1:2) Ŷ2(V) y2=V Ŷ3(V) y2=D Ŷ1(D) Ŷ2(D) Ŷ3(D) y2=N Ŷ1(N) Ŷ2(N) Ŷ3(N) Ex: Ŷ2(V) = (N, V) 28 Solve: æ ö 1:2 1:2 3 2 3 3 Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )÷÷ Å V 1:2 2 è y Î{Yˆ (T )}T ø 3 Store each Ŷ1(Z) & P(Ŷ1(Z),x1) Ŷ1(V) Store each Ŷ2(Z) & P(Ŷ2(Z),x1:2) Ŷ2(V) Claim: Only need to check solutions of Ŷ2(Z), Z=V,D,N y2=V Ŷ3(V) y2=D Ŷ1(D) Ŷ2(D) Ŷ3(V) = (V,V,V)… Ŷ3(D) Suppose …prove that Ŷ3(V) = (N,V,V) has higher prob. y2=N st depends on 1 Ŷ1(N) Proof order property • Prob. Ŷ2(N) of (V,V,V) & (N,V,V) Ŷ3(N) differ in 3 terms • P(y1|y0), P(x1|y1), P(y2|y1) 3! • None of these depend on y 2 Ex: Ŷ (V) = (N, V) 29 æ ö 1:M -1 1:M -1 M M-1 M M M Yˆ (V ) = çç argmax P(y , x )P(y = V | y )P(x | y = V )P(End | y = V )÷÷ Å V 1:M -1 M -1 è y Î{Yˆ (T )}T ø M Store each Ŷ1(Z) & P(Ŷ1(Z),x1) Store each Ŷ2(Z) & P(Ŷ2(Z),x1:2) Store each Ŷ3(Z) & P(Ŷ3(Z),x1:3) Ŷ1(V) Ŷ2(V) Ŷ3(V) Ŷ1(D) Ŷ2(D) Ŷ3(D) Ŷ1(N) Ŷ2(N) Ŷ3(N) Ex: Ŷ2(V) = (N, V) Ex: Ŷ3(V) = (D,N,V) Optional ŶM(V) … ŶM(D ) ŶM(N) 30 Viterbi Algorithm • Solve: P(y, x) argmax P ( y | x ) = argmax y y P(x) = argmax P(y, x) y = argmax P(x | y)P(y) • For k=1..M y – Iteratively solve for each Ŷk(Z) • Z looping over every POS tag. • Predict best ŶM(Z) • Also known as Mean A Posteriori (MAP) inference 31 Numerical Example x= (Fish Sleep) 0.2 start 0.8 0.1 0.8 noun verb 0.7 end 0.2 0.1 Slides borrowed from Ralph Grishman 0.1 32 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Slides'borrowed'from'Ralph Grishman'' 12' 0 start 1 verb 0 noun 0 end 0 Slides borrowed from Ralph Grishman 1 2 3 33 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 1: fish Slides'borrowed'from'Ralph Grishman'' 12' 0 1 start 1 0 verb 0 .2 * .5 noun 0 .8 * .8 end 0 0 Slides borrowed from Ralph Grishman 2 3 34 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 1: fish Slides'borrowed'from'Ralph Grishman'' 12' 0 1 start 1 0 verb 0 .1 noun 0 .64 end 0 0 Slides borrowed from Ralph Grishman 2 3 35 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 2: sleep Slides'borrowed'from'Ralph Grishman'' 12' 0 1 2 start 1 0 0 verb 0 .1 .1*.1*.5 noun 0 .64 .1*.2*.2 end 0 0 - (if ‘fish’ is verb) Slides borrowed from Ralph Grishman 3 36 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 2: sleep Slides'borrowed'from'Ralph Grishman'' 12' 0 1 2 start 1 0 0 verb 0 .1 .005 noun 0 .64 .004 end 0 0 - (if ‘fish’ is verb) Slides borrowed from Ralph Grishman 3 37 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 2: sleep Slides'borrowed'from'Ralph Grishman'' 12' 0 1 2 start 1 0 0 verb 0 .1 noun 0 .64 .005 .64*.8*.5 .004 .64*.1*.2 end 0 0 (if ‘fish’ is a noun) Slides borrowed from Ralph Grishman 3 38 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 2: sleep Slides'borrowed'from'Ralph Grishman'' 12' 0 1 2 start 1 0 0 verb 0 .1 noun 0 .64 .005 .256 .004 .0128 end 0 0 (if ‘fish’ is a noun) Slides borrowed from Ralph Grishman 3 39 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 2: sleep take maximum, set back pointers Slides'borrowed'from'Ralph Grishman'' 12' 0 1 2 start 1 0 0 verb 0 .1 noun 0 .64 .005 .256 .004 .0128 end 0 0 Slides borrowed from Ralph Grishman 3 40 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 2: sleep take maximum, set back pointers Slides'borrowed'from'Ralph Grishman'' 12' 0 1 2 start 1 0 0 verb 0 .1 .256 noun 0 .64 .0128 end 0 0 - Slides borrowed from Ralph Grishman 3 41 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 3: end Slides'borrowed'from'Ralph Grishman'' 12' 0 1 2 start 1 0 0 verb 0 .1 .256 - noun 0 .64 .0128 - end 0 0 Slides borrowed from Ralph Grishman - 3 0 .256*.7 .0128*.1 42 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Token 3: end take maximum, set back pointers Slides'borrowed'from'Ralph Grishman'' 12' 0 1 2 start 1 0 0 verb 0 .1 .256 - noun 0 .64 .0128 - end 0 0 Slides borrowed from Ralph Grishman - 3 0 .256*.7 .0128*.1 43 0.1 0.2 start 0.8 0.8 verb noun 0.7 end 0.2 0.1 P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 0.1 Decode: fish = noun sleep = verb Slides'borrowed'from'Ralph Grishman'' 12' 0 1 2 start 1 0 0 verb 0 .1 .256 - noun 0 .64 .0128 - end 0 0 Slides borrowed from Ralph Grishman - 3 0 .256*.7 44 Recap: Viterbi • Predict most likely y given x: – E.g., predict POS Tags given sentences P(y, x) argmax P ( y | x ) = argmax y y P(x) = argmax P(y, x) y = argmax P(x | y)P(y) y • Solve using Dynamic Programming 45 Recap: Independent Classification x=“I fish often” POS Tags: Det, Noun, Verb, Adj, Adv, Prep • Treat each word independently – Independent multiclass prediction per word P(y|x) x=“I” x=“fish” x=“often” y=“Det” 0.0 0.0 0.0 y=“Noun” 1.0 0.75 0.0 y=“Verb” 0.0 0.25 0.0 y=“Adj” 0.0 0.0 0.4 y=“Adv” 0.0 0.0 0.6 y=“Prep” 0.0 0.0 0.0 Assume pronouns are nouns for simplicity. Prediction: (N, N, Adv) Correct: (N, V, Adv) Mistake due to not modeling multiple words. 46 Recap: Viterbi • Models pairwise transitions between states – Pairwise transitions between POS Tags – “1st order” model M M P ( x, y) = P(End | y )Õ P(y | y )Õ P(x | y ) M i i=1 x=“I fish often” i-1 i i i=1 Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) *Assuming we defined P(x,y) properly 47 Training HMMs 48 Supervised Training S = {(xi , yi )}i=1 N • Given: POS Tag Sequence Word Sequence (Sentence) • Goal: Estimate P(x,y) using S M M i=1 i=1 P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) • Maximum Likelihood! 49 Aside: Matrix Formulation • Define Transition Matrix: A – Aab = P(yi+1=a|yi=b) or –Log( P(yi+1=a|yi=b) ) P(ynext|y) y=“Noun” y=“Verb” ynext=“Noun” 0.09 0.667 ynext=“Verb” 0.333 0.91 • Observation Matrix: O – Owz = P(xi=w|yi=z) or –Log(P(xi=w|yi=z) ) P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 50 Aside: Matrix Formulation M M i=1 i=1 P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) M M i=1 i=1 P ( x, y) = P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) M M i=1 i=1 = AEnd,yM Õ Ayi ,yi-1 Õ Oxi ,yi Log prob. formulation Each entry of à is define as –log(A) 51 Maximum Likelihood argmax A,O Õ P ( x, y) = argmax Õ A,O ( x,y)ÎS M M i=1 i=1 P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) ( x,y)ÎS • Estimate each component separately: N Mj Aab = N Mj åå1 ( j=1 i=0 )( ) é yi+1 =a Ù yi =b ù j ë j û N Mj åå1 j=1 i=0 é yi =bù ë j û Owz = åå1 ( j=1 i=1 )( ) é x i =w Ù y i =z ù j ë j û N Mj åå1 j=1 i=1 é yi =zù ë j û • Can also minimize neg. log likelihood 52 Recap: Supervised Training argmax A,O Õ P ( x, y) = argmax Õ ( x,y)ÎS A,O M M i=1 i=1 P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) ( x,y)ÎS • Maximum Likelihood Training – Counting statistics – Super easy! – Why? • What about unsupervised case? 53 Recap: Supervised Training argmax A,O Õ P ( x, y) = argmax Õ ( x,y)ÎS A,O M M i=1 i=1 P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) ( x,y)ÎS • Maximum Likelihood Training – Counting statistics – Super easy! – Why? • What about unsupervised case? 54 Conditional Independence Assumptions argmax A,O Õ P ( x, y) = argmax Õ ( x,y)ÎS A,O ( x,y)ÎS M M i=1 i=1 P(End | y M )Õ P(yi | yi-1 )Õ P(x i | yi ) • Everything decomposes to products of pairs # Parameters: – I.e., P(yi+1=a|yi=b) doesn’t depend on anything else 2 Transitions A: #Tags Observations O: #Words x #Tags • Can just estimate frequencies: directly model word/word –Avoids How often yi+1=a when yi=b over training setpairings i=b) is a common model across all – Note that P(yi+1=a|y#Tags = 10s locations of all sequences. #Words = 10000s 55 Unsupervised Training • What about if no y’s? – Just a training set of sentences S = { xi }i=1 N Word Sequence (Sentence) • Still want to estimate P(x,y) – How? – Why? 56 Unsupervised Training • What about if no y’s? – Just a training set of sentences S = { xi }i=1 N Word Sequence (Sentence) • Still want to estimate P(x,y) – How? – Why? 57 Why Unsupervised Training? • Supervised Data hard to acquire – Require annotating POS tags • Unsupervised Data plentiful – Just grab some text! • Might just work for POS Tagging! – Learn y’s that correspond to POS Tags • Can be used for other tasks – Detect outlier sentences (sentences with low prob.) – Sampling new sentences. 58 EM Algorithm (Baum-Welch) • If we had y’s max likelihood. • If we had (A,O) predict y’s 1. Initialize A and O arbitrarily Chicken vs Egg! Expectation Step 2. Predict prob. of y’s for each training x 3. Use y’s to estimate new (A,O) Maximization Step 4. Repeat back to Step 1 until convergence http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm 59 Expectation Step • Given (A,O) • For training x=(x1,…,xM) – Predict P(yi) for each y=(y1,…yM) x1 x2 … xL P(yi=Noun) 0.5 0.4 … 0.05 P(yi=Det) 0.4 0.6 … 0.25 P(yi=Verb) 0.1 0.0 … 0.7 – Encodes current model’s beliefs about y – “Marginal Distribution” of each yi 60 Recall: Matrix Formulation • Define Transition Matrix: A – Aab = P(yi+1=a|yi=b) or –Log( P(yi+1=a|yi=b) ) P(ynext|y) y=“Noun” y=“Verb” ynext=“Noun” 0.09 0.667 ynext=“Verb” 0.333 0.91 • Observation Matrix: O – Owz = P(xi=w|yi=z) or –Log(P(xi=w|yi=z) ) P(x|y) y=“Noun” y=“Verb” x=“fish” 0.8 0.5 x=“sleep” 0.2 0.5 61 Maximization Step • Max. Likelihood over Marginal Distribution N Mj Supervised: Aab = N Mj åå1 ( j=1 i=0 )( Owz = N Mj åå1 j=1 i=0 åå P(y åå1 ( åå1 j=1 i=1 N Mj åå P(y j=1 i=0 i j = b) Marginals åå1 Owz = j=1 i=0 é yi =zù ë j û N Mj = b)P(yi+1 j = a) ) N Mj Marginals i j )( é x i =w Ù y i =z ù j ë j û j=1 i=1 é yi =bù ë j û N Mj Unsupervised: Aab = ) é yi+1 =a Ù yi =b ù j ë j û j=1 i=1 é x i =wù ë j û P(yij = z) N Mj åå P(y i j = z) j=1 i=1 Marginals 62 Computing Marginals (Forward-Backward Algorithm) • Solving E-Step, requires compute marginals x1 x2 … xL P(yi=Noun) 0.5 0.4 … 0.05 P(yi=Det) 0.4 0.6 … 0.25 P(yi=Verb) 0.1 0.0 … 0.7 • Can solve using Dynamic Programming! – Similar to Viterbi 63 Notation Probability of observing prefix x1:i and having the i-th state be yi=Z az (i) = P(x1:i , yi = Z | A,O) Probability of observing suffix xi+1:M given the i-th state being yi=Z bz (i) = P(x i+1:M | yi = Z, A,O) Computing Marginals = Combining the Two Terms az (i)b z (i) P(y = z | x) = å az' (i)bz' (i) i z' http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm 64 Forward (sub-)Algorithm • Solve for every: az (i) = P(x1:i , yi = Z | A,O) • Naively: Exponential Time! a z (i) = P(x1:i , yi = Z | A,O) = å P(x1:i , yi = Z, y1:i-1 | A,O) y1:i-1 • Can be computed recursively (like Viterbi) a z (1) = P(y1 = z | y0 )P(x1 | y1 = z) = Ox ,z Az,start 1 a z (i +1) = Ox L i+1 ,z åa (i)A j z, j j=1 Viterbi effectively replaces sum with max 65 Backward (sub-)Algorithm • Solve for every: bz (i) = P(xi+1:M | yi = Z, A,O) • Naively: Exponential Time! bz (i) = P(x i+1:M | yi = Z, A,O) = å P(x i+1:M , yi+1:M | yi = Z, A,O) yi+1:L • Can be computed recursively (like Viterbi) bz (L) =1 L bz (i) = å b j (i +1)A j,zOx i+1 ,j j=1 66 Forward-Backward Algorithm • Runs Forward az (i) = P(x1:i , yi = Z | A,O) • Runs Backward bz (i) = P(x i+1:M | yi = Z, A,O) • For each training x=(x1,…,xM) – Computes each P(yi) for y=(y1,…,yM) az (i)b z (i) i P(y = z | x) = å az' (i)bz' (i) z' 67 Recap: Unsupervised Training • Train using only word sequences: S = { xi }N i=1 Word Sequence (Sentence) • y’s are “hidden states” – All pairwise transitions are through y’s – Hence hidden Markov Model • Train using EM algorithm – Converge to local optimum 68 Initialization • How to choose #hidden states? – By hand – Cross Validation • P(x) on validation data • Can compute P(x) via forward algorithm: P(x) = å P(x, y) = åa z (M )P(End | y M = z) y z 69 Recap: Sequence Prediction & HMMs • Models pairwise dependences in sequences x=“I fish often” POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) • Compact: only model pairwise between y’s • Main Limitation: Lots of independence assumptions – Poor predictive accuracy 70 Next Lecture • Conditional Random Fields – Removes many independence assumptions – More accurate in practice – Can only be trained in supervised setting • Recitation tomorrow: – Recap of Viterbi and Forward/Backward 71
© Copyright 2025 Paperzz