Machine Learning ! ! ! ! ! Srihari Partially Directed Graphs and Conditional Random Fields Sargur Srihari srihari@cedar.buffalo.edu 1 Machine Learning ! ! ! ! ! Srihari Topics • Conditional Random Fields • Gibbs distribution and CRF • Directed and Undirected Independencies – View as combination of BN and MN • CRF for Image Segmentation • CRF for Text Analytics • Naiive Bayes and Naiive Markov – Learning the models 2 Machine Learning ! ! ! ! ! Srihari Conditional Distribution Representation • Nodes correspond to Y U X – Y are target variables and X are observed variables • Parameterized as ordinary Markov Network – Set of factors Φ1(D1),..Φm(Dm) • Can be encoded as a log-linear model • Viewed as encoding a set of factors • Model represents P(Y|X) rather than P(Y,X) – To naturally represent a conditional distribution • Avoid representing a probabilistic model over X – Disallow potentials involving only variables in X 3 Machine Learning ! ! ! ! ! Srihari Conditional Random Fields • MN encodes a joint distribution over X • An MN can also be used to represent a conditional distribution P(Y|X) – Y is a set of target variables – X is a set of observed variables • Representation is called a CRF • Has an analog in directed graphical models – Conditional Bayesian Networks 4 Machine Learning ! ! ! ! ! CRF Definition • An undirected graph H with nodes X U Y – Network is annotated with a set of factors φ1 (D1 ),..φ m (Dm ) such that Di ⊄ X – Network encodes a conditional distribution as 1 P(Y , X) Z(X) , X) = ∏ φi (Di ) P(Y Where Z(X) is the marginal distribution of X ! X)is the joint distribution and P(Y, Joint distribution (unnormalized) is a product of factors , X) Z(X) = ∑ P(Y Partition function Is now a function of X P(Y | X) = m i =1 Y – Two variables in H are connected by an edge whenever they appear in the scope of a factor Srihari Machine Learning ! ! ! ! ! Deriving the CRF definition Srihari • Conditional distribution from Baye’s rule: P(Y | X) = P(Y, X) P(X) • Definition of Markov network (Gibbs distribution) PΦ (X1,..X n )= 1 P(X1,..X n ) Z m 1,..X n ) = ∏ φi (Di ) where P(X i=1 Z= ∑ 1,..X n ) P(X X1,..X n • Numerator of conditional distribution is: P(Y , X) = 1 ! , X) P(Y Z(Y , X) m where ! , X) = ∏ φi (Di ) and Z(Y , X) = ∑ P(Y ! , X) P(Y i=1 Y ,X • Denominator of conditional distribution (from sum rule): P(X) = ∑ P(Y, X) = Y 1 X) ∑ P(Y, Z(Y, X) Y • Combining Bayes and Gibbs gives CRF: P(Y / X) = 1 ! , X) = 1 P(Y ! , X) P(Y ! , X) Z(X) ∑ P(Y Y where ! , X) Z(X) = ∑ P(Y Y 6 Machine Learning ! ! ! ! ! Srihari Difference between CRF & Gibbs • Different normalization in partition function Z(X) – A Gibbs distribution 1 PΦ (X1,..X n )= P(X 1,..X n ) Z m ,..X ) = ∏ φ (D ) where P(X 1 n i i Z= i=1 ∑ ,..X ) P(X 1 n X1,..X n • factorizes into a set of factors and partition function Z – CRF 1 P(Y | X) = P(Y, X) Z(X) m X) = ∏ φi (Di ) where P(Y, i=1 X) Z(X) = ∑ P(Y, Y • Induces a different value of Z for every assignment x to X • Summation only over Y – Difference denoted by feature variables greyed-out • Known X (shown dark grey) • Y has a distribution dependent on X X1 X2 X3 X4 X5 X1 Y1 Y2 Y3 Y4 75 Y Y1 (a) X Machine Learning ! ! ! ! ! Srihari Example of CRF • CRF over Y={Y1,..Yk} and X={X1,..Xk} • Edges are Yi—Yi+1 and Yi—Xi Observed Feature Variables: 1 P(Y | X) = P(Y , X) Z(X) Assumed known when model is used (hence greyed-out) k −1 k i =1 i =1 , X) = ∏ φi (Yi ,Yi +1 )∏ φi (Yi , Xi ) P(Y , X) Z(X) = ∑ P(Y Y X1 X2 X3 X4 X5 X Y1 Y2 Y3 Y4 Y5 Y Linear chain-structured CRF For sequence labeling 8 (a) Machine Learning ! ! ! ! ! Srihari Main Strength of CRF • Avoid encoding over the variables in X • Allows incorporating into model – A rich set of observed variables • Whose dependencies are complex or poorly understood • Allows including continuous variables – Distributions may not have simple parametric forms • Can incorporate domain knowledge – Rich features without modeling joint distribution 9 building Machine Learning ! ! ! ! car ! Srihari CRF Image segmentation road cow grass (a) Original image (b) is Each superpixel a random variable Classification(c)using node potentials alone (d) using Segmentation pairwise Markov Network encoding • Each image defines a probability distribution over the variables representing super-pixel labels • Rather than define joint distribution over pixel values we define a 10 conditional distribution over segment labels given the pixel values – Avoids making a parametric assumption over (continuous) pixel values – Can define image processing routines to define rich features, e.g., presence or direction of an image gradient at pixel » such features usually rely on multiple pixels » So defining correct joint distribution or independence properties over the features is non-trivial Machine Learning ! ! ! ! ! Srihari Directed and Undirected Dependencies • A CRF defines a conditional distribution of Y on X • Thus it can be viewed as a partially directed graph • Where we have an undirected component over Y • Which has variables in X as parents 11 Machine Learning ! ! ! ! ! Srihari CRFs for Text Analysis • Important use for CRF framework • Part-of-speech labeling • Named Entity Tagging – People, places, organizations, etc • Extracting structured information from text – From a reference list • Publications, titles, authors, journals, tyear • Models share a similar structure 12 Machine Learning ! ! ! ! ! Srihari Named Entity (NE) Tagging • Entities often span multiple words • Type of entity may not be apparent from individual words • New York is location, New York Times is organization • For each word Xi introduce target variable Yi which is its entity type – Outcomes for Yi are (in BIO notation) • B-PERSON, I-PERSON, B-LOCATION, I-LOCATION, B-ORGANIZATION, I-ORGANIZATION, OTHER • B: beginning, I: inside entity • B allows segmenting adjacent entities of same type 13 Machine Learning ! ! ! ! ! Srihari CRF for NE Tagging B-PER I-PER OTH OTH OTH B-LOC I-LOC B-PER OTH OTH Mrs. Green spoke today in New York Green chairs the OTH Y OTH X finance committee KEY • Set of known variables (are words): X • Two factors for each word φ (Y ,Y ) φ (Y , X ,..X B-PER Begin person name I-PER Within person name B-LOC Begin location name I-LOC Within location name OTH Not an entitiy 1 t (a) t 2 t t +1 t 1 T ) – Factor to represent dependency between neighboring target variables φt1 (Yt ,Yt +1 ) – Factor to represent dependency between target Yt and its context in word sequence φt2 (Yt , X1,..XT ) B I O O O B I O B I I NP ADJ N V IN V PRP N IN DT N N POS Airways rose after announcing its withdrawal from the deal Can depend on arbitrary features ofUALentire KEYinput word sequence X ,..X (Three here) 1 T British B I O N Begin noun phrase Within noun phrase Not a noun phrase Noun V IN PRP DT Verb Preposition Possesive pronoun Determiner (e.g., a, an, the) 14 Machine Learning ! ! ! ! ! Srihari Linear Chain CRF for NE • Factor to represent dependency between target Yt and its context in word sequence φ (Y , X ,..X ) 2 t t 1 T – Can depend on arbitrary features of entire input word sequence X1,..XT – Not encoded using table factors but use log-linear models • Factors derived from feature functions such as ft(Yt,Xt)=I{Yt=B-ORGANIZATION,Xt=“Times”} Machine Learning ! ! ! ! ! Features for NE Tagging Srihari • For word Xi – Capitalized, In list of common person names, – In atlas of location names, End with “ton”, – Exactly “York”, Following “Times” • For word sequence – More than two-sports related terms , New York is a sports organization • Hundreds or thousands of features • Sparse (zero for most words) • Same feature variable can be connected to multiple target variables – Yi dependent on identity of several words in window 16 Machine Learning ! ! ! ! ! Srihari Performance of CRF • Linear Chain CRFs provide high per-token accuracies – High 90% range on many natural data sets • High per field Precision and Recall – Where entire phrase categories and boundaries must be correct • 80-95% depending on data set 17 Machine Learning ! ! ! ! ! Srihari Including additional information in NE • Linear chain graphical model is augmented • When word occurs multiple times in a document it has the same label • Include factors that connect identical words • Results in skip-chain CRF shown next 18 Machine Learning ! ! ! ! ! Srihari Skip Chain CRF for NE Recognition B-PER I-PER OTH OTH OTH B-LOC I-LOC B-PER OTH OTH Mrs. Green spoke today in New York Green chairs the OTH OTH finance committee KEY First occurrence of “Green” has neighboring words that provide strong I-LOC Within location name Person. OTH Not an entitiy is more ambiguous. B-PER Begin person name evidence is a I-PER Withinthat personitname B-LOC Begin location name Second occurrence (a) Augmenting with a long range factor allows to predict correctly. B I O O O B I O B I I NP PRP N IN DT N N 19 POS Graphical structure over Y can easily depend on the Xs ADJ N V IN V B-PER I-PER OTH Machine Learning OTH ! OTH B-LOC ! I-LOC ! B-PER OTH ! OTH OTH OTH ! Srihari Joint inference: Part-of-Speech Labeling/Noun-phrase Segmentation Mrs. Green spoke today in New York Green chairs the finance committee KEY B-PER Begin person name I-LOC Within location name Pair of coupled linear chain CRFs I-PER Within person name OTH Not an entitiy B-LOC phrase Begin location name Noun is composed of several words that depends (a) on the POS and word B I O O O B I O B I I NP ADJ N V IN V PRP N IN DT N N POS British Airways rose after announcing its withdrawal from the UAL deal KEY B I O N ADJ Begin noun phrase Within noun phrase Not a noun phrase Noun Adjective V IN PRP DT Verb Preposition Possesive pronoun Determiner (e.g., a, an, the) (b) 20 Machine Learning ! ! ! ! ! Partially Directed Models Srihari • Probabilistic Graphical models are useful for complex systems: – Directed (Bayesian networks) – Undirected (Markov networks) • Can unify both representations – Incorporate both directed/undirected dependencies – CRFs can be viewed as partially directed graphs • CRFs can be generalized to chain graphs – Which have subgraphs with chains – Network in which undirected components depend upon each other in a directed fashion 21 Machine Learning ! ! ! ! ! Srihari Directed and Undirected Dependencies • A CRF defines a conditional distribution of Y on X • Thus can be viewed as a partially directed graph 22 Machine Learning ! ! ! ! ! CRF as Partially Directed Graph Srihari • CRF defines a conditional distribution of Y on X X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X1 X2 Y1 Y2 Y3 Y4 Y5 Y1 Y2 Y3 Y4 Y5 Y1 Y2 X Linear chain CRF (a) (b) • Can be viewed as one with undirected component over Y which has X as parents ( X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 Y1 Y2 Y3 Y4 Y5 Y1 Y2 Y3 Y4 Y5 Equivalent Partially directed variant (CRF) (a) (b) (c) • Factors are defined over Ys only, each of which has an X value 23 • Equivalent Models Machine Learning ! ! ! ! ! Srihari CRF vs HMM • Logistic CPDs (logistic regression) is conditional analog of Naïve Bayes • CRF is conditional analog of HMM 24 1 1 Machine Learning ! ! ! ! ! Srihari Models for Sequence Labeling HMM k P(X,Y ) = ∏ P(Xi | Yi )P(Yi +1 | Yi ) X1 X2 X3 X4 X5 X1 X Y1 Y2 Y3 Y4 Y5 Y1 Y i =1 Conditioning on Unknown Generative Since Yi s are unknown joint distribution has to be(a)estimated from data HMM Determining P(Y|X) depends on first determining P(X,Y) Discriminative X1 CRF Conditioning on Known X2 X3 P(Y | X) = Y1 Y2 X4 X5 X1 X2 X3 X4 X5 X1 Y Y1 Y2 Y3 Y4 Y5 Y1 1 P(Y , X) Z(X) k −1 Y3 Y4k 5 P(Y , X) = ∏ φi (Yi ,Yi +1 )∏ φi (Yi , X i) i =1 (a) Z(X) = ∑ P(Y , X) i =1 (b) Y P(Y|X) is obtained directly X2 X3 X4 X5 X1 X2 MEMM X3 X4 X5 X1 X2 X3 X4 X5 P(X,Y ) = ∏ P(Yi | Xi )P(Yi +1 | Yi ) iY =1 Y1 Y3 Y4 Y5 2 Y1 Y2 Y3 Y4 Y5 k Y2 Y3 (a) Y4 Y5 (b) (c) 25 Machine Learning ! ! ! ! ! Srihari CRF (Partially Directed) and MEMM • Linear chain structured CRF Y = {Y1 ,..Yk }, X = {X1 ,..X k } P(Y | X) = 1 P(Y , X) Z(X) k −1 k i =1 i =1 , X) = ∏ φi (Yi ,Yi +1 )∏ φi (Yi , Xi ) P(Y , X) Z(X) = ∑ P(Y X1 X2 X3 X4 X5 X1 Y1 Y2 Y3 Y4 Y5 Y1 (a) X X X X X Equivalent Partially directed Y Y Y Y Y variant (CRF) Y 1 2 3 4 5 X1 1 2 3 4 5 Y1 X2 X3 X4 X5 X1 Y2 Y3 Y4 Y5 Y1 (a) • (b) X X X X X X X X X Fully-directed version (a XBayesian network) 5 X1 X2 X3 X4 X5 Called Max Entropy Markov Model (MEMM) Y1 Y2 Y3 Y4 Y5 Y1 Y2 Y3 Y4 is also a conditional model but is Non-equivalent Y5 Y1 Y2 Y3 Y4 Y5 1 2 3 k 4 5 (a) 1 2 3 (b) 4 (c) P(Y | X) = ∏ P(Yi | Xi )P(Yi | Yi −1 ) i =1 Y1 is independent of X2 if Y2 is unknown. If Y2 known we have dependency due to v-structure • • • Sound Conditional BN requires edges from all variables in X to each Yi In CRF probability of Y depends on: values of all variables X={X1,..Xk} 26 MEMM is more efficient, fewer parameters to be learned. Machine Learning ! ! ! ! ! Srihari Models for Sequence Labeling Sequence of observations X={X1,..Xk}. CRF Need a joint label Y={Y1,..Yk}. Both CRF and MEMM are Discriminative Models That directly obtain conditional probability P(Y|X) X1 X2 X3 X4 X5 X1 X2 Y1 Y2 Y3 Y4 Y5 Y1 Y2 P(Y | X) = HMM is a generative model 1 (a) P(Y , X) Z(X) k −1 k i =1 i =1 , X) = ∏ φi (Yi ,Yi +1 )∏ φi (Yi , Xi ) P(Y That needs joint probability P(X,Y) , X) Z(X) = ∑ P(Y Y X4 given X5 X1 X2 X3 X4 X5 Y1 is independent Xof1 X2 Xif2 weXare not Y2X1 X2 X3 X4 X5 3 More generally, Yi⊥Xj|-X-j MEMM Later observation Yhas no on probability Y2 effect Y3 Y4 posterior Y5 Y Y2 Y3of Y4 Y5 Y1 Y2 Y3 Y4 Y5 1 1 current state. k (a) (b) P(Y | X) = ∏ P(Yi (c) | Xi )P(Yi | Yi −1 ) In activity recognition in video sequence, i =1 frames are labelled as running/walking. Earlier frames may be blurry but later ones clearer. Models have Trade-offs in expressive power and learnability MEMM and HMM are more easily learned X2 X3 X4 X5 X1 Y1 Y2 Y3 Y4 Y5 Y1 HMM As purely directed models their parameters can be computed in closed-form using maximum likelihood CRF requires iterative gradient-based approach which is more expensive X1 k Needs joint distribution P(X,Y(a) ) =HMM ∏ P(Xi / Yi )P(Yi | Yi −1 ) i =1 P(Y / X) = P(X,Y ) P(X) Machine Learning ! ! ! ! ! Srihari CRF Example: Naïve Markov model • Binary-valued variables X={X1,..Xk} and Y={Y} – Variables independent of each other and only dependent on class Y • Pairwise potential between Y and each Xi I is indicator function which takes value • Φi(Xi ,Y)=exp{wi I{Xi=1,Y=1}} 1 when its argument is true and else 0 • Single node potential X1 X2 Xk • Φ0(Y)=exp{w0 I{Y=1}} – From CRF definition k ⎧ ⎫ P(Y = 1 | x1 ,..xk ) = exp ⎨w0 + ∑ wi xi ⎬ i =1 ⎩ ⎭ P(Y = 0 | x1 ,..xk ) = exp {0} = 1 which is equivalent to Y Logistic CPD (regression): not defined by a table but induced by parameters. Efficient: Linear (not exponential as in full BN) in the no of z parents k ⎧ ⎫ e P(Y = 1 | x1 ,..xk ) = sigmoid ⎨w0 + ∑ wi xi ⎬ where sigmoid(z) = 1 + ez i =1 ⎩ ⎭ Machine Learning ! ! ! ! ! Naïve Markov and Naïve Bayes Srihari • Binary variables X={X1,..Xk} and Y={Y} 1. Logistic regression is conditional analog of Naïve Bayes Classifier k ⎧ ⎫ P(Y = 1 | x1 ,..xk ) = sigmoid ⎨w0 + ∑ wi xi ⎬ i =1 ⎩ ⎭ X1 X2 Xk Discriminative Model (k parameters) 2. Naïve Bayes P(Y = 1 | X1,..X k ) = Y P(Y = 1) P(Y, X1,..X k ) k X1 X2 Xk P(Y, X1,..X k ) = P(Y )∏ P(Xi | Y ) i=1 Y Generative Model (k parameters): We have to first obtain k CPDs conditioned on unknown from which we can get the distribution conditioned on known 29 Machine Learning ! ! ! ! ! Srihari Logistic Regression Revisited • Input X , target classes Y=0 and Y=1 • A posteriori probability of Y=1 is P(Y=1|X) =y(X) = σ (wTX) where X is a M-dimensional feature vector σ (.) is the logistic sigmoid function • Goal: determine the M parameters • Known as logistic regression in statistics – Although a model for classification rather than for regression Machine Learning Logistic Sigmoid σ(a) a Properties: A. Symmetry σ(-a)=1-σ(a) B. Inverse a=ln(σ /1-σ) known as logit. Also known as log odds since it is the ratio ln[p(Y=1|x)/p(Y=0|x)] C. Derivative dσ/da=σ(1-σ) Machine Learning ! ! ! ! ! Srihari Determining Logistic Regression parameters • Maximum Likelihood Approach for Two classes Data set consists of (input,target) pairs: (Xn , tn) where tn ∈ {0,1}, n =1,..,N Since t is binary we can use the Bernoulli distribution for it p(t | w) = y t (1 − y)1−t , where y = σ (wt X) • Likelihood function associated with N observations N p(t | w) = ∏ y ntn {1− y n } 1−t n n=1 where t =(t1,..,tN)T and yn= p(y=1|Xn) Machine Learning 31 Machine Learning ! ! ! ! ! Srihari Error Fn for Logistic Regression Likelihood function N p(t | w) = ∏ y ntn {1− y n } 1−t n n=1 Error function is the negative of the log-likelihood N E(w) = −ln p(t | w) = −∑{t n ln y n + (1− t n )ln(1− y n )} n=1 Known as Cross-entropy error function Machine Learning 32 Machine Learning ! ! ! ! ! Srihari Gradient of Error Function Error function N E(w) = −ln p(t | w) = −∑{t n ln y n + (1− t n )ln(1− y n )} n=1 where yn= σ(wTXn) Using Derivative of logistic sigmoid Gradient of the error function N ∇E(w) = ∑ ( yn − tn )X n n=1 Error x Feature Vector Contribution to gradient by data point n is error between target tn and prediction yn= σ (wTfn) times input Xn Machine Learning Analytical Derivative: Let z = z1 + z2 where z1 = t ln σ (wt X) and z2 = (1 − t)ln[1 − σ (t X)] dz1 tσ (wt X)[1 − σ (wt X)]X dσ = σ (1− σ ) = da dw σ (wt X) d a (ln ax) = and dx x dz2 (1 − t)σ (wt X)[1 − σ (wt X)](−X) = 33 dw [1 − σ (wt X)] dz Therefore = (σ (wt X) − t)X = (y − t)X dw Machine Learning ! ! ! ! ! Srihari Simple Sequential Algorithm • No closed-form maximum likelihood solution for determining w • Given Gradient of error function N ∇E(w) = ∑ ( yn − tn )X n n=1 • Solve using an iterative approach wτ +1 = wτ − η∇En • where ∇En = (yn − tn )X n Machine Learning Error x Feature Vector Solution has severe over-fitting problems for linearly separable data So use IRLS algorithm 34 Machine Learning ! ! ! ! ! Srihari Multi-class Logistic Regression • Work with soft-max function instead of logistic sigmoid exp(ak ) p(Y = k | X) = yk (X) = ∑ j exp(a j ) where ak=wkTX Machine Learning 35 Machine Learning ! ! ! ! ! Srihari Muti-class Likelihood Function • 1-of –K Coding scheme – For feature vector Xn, target vector tn belonging to class Y=k is a binary vector with all elements zero except for element k N K p(T | w1,.., w K ) = ∏ ∏ p(Ck | X n ) n=1 k=1 tn,k N K tnk = ∏ ∏ ynk n=1 k=1 – where ynk=yk(Xn) – T is a N x K matrix of elements with elements tnk Machine Learning 36 Machine Learning ! ! ! ! ! Srihari Multi-class Error Function 1. Error Function: negative loglikelihood N K E(w1,...,w K ) = −ln p(T | w1,..,w K ) = −∑ ∑ t nk ln y nk n=1 k=1 – Known as cross-entropy error function for multi-class 2. Gradient of error function wrt one parameter vector wj N ∇ w j E(w1,..., w K ) = − ∑ (ynj − tnj )X n n=1 Machine Learning Error x Feature Vector Derivatives of Soft-max yk (X) = exp(ak ) ∑ j exp(a j ) ak=wkTX ∂y k = y k (Ikj − y j ) ∂a j where Ikj are elements of the identity matrix Machine Learning ! ! ! ! ! Srihari IRLS Algorithm for Multi-class 3. Hessian matrix comprises blocks of size M x M – Block j,k is given by N ∇ wk ∇ w j E(w1,..., w K ) = − ∑ ynk (I kj − ynj )X n X nT n=1 – Hessian matrix is positive-definite, therefore error function has a unique minimum 4. Batch Algorithm based on Newton-Raphson Machine Learning Srihari 38 Machine Learning ! ! ! ! ! Srihari Iterative Reweighted Least Squares (IRLS) Newton’s Method • Efficient approximation using Newton-Raphson iterative optimization w(new ) = w(old ) − H −1∇E(w) • where H is the Hessian matrix whose elements are the second derivatives of E(w) with respect to the components of w Machine Learning Srihari Since we are solving for derivative of E(w) Need second derivative 39 Machine Learning ! ! ! ! ! Srihari IRLS Steps • IRLS is applicable to both Linear Regression and Logistic Regression • We discuss Logistic Regression, for which we need 1. Error function E(w) • Logistic Regression: Bernoulli Likelihood Function 2. Gradient ∇E(w) 3. Hessian H = ∇∇E(w) 4. Newton-Raphson update w(new ) = w(old ) − H −1∇E(w) Machine Learning Srihari 40 Machine Learning ! ! ! ! ! Srihari IRLS for Logistic Regression • Posterior probability of class Y=1 is p(Y=1|X) =y(X) = σ (wTX) • Likelihood Function for data set {Xn,tn}, tn ∈{0,1} N p(t | w) = ∏ y ntn {1− y n } 1−t n n=1 1. Error Function Log-likelihood yields Cross-entropy N E(w) = −∑{t n ln y n + (1− t n )ln(1− y n )} n=1 Machine Learning Srihari 41 Machine Learning ! ! ! ! ! Srihari IRLS for Logistic Regression 2. Gradient of Error Function: N ∇E(w) = ∑ (yn − tn )X n = XT (y − t) n=1 3. Hessian: N H = ∇∇E(w) = ∑ yn (1 − yn )X n X nT = XT RX n=1 R is NxN diagonal matrix with elements Rnn=yn(1-yn)=wTX (1-wTXn) Hessian is not constant and depends on w through R Since H is positive-definite (i.e., for arbitrary u, uTHu>0) error function is a concave function of w and so has a unique minimum Machine Learning 42 Machine Learning ! ! ! ! ! Srihari IRLS for Logistic Regression 4. Newton-Raphson update: w(new ) = w(old ) − H −1∇E(w) Substituting H = XT RX and ∇E(w) = XT (y − t) w(new) = w(old) – (XTRX)-1XT (y-t) = (XTRX)-1{XXw(old)-XT(y-t)} = (XTRX)-1XTRz where z is a N-dimensional vector with elements z =Xw(old)-R-1(y-t) Update formula is a set of normal equations Since Hessian depends on w Apply them iteratively each time using the new weight vector Machine Learning Srihari 43
© Copyright 2025 Paperzz