Partially Directed Graphs and Conditional Random Fields

Machine Learning
!
!
!
!
!
Srihari
Partially Directed Graphs and
Conditional Random Fields
Sargur Srihari
srihari@cedar.buffalo.edu
1
Machine Learning
!
!
!
!
!
Srihari
Topics
•  Conditional Random Fields
•  Gibbs distribution and CRF
•  Directed and Undirected Independencies
–  View as combination of BN and MN
•  CRF for Image Segmentation
•  CRF for Text Analytics
•  Naiive Bayes and Naiive Markov
–  Learning the models
2
Machine Learning
!
!
!
!
!
Srihari
Conditional Distribution Representation
•  Nodes correspond to Y U X
–  Y are target variables and X are observed variables
•  Parameterized as ordinary Markov Network
–  Set of factors Φ1(D1),..Φm(Dm)
•  Can be encoded as a log-linear model
•  Viewed as encoding a set of factors
•  Model represents P(Y|X) rather than P(Y,X)
–  To naturally represent a conditional distribution
•  Avoid representing a probabilistic model over X
–  Disallow potentials involving only variables in X
3
Machine Learning
!
!
!
!
!
Srihari
Conditional Random Fields
•  MN encodes a joint distribution over X
•  An MN can also be used to represent a
conditional distribution P(Y|X)
–  Y is a set of target variables
–  X is a set of observed variables
•  Representation is called a CRF
•  Has an analog in directed graphical models
–  Conditional Bayesian Networks
4
Machine Learning
!
!
!
!
!
CRF Definition
•  An undirected graph H with nodes X U Y
–  Network is annotated with a set of factors
φ1 (D1 ),..φ m (Dm ) such that Di ⊄ X
–  Network encodes a conditional distribution as
1 
P(Y , X)
Z(X)
 , X) = ∏ φi (Di )
P(Y
Where Z(X) is the marginal distribution of X
! X)is the joint distribution
and P(Y,
Joint distribution (unnormalized)
is a product of factors
 , X)
Z(X) = ∑ P(Y
Partition function Is now a function of X
P(Y | X) =
m
i =1
Y
–  Two variables in H are connected by an edge
whenever they appear in the scope of a factor
Srihari
Machine Learning
!
!
!
!
!
Deriving the CRF definition
Srihari
•  Conditional distribution from Baye’s rule:
P(Y | X) =
P(Y, X)
P(X)
•  Definition of Markov network (Gibbs distribution)
PΦ (X1,..X n )=
1 
P(X1,..X n )
Z
m
 1,..X n ) = ∏ φi (Di )
where P(X
i=1
Z=
∑
 1,..X n )
P(X
X1,..X n
•  Numerator of conditional distribution is:
P(Y , X) =
1
! , X)
P(Y
Z(Y , X)
m
where
! , X) = ∏ φi (Di ) and Z(Y , X) = ∑ P(Y
! , X)
P(Y
i=1
Y ,X
•  Denominator of conditional distribution (from sum rule):
P(X) = ∑ P(Y, X) =
Y
1
 X)
∑ P(Y,
Z(Y, X) Y
•  Combining Bayes and Gibbs gives CRF:
P(Y / X) =
1
! , X) = 1 P(Y
! , X)
P(Y
! , X)
Z(X)
∑ P(Y
Y
where
! , X)
Z(X) = ∑ P(Y
Y
6
Machine Learning
!
!
!
!
!
Srihari
Difference between CRF & Gibbs
•  Different normalization in partition function Z(X)
–  A Gibbs distribution
1 
PΦ (X1,..X n )= P(X
1,..X n )
Z
m
 ,..X ) = ∏ φ (D )
where P(X
1
n
i
i
Z=
i=1
∑
 ,..X )
P(X
1
n
X1,..X n
•  factorizes into a set of factors and partition function Z
–  CRF
1 
P(Y | X) =
P(Y, X)
Z(X)
m
 X) = ∏ φi (Di )
where P(Y,
i=1
 X)
Z(X) = ∑ P(Y,
Y
•  Induces a different value of Z for every assignment x to X
•  Summation only over Y
–  Difference denoted by feature variables greyed-out
•  Known X (shown dark grey)
•  Y has a distribution dependent on X
X1
X2
X3
X4
X5
X1
Y1
Y2
Y3
Y4
75
Y
Y1
(a)
X
Machine Learning
!
!
!
!
!
Srihari
Example of CRF
•  CRF over Y={Y1,..Yk} and X={X1,..Xk}
•  Edges are Yi—Yi+1 and Yi—Xi
Observed Feature Variables:
1 
P(Y | X) =
P(Y , X)
Z(X)
Assumed known when model
is used (hence greyed-out)
k −1
k
i =1
i =1
 , X) = ∏ φi (Yi ,Yi +1 )∏ φi (Yi , Xi )
P(Y
 , X)
Z(X) = ∑ P(Y
Y
X1
X2
X3
X4
X5
X
Y1
Y2
Y3
Y4
Y5
Y
Linear chain-structured CRF
For sequence labeling
8
(a)
Machine Learning
!
!
!
!
!
Srihari
Main Strength of CRF
•  Avoid encoding over the variables in X
•  Allows incorporating into model
–  A rich set of observed variables
•  Whose dependencies are complex or poorly understood
•  Allows including continuous variables
–  Distributions may not have simple parametric forms
•  Can incorporate domain knowledge
–  Rich features without modeling joint distribution
9
building
Machine Learning
!
!
!
!
car
!
Srihari
CRF Image segmentation
road
cow
grass
(a)
Original image
(b) is
Each superpixel
a random variable
Classification(c)using
node potentials
alone
(d) using
Segmentation
pairwise Markov
Network encoding
•  Each image defines a probability distribution over the
variables representing super-pixel labels
•  Rather than define joint distribution over pixel values we define a 10
conditional distribution over segment labels given the pixel values
–  Avoids making a parametric assumption over (continuous) pixel values
–  Can define image processing routines to define rich features, e.g.,
presence or direction of an image gradient at pixel
»  such features usually rely on multiple pixels
»  So defining correct joint distribution or independence properties over the
features is non-trivial
Machine Learning
!
!
!
!
!
Srihari
Directed and Undirected
Dependencies
•  A CRF defines a conditional distribution of
Y on X
•  Thus it can be viewed as a partially
directed graph
•  Where we have an undirected component
over Y
•  Which has variables in X as parents
11
Machine Learning
!
!
!
!
!
Srihari
CRFs for Text Analysis
•  Important use for CRF framework
•  Part-of-speech labeling
•  Named Entity Tagging
–  People, places, organizations, etc
•  Extracting structured information from text
–  From a reference list
•  Publications, titles, authors, journals, tyear
•  Models share a similar structure
12
Machine Learning
!
!
!
!
!
Srihari
Named Entity (NE) Tagging
•  Entities often span multiple words
•  Type of entity may not be apparent from
individual words
•  New York is location, New York Times is organization
•  For each word Xi introduce target variable Yi
which is its entity type
–  Outcomes for Yi are (in BIO notation)
•  B-PERSON, I-PERSON, B-LOCATION, I-LOCATION, B-ORGANIZATION,
I-ORGANIZATION, OTHER
•  B: beginning, I: inside entity
•  B allows segmenting adjacent entities of same type 13
Machine Learning
!
!
!
!
!
Srihari
CRF for NE Tagging
B-PER
I-PER
OTH
OTH
OTH
B-LOC
I-LOC
B-PER
OTH
OTH
Mrs.
Green
spoke
today
in
New
York
Green
chairs
the
OTH
Y
OTH
X
finance committee
KEY
•  Set of known variables (are words): X
•  Two factors for each word φ (Y ,Y ) φ (Y , X ,..X
B-PER Begin person name
I-PER Within person name
B-LOC Begin location name
I-LOC Within location name
OTH Not an entitiy
1
t
(a)
t
2
t
t +1
t
1
T
)
–  Factor to represent dependency between
neighboring target variables φt1 (Yt ,Yt +1 )
–  Factor to represent dependency between
target Yt and its context in word sequence φt2 (Yt , X1,..XT )
B
I
O
O
O
B
I
O
B
I
I
NP
ADJ
N
V
IN
V
PRP
N
IN
DT
N
N
POS
Airways
rose
after announcing its
withdrawal from
the
deal
Can
depend
on arbitrary
features
ofUALentire
KEYinput word sequence X ,..X (Three here)
1
T
British
B
I
O
N
Begin noun phrase
Within noun phrase
Not a noun phrase
Noun
V
IN
PRP
DT
Verb
Preposition
Possesive pronoun
Determiner (e.g., a, an, the)
14
Machine Learning
!
!
!
!
!
Srihari
Linear Chain CRF for NE
•  Factor to represent dependency between target
Yt and its context in word sequence φ (Y , X ,..X )
2
t
t
1
T
–  Can depend on arbitrary features of entire input
word sequence X1,..XT
–  Not encoded using table factors but use log-linear
models
•  Factors derived from feature functions such as
ft(Yt,Xt)=I{Yt=B-ORGANIZATION,Xt=“Times”}
Machine Learning
!
!
!
!
!
Features for NE Tagging
Srihari
•  For word Xi
–  Capitalized, In list of common person names,
–  In atlas of location names, End with “ton”,
–  Exactly “York”,
Following “Times”
•  For word sequence
–  More than two-sports related terms , New York is a sports
organization
•  Hundreds or thousands of features
•  Sparse (zero for most words)
•  Same feature variable can be connected to
multiple target variables
–  Yi dependent on identity of several words in window
16
Machine Learning
!
!
!
!
!
Srihari
Performance of CRF
•  Linear Chain CRFs provide high per-token
accuracies
–  High 90% range on many natural data sets
•  High per field Precision and Recall
–  Where entire phrase categories and boundaries
must be correct
•  80-95% depending on data set
17
Machine Learning
!
!
!
!
!
Srihari
Including additional information in NE
•  Linear chain graphical model is augmented
•  When word occurs multiple times in a
document it has the same label
•  Include factors that connect identical words
•  Results in skip-chain CRF shown next
18
Machine Learning
!
!
!
!
!
Srihari
Skip Chain CRF for NE Recognition
B-PER
I-PER
OTH
OTH
OTH
B-LOC
I-LOC
B-PER
OTH
OTH
Mrs.
Green
spoke
today
in
New
York
Green
chairs
the
OTH
OTH
finance committee
KEY
First occurrence of “Green” has neighboring words that provide strong
I-LOC Within location name
Person.
OTH Not an entitiy
is more ambiguous.
B-PER Begin person name
evidence
is a
I-PER
Withinthat
personitname
B-LOC
Begin
location name
Second
occurrence
(a)
Augmenting with a long range factor
allows to predict correctly.
B
I
O
O
O
B
I
O
B
I
I
NP
PRP
N
IN
DT
N
N
19
POS
Graphical structure over Y can
easily depend on the Xs
ADJ
N
V
IN
V
B-PER
I-PER
OTH
Machine Learning
OTH
!
OTH
B-LOC
!
I-LOC
!
B-PER
OTH
!
OTH
OTH
OTH
!
Srihari
Joint inference: Part-of-Speech
Labeling/Noun-phrase Segmentation
Mrs.
Green
spoke
today
in
New
York
Green
chairs
the
finance committee
KEY
B-PER Begin person name
I-LOC Within location name
Pair
of
coupled
linear
chain CRFs
I-PER Within person name
OTH Not an entitiy
B-LOC phrase
Begin location name
Noun
is composed of several words that depends
(a)
on the POS and word
B
I
O
O
O
B
I
O
B
I
I
NP
ADJ
N
V
IN
V
PRP
N
IN
DT
N
N
POS
British
Airways
rose
after
announcing
its
withdrawal
from
the
UAL
deal
KEY
B
I
O
N
ADJ
Begin noun phrase
Within noun phrase
Not a noun phrase
Noun
Adjective
V
IN
PRP
DT
Verb
Preposition
Possesive pronoun
Determiner (e.g., a, an, the)
(b)
20
Machine Learning
!
!
!
!
!
Partially Directed Models
Srihari
•  Probabilistic Graphical models are useful for
complex systems:
–  Directed (Bayesian networks)
–  Undirected (Markov networks)
•  Can unify both representations
–  Incorporate both directed/undirected dependencies
–  CRFs can be viewed as partially directed graphs
•  CRFs can be generalized to chain graphs
–  Which have subgraphs with chains
–  Network in which undirected components depend
upon each other in a directed fashion
21
Machine Learning
!
!
!
!
!
Srihari
Directed and Undirected
Dependencies
•  A CRF defines a conditional distribution of
Y on X
•  Thus can be viewed as a partially directed
graph
22
Machine Learning
!
!
!
!
!
CRF as Partially Directed Graph
Srihari
•  CRF defines a conditional distribution of Y on X
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
X1
X2
Y1
Y2
Y3
Y4
Y5
Y1
Y2
Y3
Y4
Y5
Y1
Y2
X
Linear chain CRF
(a)
(b)
•  Can be viewed as one with
undirected
component over Y which has X as parents
(
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
Y1
Y2
Y3
Y4
Y5
Y1
Y2
Y3
Y4
Y5
Y1
Y2
Y3
Y4
Y5
Equivalent Partially directed
variant (CRF)
(a)
(b)
(c)
•  Factors are defined over Ys only, each of which
has an X value
23
•  Equivalent Models
Machine Learning
!
!
!
!
!
Srihari
CRF vs HMM
•  Logistic CPDs (logistic regression) is
conditional analog of Naïve Bayes
•  CRF is conditional analog of HMM
24
1
1
Machine Learning
!
!
!
!
!
Srihari
Models for Sequence Labeling
HMM
k
P(X,Y ) = ∏ P(Xi | Yi )P(Yi +1 | Yi )
X1
X2
X3
X4
X5
X1
X
Y1
Y2
Y3
Y4
Y5
Y1
Y
i =1
Conditioning on Unknown
Generative
Since Yi s are unknown joint distribution has to be(a)estimated
from data
HMM
Determining P(Y|X) depends on first determining P(X,Y)
Discriminative
X1
CRF
Conditioning on Known
X2
X3
P(Y | X) =
Y1 
Y2
X4
X5
X1
X2
X3
X4
X5
X1
Y
Y1
Y2
Y3
Y4
Y5
Y1
1 
P(Y , X)
Z(X)
k −1
Y3
Y4k
5
P(Y , X) = ∏ φi (Yi ,Yi +1 )∏ φi (Yi , X
i)
i =1
 (a)
Z(X) = ∑ P(Y
, X)
i =1
(b)
Y
P(Y|X) is obtained directly
X2
X3
X4
X5
X1
X2
MEMM
X3
X4
X5
X1
X2
X3
X4
X5
P(X,Y ) = ∏ P(Yi | Xi )P(Yi +1 | Yi )
iY
=1
Y1
Y3
Y4
Y5
2
Y1
Y2
Y3
Y4
Y5
k
Y2
Y3
(a)
Y4
Y5
(b)
(c)
25
Machine Learning
!
!
!
!
!
Srihari
CRF (Partially Directed) and MEMM
•  Linear chain structured CRF
Y = {Y1 ,..Yk }, X = {X1 ,..X k }
P(Y | X) =
1 
P(Y , X)
Z(X)
k −1
k
i =1
i =1
 , X) = ∏ φi (Yi ,Yi +1 )∏ φi (Yi , Xi )
P(Y
 , X)
Z(X) = ∑ P(Y
X1
X2
X3
X4
X5
X1
Y1
Y2
Y3
Y4
Y5
Y1
(a)
X
X
X
X
X
Equivalent
Partially directed
Y
Y
Y
Y
Y
variant
(CRF)
Y
1
2
3
4
5
X1
1
2
3
4
5
Y1
X2
X3
X4
X5
X1
Y2
Y3
Y4
Y5
Y1
(a)
• 
(b)
X
X
X
X
X
X
X
X
X
Fully-directed
version
(a XBayesian
network)
5
X1
X2
X3
X4
X5
Called Max Entropy Markov Model (MEMM)
Y1
Y2
Y3
Y4
Y5
Y1
Y2
Y3
Y4
is also a conditional model but is Non-equivalent
Y5
Y1
Y2
Y3
Y4
Y5
1
2
3
k
4
5
(a)
1
2
3
(b)
4
(c)
P(Y | X) = ∏ P(Yi | Xi )P(Yi | Yi −1 )
i =1
Y1 is independent of X2 if Y2 is unknown. If Y2 known we have dependency due to v-structure
• 
• 
• 
Sound Conditional BN requires edges from all variables in X to each Yi
In CRF probability of Y depends on:
values of all variables X={X1,..Xk}
26
MEMM is more efficient, fewer parameters to be learned.
Machine Learning
!
!
!
!
!
Srihari
Models for Sequence Labeling
Sequence of observations X={X1,..Xk}.
CRF
Need a joint label Y={Y1,..Yk}.
Both CRF and MEMM are Discriminative Models
That directly obtain conditional probability P(Y|X)
X1
X2
X3
X4
X5
X1
X2
Y1
Y2
Y3
Y4
Y5
Y1
Y2
P(Y | X) =
HMM is a generative model
1 (a)
P(Y , X)
Z(X)
k −1
k
i =1
i =1
 , X) = ∏ φi (Yi ,Yi +1 )∏ φi (Yi , Xi )
P(Y
That needs joint probability P(X,Y)
 , X)
Z(X) = ∑ P(Y
Y
X4 given
X5
X1
X2
X3
X4
X5
Y1 is independent Xof1 X2 Xif2 weXare
not
Y2X1 X2 X3 X4 X5
3
More generally, Yi⊥Xj|-X-j
MEMM
Later observation Yhas
no
on
probability
Y2 effect
Y3
Y4 posterior
Y5
Y
Y2
Y3of Y4
Y5
Y1
Y2
Y3
Y4
Y5
1
1
current state.
k
(a)
(b)
P(Y | X) = ∏ P(Yi (c)
| Xi )P(Yi | Yi −1 )
In activity recognition in video sequence,
i =1
frames are labelled as running/walking.
Earlier frames may be blurry but later ones clearer.
Models have Trade-offs in expressive power and learnability
MEMM and HMM are more easily learned
X2
X3
X4
X5
X1
Y1
Y2
Y3
Y4
Y5
Y1
HMM
As purely directed models their parameters can be
computed in closed-form using maximum likelihood
CRF requires iterative gradient-based
approach which is more expensive
X1
k
Needs
joint
distribution
P(X,Y(a)
) =HMM
∏ P(Xi / Yi )P(Yi | Yi −1 )
i =1
P(Y / X) =
P(X,Y )
P(X)
Machine Learning
!
!
!
!
!
Srihari
CRF Example: Naïve Markov model
•  Binary-valued variables X={X1,..Xk} and Y={Y}
–  Variables independent of each other and only
dependent on class Y
•  Pairwise potential between Y and each Xi
I is indicator function which takes value
•  Φi(Xi ,Y)=exp{wi I{Xi=1,Y=1}}
1 when its argument is true and else 0
•  Single node potential
X1
X2
Xk
•  Φ0(Y)=exp{w0 I{Y=1}}
–  From CRF definition
k
⎧
⎫
P(Y = 1 | x1 ,..xk ) = exp ⎨w0 + ∑ wi xi ⎬
i =1
⎩
⎭
P(Y = 0 | x1 ,..xk ) = exp {0} = 1
which is equivalent to
Y
Logistic CPD (regression):
not defined by a table
but induced by parameters.
Efficient: Linear (not exponential as
in full BN) in the no of
z parents
k
⎧
⎫
e
P(Y = 1 | x1 ,..xk ) = sigmoid ⎨w0 + ∑ wi xi ⎬ where sigmoid(z) =
1 + ez
i =1
⎩
⎭
Machine Learning
!
!
!
!
!
Naïve Markov and Naïve Bayes
Srihari
•  Binary variables X={X1,..Xk} and Y={Y}
1.  Logistic regression is conditional analog of Naïve
Bayes Classifier
k
⎧
⎫
P(Y = 1 | x1 ,..xk ) = sigmoid ⎨w0 + ∑ wi xi ⎬
i =1
⎩
⎭
X1
X2
Xk
Discriminative Model (k parameters)
2.  Naïve Bayes
P(Y = 1 | X1,..X k ) =
Y
P(Y = 1)
P(Y, X1,..X k )
k
X1
X2
Xk
P(Y, X1,..X k ) = P(Y )∏ P(Xi | Y )
i=1
Y
Generative Model (k parameters):
We have to first obtain k CPDs conditioned on unknown
from which we can get the distribution conditioned on known
29
Machine Learning
!
!
!
!
!
Srihari
Logistic Regression Revisited
•  Input X , target classes Y=0 and Y=1
•  A posteriori probability of Y=1 is
P(Y=1|X) =y(X) = σ (wTX)
where
X is a M-dimensional feature vector
σ (.) is the logistic sigmoid function
•  Goal: determine the M parameters
•  Known as logistic regression in
statistics
–  Although a model for classification
rather than for regression
Machine Learning
Logistic Sigmoid
σ(a)
a
Properties:
A. Symmetry
σ(-a)=1-σ(a)
B. Inverse
a=ln(σ /1-σ)
known as logit.
Also known as
log odds since
it is the ratio
ln[p(Y=1|x)/p(Y=0|x)]
C. Derivative
dσ/da=σ(1-σ)
Machine Learning
!
!
!
!
!
Srihari
Determining Logistic Regression parameters
•  Maximum Likelihood Approach for Two classes
Data set consists of (input,target) pairs: (Xn , tn)
where tn ∈ {0,1}, n =1,..,N
Since t is binary we can use the Bernoulli distribution for it
p(t | w) = y t (1 − y)1−t , where y = σ (wt X)
•  Likelihood function associated with N observations
N
p(t | w) = ∏ y ntn {1− y n }
1−t n
n=1
where t =(t1,..,tN)T and yn= p(y=1|Xn)
Machine Learning
31
Machine Learning
!
!
!
!
!
Srihari
Error Fn for Logistic Regression
Likelihood function
N
p(t | w) = ∏ y ntn {1− y n }
1−t n
n=1
Error function is the negative of the log-likelihood
N
E(w) = −ln p(t | w) = −∑{t n ln y n + (1− t n )ln(1− y n )}
n=1
Known as Cross-entropy error function
Machine Learning
32
Machine Learning
!
!
!
!
!
Srihari
Gradient of Error Function
Error function
N
E(w) = −ln p(t | w) = −∑{t n ln y n + (1− t n )ln(1− y n )}
n=1
where yn= σ(wTXn)
Using Derivative of logistic sigmoid
Gradient of the error function
N
∇E(w) = ∑ ( yn − tn )X n
n=1
Error x Feature Vector
Contribution to gradient by data
point n is error between target tn
and prediction yn= σ (wTfn) times input Xn
Machine Learning
Analytical Derivative:
Let z = z1 + z2
where z1 = t ln σ (wt X) and z2 = (1 − t)ln[1 − σ (t X)]
dz1 tσ (wt X)[1 − σ (wt X)]X dσ = σ (1− σ )
=
da
dw
σ (wt X)
d
a
(ln ax) =
and
dx
x
dz2 (1 − t)σ (wt X)[1 − σ (wt X)](−X)
=
33
dw
[1 − σ (wt X)]
dz
Therefore
= (σ (wt X) − t)X = (y − t)X
dw
Machine Learning
!
!
!
!
!
Srihari
Simple Sequential Algorithm
•  No closed-form maximum likelihood solution
for determining w
•  Given Gradient of error function
N
∇E(w) = ∑ ( yn − tn )X n
n=1
•  Solve using an iterative approach
wτ +1 = wτ − η∇En
•  where
∇En = (yn − tn )X n
Machine Learning
Error x Feature Vector
Solution has severe
over-fitting problems
for linearly separable data
So use IRLS algorithm
34
Machine Learning
!
!
!
!
!
Srihari
Multi-class Logistic Regression
•  Work with soft-max function instead of
logistic sigmoid
exp(ak )
p(Y = k | X) = yk (X) =
∑ j exp(a j )
where ak=wkTX
Machine Learning
35
Machine Learning
!
!
!
!
!
Srihari
Muti-class Likelihood Function
•  1-of –K Coding scheme
–  For feature vector Xn, target vector tn belonging
to class Y=k is a binary vector with all elements
zero except for element k
N
K
p(T | w1,.., w K ) = ∏ ∏ p(Ck | X n )
n=1 k=1
tn,k
N
K
tnk
= ∏ ∏ ynk
n=1 k=1
–  where ynk=yk(Xn)
–  T is a N x K matrix of elements with elements tnk
Machine Learning
36
Machine Learning
!
!
!
!
!
Srihari
Multi-class Error Function
1.  Error Function: negative loglikelihood
N
K
E(w1,...,w K ) = −ln p(T | w1,..,w K ) = −∑ ∑ t nk ln y nk
n=1 k=1
–  Known as cross-entropy error
function for multi-class
2.  Gradient of error function wrt
one parameter vector wj
N
∇ w j E(w1,..., w K ) = − ∑ (ynj − tnj )X n
n=1
Machine Learning
Error x Feature Vector
Derivatives of
Soft-max
yk (X) =
exp(ak )
∑ j exp(a j )
ak=wkTX
∂y k
= y k (Ikj − y j )
∂a j
where Ikj are elements of
the identity matrix
Machine Learning
!
!
!
!
!
Srihari
IRLS Algorithm for Multi-class
3. Hessian matrix comprises blocks of size M x M
–  Block j,k is given by
N
∇ wk ∇ w j E(w1,..., w K ) = − ∑ ynk (I kj − ynj )X n X nT
n=1
–  Hessian matrix is positive-definite, therefore error
function has a unique minimum
4. Batch Algorithm based on Newton-Raphson
Machine Learning
Srihari
38
Machine Learning
!
!
!
!
!
Srihari
Iterative Reweighted Least Squares
(IRLS)
Newton’s Method
•  Efficient approximation using
Newton-Raphson iterative
optimization
w(new ) = w(old ) − H −1∇E(w)
•  where
H is the Hessian matrix
whose elements are the second
derivatives of E(w)
with respect to the components of w
Machine Learning
Srihari
Since we are solving for
derivative of E(w)
Need second derivative
39
Machine Learning
!
!
!
!
!
Srihari
IRLS Steps
•  IRLS is applicable to both Linear Regression
and Logistic Regression
•  We discuss Logistic Regression, for which we
need
1.  Error function E(w)
•  Logistic Regression: Bernoulli Likelihood Function
2.  Gradient ∇E(w)
3.  Hessian H = ∇∇E(w)
4.  Newton-Raphson update
w(new ) = w(old ) − H −1∇E(w)
Machine Learning
Srihari
40
Machine Learning
!
!
!
!
!
Srihari
IRLS for Logistic Regression
•  Posterior probability of class Y=1 is
p(Y=1|X) =y(X) = σ (wTX)
•  Likelihood Function
for data set {Xn,tn}, tn ∈{0,1}
N
p(t | w) = ∏ y ntn {1− y n }
1−t n
n=1
1.  Error Function
Log-likelihood yields Cross-entropy
N
E(w) = −∑{t n ln y n + (1− t n )ln(1− y n )}
n=1
Machine Learning
Srihari
41
Machine Learning
!
!
!
!
!
Srihari
IRLS for Logistic Regression
2.
Gradient of Error Function:
N
∇E(w) = ∑ (yn − tn )X n = XT (y − t)
n=1
3.
Hessian:
N
H = ∇∇E(w) = ∑ yn (1 − yn )X n X nT = XT RX
n=1
R is NxN diagonal matrix with elements
Rnn=yn(1-yn)=wTX (1-wTXn)
Hessian is not constant and depends on w through R
Since H is positive-definite (i.e., for arbitrary u, uTHu>0)
error function is a concave function of w and so has a unique minimum
Machine Learning
42
Machine Learning
!
!
!
!
!
Srihari
IRLS for Logistic Regression
4. Newton-Raphson update:
w(new ) = w(old ) − H −1∇E(w)
Substituting
H = XT RX
and
∇E(w) = XT (y − t)
w(new) = w(old) – (XTRX)-1XT (y-t)
= (XTRX)-1{XXw(old)-XT(y-t)}
= (XTRX)-1XTRz
where z is a N-dimensional vector with elements
z =Xw(old)-R-1(y-t)
Update formula is a set of normal equations
Since Hessian depends on w
Apply them iteratively each time using the new weight vector
Machine Learning
Srihari
43