Slides - ECE@NUS

A Large-Deviation Analysis for the
Maximum-Likelihood Learning of Tree Structures
Vincent Tan †
Animashree Anandkumar †∗ , Lang Tong ∗ and Alan Willsky †
† Stochastic Systems Group, LIDS,
Massachusetts Institute of Technology
∗ School of ECE,
Cornell University
ISIT (Jun 30, 2009)
1/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
1 / 17
Motivation
Given a set of n i.i.d. samples drawn from P, a tree distribution.
Maximum-Likelihood (ML) learning of the structure.
2/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
2 / 17
Motivation
Given a set of n i.i.d. samples drawn from P, a tree distribution.
Maximum-Likelihood (ML) learning of the structure.
Large-Deviation analysis of the error in learning of the set of
edges or the structure.
2/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
2 / 17
Motivation
Given a set of n i.i.d. samples drawn from P, a tree distribution.
Maximum-Likelihood (ML) learning of the structure.
Large-Deviation analysis of the error in learning of the set of
edges or the structure.
Questions:
1
2/17
Vincent Tan (MIT)
When does the error probability
decay exponentially with n?
Large-Deviations for Learning Trees
ISIT
2 / 17
Motivation
Given a set of n i.i.d. samples drawn from P, a tree distribution.
Maximum-Likelihood (ML) learning of the structure.
Large-Deviation analysis of the error in learning of the set of
edges or the structure.
Questions:
2/17
Vincent Tan (MIT)
1
When does the error probability
decay exponentially with n?
2
What is the exact rate of decay of the
probability of error?
Large-Deviations for Learning Trees
ISIT
2 / 17
Motivation
Given a set of n i.i.d. samples drawn from P, a tree distribution.
Maximum-Likelihood (ML) learning of the structure.
Large-Deviation analysis of the error in learning of the set of
edges or the structure.
Questions:
2/17
Vincent Tan (MIT)
1
When does the error probability
decay exponentially with n?
2
What is the exact rate of decay of the
probability of error?
3
How does the rate (error exponent)
depend on the parameters of the
distribution?
Large-Deviations for Learning Trees
ISIT
2 / 17
Main Results
Almost every (true) tree distribution results in exponential decay.
Quantify the exact rate of decay for a given P.
3/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
3 / 17
Main Results
Almost every (true) tree distribution results in exponential decay.
Quantify the exact rate of decay for a given P.
Rate of decay ≈ SNR for learning (Intuitively appealing).
3/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
3 / 17
Notation and Background
Assume we have a set of samples xn = {x1 , x2 , . . . , xn } each
drawn i.i.d. from P ∈ P(X d ), where X is a finite set.
4/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
4 / 17
Notation and Background
Assume we have a set of samples xn = {x1 , x2 , . . . , xn } each
drawn i.i.d. from P ∈ P(X d ), where X is a finite set.
x = (x1 , . . . , xd ) ∈ X d .
4/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
4 / 17
Notation and Background
Assume we have a set of samples xn = {x1 , x2 , . . . , xn } each
drawn i.i.d. from P ∈ P(X d ), where X is a finite set.
x = (x1 , . . . , xd ) ∈ X d .
Vertex set: V := {1, . . . , d}. Edge set: EP ⊂
4/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
V
2
.
ISIT
4 / 17
Notation and Background
Assume we have a set of samples xn = {x1 , x2 , . . . , xn } each
drawn i.i.d. from P ∈ P(X d ), where X is a finite set.
x = (x1 , . . . , xd ) ∈ X d .
Vertex set: V := {1, . . . , d}. Edge set: EP ⊂
V
2
.
P(x) is Markov on TP = (V, EP ), a
tree.
P(x) factorizes according to TP .
Example for P with d = 4.
4/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
4 / 17
Notation and Background
Assume we have a set of samples xn = {x1 , x2 , . . . , xn } each
drawn i.i.d. from P ∈ P(X d ), where X is a finite set.
x = (x1 , . . . , xd ) ∈ X d .
Vertex set: V := {1, . . . , d}. Edge set: EP ⊂
V
2
.
P(x) is Markov on TP = (V, EP ), a
tree.
P(x) factorizes according to TP .
Example for P with d = 4.
4/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
4 / 17
Notation and Background
Assume we have a set of samples xn = {x1 , x2 , . . . , xn } each
drawn i.i.d. from P ∈ P(X d ), where X is a finite set.
x = (x1 , . . . , xd ) ∈ X d .
Vertex set: V := {1, . . . , d}. Edge set: EP ⊂
V
2
.
P(x) is Markov on TP = (V, EP ), a
tree.
P(x) factorizes according to TP .
Example for P with d = 4.
P(x) = P1 (x1 ) ×
4/17
Vincent Tan (MIT)
P1,2 (x1 , x2 ) P1,3 (x1 , x3 ) P1,4 (x1 , x4 )
×
×
.
P1 (x1 )
P1 (x1 )
P1 (x1 )
Large-Deviations for Learning Trees
ISIT
4 / 17
ML Learning of Tree Distribution (Chow-Liu) I
5/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
5 / 17
ML Learning of Tree Distribution (Chow-Liu) I
Solve the ML problem given xn = {x1 , x2 , . . . , xn }
PML = argmax log Qn (xn ).
Q∈Trees
5/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
5 / 17
ML Learning of Tree Distribution (Chow-Liu) I
Solve the ML problem given xn = {x1 , x2 , . . . , xn }
PML = argmax log Qn (xn ).
Q∈Trees
b
bxn (x): the empirical distribution of xn .
P(x)
=P
5/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
5 / 17
ML Learning of Tree Distribution (Chow-Liu) I
Solve the ML problem given xn = {x1 , x2 , . . . , xn }
PML = argmax log Qn (xn ).
Q∈Trees
b
bxn (x): the empirical distribution of xn .
P(x)
=P
be : the pairwise marginal of P
b on edge e = (i, j).
P
5/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
5 / 17
ML Learning of Tree Distribution (Chow-Liu) I
Solve the ML problem given xn = {x1 , x2 , . . . , xn }
PML = argmax log Qn (xn ).
Q∈Trees
b
bxn (x): the empirical distribution of xn .
P(x)
=P
be : the pairwise marginal of P
b on edge e = (i, j).
P
Reduces to a max-weight spanning tree problem (Chow-Liu 1968)
EML = argmax
X
EQ : Q∈Trees e∈E
Q
5/17
Vincent Tan (MIT)
be ),
I(P
Large-Deviations for Learning Trees
ISIT
5 / 17
ML Learning of Tree Distribution (Chow-Liu) I
Solve the ML problem given xn = {x1 , x2 , . . . , xn }
PML = argmax log Qn (xn ).
Q∈Trees
b
bxn (x): the empirical distribution of xn .
P(x)
=P
be : the pairwise marginal of P
b on edge e = (i, j).
P
Reduces to a max-weight spanning tree problem (Chow-Liu 1968)
EML = argmax
X
EQ : Q∈Trees e∈E
Q
be ),
I(P
Vincent Tan (MIT)
X
xi ,xj
be ) : e ∈
Samples ⇒ {I(P
5/17
be ) :=
I(P
V
2
b
bi,j (xi , xj ) log Pi,j (xi , xj ) .
P
bi (xi )P
bj (xj )
P
} ⇒ max-weight spanning tree.
Large-Deviations for Learning Trees
ISIT
5 / 17
ML Learning of Tree Distribution (Chow-Liu) II
6/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
6 / 17
ML Learning of Tree Distribution (Chow-Liu) II
True MIs {I(Pe )}
6/17
Vincent Tan (MIT)
Max-weight spanning tree EP
Large-Deviations for Learning Trees
ISIT
6 / 17
ML Learning of Tree Distribution (Chow-Liu) II
True MIs {I(Pe )}
6/17
Vincent Tan (MIT)
Max-weight spanning tree EP
Large-Deviations for Learning Trees
ISIT
6 / 17
ML Learning of Tree Distribution (Chow-Liu) II
True MIs {I(Pe )}
Max-weight spanning tree EP
be )} from xn
Empirical MIs {I(P
Max-weight spanning tree EML 6= EP
6/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
6 / 17
Problem Statement
Define EML to be the ML edge set and the error event to be:
{EML 6= EP } .
7/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
7 / 17
Problem Statement
Define EML to be the ML edge set and the error event to be:
{EML 6= EP } .
7/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
7 / 17
Problem Statement
Define EML to be the ML edge set and the error event to be:
{EML 6= EP } .
Find the error exponent KP :
1
KP := lim − log P ({EML 6= EP }) .
n→∞ n
7/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
7 / 17
Problem Statement
Define EML to be the ML edge set and the error event to be:
{EML 6= EP } .
Find the error exponent KP :
1
KP := lim − log P ({EML 6= EP }) .
n→∞ n
.
Alternatively, P ({EML 6= EP }) = exp(−nKP ).
7/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
7 / 17
Problem Statement
Define EML to be the ML edge set and the error event to be:
{EML 6= EP } .
Find the error exponent KP :
1
KP := lim − log P ({EML 6= EP }) .
n→∞ n
.
Alternatively, P ({EML 6= EP }) = exp(−nKP ).
Easier to consider crossover events first.
7/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
7 / 17
The Crossover Rate I
8/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
8 / 17
The Crossover Rate I
Given two node pairs e, e0 ∈
V
2
with distribution Pe,e0 ∈ P(X 4 ), s.t.
I(Pe ) > I(Pe0 ).
8/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
8 / 17
The Crossover Rate I
Given two node pairs e, e0 ∈
V
2
with distribution Pe,e0 ∈ P(X 4 ), s.t.
I(Pe ) > I(Pe0 ).
Consider the crossover event of the empirical mutual informations
be ) ≤ I(P
be0 )}.
{I(P
8/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
8 / 17
The Crossover Rate I
Given two node pairs e, e0 ∈
V
2
with distribution Pe,e0 ∈ P(X 4 ), s.t.
I(Pe ) > I(Pe0 ).
Consider the crossover event of the empirical mutual informations
be ) ≤ I(P
be0 )}.
{I(P
Def: Crossover Rate
n
o
1
be ) ≤ I(P
be0 ) .
Je,e0 := lim − log P I(P
n→∞ n
8/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
8 / 17
The Crossover Rate I
Given two node pairs e, e0 ∈
V
2
with distribution Pe,e0 ∈ P(X 4 ), s.t.
I(Pe ) > I(Pe0 ).
Consider the crossover event of the empirical mutual informations
be ) ≤ I(P
be0 )}.
{I(P
Def: Crossover Rate
n
o
1
be ) ≤ I(P
be0 ) .
Je,e0 := lim − log P I(P
n→∞ n
This event may potentially lead to an error in structure learning. Why?
8/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
8 / 17
The Crossover Rate II
Theorem
The crossover rate for empirical mutual informations is
o
n
D(Q || Pe,e0 ) : I(Qe0 ) = I(Qe ) .
Je,e0 = min
Q∈P(X 4 )
9/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
9 / 17
The Crossover Rate II
Theorem
The crossover rate for empirical mutual informations is
o
n
D(Q || Pe,e0 ) : I(Qe0 ) = I(Qe ) .
Je,e0 = min
Q∈P(X 4 )
v Pe,e0
Sanov’s Theorem and The
Contraction Principle.
D(Q∗e,e0 ||Pe,e0 )
v {I(Qe ) = I(Qe0 )}
Q∗e,e0
9/17
Vincent Tan (MIT)
Exact but not very intuitive.
Non-Convex.
Large-Deviations for Learning Trees
ISIT
9 / 17
The Crossover Rate III
Euclidean Information Theory [Borade & Zheng ’08]:
Q≈P
10/17
Vincent Tan (MIT)
⇒
D(Q || P) ≈
Large-Deviations for Learning Trees
1
kQ − Pk2P
2
ISIT
10 / 17
The Crossover Rate III
Euclidean Information Theory [Borade & Zheng ’08]:
Q≈P
⇒
D(Q || P) ≈
1
kQ − Pk2P
2
Def: Very noisy learning condition on Pe,e0
Pe ≈ Pe0
10/17
Vincent Tan (MIT)
⇒
Large-Deviations for Learning Trees
ISIT
10 / 17
The Crossover Rate III
Euclidean Information Theory [Borade & Zheng ’08]:
Q≈P
⇒
D(Q || P) ≈
1
kQ − Pk2P
2
Def: Very noisy learning condition on Pe,e0
Pe ≈ Pe0
10/17
Vincent Tan (MIT)
⇒
I(Pe ) ≈ I(Pe0 ).
Large-Deviations for Learning Trees
ISIT
10 / 17
The Crossover Rate III
Euclidean Information Theory [Borade & Zheng ’08]:
Q≈P
⇒
D(Q || P) ≈
1
kQ − Pk2P
2
Def: Very noisy learning condition on Pe,e0
Pe ≈ Pe0
⇒
I(Pe ) ≈ I(Pe0 ).
Def: Given a Pe = Pi,j the information density function is
se (xi , xj ) := log
10/17
Vincent Tan (MIT)
Pi,j (xi , xj )
,
Pi (xi )Pj (xj )
Large-Deviations for Learning Trees
∀ (xi , xj ) ∈ X 2 .
ISIT
10 / 17
The Crossover Rate IV
n
o
1
be ) ≤ I(P
be0 ) .
Je,e0 := lim − log P I(P
n→∞ n
11/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
11 / 17
The Crossover Rate IV
n
o
1
be ) ≤ I(P
be0 ) .
Je,e0 := lim − log P I(P
n→∞ n
Theorem
The approximate crossover rate is:
(I(Pe0 ) − I(Pe ))2
e
Je,e0 =
.
2 Var(se0 − se )
Signal-to-noise ratio for structure learning.
SNR =
11/17
Vincent Tan (MIT)
mean 2
.
stddev
Large-Deviations for Learning Trees
ISIT
11 / 17
The Crossover Rate V
Convexifying the optimization problem by linearizing the constraints.
12/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
12 / 17
The Crossover Rate V
Convexifying the optimization problem by linearizing the constraints.
v Pe,e0
D(Q∗e,e0 ||Pe,e0 )
v {I(Qe ) = I(Qe0 )}
Q∗e,e0
12/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
12 / 17
The Crossover Rate V
Convexifying the optimization problem by linearizing the constraints.
v Pe,e0
v
Pe,e0
D(Q∗e,e0 ||Pe,e0 )
v {I(Qe ) = I(Qe0 )}
Q∗e,e0
12/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
1
∗
2
0
2 kQe,e0 −Pe,e kPe,e0
Q∗e,e0
v
Q(Pe,e0 )
ISIT
12 / 17
The Crossover Rate V
Convexifying the optimization problem by linearizing the constraints.
v Pe,e0
v
Pe,e0
D(Q∗e,e0 ||Pe,e0 )
v {I(Qe ) = I(Qe0 )}
Q∗e,e0
1
∗
2
0
2 kQe,e0 −Pe,e kPe,e0
Q∗e,e0
v
Q(Pe,e0 )
Non-Convex problem becomes a Least-Squares problem.
12/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
12 / 17
The Crossover Rate V
How good is the approximation? We consider a binary model.
0.025
True Rate
Approx Rate
Rate Je,e′
0.02
0.015
0.01
0.005
0
13/17
0
Vincent Tan (MIT)
0.01
0.02
0.03
0.04
I(Pe)−I(Pe′)
0.05
Large-Deviations for Learning Trees
0.06
0.07
ISIT
13 / 17
Error Exponent for Structure Learning I
We have characterized
the rate for the crossover event
n
o
be ) ≤ I(P
be0 ) . We called the rate Je,e0 .
I(P
14/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
14 / 17
Error Exponent for Structure Learning I
We have characterized
the rate for the crossover event
n
o
be ) ≤ I(P
be0 ) . We called the rate Je,e0 .
I(P
Crossovers
Error Event
14/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
be ) ≤ I(P
be0 )}
{I(P
{EML 6= EP }
ISIT
14 / 17
Error Exponent for Structure Learning I
We have characterized
the rate for the crossover event
n
o
be ) ≤ I(P
be0 ) . We called the rate Je,e0 .
I(P
Crossovers
Error Event
be ) ≤ I(P
be0 )}
{I(P
{EML 6= EP }
How to relate this to KP , the overall error exponent?
.
P ({EML 6= EP }) = exp(−nKP )
14/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
14 / 17
Error Exponent for Structure Learning II
Theorem
KP = min
e0 ∈E
/ P
15/17
Vincent Tan (MIT)
min
e∈Path(e0 ;EP )
Je,e0 .
Large-Deviations for Learning Trees
ISIT
15 / 17
Error Exponent for Structure Learning II
Theorem
KP = min
e0 ∈E
/ P
15/17
Vincent Tan (MIT)
min
e∈Path(e0 ;EP )
Je,e0 .
Large-Deviations for Learning Trees
ISIT
15 / 17
Positivity of the Error Exponent
Theorem
The following statements are equivalent:
(a) The error probability decays exponentially i.e.,
KP > 0.
(b) TP is a spanning tree, i.e., not a proper forest.
16/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
16 / 17
Positivity of the Error Exponent
Theorem
The following statements are equivalent:
(a) The error probability decays exponentially i.e.,
KP > 0.
(b) TP is a spanning tree, i.e., not a proper forest.
16/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
16 / 17
Conclusions
Goal: Find the rate of decay of the probability of error:
1
KP := lim − log P ({EML 6= EP }) .
n→∞ n
Employed tools from Large-Deviation theory and basic properties
of trees.
17/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
17 / 17
Conclusions
Goal: Find the rate of decay of the probability of error:
1
KP := lim − log P ({EML 6= EP }) .
n→∞ n
Employed tools from Large-Deviation theory and basic properties
of trees.
Found the dominant error tree that relates crossover events to
overall error event.
17/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
17 / 17
Conclusions
Goal: Find the rate of decay of the probability of error:
1
KP := lim − log P ({EML 6= EP }) .
n→∞ n
Employed tools from Large-Deviation theory and basic properties
of trees.
Found the dominant error tree that relates crossover events to
overall error event.
Used Euclidean Information Theory to obtain an intuitive
signal-to-noise ratio expression for the crossover rate.
17/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
17 / 17
Conclusions
Goal: Find the rate of decay of the probability of error:
1
KP := lim − log P ({EML 6= EP }) .
n→∞ n
Employed tools from Large-Deviation theory and basic properties
of trees.
Found the dominant error tree that relates crossover events to
overall error event.
Used Euclidean Information Theory to obtain an intuitive
signal-to-noise ratio expression for the crossover rate.
More details available on arXiv (http://arxiv.org/abs/0905.0940).
17/17
Vincent Tan (MIT)
Large-Deviations for Learning Trees
ISIT
17 / 17