Sparse coding - online dictionary learning algorithm

Neural networks
Sparse coding - online dictionary learning algorithm
•
↵
D D
T
X
hugo.larochelle@usherbrooke.ca
T
X
1
1
2
Université
de
||h(x(t) )||1
hugo.larochelle@
Sparse codin
1
min
min l(x(t) ) = min
||x(t) D h(x(t) )||22 +
T
T
D T
D T
2
h(t)
X
X
Abstract
1
1
1
t=1
t=1
(t) 2
1,)||2012
min
min l(x(t) ) = min
||x(t) D h(xNovember
)||2 + ||h(x(t)
1
(t)
D T
D
T
2
T
h
Math for my t=1
slides
“Sparse coding”. t=1
1 X 1 (t)
November
min
||x
D h(x(t) )||22
Hugo Larochel
T
Topics: learning algorithm
(putting
it
DallT together)
2
X
1
1 (t)
(t) 2 t=1
(t)
min
||x
D h(x )||2
h
D
Département
d’inform
Abstract
D
T
2
• Learning
T
•
alternates between
inference
and
dictionary
learning
t=1
X
! T
!
1
1
1
T
T
Université
de
Sherbr
(t)
(t)
2
(t)
X
X D h || + ||h
X ||1
1
1
arg
min
arg
min
||x
(t)
(t) 2
(t)
(t)2 >
(t)
(t) >
Abstr
Math for my slides
arg“Sparse
min Dcoding”.
||x
D
h(x
)||
=
x
h(x
)
h(x
)
h(x
)
! 1
T
(t) ! 22
hugo.larochelle@usherb
h
T
T
T
T
2
t=1
D
X
X
X
1
1 (t)
t=1
t=1
(t) 2 t=1
(t)
(t) Math
>
(t)
(t)
>
forh(x
my slides
coding”.
min
||x
D h(x )||2 =
x h(x )
) h(x “Sparse
)
(t)
(t)
•
While
has
not
converged
x
D
Th
2
D
•
t=1
t=1
t=1
! T
! 1 November 1, 201
T
1
X
T
(t)
(t) 2
(t)
X
X
1
1
(t) in my
(t)(t)
(t)
(t)>||12
(t)
‣ find the sparse codes h(x ) for
=
arg
min
||x
D
h
||
+
||h
all
training
set
with
ISTA
(t) h (t) D
>
(t)
(t)
•
x
2
arg
min
arg
min
||x
D
h
||
+
||h
||
D (= (t) x 2 h(x )! 1
h(x ) h(x ) 2
!
1
T
T
T
h
T
2
X
X
X t=1
(t)
D
h
t=1
1
1
(t)
(t) >
(t)t=1 (t) >
‣ update
the
dictionary:
D (=
x h(x )
h(x ) h(x )
arg min
arg min
Abstract
PT t=1 (t)
P
T t=1 h(t) 2
(t) >
(t)
(t) >
1
t=1 T
D
-• A (=
x
h(x
)
B
(=
h(x
)
h(x
)
D
(=
B
A
t=1
t=1
Math1 for my slides
“Sparse
coding”.
PT
1
t)
(t) ->
(t)
(t) >
(t)
(t)
(t) 2
(t)
h(x )• B (= t=1 h(x ) h(x ) D
(=
B
A
h(x
)
=
arg
min
||x
D
h
||
+
||h
||
!
1
•
2
T
+1
T
2(t) >
X
X
(t)
(t)
(t)
h
1
(t)
(t)
(t)
(T +1)
(T
+1) >
- run block-coordinate descent
(t)
•
x
h
D
algorithm
to
update
x h(x !(=
x h(x )
+ (1
)x
h(x
)= arg min ||x
h(x
)
T
T
+1
T
X
X
X
2
t=1
t=1 (T +1)
1
1 (t)
(t)
(t)
(t)
(t)
(t) >
(T +1) >
h
x h(x (=
x h(x )
+ (1
)x
h(x
)
arg min
arg min ||x
T t=1 h(t) 2
•
• t=1
D
!
Similar
to the EMT +1t=1
algorithm
T
X
X
h(x(t) ) h(x(t) (=
h(x(t) ) h(x(t) )> + (1
)h(x(T +1) ) h(x(T +1) )>
SPARSE CODING
hugo.larochelle@usherbroo
hu
Départem
3
H
D h(x(t) )||22 + ||h(x(t) )||1
hugo.larochelle@usherbroo
November 1, 2012
Univers
Départe
Sparse
coding
November
1,
2012
T
1 X 1 (t)
hugo.laroch
Unive
November
1,
2012
min
||x
D h(x(t) )||22
D T
2
t=1
Topics: online learning algorithm Abstract
hugo.laro
Hugo Larochelle
Nov
Abstract
! T
! 1
Math for my slides “Sparse Tcoding”.
Département
d’informatique
• This algorithm isX
‘‘batch’’
(i.e.
not
online)
X
(t)
(t) 2
(t)
(t) >
(t)
(t) >
No
Abstract
||x
D
h(x
)||
=
x
h(x
)
h(x
)
h(x
)
Math
for
my
slides
“Sparse
coding”.
Math
for
my
slides
“Sparse
coding”.
2
t)
Université
de Sherbrooke
h(t)
D update of t=1
‣ single
the dictionary per
pass
on
the
training
set
t=1
T
X
1 for (t)
1 “Sparse
Math
my
slides
coding”.
(t) hugo.larochelle@usherbrooke.ca
(t) 2 (t)
(t)
(t)
arg
min
arg
min
||x
D
h
||
+
||h
‣ for large datasets,!we’d• like
x to update
h
D! after visiting each
• 2x
h(t)||1 D
T
X
T
X
1
1 (t)
(t)
min l(x ) = min
||x
D T
2
h(t)
=1
t=1
T
X
TD
X
hugo.larochelle@usherbrooke.ca
SPARSE CODING
T
h(t)
12
T
X
(t)
(t) >
(t)
(t) >
Math for my slides “Sparse
coding”. 1 (t)
1
(t)
(t)
D (=
x
h(x
)
h(x
)
h(x
)
• Solution: for each
• x
h
D
November
arg min T 1, 2012
arg min ||x argD
mih
t=1
t=1
T
2
(t)
slides
“Sparse
coding”.
1 (t) (t) Math
X
D(t)
D
h
(t)
(t)
2 for my
(t)
1
1
t=1
h(x
)
=
arg
min
||x
D
h
||
+
||h
||
‣ perform
(t)
inference
of
for
the
current
•
x
h
D
P
1
2
T
arg min
arg min ||x
Dh
(= t=1 h(x(t) ) h(x(t) )> D (= B A 1 h(t) 2
T
X
T
2
(t)
(t)
(t)
1
D
• of the quantities required
h
‣ update running averages
• x to update
h• D :
t=1
arg
min
arg
Abstract
1
T
(t)
(t)T
(t(
X
(t)
(t)
>
D
1
h(x ) = arg min ||x
Dh(x
hh
- B (= B + (1
)
x
h(x
)
t=1
arg
a
2 min
(t)
h
Math•for my slides
“Sparse
coding”.
T t=1(t)
1
D
- A (= A + (1
) h(x(t) ) h(x(t) )>
(t)
(t)
•
h(x ) = arg min ||x
Dh
(T +1)
(T +1) >
A
(= current
) h(x
h(x start’’
)
2h(x(t) ) = arg m
(t)
‣ use
h
of D as )‘‘warm
to block-coordinate
descent
•Ax+(t)(1value
h(t)
•
T
(t)
X
h
1
1 (t)
(t)
(t)
2 ) = arg
(t
h(x
arg min
arg min ||x
D h ||2 + ||h
t=1
hugo.larochelle@usherbrooke.ca
November 1, 2012
November 1, 2012
X1
T
T
1X
1
X
X
(t)
(t)
(t) 2
(t)
1
1
1D h(x
(t)
(t)
(t)
2
n
min
l(x
)
=
min
||x
)||
+
||h(x
)||1 (t) )||1
2
4
min
min
l(x
)
=
min
||x
D
h(x
)||
+
||h(x
T
2
D
X
T t=1 h(t)
T
2
(t)
D T
1 t=1 D1 TNovember
2
1,
2012
h
(t)
(t)
t=1
t=1
min
||x
D h(x )||22
DT T
(t)
2 (t) ))
h(t) (= shrink(h
,t=1
↵ sign(h
X
1
1 (t)
(t) 2
T
Abstract
min
||x
D
h(x
)||2
1DX
1
2 D h(x(t) )||2
minb) = [. .T. , sign(a
||x(t)
(1)
t=1
!
!
shrink(a,
bi , 0),2. . Abstract
.]
i ) max(|ai |
1
D T
2
T
T
T
X
X
X
t=1
coding”.
1 Math
1 (t)for my slides
Abstract
(t) learning
2 “Sparse
(t)
(t) >
(t)
(t) >
Topics:
online
algorithm
nmy slides “Sparse
||x
D
h(x
)||
=
x
h(x
)
h(x
)
h(x
)
T
coding”.
⇣
⌘
!
!
2
1
X
>
>
>
1
1 (t) (t)
1T
TT
T
2
(t)
(t)
(t)
(t)
X
X
X
t=1
t=1
= min
x
x
x
D
h(x
)
+
D
h(x
)
D
h(x
)
(2)
1t=1 (t)
(t)
2
(t)
(t)
>
(t)
(t)
>
Abstract
Math
for
slides
coding”.
D my
T t=1
2X
2 h(x ) h(x )
T
T = “Sparse
||x
D
h(x
)||
x h(x
)
(t)
(t)
X
2
1
1
1
•
(not||x(t)to
0!) 22 + ||h(x(t))||t=1
•D
x2 Initialize
hl(x(t)) = D
min
min
min
D h(x(t) )||!
1 Hugo Larochelle!
t=1
t=1
D T
D T T 2
T
h
T
!
!
⇣
⌘T
X
t=1
t=1
>1
X
T X
T
>
1 X
1
X
1
1
(t)
(t)
(t)
X
Math
for
my
slides
“Sparse
Département
d’informatique
(t)D h(x1coding”.
(t)
2 h(x(t) (t) 1
=(t)min
x
D
h(x
)
+
)
D
(3)
(t)• While
(t)
(t) > arg min ||x
(t)
(t)
>
(t)
(t) 2
(t)
arg
min
D
h
||
+
||h
||
1
hasn’t
converged
2
x
h1 X
D
D
T
2
D
(=
x
h(x
)
h(x
)
h(x
)
!
!
1
T
arg
min Tde Sherbrooke
arg min ||x
D h ||2 + ||h ||1
T t=1
2 t=1
t=1
(t)
T
T
D2
Université
1 X
h
(t)
(t)
X
min
||x
D
(1)
T
2
t=1
t=1
2
(t)
(t)h(x )||(t)
>
(t)
(t)
>
X
D
D
(4)
T
2
h
D‣ (=
x
h(x
)
h(x
)
h(x
)
1
1
(t)
(t)
t=1
t=1
hugo.larochelle@usherbrooke.ca
(t)
(t)
2
(t)
for
each
•
x
h
D
Online
Dictionary
Learning
for
Sparse
Coding.
T
P
arg
arg min ||xT
D h ||2 + ||h ||1
⇣
⌘> min
t=1
> T
>(t)
(t) > 1 X 1 (t)
(t)1t=1
>
1
(t)
(t)
(t)
(t)
(t)
(x= min
)D B (=x xt=1 h(x
) h(x
A
x
D h(x
)+ ) DD
h(x(=
)1 B
DMairal,
h(x
) Bach,
(2) and
X
T
2
Ponce
Sapiro,
2009.
(t) (t)
(t)
(t)
(t)
2
D
T
2
2
h
1
1
P
code
•) )> B (=t=1infer
h(x
)
=
arg
min
||x
D
h
||
+
||h
||
November
1,
2012
t=1
(t)
(t) 2
T
1
T
2
(t)
1
! (t) )>1TD
!
X
h(x
)
h(x
(=
B
A
T
arg
min
arg
min
||x
D
h
||
+
||h
X
X ⇣ h(t)
(t) ⌘>2
(t)
(t) >
t=1 >
1
2
1
1
(t) D + ↵
Dh(x
(=
h(x ) (3)
(t)
(t)
2
(t)
= min x(t) D
) +
D (x
h(x(t) ) D
D h(x
h(x(t) )) (t)
T
2
(t)
T (1
(t) > ) = argD
update
dictionary
h(x
min ||x
D T
2 +
hD h ||2 + ||h ||1
B (= B
) x(t) h(x
)
t=1
t=1 t=1
t=1
(4) h(t) 1 2
(t)
(t) >
) x h(x )
✓ B (= B + (1
(t)! Abstract
(t)
(t) 2
(t)
!
1
h(x
)
=
arg
min
||x
D
h
||
+
||h
||
T
T
T
1
•
2
X
X
X
1 Math
1 for
(T (t)
+1)
(t)my
(t) 2 coding”.
(t) (T
> +1) >
(t)
(t)
>
2
T slides “Sparse
(t)
✓ A
arg min
||x(=
h(x
)||
=
x
h(x
)
h(x
)
h(x
)
h
+
(1
)
h(x
)
h(x
)
1 (t)
1 XDA
2(t)
(t)
(t) >
(t)
(t) 2
(t)
D
(=
D
+
↵
(x
D
h(x
))
h(x
)
T
2
D
(T
+1)
(T
+1)
>
h(x ) = arg min ||x
D h ||2 + ||h
T + (1
t=1
t=1
A
) h(x t=1) h(x
)
(t)✓(=(t) A t=1
• x
h
while D hasn’t converged
2
(t)
h
T
! T
! 1
X
T
T
1
1 (t)
X
X
(T +1)
(T +1)
>
(t) 2
(t)
1 X 1 (t)
★ for
each
column
perform
gradient
update
A
+
(1
)
h(x
)
h(x
D
(=
D
/||D
(t) A 2(=
(t)
(t)
>
(t) ·,j ||)2(t) > arg min
·,j
·,j
arg
min
||x
D
h
||
+
||h
||1
in
||x
D h(x )||2 =
x h(x )
h(x ) h(x )
2
T t=1 2
T
2
h(t)
t=1 1
t=1 D
=1
SPARSE CODING
Sparse coding
(t)
x
(t)
(t) >
h(x ) B (=
•
PT
P
D
·,jT(=
t=1
t=1
(B·,j D
·,j + D·,j Aj,j )
(t)
(t)A>
A
h(x
) D (= B A 1
j,j ) h(x
D·,j (=
D·,j
||D·,j ||2
(t)
1
(t)
h(x (t)) = arg
min
||x
(t) >
D h(t) ||22 + ||h(t) ||1