In the Name of God NN Learning g based on Information Maximization Outline … • Introduction I d i to Information I f i Theory Th • Information Maximization in NNs ▫ Single Input Single Output ▫ Single Layer Networks ▫ Causal Filters • Applications ▫ Blind Separation ▫ Blind Deconvolution ▫ Experimental E i t l Results R lt • Discussion Introduction to Information Theory • The Th Uncertainty U t i t (Entropy) (E t ) function f ti for f an RV is i defined as M H ( X ) = −∑ p ( xi ) log p ( xi ), ) Discrete RVs H ( X ) = − ∫ f x ( x) log f x ( x)dx, Continuous RVs i =1 X • Consider the RV of a dice: x 1 2 3 4 5 6 H (X ) p1 ( x) 0 0 0 0 0 1 0 p2 ( x) 1/24 1/24 1/3 1/24 1/24 1/2 1.8 p3 ( x) 1/6 1/6 1/6 1/6 1/6 1/6 2.6 Introduction to Information Theory • Joint J i Entropy: E H ( X , Y ) = − ∫∫ f xy ( x, y ) log f xy ( x, y ) dxdy, XY • Conditional Entropy: X H (Y X ) = H ( X , Y ) − H ( X ) Deterministic Neural Network • Mutual Information: Uncertaintyy about the output Y I ( X ; Y ) = H (Y ) − H (Y X ) Uncertainty about the output given the input Information Maximization in NNs • Maximize: I ( X ; Y ) = H (Y ) − H (Y X ) In regard to network parameters W X Deterministic Neural Network • But H (Y X ) is independent of W : ∂I ( X ; Y ) ∂W Maximize I ( X ; Y ) = ∂H (Y ) ∂W ≡ Maximize H (Y ) Y Single Input Single Output x → g ( .) → y • g (.) is a monotonic function of x fy ( y) = fx ( x) ∂y ∂x , x = g −1 ( y ) ⎡ ∂y ⎤ H (Y ) = − E ⎣⎡log f y ( y ) ⎦⎤ = E ⎢log − E ⎡⎣log f x ( x ) ⎤⎦ ⎥ ∂x ⎦ ⎣ Independent of Network Parameters ∂H ∂ ΔW ∝ = ∂W ∂W −1 ⎛ ∂y ⎞ ⎛ ∂y ⎞ ∂ l log ⎜ ⎟=⎜ ⎟ ∂x ⎠ ⎝ ∂x ⎠ ∂W ⎝ ⎛ ∂y ⎞ ⎜ ⎟ ⎝ ∂x ⎠ Single Input Single Output • Let 1 y= , u = wx + w0 −u 1+ e • Matching a neuron’s in-out function to the distribution of signals: ∂y = wy (1 − y ), ∂x ∂ ⎛ ∂y ⎞ ⎜ ⎟ = y (1 − y )(1 + wx(1 − 2 y )) ∂w ⎝ ∂x ⎠ ⇓ Δw ∝ Anti-Hebbian bb term 1 + x(1 − 2 y ), Δw0 = 1 − 2 y w Anti-decay term Single Layer Networks 1 Y = g (WX + W0 ) , g (u ) = 1 + e−u fy ( y) = fx ( x) J x1 x2 N ⎡ ∂yi ⎤ J = det ⎢ ⎥ = ((det W )∏ yi ((1 − yi ) i =1 ⎢⎣ ∂x j ⎥⎦ , ∂yi = wij yi (1 − yi ) ∂x j xN H (Y ) = − E ⎡⎣log f y ( y ) ⎤⎦ = E ⎣⎡log J ⎦⎤ − E ⎡⎣log f x ( x ) ⎤⎦ cofwij 1 ∂J ∂H Δwij ∝ = = + x j (1 − 2 yi ) J ∂wij det W ∂wij −1 ΔW ∝ ⎡⎣W ⎤⎦ + (1 − 2Y ) X T , ΔW0 = 1 − 2Y T Anti-redundancy term Anti-Hebbian term y1 y2 yN Causal Filters w (t ) x (t ) u (t ) Causal Filter y (t ) g ( .) Nonlinearity y (t ) = g ( u (t ) ) = g ( w(t ) ∗ x(t ) ) 2 Y = g (U ) = g (WX ) ⎡ wL ⎢w W = ⎢ L −1 ⎢ M ⎢ ⎣ 0 0 wL 0 0 L O K wL −1 0 0⎤ 0 ⎥⎥ 0⎥ ⎥ wL ⎦ 3 4 1 L t ⎛ 1 ⎞ ΔwL ∝ ∑ ⎜ − 2 xt yt ⎟ t =1 ⎝ wL ⎠ M ΔwL − j ∝ ∑ ( −2 xt − j yt ) M t =1 Applications • Blind Separation (e,g, the cocktail party) x1 s1 s2 x2 u1 W u2 A sN xN Find W that will reverse the effect of A uN Unknown Mixing Matrix • Blind Deconvolution ((e.g. g echo cancelation)) s (t ) a (t ) x (t ) Unknown causal Filter w (t ) u (t ) Experimental Results • 5x5 Information Maximization Network for blind separation of 5 speakers from 7 sec segments. Experimental Results p g • Filters used to convolve speech signals Discussion g • Algorithms are Limited ▫ Single layer networks were used… ▫ N2N Mappings (not useful for dimensionality reduction or compression) ▫ Time delays not considered in Blind Separation • Learning rule is not local • The presented work brings new applications to the NN domain References • A. J. Bell and T. J. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computing, vol. 7, N. 6, pp. 1004–1034. • Ash R R. B B., “Information Information Theory Theory”, New York York, Interscience, Interscience 1965
© Copyright 2025 Paperzz