Simplicity and Inference or Some Case Studies From Statistics and Machine Learning or Why We Need a Theory of Oversmoothing CFE Workshop, Carnegie Mellon, June 2012 Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University 1 Conclusion Statistics needs a rigorous theory of oversmoothing (undefitting). There are hints: G. Terrell (the oversmoothing principle) D. Donoho (one-sided inference) L. Davies (simplest model consistent with the data). But, as I’ll show, we need a more general way to do this. 2 Plan 1. 2. 3. 4. Regression Graphical Models Density Estimation (simplicity versus L2) Topological data analysis 3 Preview of the Examples 1. (High-Dimensional) Regression: Observe (X1, Y1), . . . , (Xn, Yn). Observe new X. Predict new Y . Here, Y ∈ R and X ∈ Rd with d > n. 2. (High-Dimensional) Undirected Graphs. X ∼ P . G = G(P ) = (V, E). V = {1, . . . , d}. E = edges. No edge between j and k means Xj q Xk |rest. 4 Preview of the Examples 3. Density Estimation. Estimator: Y1, . . . , Yn ∼ P and P has density p. n 1 X 1 b ph(x) = K n i=1 hd ! ||y − Yi|| . h Need to choose h. 4. Topological Data Analysis. X1, . . . , Xn ∼ G. G is supported on a manifold M . Observe Yi = Xi + i. Want to recover the homology of M . 5 Regression Best predictor is m(x) = E(Y |X = x). Assume only iid and bounded random variables. There is no uniformly consistent (distribution free) estimator of m. How about the best linear predictor? Excess risk: b = E(Y − βbT X)2 − inf E(Y − β T X)2. E(β) β But, as n → ∞ and d = d(n) → ∞ b → ∞. inf sup E(β) βb P 6 Simplicity: Best Sparse Linear Predictor Let Bk = {β : ||β||0 ≤ k} where ||β||0 = #{j : βj 6= 0}. Small ||β||0 = simplicity. Good news: If βb is best subset estimator then E(Y − βbT X)2 − inf E(Y − β T X)2 → 0. β∈Bk Bad news: Computing βb is NP-hard. 7 Convex Relaxation: The Lasso Let BL = {β : ||β||1 ≤ L} P where ||β||1 = j |βj |. Note that L controls sparsity (simplicity). Oracle: β∗ minimizes R(β) over BL. Pn 1 b Lasso: β minimizes n i=1(Yi − β T Xi)2 over BL. In this case: 2 b > R(β ) + ) exp −cn . sup P (R(β) ∗ P But how to choose L? 8 Choosing L (estimating the simplicity) b Usually, we minimize risk estimator R(L) (such as cross-validation). It is known that this overfits. Theorem : (Meishausen and Buhlmann, Wasserman and Roeder): P n(support(β) ⊂ support(βb∗)) → 1 where β∗ minimizes the true prediction loss. But if we try to correct by moving to a simpler model, we risk huge losses since the risk function is asymmetric. Here is a simulation: true model size is 5. (d = 80, n = 40). 9 40 30 Risk 20 10 0 0 10 20 Number of Variables 30 40 10 Corrected Risk Estimation In other words: simplicity 6= accurate prediction High predictive accuracy requires that we overfit. What if we want to force more simplicity? Can we correct the overfitting without incurring a disaster? Safe simplicity: b b |R(Λ) − R(`)| . Z(Λ) = sup s(Λ, `) `≥Λ (This is Lepski’s nonparametric method, adapted to the lasso.) 11 10 20 30 40 0 CV 0 10 20 30 40 50 60 80 100 −3 −1 z 1 2 3 nv 0 20 40 Index 12 Screen and Clean Even better: Screen and Clean (Wasserman and Roeder, Annals 2009). Split data into three parts: Part 1: Fit lasso Part 2: Variable selection by cross-validation Part 3: Least squares on surviving variables followed by ordinary hypothesis testing. 13 6 T 4 2 0 0 20 40 1:d 60 80 14 Screen and Clean However, this is getting complicated (and inefficient). What happens when linearity is false, high correlations etc.? Is there anything simpler? 15 Graphs X = (X1, . . . , Xd). G = (V, E). V = {1, . . . , d}. E = edges. (j, k) ∈ / E means that Xj q Xk |rest. X Z Y means X q Y |Z. Observe: X (1), . . . , X (n) ∼ P . Infer G. 16 Graphs Common approach: assume X (1), . . . , X (n) ∼ N (µ, Σ). c to maximize b Σ Find µ, loglikelihood(µ, Σ) − λ X |Ωjk | j6=k where Ω = Σ−1. c = 0. Omit an edge if Ω jk Same problems as lasso: no good way to choose λ. In addition, non-Gaussianty seems to lead to overfitting. The latter can be alleviated using forests. 17 Forests A forest is a graph with no cycles. In this case p(x) = d Y j=1 pj (xj ) Y pjk (xj , xk ) p (x )p (x ) (j,k)∈E j j k k . The densities can be estimated nonparametrically. The edge set can be estimated by the Chow-Liu algorithm based on nonparametric estimates of mutual information I(Xj , Xk ). Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman (JMLR 2010) 18 Gene Microarray: Glasso Nonparametric 19 Synthetic Example: True Nonparametric Best Fit Glasso 20 What We Really Want Cannot estimate the truth. There is no universally consistent, distribution free test of H0 : X q Y |Z. We are better off asking: What is the simplest graphical model consistent with the data? 21 What We Really Want Precedents for this are: Davies, Terrell, Donoho etc. Here is Davies idea (simplified). Observe (X1, Y1), . . . , (Xn, Yn). For any function m we can write Yi = m(Xi) + i = signal + noise where i = Yi − m(Xi). He finds the “simplest” function m such that 1, . . . , n look like “noise.” Davies, Kovac and Meise (2009) and Davies and Kovac (2001). How do we do this for graphical models? 22 Density Estimation Seemingly and old, solved problem. Y1, . . . , Yn ∼ P where Yi ∈ Rd and P might have a density p. kernel estimator n 1 1 X pbh(y) = K d n i=1 h ! ||y − Yi|| . h Here, K is a kernel and h > 0 is a bandwidth. How do we choose h? Usually, we minimize an estimate of R(h) = E Z (pbh(x) − p(x))2dx . But this is the wrong loss function ... 23 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 p1 0.4 p0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 0.3 0.4 p2 0.0 0.1 0.2 L2(p0, p1) = L2(p0, p2) −3 −2 −1 0 1 2 3 24 0.03 0.00 0.04 0.10 0.05 0.20 0.06 0.30 0 0.00 2 0.04 4 6 0.08 8 0.12 10 Truth −5 −5 0 0 h = .01 5 5 −5 h = .4 −5 0 0 5 h=4 5 25 Density Estimation More generally p might have: smooth parts, singularities, near singularities (mass concentrated near manifolds) etc. In principle we can use Lepski’s method: choose a local bandwidth b h(x) = sup h : |pbh(x) − pbt(x)| ≤ ψ(t, h) for all t < h . Lepski and Spokoiny 1997, Lepski, Mammen and Spokoiny 1997. It leads to this ... 26 0.00 0.00 0.10 0.05 0.10 0.20 0.15 −5 −5 0 0 5 5 −5 x −5 0 0 5 x 5 27 0.0 0.0 0.5 1.0 1.0 2.0 p(x) 1.5 h(x) 2.0 3.0 2.5 Oversmoothing We really just want a principled way to oversmooth. Terrell and Scott (1985) and Terrell (1990) suggest the following: choose the largest amount of smoothing compatible with the scale of the density. The asymptotically optimal bandwidth (with d = 1) is R h= !1 2 K (x)dx 5 4 I(p) nσK where I = (p00)2. Now: minimize I = I(p) subject to: R T (P ) = T (Pbn) where T (·) is the variance. 28 Oversmoothing Solution: 1.47 s h= R 1 n5 1 2 K 5 . Good idea, but: - it is still based on L2 loss - it is based on an asymptotic expression for optimal h. We need a finite sample version with a more appropriate loss function. Here it is on our example: 29 0.00 0.05 0.10 0.15 0.20 Oversmoothing −5 0 5 30 Summary so far ... Our current methods select models that are too complex. Are there simple methods for choosing simple models? Now, a more exotic application ... 31 Topological Data Analysis X1, . . . , Xn ∼ G where G is supported on a manifold M . Here Xi ∈ RD but dimension(M ) = d < D. Observe Yi = Xi + i. Goal: infer the homology of M . Homology: clusters, holes, tunnels, etc. 32 One Cluster ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 33 One Cluster + One Hole ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● 34 Inference The Niyogi, Smale, Weinberger (2008) estimator: 1. 2. 3. 4. Estimate density. (h) Throw away low density points. (t) Form a Cech complex. () Apply an algorithm from computational geometry. Usually, the results are summarized as a function of in a barcode plot (or a persistence diagram). 35 Example: from Horak, Maletic and Rajkovic (2008) 36 The Usual Heuristic Usually assume that small barcodes are topological noise. This is really a statistical problem with many tuning parameters. Currently, there are no methods for choosing the tuning parameters. What we want: the simplest topology consistent with the data. Working on this with Sivaraman Balakrishnan, Aarti Singh, Alessandro Rinaldo and Don Sheehy. 37 Summary We still don’t know how to choose simple models. 38 Summary THE END 39
© Copyright 2025 Paperzz