2014 Technical Notes jongman@gmail.com January 21, 2015 1 Statistical Inference Writeup jongman@gmail.com January 19, 2015 This is a personal writeup of Statistical Inference ”Casella and Berger, 2nd ed.). The purpose of this note is to keep a log of my impressions during the reading process, so I cannot guarantee the correctness of contents. :-) Contents 1 Probability Theory 4 2 Transformations and Expectations 2.1 Transformations of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 2.1.1 Monotonic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Piecewise monotonic functions . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 Probablity integral transformation . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Moments and Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . 6 3 Common Families of Distribution 3.1 Discrete Distributions 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Continuous Distributions 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Exponential Families of Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Scaling and Location Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 3.5 Inequalities and Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Multiple Random Variables 10 4.1 Joint and Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Conditional Distributions and Independence . . . . . . . . . . . . . . . . . . . . . . 10 4.3 Bivariate Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.4 Hierarchical Models and Mixture Distributions . . . . . . . . . . . . . . . . . . . . 11 4.5 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.6 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.7 Inequalities and Identities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1 5 Properties of a Random Sample 13 5.1 Basic Concepts of Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Sums of Random Variables from a Random Sample . . . . . . . . . . . . . . . . . . 13 13 5.2.1 Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Bessel’s Correction . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Sampling Distributions of Sample Mean and Sample Variance 5.2.4 Using mgfs to Find Sampling Distributions . . . . . . . . . . . . 5.3 Sampling From the Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 14 14 14 15 15 5.3.1 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.3.2 Fisher’s F -distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Convergence Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16 16 5.5.1 5.5.2 5.5.3 5.5.4 Definitions . . . . . . . Law of Large Numbers Central Limit Theorem Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 17 17 6.1 The Sufficiency Principle . . . . . . 6.1.1 Sufficient Statistic . . . . . . 6.1.2 Factorization Theorem . . . 6.1.3 Minimal Sufficient Statistics 6.1.4 Ancillary Statistics . . . . . 6.1.5 Complete Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 18 18 19 19 6.2 The Likelihood Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6 Principles of Data Reduction 18 7 Point Estimation 20 7.1 Methods of Finding Estimators . . . . . . . . . . . 7.1.1 Method of Moments for Finding Estimator 7.1.2 Maximum Likelihood Estimation . . . . . . 7.1.3 Bayes Estimators . . . . . . . . . . . . . . . 7.1.4 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 21 21 21 7.2 Methods of Evaluating Estimators 7.2.1 Mean Squared Error . . . . 7.2.2 Best Unbiased Estimator . . 7.2.3 General Loss Functions . . 8 Hypothesis Testing 8.1 Terminology . . . . . . . . . . 8.2 Methods of Finding Tests . . . 8.2.1 Likelihood Ratio Tests 8.2.2 Bayesian Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 23 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 26 26 27 2 8.2.3 Union-Intersection and Intersection-Union Tests . . . . . . . . . . . . . . . 27 8.3 Evaluating Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Types of Errors and Power Function . . . . . . . . . . . . . . . . . . . . . . . 27 27 8.3.2 Most Powerful Tests: Uniformly Most Powerful Tests . . . . . . . . . . . . . 29 8.3.3 Size of UIT and IUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 8.3.4 p-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 8.3.5 Loss Function Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 9 Interval Estimation 32 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 9.1.1 Coverage Probability Example: Uniform Scale Distribution . . . . . . . . . 9.2 Methods of Finding Interval Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 9.2.1 Equivalence of Hypothesis Test and Interval Estimations: Inverting a Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 9.2.2 Using Pivotal Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 9.2.3 Pivoting CDFs Using Probability Integral Transformation . . . . . . . . . . 35 9.2.4 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 9.3 Methods of Evaluating Interval Estimators . . . . . . . . . . . . . . . . . . . . . . . 36 9.3.1 Size and Coverage Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 9.3.2 Test-Related Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 9.3.3 Bayesian Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 9.3.4 Loss Function Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 10 Asymptotic Evaluations 38 10.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 38 10.1.2 Comparing Consistent Estimators . . . . . . . . . . . . . . . . . . . . . . . . 39 10.1.3 Asymptotic Behavior of Bootstrapping . . . . . . . . . . . . . . . . . . . . . . 10.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 10.2.1 Robustness of Mean and Median . . . . . . . . . . . . . . . . . . . . . . . . . 40 10.2.2 M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 10.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 10.3.1 Asymptotic Distribution of LRT . . . . . . . . . . . . . . . . . . . . . . . . . . 41 10.3.2 Wald’s Test and Score Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 10.4 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 11 Analysis of Variance and Regression 42 11.1 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Different ANOVA Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 11.1.2 Inference Regarding Linear Combination of Means . . . . . . . . . . . . . . 11.1.3 The ANOVA F Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 43 11.1.4 Simultaneous Estimation of Contrasts . . . . . . . . . . . . . . . . . . . . . . 44 3 11.1.5 Partitioning Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 11.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 General Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 45 11.2.2 Least Square Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 11.2.3 Best Linear Unbiased Estimators: BLUE . . . . . . . . . . . . . . . . . . . . . 46 11.2.4 Normal Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 12 Regression Models 12.1 Errors in Variables ”EIV) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 50 12.1.1 Functional And Structural Relationship . . . . . . . . . . . . . . . . . . . . . 50 12.1.2 Mathematical Solution: Orthogonal Least Squares . . . . . . . . . . . . . . 50 12.1.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 50 12.1.4 Confidence Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Logistic Regression And GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 51 12.2.1 Generalized Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 12.2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 52 12.3.1 Huber Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 1 Probability Theory We will not discuss elementary set theory and probability theory. • Kolmogorov’s axiom set defines a probability function for a given σ-algebra. • Bonferroni’s inequality is useful for having a gut estimation of lower bound for events that are hard or impossible to calculate probabilities for. It is equivalent to Boole’s inequality. – Is a generalization of: P (A ∩ B) ≥ P (A) + P (B) − 1 * Proof: P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Rearranging, we get P (A ∩ B) = P (A) + P (B) − P (A ∪ B) ≥ P (A) + P (B) − 1. • Random variables: function mapping from sample space to real numbers. 2 Transformations and Expectations 2.1 Transformations of random variables This section covers functions of random variables and how to derive their cdf from the cdf of the original RV. Say we have a random variable X with a pdf fX or a cdf FX . What is the distribution of Y = g (X)? 4 2.1.1 Monotonic functions We can only do this for monotonic functions, or at least piecewise monotonic functions. So what do we do when function Y = g (X) is monotonic? • If monotonically increasing, FY (y) = P (Y ≤ y) = P (g (x) ≤ y) = P x ≤ g −1 (y) = FX g −1 (y) • If decreasing FY (y) = P (Y ≤ y) = P (g (x) ≤ y) = P x ≥ g −1 (y) = 1 − FX g −1 (y) What do you do when you have pdf, not cdf of the original RV? You don’t have to go through the above way; you can differentiate above using chain rule – the below formula takes care of both increasing and decreasing functions. fY (y) = fX g −1 d −1 (y) g (y) dy In fact, this is used far more often than the cdf case. 2.1.2 Piecewise monotonic functions You can also do this when g() is piecewise monotonic. it means, you should be able to partition χ, the domain of the original RV, into contiguous sets so the function is monotonic in each partition. For example, if f (X) = X 2 , we should split the real line into (−∞, 0] and (0, ∞). Let us partition χinto multiple subsets A1 , A2 , · · · and let g1−1 , g2−1 , · · · be the inverse of g() in each of the intervals. Then we state without proof fY (y) = X fX d gi−1 (y) gi−1 (y) dy which basically sums up the previous formula for each interval. 2.1.3 Probablity integral transformation If X has a cdf FX , and we say Y = FX (X), then Y is uniformly distributed on (0, 1). This can be understood intuitively: let FX (x) = y. Then P (Y ≤ y) = P (X ≤ x) = FX (x) = y. This of course assumes monotonicity on FX ’s part, which is not always true, but this can be treated technically. 2.2 Expected Values This section discusses the definition of expected values and their properties. The linearity of expectation arises from integration which is how EVs are defined: E (A + B) = EA + EB 5 regardless if A and B are independent or not. When you want to make transformation of a random variable and take its expected value, you can calculate EVs directly from definition. Otherwise, you can transform the pdf using the above strategy and go from there. 2.3 Moments and Moment Generating Functions Definitions: • The nth moment of X, µn is defined by E (X n ). ′ • The nth central moment of X, µn is defined by E ((X − µ) ). n • A moment generating function of X is defined as MX (t) = EetX = ´ x etx fX (x) dx. • Variance VarX is defined as the second central moment. Of couse, we have the following equality as well: VarX = EX 2 − (EX) 2 A moment generating function can be used to generate moments: we have dn MX (0) = EX n dtn Moment generating functions can be used to identify distributions; if two distributions have same mgf and all moments exist, they are the same distribution. We discuss some theorems regarding convergence of mgfs to another known mgf to prove the convergence of the distribution. It looks like it has more of a theoretical importance, rather than practical. 3 Common Families of Distribution 3.1 Discrete Distributions Discrete Uniform Distribution The simplest of sorts. • P (X = x|N ) = • EX = N +1 2 , 1 N, x ∈ 1, 2, 3, · · · , N VarX = (N +1)(N −1) 12 Hypergeometric Distribution Say we have a population of size N , of which M has a desired property. We take a sample of size K ”typically K ≪ M ). What is the probablity that x of those have this property? • pmf is derived from counting principles, EV and Var from a similar manner to binomial distribution: rewrite the sum as a sum of hypergeometric pmf for a smaller parameter set which equals 1. N −M N • P (X = x|N, M, K) = M / K x K−x 6 • EX = KM N , VarX = KM N Binomial Distribution • P (X = x|N, p) = N x (N −M )(N −K) N (N −1) px (1 − p) N −x • EX = np, VarX = np (1 − p) Poisson Distribution • P (X = x|λ) = e−λ λx x! • EX = VarX = λ • When a binomial distribution’s p is very small, the distribution can reasonably approximated by a Poisson distribution with λ = np. ”Proof uses mgfs) Negative Binomial Distribution If we want to know the number of Bernoulli trials required to get r successes? Put another way, we are interested in the number of failures before rth success. • P (Y = y) = (−1) y −r y y pr (1 − p) = r+y−1 y pr (1 − p) y • EY = r 1−p p ”a simple proof: you can find expected number of failures before each success andsum them, because linearity. Woohoo!) • VarY = r(1−p) p2 = µ + 1r µ2 • Negative binomial family includes Poisson as a limiting case ”Poisson is also related with Binomial, so this seems natural) but doesn’t seem to have a large practical significance. Geometric Distribution A special case of negative binomial distribution with r = 1. • P (X = x) = p (1 − p) • EX = p1 , VarX = x−1 1−p p2 • Memoryless property: history so far has no influence on what will happen from now on. So: P (X > s|X > t) = P (X > s − t) 3.2 Continuous Distributions Uniform Gamma A highly generic & versatile family of distributions. • A primer on gamma function Γ: ´∞ – Γ (α) = 0 tα−1 e−t dt which also has a closed form when α ∈ N. – Also, the gamma function serves as a generalization of factorial, because Γ (α + 1) = αΓ (α) if α > 0. ”This makes Gamma function easy to evaluate when we know its values between 0 and 1) 7 – We have Γ (α) = (α − 1)! for integers. • The full gamma family has two parameters, α”shape parameter, defines peakedness) β the scale parameter ”determines spreadness). • The pdf is set as f (x|α, β) = 1 xα−1 e−x/β Γ (α) β α • EX = αβ, VarX = αβ 3 • Being a generic distribution, it is related to many other distributions. Specifically, many distributions are special cases of Gamma. – P (X ≤ x) = P (Y ≥ α) where Y ∼ Poisson (x/β) – When α = p/2 ”p is an integer) and β = 2, it becomes a chi-squared pdf with p degrees of freedom. – When we set α = 1, it becomes the exponential distribution pdf with scale parameter β. ”The exponential distribution is a continuous cousin of the geometric distribution.) – It is also related to the Weibull distribution which is useful for analyzing failure time data and modeling hazard functions. • Applications: mostly relted with lifetime testing, etc. Normal without doubt, the most important distribution. • pdf f X = x|µ, σ 2 = 2 1 √ e−(x−µ) / σ 2π 2σ 2 • Proving the pdf integrates to 1 is kind of clunky. • EX = µ, VarX = σ 2 . • 68% - 95% - 99% rule. • Thanks to CLT, normal distribution pops up everywhere. One example would be using normal to approximate binomials. If p is not extreme and n is large, normal gives a good approximation. – Continuity correction: Say Y ∼ Binomial (n, p) and X ∼ n (np, n (1 − p) p), we can say P (a ≤ Y ≤ b) ≈ P (a ≤ X ≤ b). However, P a − 21 ≤ X ≤ b + 12 is a much better approximation, which is clear from a graphical representation. Beta One of rare distributions which have a domain of [0, 1]. Can be used to model proportions. • Given that B (α, β) = ´1 0 xα−1 (1 − x) β−1 f (X = x|α, β) = 8 dx , 1 β−1 xa−1 (1 − x) B (α, β) • Since the Beta function is related to Gamma function B (α, β) = lated to Gamma function as well. Γ(α)Γ(β) Γ(α+β) , it is re- • It can have varying shapes depending on the parameters; unimodal, monotonic, ushaped, uniform, etc. • EX = α α+β , VarX = αβ (α+β)2 (α+β+1) Cauchy A symmetric, bell-shaped curve with undefined expected value. Therefore, it is mostly used as an extreme case against which we test conjectures. However, it pops up in unexpected circumstances. For example, the ratio of two normal variables follows the Cauchy distribution. 1 f (x|θ) = π 1 + (x − θ) 2 Lognormal log of which is the normal distribution. Looks like Gamma function. • f x|µ, σ 2 = • EX = e 2 2 √ 1 e−(lg x−µ) / 2σ 2πσx µ+ σ 2 /2 2 µ+σ 2 , VarX = e − e2µ+σ 2 Double Exponential formed by reflecting the exponential distribution around its mean. 3.3 Exponential Families of Distribution A family of pdfs or pmfs is an exponential family if it can be expressed as: f (x|θ) = h (x) c (θ) exp X wi (θ) ti (x) i ! Many common families are exponential; normal, gamma, beta, binomial, Poisson, etc. Such form has some algebraic advantages: there are equalities which provide a shortcut for calculating the first two moments ”EV and Var) with differentiation instead of summation/integration! Will not discuss the actual formulas because they are kind of messy. The section also discusses natural parameter set of a exponential family: in the above definition, Also, the definition of full and curved exponential families are introduced; not sure about practical significance. 3.4 Scaling and Location Families If f (x) is a pdf, then the following function is a pdf as well. g (x|µ, σ) = 1 f σ x−µ σ where µ is the location parameter, σ is the scaling parameter. Expected value will get translated accordingly, and variance will grow by σ 2 . 9 3.5 Inequalities and Identities Chebychev Inequality if g (x) is a nonnegative function, P (g (x) ≥ r) ≤ Eg (X) r which looks kind of arbitrary ”the units do not match) but is useful sometimes. It sometimes provides useful boundaries. 2 which is the number of standard deviations squared For example, let g (x) = x−µ σ ”square is there to make g nonnegative). Letting r = 22 yields P x−µ σ 2 ≥4 ! 2 ≤ 1 1 (x − µ) E = 4 σ2 4 which is a lower bound of elements 2 standard deviations! This did not make any assumption on the distribution of X. Stein’s Lemma with X ∼ n θ, σ 2 and g that is differentiable and satisfies E |g ′ (X)| < ∞, E [g (X) (X − θ)] = σ 2 Eg ′ (X). Seems obscure, but useful for calculating higher order moments. Also, wikipedia notes that it is useful in MPT. 4 Multiple Random Variables 4.1 Joint and Marginal Distributions Mostly trivial stuff. We extend the notion of pmf/pdfs by adding more variables, which give us joint pdf/pmfs. Marginal distributions ”distribution of a subset of variables without referencing the other variables) are introduced. Intuitively, they are compressed versions of joint pdf/pmfs by integrating them along a subset of parameters. 4.2 Conditional Distributions and Independence Intuitively, conditional distributions are sliced versions of joint pdf/pmfs. Some of the random variables are observed; what are the distribution of the remaining variables given the observation? The derivation of conditional pmf is straightforward for discrete RVs, it is the ratio of joint pmf and marginal pmf. This relationship, somewhat surprisingly, holds true for continuous RVs as well. Two variables X and Y are said to be independent when f (x, y) = fX (x) fY (y). Actually, the converse is true as well; if the joint pdf can be decomposed into a product of two functions, one on x and one on y, they are independent. Consequences of independence: 10 • E (g (X) h (Y )) = Eg (X) Eh (Y ) • Covariance is 0 • You can get the mgf of their sum by multiplying individual mgfs. This can be used to derive the formula for adding two normal variables. 4.3 Bivariate Transformations This section mainly discusses strategies of taking transformations ”sum, product, division) of two random variables. This is analogous to section 2.1 where we discussed transformations of a single variable. Problem: You have a random vector (X, Y ) and want to know about U = f (X, Y ). Strategy: We have a recipe for transforming a bivariate vector into another. So we transform (X, Y ) into (U, V ) and take the marginal pdf to get the distribution of U . V is chosen so it will be easy to back out X and Y from U and V ; which is essential in the below recipe. Recipe: Basically, similar to transformation recipe in 2.1. However, the derivative of the inverse function is replaced by the Jacobian of the transformation; which is defined as the determinant of the matrix of partial derivatives. J= ∂x ∂y ∂x ∂y − ∂u ∂v ∂v ∂u Given this, we have fU,V (u, v) = fX,Y (h1 (u, v) , h2 (u, v)) |J| where g1 (x, y) = u, h1 (u, v) = x, g2 (x, y) = y, h2 (u, v) = y. Similar to formula in 2.1, this assumes the transformation is 1-to-1, and thus the inverse exists. When this assumption breaks, we can use the same trick as in 2.1 by breaking the domain into sets where in each set the transformation is 1:1. fU,V (u, v) = k X i=1 which is the formula in 4.3.6. 4.4 fX,Y (h1i (u, v) , h2i (u, v) |Ji |) Hierarchical Models and Mixture Distributions Hierarchical models arise when we model a distribution where a parameter has its own distribution. This is sometimes useful in gaining a deeper undestanding of how things work. As an example, say an insect lays a large number of eggs following Poisson, and individual egg’s survival is a Bernoulli trial. Then, the expected number of surviving insect is X|Y ∼ binomial (Y, p) and Y ∼ Poisson (λ) . 11 This section also introduces a trivial, but useful equality: EX = E (E (X|Y )) This is very intuitive if you think about it, but realizing this makes calculations very easy sometimes. A noncentral chi-squared distribution is given as an example. A formula for variance in hierarchical model is given: VarX = E (Var (X|Y )) + Var (E (X|Y )) 4.5 Covariance and Correlation This section introduces covariances and correlations. Important stuff, but it only gives rather a basic treatment. Definitions and identities: • Cov (X, Y ) = E [(X − µX ) (Y − µY )] = EXY − µX µY • ρXY = Cov(X,Y ) σX σY • Var (aX + bY ) = a2 VarX + b2 VarY + 2abCov (X, Y ) This section also introduces bivariate normal distributions. 4.6 Multivariate Distributions A strategy for transforming a vector of random variable is introduced. Since the Jacobian is well defined for larger matrices, the recipe is more or less the same with the bivariate case. 4.7 Inequalities and Identities. Analogous to chapter 2, we have a section devoted to inequalities and identities. Apparently, all of these inequalities have many forms, being applied in different contexts. Many of these have popped up in CvxOpt course as well. ”Looks like most primitive forms come from field of mathematical analysis.) Holder’s Inequality Let p, q be real positive numbers s.t. have p 1/p |EXY | ≤ E |XY | ≤ (E |X| ) 1 p + 1 q = 1 and X, Y are RV. Then we q 1/q (E |Y | ) Cauchy-Schwarz Inequality A special case of Holder’s inequality when p = q = 2. Then, |EXY | ≤ E |XY | ≤ q E |X| E |Y | 2 2 In vector terms, it means x · y ≤ |x| |y|. This is intuitive, as taking inner product gives chance of things canceling out each other, just like triangular inequality. 12 Also notable is that this can be used to prove the range of the correlation; just take |E (X − µX ) (Y − µY )| and apply CS inequality. Squaring each side gives 2 2 2 (Cov (X, Y )) ≤ σX σY Minkowski’s Inequality This feels like an additive version of Holder’s Inequality. p 1/p [E |X + Y | ] p 1/p ≤ [E |X| ] p 1/p + [E |Y | ] Jensen’s Inequality on convex function. Given a convex function g, Eg (X) ≥ g (EX) 5 Properties of a Random Sample This chapter deals with several things: • Definition of random sample • Distribution of functions of random sample ”statistics) and how they converge as we increase sample size • And of course, the LLN and the CLT. • Generating random samples. 5.1 Basic Concepts of Random Samples A random sample X1 , · · · , Xn is a set of independent and identically distributed ”iid) RVs. This means we are sampling with replacement from an infinite population. This assumption doesn’t always hold, but is a good approximation in a lot of cases. 5.2 Sums of Random Variables from a Random Sample Sums of random variables can be calculated, of course, using the transformation strategies from Chapter 4. However, since each random variable is iid, the calculation can be simplified greatly. Also, a definition: Given a vector-or-scalar valued statistic Y = T (X1 , X2, · · · ), the distribution of Y is called the sampling distribution of Y . 13 5.2.1 Basic Statistics Two most basic statistics are introduced. ¯ is defined as Sample mean X X ¯= 1 X Xi n i Sample variance S 2 is defined as S2 = 1 X ¯ Xi − X n−1 i The formula for sample variance raises some questions; why n − 1, not n? This was chosen so that ES 2 = σ 2 . If we were to use n, S 2 would be a biased towards 0. 5.2.2 Bessel’s Correction Using n − 1 instead of n as the denominator in sample variance is called Bessel’s Correction. The legitimacy of this can be proven by taking ES 2 and seeing it equals to σ 2 , but the wikipedia page offers some explanation. By taking sample mean, we are minimizing the squared error of the samples. So unless the sample mean is equal to population mean, sample variance must be smaller than variance measured using the population mean. Put more intuitively, the sample mean skews towards whatever sample observed, so the variance will be smaller than real. Another subtle, but astounding point mentioned is that by Jensen’s inequality, S is a biased estimator of standard deviation – it underestimates! Since square root is a concave function, according to Jensen’s inequality we have √ √ √ E S 2 ≤ ES 2 = σ 2 = σ Also, it is noted that there is no general formula for an unbiased estimate of the standard deviation! 5.2.3 Sampling Distributions of Sample Mean and Sample Variance ¯ = µ: the expected value of sample mean is the population mean. LLN will state that • EX it will converge to the population mean almost surely as the sample size grows. ¯= • VarX σ2 n : this follows from that the random variables are independent. Var 1X Xi n = X VarX σ2 1 1 Var = X = i n2 n n So, the variance decreases linearly with the sample size. 14 • ES 2 = σ 2 . This can be shown algebraically. 5.2.4 Using mgfs to Find Sampling Distributions We can plug the statistic’s definition into mgf, and simplify, praying the mgf would be a recog nizable form. For example, this is how we show the mean of iid normal variables Xi ∼ n µ, σ 2 2 follows n µ, σn . 5.3 Sampling From the Normal Distribution More notable facts when we sample from the normal distribution. ¯ and S 2 are independent random variables. • X ¯ has a n µ, σ 2 /n distribution ”proven by mgfs as noted above) • X • (n − 1) S 2 /σ 2 has a chi-squared distribution with n − 1 degrees of freedom. 5.3.1 Student’s t-distribution If Xi ∼ n µ, σ 2 , we know that ¯ −µ X √ ∼ n (0, 1) σ/ n However, in most cases µ and σ are unknown parameters. This makes it hard to make inferences about one of them. We can approximate σ by S: this gives us another distribution, however it makes it easier to make inferences about µ. So, the statistic ¯ −µ X √ S/ n is known to follow Student’s t-distribution with n − 1 degrees of freedom. The concrete distri¯ and S , and their respective distribution could be found using the independence between X butions ”normal and chi-squared). 5.3.2 Fisher’s F -distribution Now we are comparing variances between two samples. One way to look at this is the variance ratio. There are two variance ratios; one sample, one population. Of course, we don’t know about population variance ratios. However, the ratio between the two ratios ”I smell recursion) 2 2 SX /SY2 S 2 /σX = X 2 2 2 2 σX /σY SY /σY is known to follow F -distribution. The distribution is found by noting that in the second representation above, both the numerator and the denomiator follows chi-squared distributions. 15 5.4 Order Statistics Order statistics of a random sample are the sorted sample values. This seems obscure, but indeed is useful where point estimation depends on the minimum/maximum value observed. The pmf/pdf of those can be derived from noticing that: • Say we want to find P X(j) ≤ xi ”X(j) is thejth smallest number). • Each random variable is now a Bernoulli trial, with FX (xi ) probability of being ≤ xi . • So the cumulative functions can be derived from binomial distribution of each variable ”for discrete) or something similar ”continous). fX(j) (x) = 5.5 n! j−1 n−j fX (x) [FX (x)] [1 − FX (x)] (j − 1)! (n − j)! Convergence Concepts The main section of the chapter. Deals with how sampling distributions of some statistics change when we send the sample size to infinity. 5.5.1 Definitions Say we have a series of random variables {Xi }: Xn being the statistic value when we have n variables. We study the behavior of limn→∞ Xn . There are three types of convergences: Convergence in Probability when for any ǫ > 0, lim P (|Xn − X| ≥ ǫ) = 0 n→∞ Almost Sure Convergence when P lim |Xn − X| ≥ ǫ = 0 n→∞ which is a much stronger guarantee and implies convergence in probability. Convergence in Distribution when their mgfs converge and they have the same distribution. This is equivalent to convergence in probability when the target distribution is a constant. 5.5.2 Law of Large Numbers Given a random sample {Xi } where each RV has finite mean, the sample mean converges almost surely to µ, the population mean. There is a weak variant of LLN, which states it converges in probability. 16 5.5.3 Central Limit Theorem The magical theorem! When we have a sequence of iid RVs {Xi } with EXi = µ and VarXi = σ 2 > 0 then lim n→∞ ¯n − µ X √ ∼ n (0, 1) σ/ n which also means ¯ n ∼ n µ, σ 2 lim X n→∞ Another useful, and intuitive theorem is mentioned: Slutsky’s Theorem if Yn converges to a constant a in probability, and Xn converges to X in distribution, • Xn Yn = aX in distribution • Xn + Yn = a + X in distribution Slutsky’s theorem is used in proving that normal approximation with estimated variance goes to standard normal as well. We know lim n→∞ ¯n − µ X √ ∼ n (0, 1) σ/ n and we can prove lim n→∞ σ =1 S in probability. Multiplying those yields lim n→∞ ¯n − µ X √ ∼ n (0, 1) S/ n Marvelous! 5.5.4 Delta Method Delta method is a generalized version of CLT. Multiple versions of it are discussed, however here I will only state the most obvious univariate case. σ2 lim Yn ∼ n µ, n→∞ n 2 g ′ (µ) σ 2 =⇒ lim g (Yn ) ∼ n g (µ) , n→∞ n ! When Yn converges to µ with normal distribution when we increase n with virtual certainty, a function of Yn converges to a normal distribution. In the limiting case, both are virtually a constant, so this doesn’t surprise me much. 17 6 Principles of Data Reduction This chapter seems to be highly theoretic. Without much background, it is hard to say why this material is needed. However, it looks like this chapter is closely related with the next chapter point estimation. That makes sense, because estimating point parameters are essentially data reduction. 6.1 The Sufficiency Principle 6.1.1 Sufficient Statistic Colloquially, a sufficient statistic captures all information regarding a parameter θ in a sample x; there are no remaining information to be obtained by consulting the actual sample. Formally, a statistic T (X) is a sufficient statistic if the conditional distribution Pθ (X|T (X)) does not depend on θ. The practical upshot of this is that this justifies only reporting means and standard deviations of a given sample; if we assume the population is normal, they are sufficient statistics which contain all the information we can infer about the population. However, remember that these are model dependent; the population might be coming from a different family with different parameters – those might not be entirely captured by the statistic. 6.1.2 Factorization Theorem The definition could be directly used to verify if a given statistic is sufficient or not, but practically the following theorem makes it easier to identify sufficient statistics. If the joint pdf f (x|θ) could be factored as a product of two parts: f (x|θ) = g (T (x) |θ) · h (x) where h (x) does not depend on θ, then T (X) is a sufficient statistic. This makes intuitive sense; if T is sufficient, the probabity of us seeing x is related with only two things: a function of the statistic ”given θ) and a function unrelated to θ. If there were a part with another input involved, we wouldn’t be able to make inferences about θ with only T . 6.1.3 Minimal Sufficient Statistics There are lots of sufficient statistics, some more compact than the other. So what is a minimal sufficient statistic? A sufficient statistic T (X) is minimal when for any other sufficient statistic T ′ (X), T (X) is a function of T ′ (X). In other words, take any sufficient statistic, and we can use it to derive a minimal sufficient statistic. For example, the sample mean x ¯ is a sufficient statistic for population mean µ. On the other hand, the sample itself is a sufficient statistic but not minimal – you can derive x ¯ from the sample, but not vice versa. 18 6.1.4 Ancillary Statistics Ancillary statistic contains no information about θ. Paradoxically, an ancillary statistic, when used in conjunction with other statistics, does contain valuable information about θ. The following is a good example; let X have the following discrete distribution Pθ (X = θ) = Pθ (X = θ + 1) = Pθ (X = θ + 2) = 1 3 Now, say we observe a sample and take the range of the sample R = X(N ) − X(1) . It is obvious that the range itself has no information regarding θ. However, say we know the mid-range statistic as well; M = X(1) + X(N ) /2. The mid-range itself can be used to guess θ, but if you combine it with the range, suddely you can nail the exact θ if R = 2. Of course, in the above case, the statistic M was not sufficient. What happens when we have a minimal sufficient statistic? The intuition says they should be independent. However, it could not be the case. In fact, the pair X(1) , X(N ) is minimal sufficient and R is closely related to them! So they are not independent at all. 6.1.5 Complete Statistics For many important situations, however, the minimal sufficient statistic is indeed independent of ancillary variables. The notion of complete statistic is introduced, but the definition is very far from intuitive. It goes like this: let T (X) ∼ f (t|θ). If ∀θEθ g (T ) = 0 =⇒ ∀θPθ (g (T ) = 0), T (X) is called a complete statistic. Colloquially, T (X) has to be uncorrelated with all unbiased estimators of 0. So WTF does this mean at all? It looks like we will get more context in the next chapter. The most intuitive explanation I could get was from here. It goes like Intuitively, if a nontrivial function of T has mean value not dependent on θ, that mean value is not informative about θ and we could get rid of it to obtain a sufficient statistic simpler . Hopefully reading chapter 7 will give me more background and intuition so I can revisit this. 6.2 The Likelihood Principle The likelihood function is defined to be L (θ|x) = f (x|θ) We can use likelihood functions to compare the plausibility of various parameter values. Say if we don’t know the parameter and observed x. If for two parameters θ1 and θ2 , if we have L (θ1 |x) > L (θ2 |x), we can say θ1 is more plausible than θ2 . Note we used plausible rather than probable . The likelihood principle is an important principle used in later chapters on inferences: if x and y are two sample points such that L (θ|x) is proportional to L (θ|y), that is, there is a constant 19 C (x, y) such that L (θ|x) = C (x, y) L (θ|y), then the conclusions drawn from x and y should be identical. 7 Point Estimation This chapter deals with estimating parameters of a population by looking at random samples. The first section discusses strategies for finding estimators; the second section deals with ways to evaluate the estimators. 7.1 Methods of Finding Estimators 7.1.1 Method of Moments for Finding Estimator Basically, you can equate expected values of arbitrary functions of the random sample with realized value to estimate parameters. Method of moments use the first k sample moments to achieve this. Suppose we have Xi ∼ n θ, σ 2 . The first moment would be θ, the second ”uncentered) moment would be θ2 + σ 2 . So we have a system of equations " 1 n ¯ X P Xi2 # = " θ θ2 + σ2 # On the left hand side there are concrete values from the random sample; so solving this is trivial. Also note solving the above for σ gives σ ˜2 = X 1X 2 ¯2 = 1 ¯ 2 Xi − X Xi − X n n which does not have Bessel’s correction. This is an example where the method of moments come short; it is a highly general method where you can fall back on, but in general it produces relatively inferior estimators1 . Interesting Example: Estimating Both Parameters Say we have Xi ∼ binomial (k, p) where both k and p are unknown. This might be an unusual setting, but here’s an application: if we are modeling crime rates – you don’t know the actual number of crimes committed k and the reporting rate p. In this setting, it is hard to come up with any intuitive formulas. So we could use the method of moments as a baseline. Equating the first two moments, we get: " # # " ¯ X kp = 1 2 kp (1 − p) + k 2 p2 n Xi Solving this gives us an approximation for k: 1 However, this is still a maximum-likelihood estimate. 20 k˜ = which is a feat in itself! ¯ − (1/n) X ¯2 X P ¯ Xi − X 2 7.1.2 Maximum Likelihood Estimation MLE is, by far, the most popular technique for deriving estimators. There are multiple ways to find MLEs. Classic calculus differentiation is one, exploiting model properties is two, using computer software and maximizing it numerically is three. Couple things worth noting: using log-likelihood sometimes makes calculations much easier, so it’s a good idea to try taking logs. Another thing is that MLE can be unstable - if inputs change slightly, the MLE can change dramatically. ”I wonder if there’s any popular regularization scheme for MLEs. Maybe you can still use ridge style regularization mixed with cross validation.) There’s a very useful property of maximum likelihood estimators called the invariance prop ˆ ˆ erty of maximum likelihood estimators, which says that if θ is a MLE for θ, then r θ is a MLE for r (θ). This is very useful to have to say the least. 7.1.3 Bayes Estimators The Bayesian approach to statistics is fundamentally different from the classical frequentist approach, because they treat parameters to have distribution themselves. In this approach, we have a prior distribution of θ which is our belief before seeing the random sample. After we see some samples from the population, our belief is updated to the posterior distribtion following this formula: π (θ|x) = ´ f (x|θ) π (θ) f (x|θ) π (θ) dθ which is just the Bayes formula. Conjugate Families In general, for any sampling distribution, there is a natural family of prior distributions, called the conjugate family. For example, binomial family’s conjugate family is the beta family. Let’s say if we have a binomial (n, p) distribution where only p is unknown. If our prior distribution of p is beta (α, β), then after observing a sample y our posterior distribution updates as: beta (y + α, n − y + β). So classy and elegant! However it is debatable that a conjugate family always should be used. Also, normals are their own conjugates. 7.1.4 The EM Algorithm The EM algorithm is useful when the likelihood function cannot be directly maximized; due to missing observations or latent variables which are not observed. The setup goes like this; 21 given a parameter set θ, there are two random variables x and y of which only y is observed. The marginal distribution for y, g (y|θ), is unknown, however we do know f (y, x|θ). Since x is not observed ”or incomplete), it is hard to estimate θ from given y. The EM algorithm solves it by an iterative approach. We start with a guess of the parameter θ (0) , and the following two steps are repeated: • Expectation step: Given the latest guess for the parameter θ(i) , and the observed y, find the expected x. • Maximization step: Given the x and y, find the parameter θ(i+1) that is most plausible for this pair of observations. It can be proven that the likelihood from successive iterations of parameters are nondecreasing, and will eventually converge. EM Algorithm, Formally The above two steps are just colloquial descriptions; the actual expected x is not directly calculated. In the expectation step, the following function is created: Q θ|θ(t) = EX|Y,θ [log L (θ; X, Y)] Let’s see what this function is trying to do; since we have a guess for the parameter θ(t) , we now can plug this into the joint pdf and find the distribution of the pair (x, y). For each concrete pair of observations, we can calculate the log-likelihood of θ using this observation. Q is simply a weighted average of these likelihoods. Q is a function of θ, and answers this question: Let’s say (x, y) ∼ f x, y|θ(t) . Now what is the EV of log likelihood for a given θ? Now the second step puts this function through argmax operator. 7.2 Methods of Evaluating Estimators This section is much more involved than the previous section, and discusses multiple strategies of evaluating estimators – also discusses some strategies to improve existing estimators. 7.2.1 Mean Squared Error A simple, intuitive way of measuring the quality of the estimator is taking the MSE: Eθ (W − θ) 2 note this is a function of the parameter θ. Also, MSE has a very intuitive decomposition: Eθ (W − θ) = Varθ W + (Eθ W − θ) 2 22 2 The first term is the variance of the parameter, while the second term is the square of the bias. Variance represents the precision of the estimator; if variance is high, even if the estimator works in average, it cannot be trusted. If the bias is high, it will point in a wrong direction. So this tradeoff is a topic that is discussed throughout this chapter. P ¯ 2 / (n − 1) is an unbiased estimaXi − X Quantifying Bassel’s Correction We know S 2 = n tor of σ 2 , thus the bias is 0. However, we also know n−1 S 2 is the maximum likelihood estimator. Which of these are better in terms of MSE? Carrying out the calculations tells you the MLE will give you smaller MSE; so by accepting some bias, we could reduce the variance and the overall MSE. Which estimator we should ultimately use is still debatable. No Universal Winner Since MSE are functions of parameters, it often is the case the MSE of two estimators are incomparable, because one can outperform another on a certain parameter set, but can underperform on another parameter set. In the text, this is shown with an example of estimating p from binomial (n, p) where n is known. Two approaches are showcased; one is P just using the MLE Xi /n, another is taking a Bayesian estimate from a constant prior. If you plot the MSE against different values of p, MSE from the MLE is a quadractic curve; but the MSE from the Bayesian estimate is a horizontal line. Therefore, one doesn’t dominate the other – plotting for different n will reveal the Bayesian approach gets better as we increase n. 7.2.2 Best Unbiased Estimator One way of solving the bias-variance tradeoff problem is to ignore some choices! We can restrict ourselves to unbiased estimators and try to minimize variance. In this setting, the best estimator is called a uniform minimum variance unbiased estimator ”UMVUE). Finding UMVUE is a hard problem; it would be hard to prove that the given estimator is the global minimum. Finding Lower Bounds The Cramer-Rao Theorem sometimes helps; it is an application of the Cauchy-Schwartz inequality which provides a lower bound for the variance of an unbiased estimator. So if this limit is reached, we can say for sure we have a UMVUE. This is indeed attainable for some distributions - such as Poisson. The big honking problem of CR theorem is that the bound is not always sharp – sometimes the bound is not attainable. Since CR is an application of CS inequality, we can state the condition under which the equality will hold. This can either give us hints about the shape of the UMVUE, or let us prove it is unreachable. Sufficiency and Completeness for UMVUE Somewhat surprising and arbitrary - but sufficient or complete statitics can be used to improve unbiased estimators, by conditioning the 23 estimator on the sufficient statistic! The Rao-Blackwell theorem specifies that if W is any unbiased estimator of an arbitrary function of θ, τ (θ), the following function is a uniformly better unbiased estimator: φ (T ) = E (W |T ) This was a WTF moment for me. However, note that Varθ W = Varθ [E (W |T )] + E [Var (W |T )] ≥ Varθ [E (W |T )] for any T – you can condition on any statistic, even totally irrelevant, and it can never actually hurt. Then why the requirement for sufficiency? If not, the estimator will become a function of θ so it isn’t exactly an estimator. Can we go further than that? Can we create a UMVUE from sufficient statistics? We analyze the properties of a best estimator to find out. Suppose we have an estimator W satisfying Eθ W = τ (θ). Also say we have an unbiased estimator of 0, U . Then the following estimator φa = W + aU is still an unbiased estimator of τ (θ). What is the variance of this? Varθ φa = Varθ (W + aU ) = Varθ W + a2 Varθ U + 2aCovθ (W, U ) Now, we can always concoct a U such that Covθ (W, U ) < 0 so Varθ φa < Varθ W . Note that U is essentially a random noise for estimation - yet it can give us an improvement. This doesn’t make sense at all. Theorem 7.3.20 actually brings order, by stating that W is the UMVUE iff it is uncorrelated with all unbiased estimators of 0. Wait, that definition sounds familiar. Recall the definition of complete statistics? Now we have to guess they are somewhat related. They indeed are, by Theorem 7.3.23. This states: Let T be a complete sufficient statistic for a parameter θ, and let φ (T ) be any estimator based only on T , then φ (T ) is the unique best unbiased estimator of its expected value. 7.2.3 General Loss Functions MSE is a special case of loss function, so we can generalize on that. Absolute error loss and squared loss are the most common functions, but you can come up with different styles, of course. Evaluating an estimator on a given loss function L (θ, a) is done by a risk function: R (θ, δ) = Eθ L (θ, δ (X)) 24 which is still a function of the parameter. Also note when we use squared loss function, R becomes MSE. Stein’s Loss Function For scale parameters such as σ 2 , the values are bounded below at 0. Therefore, a symmetric loss function such as MSE can be penalizing overestimation because the penalty can grow infinitely only in one direction. An interesting class of loss function, which tries to overcome this problem, is introduced: a a L σ 2 , a = 2 − 1 − log 2 σ σ which goes to infinity as a goes to either 0 or infinity. Say we want to estimate σ 2 using an estimator δb = bS 2 for a constant b. Then R σ 2 , δb = E bS 2 bS 2 − 1 − log 2 2 σ σ which is minimized at b = 1. Bayes Risk = b − log b − 1 − E log S2 σ2 Instead of using the risk function as a parameter of θ, we can take a prior distri- bution on θ and getting the expected value. This takes us to the Bayes risk: ˆ R (θ, δ) π (θ) dθ Θ Finding the estimator δ which minimizes the Bayes risk seems daunting but is tractable. Note: ˆ L (θ, δ (x)) f (x|θ) dx π (θ) dθ Θ χ ˆ ˆ = L (θ, δ (x)) π (θ|x) dθ m (x) dx R (θ, δ) π (θ) dθ = Θ ˆ ˆ χ Θ The quantity in square brackets, which is the expected loss given an observation, is called the posterior expected loss. It is a function of x, not θ. So we can minimize the posterior expected loss for each x and minimize Bayes risk. A general recipe for doing this is not available, but the book contains some examples of doing this. 8 Hypothesis Testing This chapter deals with ways of testing hypotheses, which are statements about the population parameter. 25 8.1 Terminology In all testing, we use two hypothesis; the null hypothesis H0 and the alternative hypothesis H1 , one of which we will accept and the other we will reject. We can always formulate the hypothesis as two sets: H0 is θ ∈ Θ0 and H1 is θ ∈ ΘC 0 . We decide which hypothesis to accept on the basis of a random sample. The subset of the sample space which will make us reject H0 is called the rejection region. Typically, a hypothesis test is performed by calculating a test statistic and rejecting/accepting H0 if the statistic falls into a specific set. 8.2 Methods of Finding Tests We discuss three methods of constructing tests. 8.2.1 Likelihood Ratio Tests Likelihood Ratio Tests are very widely applicable, and are also optimal in some cases. We can prove t-test is a special case of LRT. The LRT calculates the ratio between the maximum likelihood, and the maximum likelihood given the null hypothesis. The test statistic λ is defined as: λ (x) = supΘ L(θ|x) 0 supΘ L(θ|x) = supΘ L(θ|x) 0 ˆ L θ|x (θˆ is MLE of θ) When do we reject H0 ? We want to reject the null hypothesis when the alternative hypothesis is much more plausible; so the smaller λ is, the more likely we will reject H0 . Therefore, the rejection region is {x|λ (x) ≤ c} where 0 ≤ c ≤ 1. Also note that λ based on a sufficient statistic of θ is equivalent to regular λ. Example: Normal LRT for testing θ = θ0 , known variance Setup: we are drawing sample from n (θ, 1) population. H0 : θ = θ0 and H1 : θ 6= θ0 . Since the MLE is x ¯, the LRT statistic is: λ (x) = h i f (x|θ0 ) 2 = ”some simplification) = exp −n (¯ x − θ0 ) /2 f (x|¯ x) Say we reject H0 when λ ≤ c; we can rewrite this condition as |¯ x − θ0 | ≥ p −2 (log c) /n Therefore, we are simply testing the difference between the sample mean and the asserted mean with a positive number. 26 Example: Normal LRT for testing µ < µ0 , unknown variance Setup: we are drawing from n µ, σ 2 where both parameters are unknown. H0 : µ ≤ µ0 and H1 : µ > µ0 . The unrelated parameter σ 2 is called a nuisance parameter. It can be shown ”from exercise 8.37) that the test relying on this statistic is equivalent to Student’s t test. 8.2.2 Bayesian Tests Bayesian tests will obtain posterior distribution from the prior distribution and the sample, using the Bayesian estimation technique discussed in Chapter 7. And we use the posterior distribution to run the test. We might choose to reject H0 only if P θ ∈ ΘC 0 |X is greater than some large number, say 0.99. 8.2.3 Union-Intersection and Intersection-Union Tests Ways of structuring the rejection region when the null hypothesis is not simple. For example, the null hypothesis ΘC 0 is best expressed by an intersection or union of multiple sets. If we have tests for each individual sets that constitute the union/intersection, how can we test the entire hypothesis? • When the null hypothesis is an intersection of sets, any test failing will result in the rejection of the null hypothesis. • When the null hypothesis is a union of sets, any test passing will let us reject the null hypothesis; null hypothesis is rejected only when all of the tests fail. 8.3 Evaluating Tests How do we assess the goodness of tests? 8.3.1 Types of Errors and Power Function Since the null hypothesis could be true or not, and we can either accept or reject it, there are four possibilities in a hypothesis test, which can be summarized in the below table: Decision Truth Accept H0 Reject H0 H0 Correct Type I Error H1 Type II Error Correct To reiterate: • Type I Error incorrectly rejects H0 when H0 is true. Since we usually want to prove the alternative hypothesis, this is a false positive error, when we assert H1 when it is not. Often this is the error people want to avoid more. 27 • Type II Error incorrectly accepts H0 when H1 is true. This is a false negative error; the alternative hypothesis is falsely declared negative. The probabilities of these two errors happening are characterized by a power function. A power function takes θ as an input, and calculates the probability that X will be in the rejection region: β (θ) = Pθ (X ∈ R) What is an ideal power function? We want β (θ) to be 0 when θ ∈ Θ0 , and 1 when θ ∈ ΘC 0. We can plot β with regards to all possible values of θ, and compare the different tests. Example: Normal Power Function Setup: drawing from n θ, σ 2 , where σ 2 is known. H0 : θ ≤ θ0 , H1 : θ > θ0 . Say we have a test that rejects H0 when ¯ − θ0 X √ >c σ/ n The power function of this test is ¯ X − θ0 √ >c β (θ) = Pθ σ/ n ¯ θ−θ X −θ √ > c+ √0 = Pθ σ/ n σ/ n θ − θ0 =P Z >c+ √ σ/ n where Z is a standard normal random variable. As we increase θ from −∞ to ∞, this probability will go from 0 to 1. Changing c will change when we reject H0 and affect the error probabilities. 1.0 0.6 0.4 0.6 0.4 0.2 0.2 0.0 −1.0 n=10 n=20 n=100 θ0 0.8 β(θ) β(θ) 0.8 1.0 c=0.1 c=0.5 c=1.0 θ0 −0.5 0.0 0.5 1.0 θ 1.5 2.0 2.5 0.0 −1.0 3.0 ”a) −0.5 0.0 0.5 1.0 θ 1.5 2.0 2.5 3.0 ”b) The figure ”a) gives an example of different c – since H0 is true when θ is left of θ0 , the type I error probability is depicted by the distance between the x-axis and the power function curve. On the right of θ0 , the type II error probability is depicted by the distance between y = 1 and 28 the curve. We can see that increasing c trades off the maximum type I error and the maximum type II error: if we have a higher c, we have a more strict criteria for rejecting H0 . This results in lower false positive, but in higher false negative. Consequences of the Power Function Figure ”b) above shows the power function for different ns. We see as we increase n, the power function approaches the ideal step function. Typically, the power function will depend on the sample size n. If n can be chosen by the experimenter, considering the power function will be useful to determine the sample size before the experiment is performed. Size and Level of Tests For a fixed sample size, it is usually impossible to make both types of error probability arbitrarily small. As most people care more about Type I errors ”since alternative hypothesis are what we want to prove), we classify tests by the maximum possible type I error across all possible parameters in Θ0 : supθ∈Θ0 β (θ) • We want these numbers to be low, since β (θ) is ideally 0 for Θ0 . Lower size/level means more powerful tests! • Note as we do not have a prior distribution of parameters, we use the maximum error probability rather than the expected error probability. • When this bound α is tight ”supθ∈Θ0 β (θ) = α), the test is called a test of size α. When this bound is not tight ”supθ∈Θ0 β (θ) ≤ α), the test is called a test of level α. – So levels are upper bounds of test powers; every test is a level 1 test but that is not informative. Unbiased Tests A test with power function β (θ) is unbiased if sup β (θ) ≤ inf β (θ) θ∈ΘC 0 θ∈Θ0 Colloquially, if when the alternative hypothesis is true, it should be more likely to reject the null hypothesis than when the null hypothesis is true. 8.3.2 Most Powerful Tests: Uniformly Most Powerful Tests Depending on the problem, there might be multiple level α tests. What should we use among them? Since we have only controlled for Type I errors so far, it is a good idea to look at Type II errors this time. The Uniformly Most Powerful test is the most powerful test across all θ ∈ ΘC 0: its power function β should be larger than any other test’s power function, for any θ ∈ ΘC 0. 29 Neymann-Pearson Lemma UMP has a very strong argument; we can easily imagine situations where the UMP does not exist. However, in some cases, they do. When you are only considering two hypothesis H0 : θ = θ1 and H1 : θ = θ2 , accepting H0 when and only when f (x|θ1 ) < kf (x|θ0 ) for some k ≥ 0 is a UMP level α test, where α = Pθ0 (X ∈ R). Colloquially, if your test is formed like above, it is a UMP for its own class ”level α). Extending Neyman-Pearson Lemma To A One-Sided Test In case of a one-sided hypothesis ”for example, θ > θ0 ), we can extend the above lemma to find a UMP. First, we need to define the following concept: a pdf has a MLR ”monotone likelihood ratio) property if: whenever θ2 > θ1 , the likelihood ratio f (t|θ2 ) /f (t|θ1 ) is a monotone function of t. ”This holds for all the regular exponential families g (t|θ) = h (t) c (θ) ew(θ)t when w (θ)is a nondecreasing function.) Now, we state the following theorem: Karlin-Rubin Theorem With H0 : θ ≤ θ0 and H1 : θ > θ0 . Suppose that T is a sufficient statistic of θ and its distribution has a nondecreasing MLR. Then, for any t0 , rejecting H0 iff T > t0 is a UMP level α test, where α = Pθ0 (T > t0 ). In laymen’s terms: when the conditions are met, a hypothesis about a parameter above or below the threshold can be translated into a sufficient statistic above or below a threshold. So this is actually very intuitive, maybe even trivial, stuff. When UMP Cannot Be Found In case of two-sided hypothesis, the above theorem is not applicable. For example; drawing from n θ, σ 2 , σ 2 known. Say H0 : θ = θ0 and H1 : θ 6= θ0 . We look at two alternative values of θ, θ1 < θ0 < θ2 . We are able to show that different tests are optimal for θ1 and θ2 , while having the same level. Details: ¯ < θ0 − σzα /√n ⇐⇒ Z < −zα − • Test 1 rejects H0 when X ¯ > θ0 + σzα /√n ⇐⇒ Z > zα + • Test 2 rejects H0 when X √ √ n(θ−θ0 ) σ n(θ0 −θ) σ Note they are pathological devices, only controlling for Type I error for the sake of the proof; √ 2 ¯ − θ · n is going to ¯ both tests have spectacular Type II errors. Anyways, as X ∼ n θ, σn , X σ be a standard normal variable. Therefore you can imagine both tests having level α. However, we can easily see the power for Test 1 would be higher for θ1 < θ0 than Test 2, and vice versa for θ2 > θ0 . So there are no UMPs for level α. 30 Unbiased UMPs Note the above tests are pathetically biased. What we should do is rejecting H0 when √ ¯ − θ0 > σzα/2 / n X Obviously. The below figure shows the power function of the three tests: 1.0 0.8 0.6 0.4 biased 1 biased 2 best unbiased 0.2 0.0 θ0 Note that the best unbiased estimator has a lower probabilty of rejecting H0 for some range of θ0 , but it is clear it is better than both tests in general. 8.3.3 Size of UIT and IUT The size/level analysis of UIT and IUT are different. UITs are universally less powerful than LRTs, but they might be easier to reason about. 8.3.4 p-Values The definition of p-value is very unintuitive, actually. In the definition, a p-value p (X) is a test statistic. Small values of p (X) give evidence that H1 is true. A p-value is valid if, for every θ ∈ Θ0 and every 0 ≤ α ≤ 1, Pθ (p (X) ≤ α) ≤ α So WTF does this mean? p (X) ≤ α can be interpreted as the condition p-value having a value of α, or something more extreme . The probability of this happening given a θ in the null hypothesis should roughly be α. So this sort of resonates with the colloquial definition of p-value. Also note that a test that rejects H0 when p (X) ≤ α is a level α test from the above definition. Defining Valid p-Values I guess this is the more intuitive definition: let W (X) be a test statistic such that large values of W give evidence that H1 is true. Then let 31 p (x) = sup Pθ (W (X) ≥ W (x)) θ∈Θ0 so this is the colloquial definition. We can prove that p (x) is a valid p-value according to the above definition. Another method for defining p-values is discussed as well that depends on sufficient statistics. If S (X) is a sufficient statistic of θ when the null hypothesis is true, we have: p (x) = P (W (X) ≥ W (x) |S = S (x)) is a valid p-value. This makes sense as well; if we have a sufficient statistic we need no stinking inference about θ.... Calculating p-Values Calculating the supremum probability isn’t always easy; but it can be derived from the properties of the distribution. Our usual normal tests are good examples. In these cases, the test statistic follows Student’s t distribution, enabling us to nail the supremum. 8.3.5 Loss Function Optimality Since there are just two actions in a hypothesis test ”we either accept H0 or H1 ), a loss function is defined simply by: L (θ, a0 ) = 0 C θ ∈ Θ0 II θ∈ L (θ, a1 ) = ΘC 0 C 0 I θ ∈ Θ0 θ ∈ ΘC 0 where ai is the action of accepting Hi . CI and CII are the costs of type I and type II errors, respectively, and they can be a function of θ as well. 9 Interval Estimation 9.1 Introduction Interval estimation techniques give us confidence intervals of parameters; for example, we can assert µ ∈ [L (x) , U (x)] where L (x) and U (x) are functions of the random sample that predicts the parameter. What do we gain from going to interval estimations from point estimations? The probability of a point estimation being correct is 0, if the parameter is continuous. However, once we have an interval estimation, we can have nonzero probability of our prediction being correct. Here are some definitions: • A coverage probability of [L (X) , U (X)] is a function of θ, that represents the probability that the interval covers the true parameter. 32 Pθ (θ ∈ [L (X) , U (X)]) Note that the interval estimates are functions of X, thus are random variables themselves. • A confidence coefficient is the infimum of coverage probability across all possible inf P (θ ∈ [L (X) , U (X)]) θ I want to emphasize the consequences of this definition: don’t think interval estimation in terms of conditional probability! A coverage probability is not related to the conditional prob¯ seemed like a very obvious way for ability. Personally conditional probability of θ given X interval estimation, but it requires a prior distribution of θ and is strictly not a frequentist approach. The coverage probability is more like a likelihood function. It’s just ConverageProb (θ) = ˆ f (x, θ) dx C where f is the pdf and C = {x|L (x) ≤ θ ≤ U (x)}. This also changes how we interpret the results of an estimation process: the claim that 95% confidence interval of θ is [1, 10] does not mean P (θ ∈ [1, 10]) = 0.95, but P (x|θ) ≥ 0.95 for all θ ∈ [1, 10]. 9.1.1 Coverage Probability Example: Uniform Scale Distribution In many cases, the coverage probability is constant across different parameters. However, it may not be so in certain cases. The following example demonstrates this. Say we draw Xi ∼ uniform (0, θ). The sufficient statistic for θ is Y = max Xi . Now, what are our estimates for θ? We examine two candidates: • Scale intervals:[aY, bY ] where 1 ≤ a < b • Location intervals: [Y + c, Y + d] where 0 ≤ c < d Now let’s send that θ to infinitely. Location interval will have a coverage probability of 0, since the size of the interval stays constant regardless of θ. However, the scale interval manages to have a positive coverage probability because the interval’s size roughly grows with θ. ”These are more intuitive descriptions; formally they are proven by integrating pdfs of Y /θ. Also see the later section on pivotal quantities.) 9.2 Methods of Finding Interval Estimators 9.2.1 Equivalence of Hypothesis Test and Interval Estimations: Inverting a Test Statistic Hypothesis tests and interval estimations are obviously, very closely related. They both take a random sample to make an inference about the population parameter. To draw a poor analogy: both processes throw darts to a board. Hypothesis testing tell you the target position to 33 shoot for, then you throw the dart, and we will know if we landed within a certain range from the target. Interval estimation lets you throw the dart first; and gives us target positions you might have shot for. Here’s a general recipe for converting a hypothesis test into an interval estimation; just go over all the possible values of θ, use a test of size α for H0 : θ = θ0 where θ0 is a possible value. If this test passes, θ0 falls within the confidence interval. So: C (T (x)) = {θ : P (T (x) |θ) ≥ α} Of course, in real life, this is not done by enumerating all possible values but analytically. Normal Test Example Let Xi ∼ n µ, σ 2 . Consider testing H0 : µ = µ0 and H1 : µ 6= µ0 . Hypothesis test draws a region on the dart, called the acceptance region A (µ0 ), upon which we will accept H0 . Let’s use a test √ with acceptance region x : |¯ x − µ0 | ≤ zα/2 σ/ n . This test has a size of α. Now, let’s throw the dart and get a concrete x ¯. Now, we wiggle µ0 so its acceptance region will still contain x ¯. It’s easy to see the acceptance region will contain x ¯ when σ σ x ¯ − zα/2 √ ≤ µ0 ≤ x ¯ + zα/2 √ n n Yay, this is a range of parameters. Now what is the confidence coefficient? From the definition of the test, we know P (x ∈ R (µ0 ) |µ = µ0 ) = α ⇐⇒ P (x ∈ A (µ0 ) |µ = µ0 ) = 1 − α So it’s 1 − α! Formal Definition (Theorem 9.2.2) For each θ0 ∈ Θ, let A (θ0 ) be the acceptance region of a level α test of H0 : θ = θ0 . For each x ∈ X , define a set C (x) in the parameter space by C (x) = {θ0 : x ∈ A (θ0 )} Then the random set C (x) is a 1 − α confidence set. The converse holds true as well; if C is a 1 − α confidence set then A (θ0 ) = {x : θ0 ∈ C (x)} is a valid acceptance region for a level α test. 34 Shapes of Confidence Interval Also, note the above definition do not include the form of H1 . In general, one-based H1 produces one-sided C (x) and two-sided H1 produces two-sided C (x). The biased property of the inverted test also carries over to the confidence set, as can be expected. What do we do if we want C (x) = (−∞, U (x)] i.e. only upper-bounded? If a test uses H1 : µ < µ0 , its acceptance region will look like {x : x ¯ < µ0 − t} where t is a function of various things ”sample size, nuisance parameters, etc..). i.e. the acceptance region is left-bounded if looked from µ0 . Now, once we have x ¯, we can move µ0 infinitely to the left, but cannot move µ0 too far to the right. So the confidence interval is one-sided only with a upper bound. 9.2.2 Using Pivotal Quantities A pivotal quantity is a random variable whose distribution does not depend on the parameter. √ ¯ 2 2 For example, if Xi ∼ n µ, σ with σ known, X − µ / (σ/ n) is normally distributed and is a pivotal quantity. Then, we can trivially construct a confidence interval for µ. First note that P ¯ −µ X √ ≤ a = P (−a ≤ Z ≤ a) = α −a ≤ σ/ n Where a is chosen to make the last equality hold. So what µ will make this inequality hold? It’s easy to see the following will do it. σ σ µ:x ¯ − a√ ≤ µ ≤ x ¯ + a√ n n = σ σ µ:x ¯ − z1−α √ ≤ µ ≤ x ¯ + z1−α/2 √ n n Note that we choose a = z1−α/2 which splits the probability equally here. It’s a matter of choice, but for symmetric distributions such as normal, it seems natural. For asymmetric distributions such as chi-squared, a different approach will be better ”we will revisit this). In General If Q (x, θ) is a pivotal quantity, we can a and b so that Q (x, θ) will fall in that interval with a probability of α ”we can plug in concrete numbers, because we know Q’s distribution). Then, we have: C (x) = {θ0 : a ≤ Q (x, θ0 ) ≤ b} Now, we unravel Q’s definition and leave only the desired parameter in the middle, moving everything to the left and right sides. Then we have a confidence interval. 9.2.3 Pivoting CDFs Using Probability Integral Transformation The previous strategies can sometimes result in non-interval confidence sets, as shown in example 9.2.11. This strategy, when applicable, will always result in an interval. Say T (x) is 35 a sufficient statistic for θ with cdf FT (t|θ). Recall, from probability integral transform, that FT (T |θ) ∼ uniform (0, 1). So this is a pivotal quantity! For the confidence set to be an interval, we need to have FT (t|θ) to be monotone in θ. If it is monotonically increasing, we set θL (t) and θU (t) to be solution of the following system: FT (t|θL (t)) = α1 FT (t|θU (t)) = 1 − α2 where α1 + α2 = α. ”We usually set α1 = α2 = α/2 unless additional information. We will discuss about this in section 9.3) Note the above equations do not have to be solved analytically – we can solve them numerically as well. 9.2.4 Bayesian Inference As noted in the beginning of the chapter, the frequentist approach does not allow to say the parameter belongs to the confidence set with a probability of α: parameter is a fixed value and it will belong to the set with a probability of 1 or 0. The Bayesian setup precisely allows us to do that. The Bayesian equivalent of a confidence set is called a credible set, to avoid the confusion between the two approaches. The equivalent of a coverage probability is called a credible probability. In many cases, each approach looks poor if examined using the criteria from a different approach. The textbook shows some examples where the credible set has a limiting converage probability of 0, and the confidence set having a limiting credibility of 0. 9.3 Methods of Evaluating Interval Estimators We can have multiple ways of coming up with interval estimation, with the same confidence coefficient. Some is bound to be better than the others. But how do we evaluate them? 9.3.1 Size and Coverage Probability When two interval estimations have the same confidence coefficient, it is obvious to prefer a smaller interval. When we estimate the population mean of a normal distribution, we can pick any α1 , α2 ≥ 0 where α = α1 + α2 such that P (zα1 ≤ Z ≤ zα2 ) = 1 − α Obviously there are multiple choices for those. Intuition tells us that splitting probabilities evenly for the left and right hand side of the sample mean is the way to go, and it indeed is. It is justified by the following theorem: Theorem 9.3.2 Let f (x) be a unimodal pdf. We consider a class of intervals [a, b] where ´b f (x) dx = 1 − α. If f (a0 ) = f (b0 ) > 0, and x∗ ∈ [a0 , b0 ] where x∗ is the mode of f , then a 36 [a0 , b0 ] is the shortest interval which satisfies the 1 − α probability constraint. The textbook discusses a caveat for this ”See example 9.3.4) 9.3.2 Test-Related Optimality Optimality criteria for tests carry over to their related interval estimations. For example, we have some guarantee for interval estimations which are inversions of UMP tests. However, note that UMP controls type-II errors; therefore the guarantee we have for the interval estimation also looks different. UMP-inverted estimations give us optimal probability of false coverage. Colloquially, it is a function of θ and θ′ and measures the probability of θ′ being covered when θ is the truth. Note the probability of false coverage has to take different forms according to the form of the original test. Say the the interval estimation is one-sided only with upper bound. We only define the probability of false coverage for θ < θ′ , obviously – θ′ less than or equal to θ are correct. Intuition Behind False Coverage Probability We do offer some intuition about the linkage between the false coverage probability and the size of the interval; if an interval contains less false parameters, it’s more likely to be short; there are some loose links discussed in the book ”see theorem 9.3.9) which says the expected length of the interval for a given parameter value is equivalent to an integral over false positive probability. Uniformly Most Accurate (UMA) Intervals A conversion of a UMP test yields a UMA confidence set, which has the smallest probability of false coverage. Also note that UMP tests are mostly one-sided – Karlin-Rubin works only for one sided intervals. So most UMA intervals are one sided. Inverting Unbiased Tests The biasedness of tests carry over to interval estimation. In case UMP does not exist, we can invert the unbiased test to get an unbiased interval. 9.3.3 Bayesian Optimality When we have a posterior distribution, we can order all intervals of probability of 1 − α by their size. A corollary to Theorem 9.3.2 mentioned above gives us that when the posterior distribution is unimodal, {θ : π (θ|x) ≥ k} is the shortest credible interval for a credible probability of 1 − α = a region is called the highest posterior density ”HPD) region. 37 ´ π(θ|x)≥k π (θ|x) dx. Such 9.3.4 Loss Function Optimality So far we have set the minimum coverage probability we want, and then find the best tests among them. We can do something in the middle by using a generalized loss function such as L (θ, C) = b · Length (C) − IC (θ) where IC (θ) = 1 if θ ∈ C, 0 otherwise. By varying b, we can very the relative importance of the length and coverage probability. However, the use of decision theory in interval estimation problems is not widespread; it is hard to solve in many cases, and this can sometimes lead to unexpected types of sets. 10 Asymptotic Evaluations This chapter looks at asymptotic behaviors of several topics ”point estimation, hypothesis testing, and interval estimation). That is, we send sample size to infinity and see what happens. The detailed treatment of the topic seems to be the most theoretical part of the book; I just skimmed over the chapter and will try to summarize only the big ideas here. 10.1 Point Estimation 10.1.1 Criteria A point estimator’s asymptotic behavior is characterized by two properties; • Consistency means the estimator converges to the correct value as the sample size becomes infinite. lim Pθ (|Wn − θ| < ǫ) = 1 n→∞ • Efficiency looks at the variance of the estimator as the sample size becomes infinite. If the variance reaches the lower bound defined by the Cramer-Rao, the estimator is called asymptotically efficient. Concretely, an asymptotically efficient estimator Wn for a parameter τ (θ) if √ n [Wn − τ (θ)] → n [0, v (θ)] and v (θ) achieves the Cramer-Rao Lower Bound. It is notable that MLEs are in general best estimators: they are consistent estimators, and asymptotically efficient at the same time. ”Some regularity conditions are required, but it is explicitly said to hold in most common situations and you don’t really want to care about it.) 38 10.1.2 Comparing Consistent Estimators MLE is everyone’s favorite estimator. However, other estimators may have other desirable properties ”robustness, ease of calculation, etc) so we need to be able to see what we are giving up in terms of efficiency. Comparing different asymptotically consistent estimators is done by looking at their variances, through the idea of asymptotic relative efficiency ”ARE). If two estimators Wn and Vn satisfy √ n [Wn − τ (θ)] √ n [Vn − τ (θ)] Then, the ARE is defined by 2 → n 0, σW → n 0, σV2 ARE (Vn , Wn ) = 2 σW σV2 ARE is going to be a function of the parameters – so we will be able to see where it peaks, where is it larger/smaller than 1, etc. 10.1.3 Asymptotic Behavior of Bootstrapping Bootstraps are good ways to estimate variances of arbitrary estimators. For any estimator ˆ θˆ (x) = θ, Var θˆ = ∗ nn 1 X ˆ∗ ¯ˆ∗ 2 θ −θ n n − 1 i=1 i Of course, we cannot sample all nn possible samples. So we can always do a partial bootstrap by taking B random resamples. The book shows some examples where bootstrap yields a better variance estimate than the Delta method ”which exploits that when estimates are asymptotically efficient, they will reach Cramer-Rao bound – so naturally it will underestimate.). Parametric and Nonparametric Bootstrap The usual type of bootstrapping, which draws data ”with replacement) to generate random samples is called a nonparametric bootstrap. On the contrary, parametric bootstrapping asˆ and generate random samples sumes a distribution. Say Xi ∼ f (x|θ). We take MLE estimate θ, from there. X1∗ , X2∗ , · · · , Xn∗ ∼ f x|θˆ It is same as the usual nonparametric bootstrap from there. 39 Consistency and Efficiency The textbook does not cover a lot of material about the evaluation of bootstrapping. In general, it is an effective and reasonable way. 10.2 Robustness 1. It should have a reasonably good efficiency at the assumed model. 2. It should be robust in the sense that small deviations from the model assumptions should impair the performance only slightly. 3. Somewhat larger deviations from the model should not cause a catastrophe. 10.2.1 Robustness of Mean and Median The mean reaches the Cramer-Rao bound, so it is an efficient estimator. Also, if there is a small variation from the normal distribution assumption, it will fare pretty well. We can try to see this by a δ-contamination model. There, the distribution is the assumed model with a probability of 1−δ, some other distribution with δ. Note when the two distributions are similar, mean’s variance is still small. However, if the other distribution is Cauchy, for example, the variance will go to infinity. That brings us to the notion of breakdown value. A breakdown value b is the maximal portion of the sample which can go to infinity before the statistic goes infinity, or meaningless . Of course, the mean’s breakdown value is 0, where the median has 0.5. How can we compare the mean and the median? We can prove median is asymptotically normal, and use ARE to compare those in different types of distributions. Some examples in the book show it will fare better than the mean in double exponential distribution, but not in normal or logistic. So it fares better with thicker tails, as we expect. 10.2.2 M-estimators M-estimators is a generalized form of estimator. Most estimators minimize some type of criteria - for example, squared error gives the mean, absolute error gives the median, and the negative log likelihood will give you MLE. M-estimators are a class of estimators which minimize n X i=1 ρ (xi − a) Huber’s Loss Huber’s loss is a Frankenstein-style loss function created by patching squared and absolute loss together. 40 ρ (x) = 1 x2 2 k |x| − 1 k 2 2 if |x| ≤ k otherwise Works like quadratic around zero, and linear away. It’s differentiable, and continuous. k is a tunable parameter. Increasing k would be like decreasing the robustness to outliers. The minimizer to this loss function is called the Huber estimator and it is asymptotically normal with mean θ. If you do an ARE comparison with mean and median; • Huber is close to mean in normal, and better than mean at logistic or double exponential. • Huber is worse than median in double exponential, much better than it in normal or logistic. 10.3 Hypothesis Testing 10.3.1 Asymptotic Distribution of LRT When H0 : θ = θ0 and H1 : θ 6= θ0 . Suppose Xi ∼ f (x|θ). Then under H0 , as n → ∞, −2 log λ (X) → χ21 in distribution Regardless of the original distribution! Kicks ass. A more general version, which do not specify H0 and H1 explicitly, states that the quantity −2 log λ (X) will still converge to chi- squared distribution with its df equal to the difference between the number of free parameters in Θ0 and the number of free parameters in Θ1 . How do we define the number of free parameters? Most often, Θ can be represented as a subset of Rq and Θ0 can be represented as a subset of Rp . Then q − p = v is the df for the test statistic. 10.3.2 Wald’s Test and Score Test Two more types of tests are discussed. Wald Test When we have an estimator Wn for a parameter θ, which will asymtotically normal, we can use this as a basis for testing θ = θ0 . In general, a Wald test is a test based on a statistic of the form Zn = Wn − θ0 Sn Sn is standard error for Wn . ”Last time I’ve seen this was in logistic regression; they did Wald Test on regression coefficients to derive p-values of coefficient being nonzero.) 41 Score Test 10.4 Interval Estimation 11 Analysis of Variance and Regression 11.1 One-way ANOVA ANOVA is a method of comparing means of several populations, often assumed to be normally distributed. Normally, the data are assumed to be follow the model Yij = θi + ǫij where θi are unknown means and ǫij are error random variables. The classic oneway ANOVA assumptions are as follows: 1. Eǫij = 0, Varǫij = σi2 < ∞, and all errors are uncorrelated. 2. ǫij are independent, normally distributed 3. σi2 = σ 2 for all i ”also known as homoscedascity) 11.1.1 Different ANOVA Hypothesis The classic one-way ANOVA null hypothesis states H0 : θ 1 = θ 2 = θ 3 = · · · = θ k which is kind of uninformative, and doesn’t give us much information. The text starts with a more useful type of hypothesis, using contrasts. A contrast is a linear combination of variables /parameters where the weights sum up to 0. We now run the test with different hypothesis, the null being H0 : k X ai θi = 0 i=1 Now, by choosing the weights carefully, we can ask other types of interesting questions. a = (1, −1, 0, 0, · · · , 0) will ask if θ1 = θ2 . a = 1, − 21 , − 12 , 0, 0, · · · , 0 will ask if θ1 = (θ2 + θ3 ) /2, etc. 11.1.2 Inference Regarding Linear Combination of Means The means of each sample Y¯i are normal: ni 1 X Yij ∼ n θi , σ 2 /ni Y¯i = ni j=1 42 The linear combination of normal variables are once again, normal, with the following parameters: k X i=1 ai Y¯i ∼ n k X ai θi , σ 2 i=1 k X a2 i i=1 ni ! Since we don’t know the variance, we can replace this as the sample variance. Si2 is the regular sample variance from the ith sample. Then, the pooled estimator is given by Sp2 = k k ni 1 XX 1 X 2 (ni − 1) Si2 = (yij − y¯i· ) N − k i=1 N − k i=1 j=1 The estimator Sp2 has an interesting interpretation: it is the mean square within treatment groups. This makes sense, since σ 2 only affects the variance within a group in our model. Then, replacing the variance with Sp2 , we get the following which has usual Student’s t-distribution with N − k degrees of freedom. Pk ¯ Pk a i θi i=1 ai Yi − i=1 q P ∼ tN −k k 2 Sp i=1 a2i /ni Now, we can do the usual t-test. In cases where we are only checking the equivalence of two means, this is equivalent to the two-sample t test is that here information from other groups are factored in estimating Sp2 . 11.1.3 The ANOVA F Test How do we test the classic hypothesis? We can think of it as an intersection of multiple hypothesis: θ ∈ {θ : θ1 = θ2 = · · · = θk } ⇐⇒ θ ∈ Θa for all contrast a ⇐⇒ θ ∈ ∩a∈A Θa We can reject this intersection if the test fails for any a. We can test an individual hypothesis with H0a by the following statistic: Pk P i=1 ai Y¯i − ki=1 ai θi q P Ta = k Sp2 i=1 a2i /n where we will reject it if Ta > k for some k. Now, all tests will pass if and only if sup Ta ≤ k a∈A Where A = {a : P a = 0}. How can we find this supremum? Lemma 11.2.7 gives us the exact form of a and the value of the supremum. However, the important thing is that: 43 sup Ta2 = a∈A Pk i=1 ni 2 Y¯i − Y¯ − θi − θ¯ Sp2 ∼ (k − 1) Fk−1,N −k the statistic of which is called the F -statistic, and it gives us the F-test. Now, we can reject H0 when 2 ¯ / (k − 1) ¯ n Y − Y i i=1 i Pk Sp2 > Fk−1,N −k,α What is the rationale behind looking at F statistic? The denominator is the estimated variance within groups. The numerator is the mean square between treatment groups, weighted 2 by the size of the group. Y¯i − Y¯ is the squared error between the group mean and the grand mean. ni weights them by the size of the group. Dividing by k − 1 is getting the average error per group. Now, the ratio between these two quantities should be higher when inter-group variance is high relative to the intra-group variance. 11.1.4 Simultaneous Estimation of Contrasts A couple of strategies for making inferences about multiple equalities are discussed; the Bonferroni procedure and the Scheffe’s procedure. The Bonferroni procedure allows you to make inferences about m pairs of means being equal. You have to set m in advance, and adjust the level of the test so the intersection tests will be of desired power. Scheffe’s procedure is more notable, which allows you to construct confidence intervals on any arbitrary contrasts after the procedure is done. It is noted as a legitimate use of data snooping. However, at the cost of power of inference, the intervals are usually wider. It goes like: p If M = (k − 1) Fk−1,N −k,α , then the probability is 1 − α that v v u u k k k k 2 u X u X X X a a2i i t 2 ¯ ¯ a i Y i − M Sp ai Yi + M tSp2 ai θi ≤ ≤ n n i=1 i=1 i=1 i i=1 i i=1 k X 11.1.5 Partitioning Sum of Squares ANOVA provides a useful way of thinking about the way in which different treatments affect a measured variable. We can allocate variation of the measured variable to different sources, because: ni k X X i=1 j=1 2 (yij − y¯) = k X i=1 2 ni (¯ yi· − y¯) + ni k X X i=1 j=1 (yij − y¯i· ) 2 2 which can easily be proved because (yij − y¯) = ((yij − y¯i ) + (¯ yi − y¯)) and when you evalu2 ate the square, cross terms are zero. The sums of squares are also chi-square distributed, after scaling. 44 11.2 Simple Linear Regression Simple linear regression is discussed in three different contexts - as a minimizer to the leastsquares without any statistical assumptions, as a best linear unbiased estimators under some variance assumptions, as an inference mechanism under distribution assumptions. Not surprisingly, we will be able to draw more powerful conclusions when we assume more. 11.2.1 General Model In all three different contexts, the actual line stays the same. Say (xi , yi ) are pairs of examples, where xi are the predictor variables, yi being response variables. Then x ¯ and y¯ are means of xi and yi , respectively, and Sxx = n X i=1 Syy = n X i=1 Sxy = n X i=1 (xi − x ¯) (yi − y¯) 2 2 (xi − x ¯) (yi − y¯) And then we will fit the following line y = bx + a with: b= Sxy Sxx a = y¯ − b¯ x Note the slope can be interpreted as Cov (X, Y ) /VarX. 11.2.2 Least Square Solution Least square provides a way to fit a line to the given data. No statistical inferences can be drawn here. Let’s say we want to minimize the residual sum of squares RSS = n X i=1 (yi − (c + dxi )) 2 Now c can be determined easily - rewrite the summand as 2 (yi − (c + dxi )) = ((y − dxi ) − c) 2 and the minimizer of this is just the average of y − dxi which is y¯ − d¯ x. d can be determined from differentiating the quadratic formula and setting it to 0. Also note that changing the direction of the regression ”using y to predict x) will give you a different regression line: this is obvious since b becomes Syy /Sxy . 45 11.2.3 Best Linear Unbiased Estimators: BLUE Let’s add some contexts: we now think of the values yi as observed values of uncorrelated random variables Yi . xi are known, fixed values chosen by the experimenter. We assume the following model: EYi = α + βxi where VarYi = σ 2 , which is a common variance across variables. Or equivalently, set Eǫi = 0, Varǫi = σ 2 and have Yi = α + βxi + ǫi Now, let us estimate α and β as a linear combination of Yi s n X di Yi i=1 Furthermore, we only look at unbiased estimators. With an unbiased estimator of slope β must satisfy E n X di Yi = β i=1 We can transform LHS as E n X i=1 di Yi = n X di EYi = n X di (α + βxi ) = α i=1 i=1 i=1 n X di ! +β For this to be β, we need the following conditions to hold true: n X di = 0 and i=1 n X n X i=1 di xi ! di x i = 1 i=1 So the minimum variance estimator which satisfies the above conditions is called the best linear unbiased estimator ”BLUE). The di s which satisfy this could be find using a similar strategy for maximizing Ta in section 11.1.3 above. After the dust settles, we have: di = (xi − x ¯) Sxx which seems to have an interesting interpretation. Higher Sxx make the coefficients smaller, and xi deviating more from x ¯ makes coefficients larger. What is the variance of β now? Varb = σ 2 n X i=1 46 d2i = σ2 Sxx 11.2.4 Normal Assumptions Now, we can assume normality for the variables which let us make further claims regarding the estimators. The text discusses two ways of doing this, which are practically equivalent. The more common one is the conditional normal model, which states Yi ∼ n α + βxi , σ 2 which is a special case of the model discussed above. Even less general is the bivarite normal model, which assumes the pair (Xi , Yi ) follows a bivariate normal distribution. However, in general we don’t care about the distribution of Xi , but only the conditional distribution of Yi . So bivariate normal assumptions are not used often. Also, note both models satisfy the assumptions we have made in the above section. Maximum Likelihood Under this distribution assumption, we can try to find the MLE of β. We expect to find the same formula - and we actually do. The log likelihood function is maximized at the same choice of β and α. What about the MLE of σ 2 ? It is given by n 2 1 X ˆ i σ ˆ = yi − α ˆ − βx n i=1 2 which is the variance of the error ”RSS) - makes sense because RSS are effectively ǫi with Varǫi = σ 2 ! However, note that σ ˆ 2 is not an unbiased estimator of σ 2 . Distributions of Estimators Under Normality Assumption ˆ and S 2 are The sampling distributions of the maximum likelihood estimates α ˆ , β, with n σ2 X 2 α ˆ ∼ n α, x nSxx i=1 i ! σ2 ˆ β ∼ n β, Sxx −σ 2 x ¯ Cov α ˆ , βˆ = Sxx Furthermore, α ˆ , βˆ and S 2 are independent and (n − 2) S 2 ∼ χ2n−2 σ2 When σ 2 is unknown, we can still make inferences about them using S 2 since we get Student’s t-distribution using: 47 α ˆ−α p Pn ∼ tn−2 S ( i=1 x2i ) / (nSxx ) and βˆ − β √ ∼ tn−2 S/ Sxx Significance of the Slope The t-test for significance of β will reject H0 : β = 0 when βˆ − 0 > tn−2,α/2 √ S/ Sxx This is equivalent to the following, since t-distribution squared is distributed following the F -distribution. βˆ2 > F1,n−2,α S 2 /Sxx The LHS quantity, the F -statistic, can be interpreted as follows: 2 Sxy /Sxx Regression sum of squares βˆ2 = = 2 S /Sxx RSS/ (n − 2) Residual sum of squares/df which is nicely summarized in an ANOVA table. Partitioning Sum of Squares As an another similarity to ANOVA, we can express the total sum of squares in the data set by a sum of regression sum of squares and the residual sum of squares: n X i=1 2 (yi − y) = n X i=1 2 (ˆ yi − y) + n X i=1 (yi − yˆi ) 2 When we split the sum of squares, we can take the ratio between the regression sum of squares and the total sum of squares as the coefficient of determination, called r2 : Pn 2 2 Sxy (ˆ yi − y) r2 = Pni=1 = 2 Sxx Syy ˆi ) i=1 (yi − y The last portion of this inequality is not very obvious... look at Exercise 11.34 for more intuition. 48 Prediction at Specified x0 We are able to discuss the distribution of the response variable at a specified position x0 . Call ˆ 0 which is an this Y0 . Under our assumptions, E (Y |x0 ) = α + βx0 which is estimated by α ˆ + βx unbiased estimator. What is the variance of this estimator? ˆ 0 = Varˆ Var α ˆ + βx α + Varβˆ x20 + 2x0 Cov α ˆ , βˆ n σ 2 X 2 σ 2 x20 2σ 2 x0 x ¯ xi + − nSxx i=1 Sxx Sxx ! 2 (x0 − x) 1 + = σ2 n Sxx = Now we have a normal distribution for Y0 . For inference, the following quantity follows a Student’s t-distribution: ˆ − (α + βx0 ) α ˆ + βx q0 ∼ tn−2 2 S n1 + (x0S−x) xx which can be used to make confidence interval for EY0 = α + βx0 . Prediction Interval ˆ 0 , which is our estimate The previous estimation and inference was done on the estimator α ˆ +βx for the mean of Y0 . Now, can we make intervals for Y0 itself? Obviously, the interval is going to be larger - we should add variance from the distribution of Y0 as well. Here’s the definition: a 100 (1 − α) % prediction interval for an unobserved random variable Y based on the observed data X is a random interval [L (X) , U (X)] such that Pθ (L (X) ≤ Y ≤ U (X)) ≥ 1 − α for all θ. The variance of Y0 is given by summing up the variance of the mean estimator and the common variance σ 2 . Simulataneous Estimation and Confidence Bands We can create confidence bands around the fitted line, which gives us confidence intervals for the mean of Y at that x. This is similar to getting confidence bands in ANOVA, and the same two processes apply: Bonferroni and Scheffe. Without further details, we state the Scheffe band: Under the conditional normal regression model, the probability is at least 1 − α that ˆ − Mα S α ˆ + βx where Mα = s p 2F2,n−2,α . 2 1 (x − x) ˆ + Mα S < α + βx < α ˆ + βx + n Sxx 49 s 1 (x − x) + n Sxx 2 12 Regression Models The last chapter! Yay! Can’t believe I made it so far. ”Well, yeah, I skipped over a good amount of material...) Anyways, this chapter covers a number of different models for regression. 12.1 Errors in Variables (EIV) Models In EIV models, contrary to the traditional regression methods, the xs, as well as ys, are realized values of a random variable whose mean we cannot observe: EXi = ξi . The means of the two families of variables are linked by a linear relationship. If EYi = ηi , ηi = α + βξi In this model, there is really no distinction between the predictor variable and the response variable. 12.1.1 Functional And Structural Relationship There are two different types of EIV models. The more obvious one is the linear functional relationship model, where ξi s are fixed, unknown parameters. Adding more parameterization 2 gives us the linear structural relationship model, where ξi ∼ iid n ξ, σξ . In practice, they share a lot of properties and the functional model is used more often. 12.1.2 Mathematical Solution: Orthogonal Least Squares OLS regression measures the vertical distance between each point and the line, since we trust xs to be correct. In EIV, there is no reason to do that and we switch to orthogonal regression. Here, the deviation is the distance between the point and the regression line. The line segment spanning this distance is orthogonal to the regression line, thus the name. The formula for this in case of a simple regression is given in the book. Orthogonal least squares line always lies between the two OLS regression lines - y on x and x on y. 12.1.3 Maximum Likelihood Estimation The MLE of the functional linear model is discussed. The obvious likelihood function, however, does not have a finite maximum. ”Setting derivatives to zero results in a saddle point.) To avoid this problem, we change the model where we do not know the variances of the two errors ”one in x and one in y), but their ratio λ. σδ2 = λσǫ2 Note since VarX = σδ2 , this includes the regular regression model when λ = 0 =⇒ VarX = 0. The maximization can be done analytically. This MLE, when λ = 1, will be the result of the 50 orthogonal least squares. When we send λ → 0, it will become the regular OLS results. Cool right? Case for the structural model is discussed, but I’m going to just skip over it. 12.1.4 Confidence Sets Omitted. 12.2 Logistic Regression And GLM 12.2.1 Generalized Linear Model A GLM consists of three components: the random component ”response variables), the systematic component ”a function h (x) of predictor variables, linear in the parameter), and the link function g (µ). Then the model states g (EYi ) = h (x) Important points: the response variables are supposed to come from a specified exponential family. 12.2.2 Logistic Regression In logistic regression, Yi ∼ Bernoulli (πi ), g is the logit function. Let us limit h to have the form α + βxi for easier discussion. Then we have log πi 1 − πi = α + βxi Note that log (π/ (1 − π)) is the natural parameter of the Bernoulli family, since the pmf can be represented as y π (1 − π) 1−y π = (1 − π) exp y log 1−π when the natural parameter is used as the link function, as in this case, it is called the canonical link. We can rewrite the link equation which gives us better intution about how the probability and the linear combination is related. πi = eα+βxi 1 + eα+βxi Estimating logistic regression is done by MLE, as we don’t have a clear equivalent of least squares. This will be done numerically. 51 12.3 Robust Regression Remind the relationship between mean ”minimizes L2 mean) and the median ”minimizes L1 mean). There is a median-equivalent for least squares; which is called LAD ”least absolute deviation) regression. It minimizes n X i=1 |yi − (a + bxi )| This is L1 regression. ”Solvable by LP.) As can be expected, it is quite more robust against outliers. However, the asymptotic normality analysis gives as the ARE of least squares and LAD is 4f (0) ”f is the standard normal pdf) which gives us about 64%. So we give up a good 2 bit of efficiency against least squares. 12.3.1 Huber Loss Analogeous to M-estimator, we can find regression functions that minimizes the Huber loss. The analysis of this is complicated and is omitted from the book as well. However, it hits a good middle ground between the two extreme regression techniques. The book demonstrates this over three datasets; where the errors are generated from normal, logistic, and double exponential distributions. Then, the AREs are calculated between least squares, LAD and M-estimator. The result is very good. Here I replicate the table: Error Normal Logistic Double Exponential vs. least squares 0.98 1.03 1.07 vs. LAD 1.39 1.27 1.14 Almost as good as least squares in normal, completely kick arse in other cases. Very impressive!! Also, note LAD is worse off than least squares in everything. What a piece of crap. Anyways, I sort of understand why professor Boyd said Huber loss will improve things greatly! 52 Linear Algebra Lecture Notes jongman@gmail.com January 19, 2015 This lecture note summarizes my takeaways from taking Gilbert Strang’s Linear Algebra course online. 1 Solving Linear Systems 1.1 Interpreting Linear Systems Say we have the following linear system: 2 4 1 0 3 5 14 1 x1 7 x2 = 35 14 1 x3 There are two complementing ways to interpret this. 1.1.1 The Row-wise Interpretation In the classical row-wise picture, each equation becomes a hyperplane ”or a line) in a hyperspace ”or space). For example, 2x1 + 4x2 + x3 = 14. The solution is where the three hyperplanes meet. 1.1.2 The Column-wise Interpretation In the column-wise picture, we think in terms of column vectors. We want to represent the right hand side as a linear combination of column vectors of A. 1.2 Elimination Elimination is a series of row operations that will change your given matrix into a upper-triangular matrix. The allowed operations are as of follows. • Adding a multiple of a row to another row • Changing the order of rows Elimination, combined with back substitution, is how software packages solve systems of linear equations. 1 1.2.1 Row Operations and Column Operations Say you are multiplying a 1 × 3 row vector with a 3 × 3 matrix. What is the result? It is a linear combination of rows of the matrix. h 1 3 i a 2 d g b c e f = 1 × [a, b, c] + 3 × [d, e, f ] + 2 × [g, h, i] h i Similarly, multiplying a matrix with a column vector on its right side gives us a liner combination of columns of the matrix. We conclude that multiplying on the left will give us row operations; multiplying on the right gives us column operations. 1.2.2 Representing Elimination With A Series of Matrices Since elimination is a purely row-wise operation, we can represent it with a series of multiplication operations on the left of the matrix. The matrices that are multiplied to do the elimination are called the elementary matrices or the permutation matrices, depending on what they are trying to do. Now, the elimination process of a 3 × 3 matrix A can be represented as: E23 E12 A = U Now, keep an eye on E23 E12 : if you multiply these together, it will be a single matrix that will do the entire elimination by itself! 1.2.3 Side: Multiple Interpretations of Matrix Multiplications Say we are multiplying two matices A × B = C. Multiple ways to interpret this operation: • Dot product approach: Cij = P Aik Bkj ”all indices are row first) • Column-wise approach: C i = A × B i . Columns of C are linear combinations of columns in A. • Row-wise approach: Ci = Ai × B. Rows of C are linear combinations of rows in B. • Column multiplied by rows: Note that a column vector multiplied by a row vector is a full matrix. Now, we can think of C as a sum of products between ith column of A and ith row of B! • Blockwise: If we split up A and B into multiple blocks where the sizes would match, we can do regular multiplication using those blocks! If A and B were both split into 2 × 2 chunks, each block being a square. Then, C11 = A11 B11 + A12 B21 ! 1.3 Finding Inverses with Gauss-Jordan Say we want to find an inverse of A. We have the following equation: A [c1 c2 · · · cn ] = In 2 where ci is the i-th column of A−1 . Now, each column in In are linear combinations of columns of A - namely, ith column of In is Aci . So each column in In gives us a system of linear equations, that can be solved by Gauss elimination. The way of solving n linear systems at once is called the Gauss-Jordan method. We work with an augmented matrix of form [A|In ] and we eliminate A to be In . We can say: Y Ej [A|In ] = [In |?] Say we found a set of Es that make the above equation hold. Then, we got: Y Ej A = In ⇐⇒ Y Ej = A−1 = Y Ej I n the last equality telling us that the right half of the augmented matrix after the elimination is A−1 , thus proving the validity of the algorithm. 1.3.1 Inverses of Products and Transposes What is (AB) −1 ? We can easily see B −1 A−1 is the answer because: B −1 A−1 AB = I = ABB −1 A−1 Now, (AT )−1 ? We can start from A−1 A = I and transpose both sides. Now we get A−1 So A−1 1.4 T is the inverse of AT . T AT = I T = I Elimination = Factorization; LU Decomposition Doing a Gaussian elimination to a matrix A will reduce it to a upper triangular matrix U . How are A and U related? LU decomposition tells us that there is a matrix L that connects the two matrices. ”Also, note we ignore row exchanges for now) We take all the elimination matrices used to transform A to E. Since row exchanges are not allowed, all of these elements are lower triangular. ”We take an upper row, and add an multiple of it to a lower row). Example: let " 2 A= 8 # 1 7 We can represent the elimination with a elimination matrix: " #" # " 1 0 2 1 0 A= −4 1 8 −4 1 Thus 3 # " 1 2 = 7 0 # 1 3 " # 1 2 4 7 = | {z } A " | 1 0 #" # 1 2 −4 1 0 3 {z } | {z } U E −1 We can factor out a diagonal matrix so U will only have ones on the diagonal: A= " #" 1 0 −4 1 2 0 #" 0 1 3 0 1 2 1 # 1.4.1 A case for 3 × 3 matrices Say n = 3. Then we can represent the elimination is E32 E31 E21 A = U Now, the following holds: −1 −1 −1 A = E21 E31 E32 U = LU and we call the product of elimination matrices L. Question: how is L always lower triangular? Let’s start with E32 E31 E21 . Since each elimination matrix subtracts an upper from from lower rows - so everything is moving downwards . A nonzero number cannot move up . Now how do we calculate L? We go through an example, and we make the following claim: if you are not using row exchanges, the multipliers will go into L! That means, the multiplier we used to make U21 zero will go into L21 . This is checked with an example. 1.4.2 Time Complexity for Gaussian elimination Of course, the naive way gives O n3 . To eliminate using the i-th ”0-based) row, we would have to change 2 2 (n − i) cells. So, n2 + (n − 1) + · · · + 12 . Integrating n2 gives 31 n3 – which is the ballpark range for this sum. 1.5 Permutation Matrices Say we have a list of all possible permutation matrices for a matrix of size n × n. There are n! possibilities: we have n! different permutations. What if we multiply two of those? The result is another permutation, so this set is closed under multiplication. What if we invert one of those? The result is also a permutation - so this is closed under inversion as well. 1.5.1 On Transpose and Symmetric Matrices Trivial result: AT A is always symmetric for any A. Proof? T AT A = AT AT T = AT A 4 1.5.2 Permutation and LU matrices How do we account for row exchanges in LU decomposition? We exchange rows before we start! P A = LU 2 Vector Spaces and Subspaces 2.1 Definitions and Examples • Closed under scalar multiplication • Closed under vector addition Some things are obvious: • Closed under linear combination • Contains 0 2.2 Subspaces A subspace is a subset of a vector space, which is a vector space itself. For example, R2 has three kinds of subsets: the plane itself, any line that goes through (0, 0), and {(0, 0)}. If you take a union of two subspaces, in general, it is not a subspace. However, their intersection is still a subspace. 2.3 Spaces of a matrix Given a vector A, the set of all possible linear combinations of its vectors is called the column space of A: C(A). A column space is inherently related to the solvability of linear systems. Say we want to solve Ax = b. Any possible value of Ax is in the column space by definition; so Ax = b is solvable iff when b ∈ C(A). The Null space is defined by the solution space of Ax = 0. Null and column spaces can have different sizes: if A ∈ Rm×n , N (A) ∈ Rn and C (A) ∈ Rm . How do we find column and null spaces? Once again, elimination. 2.4 Finding the Null Space: Solving Ax = 0 Say we eliminated A to get a row echelon form matrix U . Here are some definitions. Pivot columns The columns that contain the pivots used in the elimination. Free columns The rest of the columns. Rank of a matrix Number of pivot columns. 5 Why are free columns called free ? In solving Ax = 0, we can assign arbitrary values to the variables associated with free columns. The rest of the variables will be uniquely defined from those values. To find the entire null space, we construct a particular solution for each of the free variables. We can set its value 1, with rest of the free variables 0. Now we can get a particular solution. We repeat for all the n − r free variables, and take their linear combination. We now know this set spans all possible values for the free variables. 2.4.1 Reduced-row Echelon Form Reduced-row echelon form does one more elimination upwards, and make the pivots 1. Making pivots won’t change the solution since we are solving Ax = 0. Also, note that since every entry ”except the pivot itself) of the pivot column is eliminated, if we take the pivot rows and columns we get an identity matrix of size r. The typical form of an rref can be shown as a block matrix: R= " I F 0 0 # where I is the identity matrix, the pivot part, and F is the free part. Note that you can read off the particular solutions directly off the matrix: now each row of the equation Rx = 0 takes the following form: xp + axf1 + bxf2 + cxf3 + · · · = 0 where xp is a pivot variable, and xfi s are the free variables. And now getting the value of xp is extremely easy. We can abstract even further; think of a null space matrix N such that RN = 0. Each column of this matrix is the particular solution. And we can set: N = " −F I # From the block multiplication, we know RN = 0 and how each column of N looks like. 2.5 Solving Ax = b Now we solve a generic linear system. First, some solvability condition: we can solve it if b ∈ C (A). Finding the solution space is pretty simple: • Find a particular solution by setting all free vars to 0, and solving for pivot variables. • Add it to the null space! Is the solution space going to be a subspace? No, unless it goes through origin. 2.5.1 Rank and Number of Solutions The key takeaway is that you can predict the number of solutions by only looking at the rank of the matrix. Say A is a m × n matrix. What is the rank r? 6 • Full rank square matrix: When r = n = m, the rref becomes I and we have exactly one solution for any b in Ax = b. " # I and we • Full column rank: This happens in tall matrices, where r = n < m. The rref looks like 0 n o have no free variables, so N (A) = ~0 . Also, for any b, there might be 0 solutions ”when the zero row should equal a nonzero bi ) or exactly one solution. h i • Full row rank: This happens in wide matrices, where r = m < n. The rref looks like I F ”in prac- tice, the columns of I and F are intertwined.). Since we have no zero rows, the number of solutions is not going to be 0. Also, since there is a free variable, we are always getting a ∞ number of solutions. # " I F . We get 0 or ∞ solutions depending on b. • Not-full column/row rank: The rref looks like 0 0 3 Linear Independence, Basis and Dimension First, a lemma: when A ∈ Rm×n where m < n, there is a nonzero solution to Ax = 0 since we always get a free variable. 3.1 Linear Independence Linear independence A set of vectors v1 , v2 , · · · , vn is linearly independent when no linear combination of them ”except for the 0 combination) result in a 0 vector. i h Equivalently, say A = v1 v2 · · · vn . The column vectors are linearly independent: • N (A) = {0} • A is full column rank. A corollary of the above lemma: 3 vectors in a 2-dimensional space cannot be independent. 3.2 Spanning, Basis, and Dimensions Definition: A set of vectors {v1 , v2, · · · , vl } span a space iff the space consists of all linear combinations of those vectors. A basis for a space is an independent set of vectors {v1 , v2 , · · · , vd } which span the space. The number of vectors d is called the dimension of the space. Here are some facts: • A set of vectors in Rn {v1 , · · · , vn } gives a basis iff the n × n matrix with those as columns gives an invertible matrix. • Every basis has the same number of vectors, the number being the dimension of the space. 7 3.2.1 Relationship Between Rank and Dimension The rank of matrix A is the number of pivot columns. At the same time, it is the dimension of the column space of A - C (A). OTOH, what is the dimension of N (A)? For each free variable, we get a special solution with that free variable set to 1 and other free variable to 0. Each of these special solutions are independent, and span the null space! So the dimension of N (A) = n − r. 3.3 Four Fundamental Spaces of A Given a A ∈ Rn×m , here are the four fundamental spaces. • Column space C (A) ∈ Rn • Null space N (A) ∈ Rm • Row space R (A) = all linear combinations of rows of A = C AT ∈ Rm • Null space of AT - often called the left null space of A : N AT ∈ Rn This is called the left null space because, N AT = y|AT y = 0 = y|y T A = 0 the latter equality derived from taking the transpose. We should understand what dimensions and basis do they have; and also how do they relate with each other. First, the dimension and the basis: • C (A) has dimension r, and the pivot columns are the basis. • N (A) has dimension n − r , and the special solutions are the basis. • R (A) has dimension r, and the first r rows in the rref are the basis. • N AT has dimension m − r, and the last m − r rows in the elementary matrix E s.t. EA = R. IOW, the E that comes out of Gauss-Jordan. Note that summing the dimension of C (A) and N (A) gives n, where summing dimensions of R (A) and N AT gives m! 3.3.1 Elimination and Spaces What does taking rref do to the matrix’s spaces? We do know C (A) 6= C (R) since the last row of R can potentially be zero. However, elimination does not change the row space; and its basis is the first r rows in R. 3.3.2 Sets of Matrices as Vector Space Suppose the following set: all sets of 3 × 3 matrices. It is a vector space because it is closed under addition, multiplication by scalar! What are some possible subspaces? All upper triangular matrices, all symmetric matrices, all diagonal matrices, multiples of I, .. 8 We discuss an interesting properties of these different subspaces: dim (A + B) = dim A + dim B − dim (A ∩ B) 3.3.3 Solutions As Vector Space What are the solutions of the following differential equation? ∂2y +y =0 ∂x2 We can say: y = a sin x + b cos x. Now, we can think of the solution as a vector space with sin x and cos x as their basis. Note these are functions, not vectors! So this is a good example why the idea of basis and vector spaces plays a large role outside the world of vectors and matrices. 3.3.4 Rank One Matrices Rank one matrices are special, as we can decompose it into a product of column vector and row vector A = uv T . They are building blocks of other higher-rank matrices. A four-rank matrix can be constructed by adding four rank-one matrices together. 4 Applications 4.1 Graphs and Networks The most important model in applied math! A graph with m edges and n nodes can be represented by an incidence matrix of size m × n, each row representing an edge. The entry Aij = 1 if edge i is coming into node j, −1 if edge i leaves node j, and 0 otherwise. Note this is different from the adjacency matrix form I am used to. Some remarks about how notions in electric network and linear algebra concepts relate to each other. • Loops: When 2 or more edges form a loop ”not necessarily a cycle), those rows are _not_ independent and vice versa. • Sparsity: This is a very sparse matrix! However, in applied linear algebra, it is way more common to have structured matrices. • Null space: Say x is a vector of potentials for each nodes - then Ax means differences between potentials of nodes. So Ax = 0 gives you pairs of nodes for which the potentials are the same. If the graph is connected, the null space is a single dimensional space - c~1. • N AT : What does AT y = 0 mean? Kirchhoff’s circuit law is like the flow preservation property in electrical networks; net incoming/outgoing current is zero for any node. If y is a vector of currents, AT y is a vector of net incoming currents. – The basis of this null spaces are related to the loops in the graph. Suppose we pick a loop and send a unit current along it. This gives us a basis. 9 – Repeat, and we can take all loops one by one and take all the basis! – Say there are two loops: a − b − c − a and a − b − e − a. Is a − c − b − e − a a valid loop? No, it is the sum ”or difference) of two loops and the special solution will be dependent of the previous two special solutions. – Now, realize that the number of the loop is dim N AT = m − r! • R AT : What do the pivot columns represent? It is a fucking spanning tree! Whoa If there is a cycle those rows are not going to be independent! Taking all of this together, we can derive Euler’s formula! • dim N AT = m − r • r = n − 1 ”since the pivot columns represent a spanning tree which always have n − 1 edges!) • Then Number of loops = Number of edges − (Number of nodes − 1) Now, in graph theory speak: F = E − V + 1 ⇐⇒ V − E + F = 1 Ladies and gentlemen, let me introduce Euler’s formula. Holy crap. Also, we can merge everything in a big equation. So far, we know: • e = Ax ”potential differences) • y = Ce ”Ohm’s law) • AT y = f ”Kirchhoff’s law) So we get: AT CAx = f whatever that means haha. 5 Orthogonality 5.1 Definition What does it mean for two subspaces/vectors/basis to be orthogonal? Vector orthogonality: Two vectors x and y are orthogonal iff xT y = 0. We can connect this to Pythagorean 2 2 2 theorem - kxk + kyk = kx + yk iff x and y are orthogonal. 2 2 2 kxk + kyk = xT x + y T y = kx + yk = xT x + y T y + 2xT y ⇐⇒ xT y = 0 10 Subspace orthogonality: two subspace S is orthogonal to subspace T when every vector in S is orthogonal to every vector in T . Examples: • Are xz and yz planes orthogonal in R3 ? No they aren’t: they have a nonzero intersection! The vectors in that intersection cannot be orthogonal to themselves. Facts: • Row space orthogonal to null space. Why? x ∈ N (A) iff 0 r1 · x . . . . Ax = . = . 0 rn · x So x is orthogonal to all rows. And of course, it will be orthogonal to all linear combinations of the rows. • Row space and null spaces are orthogonal complements of R3 : nullspace contains all vectors orthogonal to the row space. 5.2 Projections 5.2.1 Why Project? If Ax = b cannot be solved in general. So what do we do? We find the closest vector in C (A) that is closest to b, which is a projection! 5.2.2 2D Case Suppose we project vector b onto a subspace that is multiple of a. Say that the projected point is ax. We know the vector from ax to b is orthogonal to a. We have: 0 = aT (ax − b) = aT ax − aT b ⇐⇒ x = aT b aT a Look at the projected point ax: ax = a aT b = aT a aaT aT a b Note the last formula - that rank one matrix is the projection matrix P ! It has the following properties: P 2 = P and P T = P 5.2.3 General Case It’s the same derivation! If p is the projected point, we can write it as Aˆ x. Then the error vector b − Aˆ x is perpendicular to the column space of A. So: −1 T 0 = AT (b − Aˆ x) = AT b − AT Aˆ x ⇐⇒ x ˆ = AT A A b 11 Welcome pseudoinverse! Now, get p by −1 T p = Aˆ x = A AT A A b {z } | P We got the projection matrix! 5.3 Least Squares When A is tall, Ax = b is not generally solvable exactly. We multiply AT to both sides of the equation to get AT Ax = AT b Hoping AT A is invertible and we can solve this exactly. When is AT A invertible? It turns out rank of AT A is equal to rank of A - so AT A is invertible only when A is full column rank. 5.3.1 Invertibility ofAT A Assume A is full column rank. Let’s prove the following set of equivalent statements: AT Ax = 0 ⇐⇒ x = 0 ⇐⇒ N AT A = {0} ⇐⇒ AT A is invertible Take the first equation, and calculate dot products of each side with x: T xT AT Ax = 0 = (Ax) Ax ⇐⇒ Ax = 0 Since A is full column rank, N (A) = {0}. So x must be 0. 5.3.2 Least Squares as A Decomposition If least square decomposes a vector b into p + e, p ∈ C (A) and e ∈ N (A). Now, if p = P b, what is e? Of course, e = (I − P ) b. We get p + e = (P + I − P ) b = b. 5.4 Orthonormal Basis A set of orthonormal vectors {q1 , q2, · · · } is a set of unit vectors where every pair is perpendicular. We can write this with a matrix Q: Q = [q1 q2, · · · ] where qi s are column vectors. Now, the above requirement can be written as: QT Q = I This matrix Q is called a orthonormal matrix. ”Historically, we only call it orthogonal when it’s a square...) What happens when Q is square? Since QT Q = I, Q = Q−1 . 12 5.4.1 Rationale The projection matrix can be found as P = Q QT Q −1 QT = QQT 5.4.2 Gram-Schmidt Given a set of vectors, how can we make them orthonormal? Well... I sorta do know... tedious process to do by hand. • Take an arbitrary vector, and normalize it. Include it in the result set. • For each other vector v, – For each vector u in the result set, subtract the projection of v onto u from v: v = v − uT v · u – Normalize the resulting v and include it in the result set. 5.4.3 QR Decomposition How do we write Gram-Schmidt in terms of matrices? We could write Gaussian elimination by P A = LU We write this as: A = QR Note that, because of how Gram-Schmidt works, R is going to be a lower triangular matrix! The first column of Q is going to be the first column of A, scaled! Also, the second column is going to be a linear combination of first two columns of A and etc. 6 Determinants 6.1 Properties 1. Identity matrix has determinant 1. 2. When you swap two rows, the sign of the determinant will change. 3. The determinant is a linear function of the matrix: ”I’m not saying det A + det B = det (A + B) – this is not true) ”a) If you multiply a row by a scalar t, the determinant will be multiplied by t. h h iT h iT + det a · · · y = det a · · · x · · · b ”b) det a · · · x + y · · · b The following can be derived from the above points: 13 ··· b iT • If two rows are equal, determinant is 0. • If we subtract a multiple of a row from another row, determinant doesn’t change. ”This can be proved from 3b and 4) • If there’s a row of zeros, determinant is 0. • For a upper triangular matrix, determinant is the product of the diagonal products. ”Proof: do reverse elimination to get a diagonal matrix with same determinant, use rule 3a repeatedly until we get I) • det A ⇐⇒ A is singular • Determinant is multiplicative: det AB = det A · det B – So: det A−1 = 1 det A • det A = det AT ”this means, swapping columns will flip the sign) – Proof: let A = LU . det A = det (LU ) = det L det U . det L = 1, and det U = det U T since it is upper triangular. 6.2 Calculation 6.2.1 Big Formula X det A = p∈permute(1···n) ± Y aipi Argh.. messy. Whatever. 6.2.2 Cofactor Taking the big formula, and collecting terms by the number in the first row gives the cofactor expansion formula. Say the matrix is 3 × 3: det A = a11 (a22 a33 − a23 a32 ) + a12 (−a21 a33 + a23 a31 ) + a13 (a21 a32 − a23 a31 ) Notice the quantities in parenthesis are either determinants of the 2 × 2 matrices, or their negatives. Formally, Cij is defined to be a cofactor Cij = ± det (smaller matrix with row iand col jremoved) where the sign is + when i + j is even, - when i + j is even. This follows the checkerboard pattern. ”Formally, cofactors with the sign are called minors.) The resulting cofactor formula for the determinant is: det A = a11 C11 + a12 C12 + · · · + a1n C1n 14 6.3 Applications 6.3.1 Formula for A−1 A−1 = 1 CT det A where C is the matrix of cofactors in A. How do we verify it? Let’s check if AC T = (det A) I Expand the elements of the matrix - we actually see the diagonals are the determinants, from the cofactor formula! But what about the off-diagonal elements? We claim those calculations are actually using cofactor formula for a matrix with two equal rows! Aaaaah..... 6.3.2 Ax = b and Cramer’s Rule Of course, we know x= 1 CT b det A What are the entries of x in this formula? Since ci b is always a determinant of some matrix, we can write it as: xi = det Bi det A for some matrix Bi . Cramer realized that Bi is A with column 1 replaced by b. This are beautiful, but not practical ways of calculating stuff. 6.3.3 Determinant and Volume The determinant of a n × n matrix A is the volume of a box in a n dimensional space with n sides of it coinciding with the column/row vectors of A. Let’s look at some example cases to get ourselves convinced: • A = I: it’s going to be a ”hyper)cube with volume 1. • A = Q ”orthonormal): another cube, rotated. From here we can rediscover QT Q = I - take determi2 nants from both sides: (det Q) = 1. So det Q = ±1. Also, we revisited many determinant properties and made sure they hold in this context as well. 7 Eigenvalues and Eigenvectors 7.1 Definition You can think of a matrix as a linear mapping, using f (x) = Ax. For a matrix, we can find vector”s) x such that f (x) is parallel to x. Formally: 15 Ax = λx where λ is a scalar, called the eigenvalue. The xs are called the eigenvectors. 7.1.1 Example: Projection Matrix Say there is a projection matrix, which projects stuff into a plane: when a vector is already in the plane, it won’t change. So they will be eigenvector with eigenvalue 1. We can say there are two perpendicular eigenvectors that span the plane. Are there any other eigenvalue? ”Intuitively, we expect to find one, since we are in a 3D space) Yes, find the normal vector that goes through the origin. This will become a 0 vector - so this one is an eigenvector with eigenvalue 0. 7.1.2 Example 2: Permutation Matrix Let A= " 0 1 # 1 0 Trivially, [1, 1] is an eigenvector with eigenvalue 1. Also, [−1, 1] is an eigenvector with eigenvalue -1. T T 7.1.3 Properties • n × n matrices will have n eigenvalues. • Sum of the eigenvalues equal the trace of the matrix ”sum of the diagonal entries). • The determinant is the product of the eigenvalues. 7.2 Calculation Rewrite the equation as: (A − λI) x = 0 Now, for this to be true for nontrivial x, A − λI has to be singular: this is equvalent to: det (A − λI) = 0 We solve this to find λs. After that, we can use elimination to find x. 7.2.1 Properties If we add aI to a matrix, each of its eigenvalues will increase by a. The eigenvectors will stay the same! See: if Ax = λx, (A + aI) x = λx + ax = (λ + a) x 16 What do we know about general addition? If we know eigenvalues and eigenvectors of A and B, what do we know about A + B? First guess would be their eigenvalues be added, but false. Because they can have different eigenvectors. 7.2.2 When Things Are Not So Rosy Think of a rotation matrix; what vector is parallel to itself after rotating 90 degrees? None! When we carry out the calculation, we get complex eigenvalues. Also, for some matrices we can have duplicate eigenvalues, and no independent eigenvectors. Why is this important? We will see. 7.3 Diagonalization of A Matrix Let us assume that A has n linearly independent eigenvectors, x1 to xn , each associated with eigenvalues λ1 to λn . We can put them in columns of a n × n matrix, and call this S. Then, we can write what we know as follows: h AS = A x1 x2 i h x n = λ1 x 1 ··· λ2 x 2 ··· λn x n i How do we write the latter with a matrix representation? We can multiply S with a diagonal matrix with eigenvalues along the diagonal: h λ1 x 1 λ2 x 2 ··· ··· λ1 0 i 0 λn x n = S .. . λ2 .. . ··· .. . 0 .. . 0 ··· λn 0 We call the diagonal matrix Λ. Now we have: 0 = SΛ AS = SΛ ⇐⇒ S −1 AS = Λ ⇐⇒ A = SΛS −1 Note since we assumed the eigenvectors are linearly independent, S is invertible. From this, we can infer interesting properties of eigenvalues. What are the eigenvectors and eigenvalues of A2 ? Intuitively, of course, the eigenvectors are the same - for an eigenvector x, Ax is a scalar multiplication of x. So A2 x will still be a scalar multiplication, with a factor of λ2 . However, we can see it from diagonalization as well: A2 = SΛS −1 2 = SΛS −1 SΛS −1 = SΛ2 S −1 In general, if you take powers of a matrix, it will have the same eigenvectors, but the eigenvalues will get powered. 7.3.1 Understanding Powers of Matrix Via Eigenvalues What does lim An n→∞ look like? Does it ever go to 0? We can find out, by looking at its eigenvalues: 17 An = SΛn S −1 For An to go to 0, Λn has to go to zero - so the absolute value of all eigenvalues are less than 1! 7.3.2 Understanding Diagonalizability A is sure to have n independent eigenvectors ”and thus be diagonalizable) if all the eigenvalues are different. Note there are cases where there are repeated eigenvalues and there are independent eigenvectors ”well, take I for an example). 7.4 Applications 7.4.1 Solving Recurrences Let’s solve a difference equation: uk+1 = Auk with a given u0 . How to find uk = Ak u0 , without actually powering the matrix? The idea is to rewrite u0 as a linear combination of normalized eigenvectors of A: u0 = c 1 x 1 + c 2 x2 + · · · + c n x n Now, Au0 is: Au0 = Ac1 x1 + Ac2 x2 + · · · + Acn xn = c1 λ1 x1 + c2 λ2 x2 + · · · Yep, we got a pattern! Now we know: Ak u0 = X ci λki xi i The idea of eigenvalues kind of sank in after looking at this example; they, sort of, decompose the linear mapping represented by A into orthogonal basis. After you represent a random vector in this space, the effects of A can be isolated in each direction. So it actually describes A pretty well! The name eigen now kind of makes sense. 7.4.2 Deriving Closed Form Solution of Fibonacci Sequence Let f0 = 0, f1 = 1, fn = fn−1 + fn−2 . I know that I can write un = " fn fn−1 # and un+1 " 1 = 1 # 1 un = Aun 0 What are the eigenvalues of A? The characteristic polynomial comes out to be (1 − λ) · (−λ) − 1 = λ2 − λ − 1 = 0 18 Plugging this into the quadratic formula, we get: √ √ √ 1± 5 1+ 5 1− 5 λ= : λ1 = , λ2 = 2 2 2 Since λ1 > 1, λ2 < 1, when n grows, λ1 will dominate the growth of the fibonacci number. So we know: √ !100 1+ 5 2 f100 ≈ c · Why? Rewrite u0 as u0 = c 1 x1 + c 2 x2 and we know 100 u100 = c1 λ100 1 x 1 + c 2 λ2 x 2 and with λ2 < 1, the second term is meaningless when k is large. Whoa whoa.... nice... For the sake of completeness, let’s calculate the eigenvectors as well. The eigenvectors are: x1 = " λ1 1 # , x2 = " λ2 1 # Solve for c1 and c2 by equating: u0 = " # 0 1 = c 1 x1 + c 2 x2 = " √ # c1 +c2 +(c1 −c2 ) 5 2 c1 + c2 Now we have the formula for uk - thus a closed form solution for fk . 7.4.3 Differential Equations Consider the following system of differential equations: du1 = −u1 + 2u2 dt du2 = u1 − 2u2 dt with initial value u1 (0) = 1, u2 (0) = 0. Let u (t) = [u1 (t) , u2 (t)] and we can write above as: T du = Au dt with A= " −1 1 2 # −2 The eigenvalues of A are 0 and -3, and the associated eigenvectors are x1 = [2, 1] and x2 = [1, −1] , T respectively. 19 T Now, the solution is a linear combination of two special solutions, each corresponding with an eigenvalue: u (t) = c1 eλ1 t x1 + c2 eλ2 t x2 which we can check by plugging into the above representation. We can solve for c1 and c2 using the initial condition - and we now know everything. So eigenvalues still give us insights about which parts of the solution blows up, or goes to 0, etc. Of course, since eigenvalues now sit on the exponent, it has to be negative to go to 0. If the eigenvalue is 0, the corresponding portion will stay constant. Note: the eigenvalues might be imaginary - in that case, only the real portion counts in terms of asymptotic behavior. For example, e(−3+6i)t = e−3t e6it = e−3t (cos t + i sin t) 6 and the latter part’s magnitude is 1. 7.4.4 Thinking In Terms of S and Λ In du = Au dt A is a non-diagonal matrix, and represents the interaction, relation between the variables. This is coupling; we can decouple variables by using eigenvalues. Now, decompose u into a linear combination of eigenvectors by setting u = Sv. We get: du dv dv =S = ASv ⇐⇒ = S −1 ASv = Λv dt dt dt Wow, we now have a set of equations like: v1′ = λ1 v1 , v2′ = λ2 v2 , and so on. Same as in the difference equation example. Now, how do we express the solution in this? Since dv = Λv dt we would like to express v as v (t) = eΛt v (0) which gives u (t) = SeΛt S −1 u (0) which gives the decoupling effect we are looking for. But WTF is that eΛt ? First let’s define eAt as: eAt = I + At + ∞ 2 n X (At) (At) + ···+ = 2 n! n=0 Just like the power series definition for eax = 1 + ax + 20 (ax)2 2! + (ax)3 3! + · · · . Now what is SeΛt S −1 ? e At =e SΛS −1 So it’s same as eAt ! 7.5 X SΛS −1 = n! n tn = X SΛn S −1 tn n! =S X Λ n tn n! S −1 = SeΛt S −1 Applications: Markov Matrices and Fourier Series 7.5.1 Markov Matrix What is a Markov Matrix? Nonnegative square matrix, with all columns summing up to 1. Now, we know those processes are never going to blow up. Maybe they will reach a steady state. We already know: • Any eval cannot be greater than 1, because it will make things blow up. • If an eval equals 1, the evec for that eval will be the steady state. We will find out that Markov matrices always have 1 as an evalue. Also, it will never be a repeated eigenvalue. 7.5.2 Proof of 1 Being An Eigenvector Since A has 1 as column sums, A − 1I has zero sum columns. Now, the rows are dependent: add up all rows and they sum to 0. This leads us to the corollary that A and AT has same eigenvalues ”because T det (A − λI) = det (A − λI) = det AT − λI ) 7.5.3 Projections with Orthonormal Basis If we have a set of orthonormal basis, arbitrary vector v can be represented as: v = Qx Since Q is a projection matrix, Q−1 = QT . So we can solve for x as: x = QT v Nothing Fourier related so far. What now? 7.5.4 Fourier Series We write a given function f as a linear combination of sin and cos: f (x) = a0 + a1 cos x + b1 sin x + a2 cos 2x + b2 sin 2x + · · · This infinite series is called the Fourier series. We now work in function space; instead of orthogonal vectors we use orthogonal functions: 1, cos x, sin x, cos 2x, and so on. We represent a function with a linear combination of those basis functions. But what does it mean that two functions are orthogonal? How is dot product defined between two functions? We define: f T g (x) = ˆ f (x) g (x) dx 21 We can calculate this between constants, sines and cosines because they are all periodic. Now, a0 is easy to determine - take the average value. What is a1 ? ˆ 2π f (x) cos xdx = a1 0 ˆ 0 2π 1 cos xdx =⇒ a1 = π 2 ˆ 2π f (x) cos xdx 0 where the latter equality comes from the fact the basis functions are orthogonal. ”Btw, I’m not sure how we can fix the bounds on the above integral. Maybe I should go back to the book.) 7.6 Symmetric Matrices and Positive Definiteness When A ∈ Sn where Sn is the set of n × n symmetric matrices, we state: • All eigenvalues are real • We can choose eigenvectors such that they are all orthogonal. The exact proof is left to the book. The usual diagonalization, A = SΛS −1 now becomes A = QΛQ−1 = QΛQT , the latter equality coming from the fact that Q has orthonormal columns, so QQT = I. This is called the spectral theorem; spectrum means the eigenvalues of the matrix. 7.6.1 Proof Why real eigenvalues? Let’s say Ax = λx We can take conjugates on both sides; ¯x ¯x = λ¯ A¯ However, since we assume A to be real, we know: ¯x ¯x = A¯ A¯ x = λ¯ Try using symmetry, by taking transposing stuff: ¯ x ¯T AT = xT A = x ¯T λ The second equality coming from symmetry assumption. Now, multiply x ¯T to the both sides of the first equation: x ¯T Ax = λ¯ xT x And we multiply x to the right side of the last equality: ¯ Tx x ¯T Ax = λx Then 22 λ¯ xT x = x ¯T Ax = λxT x Now, we know: ¯ λ=λ thus λ is real - if x ¯T x is nonzero. x ¯T x = X i ”unless x is 0. But if x is 0, λ = 0) (a + ib) (a − ib) = X a 2 + b2 > 0 i 7.6.2 When A is Complex ¯ The last equality can work if: We repeat the above argument, without assuming A = A. A=A T 7.6.3 Rank One Decomposition Recall, if A is symmetric, we can write: A = QΛQT = [q1 q2 · · · ] λ1 λ2 q1T T q2 = λ1 q1 q1T + λ2 q2 q2T + · · · .. .. . . So every symmetric matrix can be decomposed as a linear combination of perpendicular projection ”rank one) matrix! 7.6.4 Pivots And Eigenvalues Number of positive/negative eigenvalues for symmetric matrices can be determined from the signs of the pivots. The number of positive eigenvalues is the same as the number of positive pivots. 7.6.5 Positive Definiteness A PSD matrix is a symmetric matrix. If symmetric matrices are good matrices, PSD are excellent . It is a symmetric matrix with all eigenvalues are positive. Of course, all the pivots are positive. So, for 2 × 2 matrices, PSD matrices always have positive determinants and positive trace. However, this is not a sufficient condition for positive definiteness, as can be demonstrated in the following matrix: " # −1 0 0 −2 We state that a matrix is positive definite iff all its subdeterminants are positive; they determinants of submatrices formed by taking a m × m submatrix from the left top corner. 23 To summarize: • All eigenvalues are positive • All pivots are positive • All subdeterminants are positive 7.7 Complex Numbers and Examples Introducing complex numbers and FFT. 7.7.1 Redefining Inner Products for Complex Vectors If z ∈ Cn , z T z is not going to give me the length of the vector squared, as in real space. Because z T z = P P 2 2 2 j (aj + ibj ) = j aj − bj 6= kzk. As seen in the proof of real eigenvalues of symmetric vectors, we need: z¯T z. For simplicity, we write: z¯T z = z H z = X 2 kzi k = kzk 2 where H stands for Hermitian. So, from now on, let’s use Hermitian instead of usual inner product. 7.7.2 Redefining Symmetric Matrices We also claim our notion of symmetric matrix A = AT is no good for symmetrix matrix. We want A¯T = A 2 for obvious reasons ”so the diagonal elements of A¯T A are kai k ). We define a Hermitian matrix A to satisfy: AH , A¯T = A 7.7.3 Orthonormal Basis Now, for a matrix Q with orthonormal columns, we say: QH Q = I Also we call these matrices unitary. 7.7.4 Fourier Matrices 1 1 Fn = 1 . . . 1 w 1 w2 w2 .. . w4 .. . 1 wn−1 24 w2(n−1) ··· · · · · · · · · · ··· to generalize, we have (Fn )ij = wij . Note both indices are 0-based. Also, we want wn = 1 - nth primitive root of unity. So we use w = cos 2π 2π + sin = ei(2π/n) n n ”Of course I know we can use modular arithmetic to find a different primitive root of unity.. but meh.) One remarkable thing about Fourier matrices is that their columns are orthonormal. Then the following is true: T FnH Fn = I ⇐⇒ Fn−1 = F¯n which makes it easy to invert Fourier transforms! 7.7.5 Idea Behind FFT Of course, FFT is a divide-and-conquer algorithm. In the lecture, larger order Fourier matrices are connected to a smaller order matrice by noticing: Fn = " I D I −D #" Fn/2 0 0 Fn/2 # P for which P is a odd-even permutation matrix. Exploiting the structures of these matrices, this multiplication can be done in linear time. 7.8 Positive Definiteness and Quadratic Forms 7.8.1 Tests of Positive Definiteness These are equivalence conditions of positive definiteness for a symmetric matrix A: 1. All eigenvalues are positive. 2. All subdeterminants are positive. 3. All pivots are positive. 4. xT Ax > 0 for all x. ”Also, if all the positiveness in above definition is swapped by nonnegativeness, we get a positive semidefinite matrix.) Now what’s special with the new 4th property? Actually, property 4 is the definition of positive definiteness in most texts; property 1-3 are actually just the tests for it. What does the product xT Ax mean? If we do it by hand we get: xT Ax = X Aij xi xj i,j This is called the quadratic form. Now, the question is: is this positive for all x or not? 25 7.8.2 Graphs of Quadratic Forms Say x ∈ R2 and let z = xT Ax; what’s the graph of this function? If x is not positive ”semi)definite, we get a saddle point. A saddle point is a maximum in one direction whilest being a minimum in another. ”Actually these directions are eigenvalue directions.) What happens when we have a positive definite matrix? We know z will be 0 at the origin, so this must be the global minimum. Therefore, we want the first derivative to be all 0. However, this is not enough to ensure minimum point. We want to refactor quadratic form as a sum of squares form ” the completing the squares trick). So we can ensure that xT Ax is positive everywhere except 0, given a particular example. But how do we do it in a general case? The course reveals that, actually, Gaussian elimination is equivalent to completing the squares! Holy crap... Positive coefficients on squares mean positive pivots, which means positive definiteness! And if we try to recall Calculus we were presented with a magic formula called the second derivative test - which was just checking if second derivative matrix was positive definite! Niiicee. 7.8.3 Geometry of Positive Definiteness; Ellipsoids If a matrix is positive definite, we know xT Ax > 0 for except x = 0. If we set xT Ax = c for a constant c, this is an equation of a ellipse - or an ellipsoid. The major/middle/minor axis of this ellipsoid will be determined by the eigenvectors of A, their lengths being determined by the eigenvalues. 7.8.4 AT A is Positive Definite! Yep, AT A is always positive definite regardless of A. Because covariances.. but here’s a proof: xT AT Ax = (Ax) (Ax) = kAxk > 0 unless x = 0 2 T 7.9 Similar Matrices 7.9.1 Definition Two n × n matrices A and B are similar, if for some invertible M , A = M BM −1 . Example: A is similar to Λ because Λ = S −1 AS. Why are they similar? They have the same eigenvalues. How do we prove it? Let Ax = λx. Then Ax = M BM −1 x = λx ⇐⇒ BM −1 x = λM −1 x. Now, M −1 x is an eigenvector of B with the same eigenvalue. So this measure divides matrices into groups with identical set of eigenvalues. The most preferable of them, obviously, are diagonal matrices. They have trivial eigenvectors. 7.9.2 Repeated Eigenvalues And Jordan Form What if two or more eigenvalues are the same? Then the matrix might not be diagonalizable - it might not have a full set of independent eigenvectors. What happens now? Using above definition of similar matrices, there can be two families of matrices with same set of eigenvalues, but not similar to each 26 other. One is the diagonal matrix, in its own family. The other family contains all the other matrices with given eigenvalues. In the latter family, the matrix that is as close as possible to the diagonal matrix, is called the Jordan Form for the family of matrices. 7.9.3 Updated Definition Looking at some examples reveals that similar matrices actually have the same number of eigenvalues as well. So when eigenvalues are repeated, diagonal matrices have n eigenvectors - other matrices don’t. 7.9.4 Jordan Theorem Every square matrix A is similar to a Jordan matrix J which is composed of Jordan blocks in the main diagonal. For an eigenvalue λi repeated r times, the Jordan block is a r × r matrix with λi on the diagonals, and 1s on the above diagonal with rest of the entries 0. 7.10 Singular Value Decomposition 7.10.1 Introduction We can factorize symmetric matrices as A = QΛQT . We can also factor a general matrix A as A = SΛS −1 but this can’t be done in some cases, when S is not invertible. SVD is a generalized diagonalization approach which defines a decomposition A = U ΣV T for any matrix A with a diagonal Σ and orthonormal U and V . In other words, we can say: A = U ΣV T = U ΣV −1 ⇐⇒ AV = U Σ where V is a orthnormal basis for the row space, U is an orthonormal basis for the column space. Hmm. Do the dimensions even match? 7.10.2 Calculation How can we find V ? Let’s think of AT A. AT A = V ΣT U T U ΣV T = V Σ2 V T Yes! We can factorize AT A ”which is posdef) to find its eigenvalues and orthnormal eigenvectors. The eigenvectors will be V , and we can construct Σ by taking the square root of eigenvalues of AT A ”which is possible, since AT A is posdef). Similarly, looking at AAT will let us find the U s. Of course, our assertion ”that U and V are basis of column and row spaces, respectively, and why the eigenvalues of AT A and AAT are the same) needs to be proven. ”In the lecture, the latter fact is stated as obvious.. where did I miss it?) 27 8 Linear Transformation 8.1 Linear Transformation and Matrices A linear transformation is a function T with the following properties: • T (u + v) = T (u) + T (v) • T (cu) = cT (u) Of course, all linear transformations have an unique matrix representation, and vice versa. How can we find a matrix given a linear transform? Let’s say we have a linear transform T : Rn → Rm . We have a basis for the input space v1 , · · · , vn and a basis for the output space w1 , · · · , wm . How do we find the matrix for this transformation? Transform the first basis v1 to get: T (v1 ) = c1 w1 + c2 w2 + · · · + cm wm . And those coefficients c1 · · · cm take up the first column of the matrix. 8.2 Change of Basis Image compression: say we are compressing a grayscale image. Using standard basis to represent this image does not exploit the fact that neighboring pixels tend to have similar luminousity value. Are there other bases that can give us good sparse approximations? JPEG uses the Fourier basis. It divides images into 8 × 8 blocks and changes the basis to Fourier basis, and we can threshold the coefficients. Hmm, actually, I want to try this out. Also, wavelets are a popular choice recently. 8.2.1 Same Linear Transformation from Different Bases Say we have a linear transformation T . If we put vectors in coordinate system defined by a set of basis v1 · · · vn , we will get a matrix for the transformation. Let’s call it A. If we use another set of basis w1 · · · wn , we get another matrix which we call B. A and B are similar as in 7.9! 8.2.2 Eigenvector basis Say we have an eigenvector basis: T (vi ) = λi vi . Then, the change of basis matrix is diagonal! 8.3 Left, right, pseudoinverse Let’s generalize the idea of an inverse. A 2-sided inverse is the usual inverse: AA−1 = I = A−1 A A matrix A has a 2-sided inverse when it’s square, and has full rank: n = m = r. 28 8.3.1 Left and Right Inverses Suppose a full column-rank matrix n = r. The null space is just {0}, and for any Ax = b there is either 0 or 1 solution. Since A is full column rank, AT A is full rank and invertible. Note that we have a left inverse −1 T here: AT A A . If we multiply this inverse and A we get: −1 T −1 T AT A A A = AT A A A =I Resulting I is a n × n matrix. Similarly, suppose a full row-rank matrix ”m = r). The null space has dimension n − r, so for every −1 Ax = b there are infinitely many solutions. Now we have a right inverse: AT AAT . 8.3.2 Left Inverse and Project What do we get if we multiply the left inverse on the right? −1 T A AT A A This is a projection matrix! Similarly, if we multiply the right inverse on the left, we will get a projection matrix that projects onto its row space. 8.3.3 Pseudoinverse What is the closest thing to an inverse when A is neither full row or column rank? Think of this: take any vector in the row space, and transform it by A, you’ll get something from a column space. ”In this regard, A is a projection onto the column space.) The two spaces ”row and column) have the same dimension, so they should have a 1:1 ”or invertible) relationship - so the matrix that undoes this transformation is called the pseudoinverse, A+ . 8.3.4 Pseudoinverse from SVD How do we find a pseudoinverse? Maybe we can start from SVD: A = U ΣV −1 where Σ is a ”sort-of) diagonal matrix sized m × n with singular values on the diagonal. What is its pseudoinverse? Well that’s easy - we can get a n×m matrix and put inverse singular values on the diagonal. Then what happens in ΣΣ+ ? We get a m × m matrix with first r diagonal entries 1. Σ+ Σ works similarly. How do we go from Σ+ to A+ ? Because U and V are orthonormal, we can just use: A+ = V Σ+ U T 29 Singular Vector Decomposition January 19, 2015 Abstract This is my writeup for Strang’s Introduction to Linear Algebra, Chapter 6, and the accompanying video lectures. I have never had an intuitive understanding about SVDs... now I sorta do. Let’s go over the basics, derivation, and intuition. 1 Eigenvalues and Eigenvectors 1.1 Introduction Let A be a n × n matrix. Then x ∈ Rn is an eigenvector of A iff: Ax = λx ”1) for some number λ. In laymen’s terms, x does not change its direction when multiplied by - it merely scales by λ. λ is called the eigenvalue associated with x. Of course, any nonzero multiple of x is going to be a eigenvector associated λ; it is customary we take unit vectors. 1.1.1 Calculation How do we find eigenvalues and eigenvectors given A? Rewrite ”1) as: Ax = λIx ⇐⇒ (A − λI) x = 0 In other words, x must be in null space of A − λI. In order to have nontrivial x, A − λI must be singular. That’s how we find eigenvalues - solve: det |A − λI| = 0 This boils down to solving a nth order polynomial. The roots can be anything - they can be repeated, negative, or even complex. 1.1.2 Properties These nice properties hold: • If A is singular, 0 is an eigenvalue - of course, because det A = det |A − 0I| = 0. 1 • Determinant of A equals the product of all eigenvalues. • Trace of A equals the sum of all eigenvalues. 1.1.3 Eigenvectors Are In C (A) Something that was not immediately obvious was that each x was in the column space C (A). x comes from null space of A − λI, which doesn’t seem to be related to A - so it sounded surprising to me first. However, recall the definition: Ax = λx. Therefore, λx ∈ C (A) and so is x. 1.2 Diagonalization Everything we want to do with eigenvectors work fine when we have n linearly independent eigenvectors. For now, let us assume we can choose such eigenvectors. Then, we can perform something that is very nice called diagonalization. Let x1 · · · xn be n eigenvectors, each associated with eigenvalues λ1 · · · λn . It is defined: Axi = λi xi for i = 1 · · · n We can write this in matrix form. Let S be a matrix with xi as column vectors. This is a n × n matrix, and the above LHS can be written as: h A x1 x2 ··· x3 i xn = AS Let Λ be a diagonal matrix where Λii = λi , then RHS can be written as: h λ1 x 1 λ2 x 2 λ3 x 3 ··· i λn xn = SΛ Joining the two gives AS = SΛ S is invertible since its columns are independent. Then we can write: A = SΛS −1 which is called the diagonalization of A. 1.2.1 Matrices That Are Not Diagonalizable When is A not diagonalizable? Say we have a repeated eigenvalue, λd which is repeated r times. A − λd will be singular, but its rank might be greater than n − r. In such case, rank of the null space of the matrix is less than r, and we won’t be able to choose r vectors that are independent. 1.3 Intuition A n × n matrix can be seen as a function ”actually it’s a linear transformation): it takes a vector, and returns a vector. Eigenvectors can be regarded as axis of this function, where those on this axis are not changed 2 ”at least in terms of directions). If we use eigenvectors as axis, we can represent any vector in C (A) in this coordinate system, by representing it as a linear combination of eigenvectors. After we do this change of coordinates, the transformation is done by multiplying each coordinate with its eigenvalues individually. There are no interactions between each axis anymore. What happens when x ∈ / C (A)? Of course, even if x ∈ / C (A), Ax ∈ C (A). Say x = a + b where a ∈ C (A) and b ∈ / C (A) ⇐⇒ b ∈ N (A). Then Ax = Aa + Ab = Aa. Therefore we can just project x onto C (A) first and proceed as if x was in C (A). So, now we see x as: projecting x onto C (A) and do a change of coordinates, multiply each coordinate, and convert back to the original coordinate system. Note the diagonalization: A = SΛS −1 Multiplying x by A is multiplying 3 matrices - S −1 which maps x onto the eigenvector coordinate system, Λ that multiplies each coordinates ”remember that Λ is diagonal!), and S that maps the result onto the original coordinate system again. 1.3.1 Application: Powers of Matrix Using this diagonalization, powers of matrices can be calculated easily. Let what is A100 ? Since A2 is A2 = SΛS −1 SΛS −1 = SΛ2 S −1 A100 = SΛ100 S −1 . And exponential of a diagonal matrix is really trivial. 2 Symmetric And Positive Definite-ness 2.1 Orthonormal Eigenvectors When A is symmetric, we have special characteristics that make the diagonalization even better. Those are: 1. All eigenvalues are real. 2. We can always choose n orthonormal eigenvectors; not only they are independent, but they can be chosen to be perpendicular to each other! The latter point now means that S T S = I ⇐⇒ S = S −1 . We should actually call this matrix Q now, as in QR decomposition. That gives the following diagonalization: A = QΛQT 2.1.1 Proof ¯ The second The first property is proven by taking complex conjugate of λ and discovering that λ = λ. property is more complex to prove; it comes from Schur’s theorem. Schur’s theorem states that: 3 Every square matrix can be factored into QT Q−1 where T is upper triangular and Q is orthonormal. I think this is very intuitive - it’s an analogue of using QR decomposition to find an orthonormal basis for the column space. Now, R is upper triangular - because first column in A is represented as a multiple of the first column of Q. Second column in A is represented as linear combinations of first two columns in R, etc. Now, we can think of decomposing R into R = T Q−1 where Q−1 maps x onto this coordinate system, and T working in this coordinate system. Then QT Q−1 works just like the diagonalization in 1.3. Now we use induction and prove the first row and column of T are zeros, except for the top left element. If we continue like this, we get a diagonal T which means we will have: A = QΛQ−1 = QΛQT 2.1.2 Sum of Rank One Matrices The symmetric diagonalization can be rewritten as A = QΛQT = λ1 x1 xT1 + λ2 x2 xT2 + · · · Yep, it’s a weighted sum of rank one matrices! Nice! Even better, x1 xT1 are projection matrices! Because kxi k = 1 since orthonormal, the projection matrix formula −1 T A AT A A becomes just xi xTi in this case. So, we do have a great interpretation of this; multiplication by a symmetric matrix can be done by mapping x onto different, orthogonal 1D spaces. Then, each coordinate will be multiplied by a certain number. Later we can map it back to the original coordinates. This provides another insight regarding what eigenvalues and eigenvectors actually mean. The eigenvector associated with the largest eigenvalue is the axis where multiplying A makes the biggest change! Since truncated SVD is just a generalization of this diagonalization, it is very intuitive how PCA can be done by SVD. 2.2 Positive Definite Matrices 2.2.1 Definition A special subset of symmetric matrices are called positive definite. The following facts are all equivalent; if one holds, everything else will. 1. If we do elimination, all pivots will be positive. 2. All eigenvalues are positive. 4 3. All n upper-left determinants are positive. 4. xT Ax is positive except at x = 0. 2.2.2 Gaussian Elimination is Equivalent to Completing the Square How on earth is property 4 related to other properties? Why are they equivalent? The course proceeds with an example, and states the following without further generalization. P xT Ax is effectively a second order polynomial: ij xi xj Aij . How on earth do we know if this is bounded at 0? We complete the squares! Represent that formula by a sum of squares - and since squares are non- negative we know the polynomial will be positive unless all numbers are 0. The big revealation is that this process is same as Gaussian elimination. Positive coefficients on squares mean positive pivots! Holy crap. Thus, if 1-3 holds, 4 will hold. If 4 holds, the result of the completed squares will show that 1-3 holds. 2.2.3 Relation to Second Order Test in Calculus Recall the awkward second order test in Multivariate Calculus, which was stated without proof? It actually checks if the Hessian is positive definite. Hahaha! 2.2.4 AT A is Always PosDef A special form, AT A, always yields positive definite matrix regarless of the shape of R - if R has independent columns. This is why covariance matrices are always positive! Actually, the converse always holds too: if B is posdef, there always exists an A such that AT A = B. This can be found by the Cholesky decomposition of B! Recall: all covariance matrices are positive definite, and all positive definite matrices are covariance matrices of some distribution . 3 Singular Value Decomposition 3.1 Definition and Intuition SVD is a generalization of the symmetric diagonalization. It is applicable to any matrix ”doesn’t have to be symmetric, or even square). Therefore it is the most useful. How can this be done? Instead of using a single set of orthonormal basis, we use two sets U and V . Then a matrix A can be factored as: A = U ΣV T where U and V are orthonormal. U is a basis of column space of A, V is a basis of row space of A. Σ is a ”sort-of) diagonal matrix with singular values on its diagonals. 3.1.1 Intuition The same intuition from symmetric diagonalization applies here. Given x, we can first map it to the row space. ”Since V is orthonormal, V T = V −1 - so think of it as the mapping to the row space.) Now, x is rep- 5 resented as a linear combination of an orthnormal basis of the row space of A. We now multiply each coordinate by a number, and convert back to original coordinates by U . Since Ax is always in C (A), columns of U are the basis of C (A). Hmm, sounds like magic. I know I can map to a space, and recover original coordinate system from that space. But I am mapping x to row space and recovering from column space as if we mapped x to the column space as if we mapped to column space in the first place. WTF? This holds because columns of V are sort-of eigenvectors of A; if we transform vi , we get a multiple of ui . Those are called singular vectors, and the relation goes like this: Avi = σi ui where σi are called singular values. Also, orthogonality gives that all vi and ui s are orthogonal to each other! Woah, there’s too much magic in here.. :) 3.1.2 Analogue to Rank-one Decomposition We can also write: A = U ΣV T = σ1 u1 v1T + σ2 u2 v2T + · · · we can easily see how PCA works with SVD. 3.2 Derivation When m 6= n for A ∈ Rm×n , the dimensions are kind of match up. But I will gloss over the details. ”It’s past midnight) Let U be the eigenvector matrix for AAT , V be the eigenvector matrix for AT A and everything works out. We can choose these matrices to be orthnormal, because AT A and AAT are symmetric. Now we assert we can find Σ such that: A = U ΣV T Multiply both sides by AT on the left: AT A = U ΣV T T U ΣV T = V ΣU T U ΣV T = V Σ2 V T the last equality coming from that we chose Σ to be orthonormal. Yep, RHS is the symmetric decomposition of AT A - I said it works out. So the diagonal entries of Σ will contain square roots of eigenvalues of AT A. Now we can close the loop by proving AV Σ−1 = U First, we prove that columns of AV Σ−1 have unit length: AT AV = V Σ2 ⇐⇒ V T AT AV = V T V Σ2 T ⇐⇒ (AV ) AV = Σ2 6 ”2) The last equality shows that columns of AV are orthogonal, and the ith singular value gives its length squared. Next, we prove AV Σ−1 is indeed an eigenvector of AAT . Multiply ”2) by A on both sides, on the left, and rearrange to get: AAT AV = ΣAV Note that column vectors of AV are actually eigenvectors of matrix AAT . ”Smart parenthesis will let you realize this). 4 That Was It! Wasn’t too hard right? 7 Multivariable Calculus Lecture Notes jongman@gmail.com January 21, 2015 Contents 1 Directional Derivatives 2 2 Lagrange Multipliers 2 3 Non-independent Variables 3 4 Double Integrals 4 5 Polar Coordinates 4 5.1 Moment of Inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Change Of Variables 4 4 6.1 General Method and Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6.2 Outline of Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6.3 Revisiting Polar Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6.4 Determining Boundaries with Changed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7 Vector Fields and Line Integrals 5 7.1 Parameterized Representation of Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7.2 Geometric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7.3 Gradient Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7.3.1 The Fundamental Theorem of Calculus for Line Integrals . . . . . . . . . . . . . . . . . 7 7.3.2 Testing If F Is A Gradient Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7.3.3 Identifying Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Curl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 8 8 Green’s Theorem 8.1 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 9 Flux and Green’s Theorem in Normal Form 9 9.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9.2 Green’s Theorem in Normal Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 Triple Integrals and Spherical Coordinates 9 10.1 Cylinderical Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10.2 Spherical Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1 11 Vector Fields in Space 10 11.1 Flux in Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Examples: Spheres and Cylinders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 11.1.2 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11.1.3 Even More General: Parametric Description . . . . . . . . . . . . . . . . . . . . . . . . . 11 11.1.4 When You Only Know The Normal Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Divergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 11.2.1 Del Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 11.2.2 Physical Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 11.2.3 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 11.2.4 Application: Diffusion Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 11.3 Line Integrals In The Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Testing If A Vector Field Is A Gradient Field . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 11.3.2 Using Antiderivatives to Find The Potential . . . . . . . . . . . . . . . . . . . . . . . . . . 14 11.4 Curl in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Stokke’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 11.5.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 11.5.2 Relationship To Green’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 11.5.3 Outline of Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 11.6 Green and Related Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 11.7 Simply-connectedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 11.8 Path Independence and Surface Independence 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Exam Review 17 13 Extra: Matrix Derivatives 17 13.1 Derivative Matrix and Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 13.2 The Chain Rule for Derivative Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 13.3 Hessian Matrix 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Directional Derivatives What is the derivative of function f along a vector ~v ? Simply, ∇f · ~v |~v | which is the gradient vector projected onto the unit vector. Makes so much sense. 2 Lagrange Multipliers Instead of solving regular minimization/maximization problems, one can solve such problems when the inputs are constrained , where they are on a curve. Restating, Maximize/minimize f (x, y, z) such that g (x, y, z) = c 2 How do we solve this? Note that when the extreme will happen on one of the boundary, or where the level curve of f is parallel to the curve g = c. Here is a colloquial, informal argument; if they were not parallel, working along the curve in the right direction will increase/decrease the value of f , thus it won’t be a maximum/minimum. Finding such point is easy: the gradient vector of f is perpendicular to the level curve, and the gradient of g is perpendicular to the constraint curve. So we set ∇f = λ∇g which gives us a system of equations. The λ is called the Lagrange Multiplier. Note the Lagrange Multiplier condition being satisfied is only a necessary condition for minimum/maximum points; always check the boundary, as well as minimum/maximum criteria. 3 Non-independent Variables What happens when some of the variables are related in a multivariate function? Let w = x2 + y 2 + z 2 , where z = x2 + y 2 . What is ∂w ∂x ? Turns out this problem is ill-defined. All the materials so far assumed all the variables are independent; however in this problem it is not. The restriction equation is a curve; we are differentiating the value of w along a curve. It depends on which direction you are differentiating – in other words, which direction you are holding constant. A new notation is in order: ∂w ∂x y means differentiating w with respect to x, while holding y constant. Therefore, z will become a dependent variable and change as we go. This has the effect of slicing the constraint surface with a xy plane and differentiating on the resulting curve. Now how do we calculate these? • Total differentials: take total differentials, let the differentials for the fixed variables to be 0. Solve for interested derivative. dw = 2xdx + 2ydy + 2zdz = 2xdx + 2zdz dz = 2xdx + 2ydy = 2xdx Combining, Solving for dw/dx dw = 2xdx + 2 x2 + y 2 2xdx dw = 2x + 4x3 + 4xy 2 dx • Chain rule: 3 ∂w ∂x y ∂y ∂z ∂x +wy +wz = 2x + 4xz = 2x + 4x3 + 4xy 2 = wx ∂x ∂x ∂x | {z } | {z } | {z } 1 0 2x 4 Double Integrals Iterative integral method is introduced. 5 Polar Coordinates Sometimes it is helpful to use polar coordinates, r and θ. When we use this change of variables, we cannot say: dxdy = drdθ Why? We are dividing the region as a set of infinitesimal rectangles, one side of which is dr, but the another side is r · dθ. So we have dxdy = r · drdθ 5.1 Moment of Inertia I= ¨ 2 d (x, y) δ (x, y) dA d being the distance from the axis of rotation, δ being the density. 6 Change Of Variables 6.1 General Method and Jacobian Say we have a function of x and y, and want to integrate it by u (x, y), v (x, y). ”This usually happens when this change simplifies the integrand greatly, or the boundaries of the integration) How do we do this in general? When we transform dxdy into dudv – what is the conversion factor for transformation? It is the absolute value of the Jacobian, which is defined as: ∂ (u, v) ux J= = ∂ (x, y) vx uy vy which is the determinant of the matrix of partial derivatives. We have: ˆ dudv = ˆ |J| dxdy You can also remember it as |J|∂ (x, y) = ∂ (u, v) 4 Note the absolute value: the area of the area element is always positive. Also note ∂ (u, v) ∂ (x, y) · =1 ∂ (x, y) ∂ (u, v) which is useful for easier calculation. 6.2 Outline of Proof We take total derivatives of the relationship to get: " ∆u ∆v # = " ux ∆x + uy ∆y vx ∆x + vy ∆y # = " ux vx uy vy #" ∆x ∆y # Now transform(∆x, 0) and (0, ∆y) to uv-coordinate space. We get (ux ∆x, vx ∆y) and (uy ∆y, vy ∆y). Now, take the cross product of these variables which is the area of the translated rectangle: dA′ = dudv = (ux vy − vx uy ) ∆x∆y = Jdxdy 6.3 Revisiting Polar Coordinates Let’s try to revalidate our polar coordinates formula using the Jacobian. We have the following transformation: " x y # = " r cos θ r sin θ # The Jacobian is: x r J= yr So xθ cos θ = yθ sin θ −r sin θ =r r cos θ dxdy = |r| dudv = rdudv which justifies our formula for polar coordinates. 6.4 Determining Boundaries with Changed Variables There are no general methods, a careful case-by-case analysis is needed. One good tip is transform the integrated region into uv-coordinate space ”not always possible, when we are in higher dimension), and consider each edge one by one. 7 Vector Fields and Line Integrals Vector field is defined by a function from R2 → R2 : F~ (x, y) = M (x, y) ˆi + N (x, y) ˆj 5 A line integral integrates a function of the vector field along a trajectory. For example, you can find the work done by the forces in the vector field, to a particle moving along a given trajectory. The work is defined as the dot product of the force and the displacement: W = F~ · ∆r So the quantity we want to calculate is: ˆ F~ (x, y) ∆r C which is very vague. What should we do? 7.1 Parameterized Representation of Trajectory Let ~r be a function of time parameter, t. Then we can slice the trajectory by infinitesimal dt and have: ˆ F~ (x, y) dr = C ˆ dr F~ (x (t) , y (t)) dt dt the latter integral is just a single variable integral with respect to t. Effectively, we are integrating the dot product of the force vector and the velocity vector. The same calculation could be carried out using a different notation: letting dr = hdx, dyi and realizing ~ F = hM, N i gives ˆ M dx + N dy C We cannot integrate this directly; M and N both rely on each of the two parameters. We can get dx and dy by using total differentials and represent them at t. This results in the same integral. 7.2 Geometric Approach Notice: d~r = T~ ds where T~ is the tangent vector along the trajectory, and ds is the length element. Noticing this will let you integrate some of the integrals just intuitively. 7.3 Gradient Field In physics, we see something called gradient field; vector field where F is defined as a gradient of a function. F~ = ∇f The following three statements are equivalent. 1. F is a gradient field. 6 2. Path-dependency: if two paths have coinciding starting and ending points, the work done by the field along the two paths are the same. 3. Conservativeness: the work done to any closed path is 0. The proofs are not too involved. ”Don’t think we covered the proof from 3 to 1 though..) 7.3.1 The Fundamental Theorem of Calculus for Line Integrals States ˆ ∇f · d~r = f (P1 ) − f (P0 ) C where P1 and P0 are ending and starting points of the path. Outline of proof: ˆ ∇f · d~r = C ˆ fx C dy dx + fy dt dt with the second equality coming from the chain rule. dt = ˆ C df dt = dt ˆ df C 7.3.2 Testing If F Is A Gradient Field It is claimed that F~ = M ˆi + N ˆj is a gradient field iff: My = N x and the field is defined & differentiable in the entire plane. 7.3.3 Identifying Potentials Say the above equation holds - can we recover the potential function from the gradients? Two ways are discussed: 1. Set up an arbitrary path from (0, 0) to (x1 , y1 ) and do a line integral. The value will be f (x1 , y1 )−f (0, 0) but as we don’t care about constants ”potentials are equivalent up to a constant) we can just calculate f (x1 , y1 ) and take it to the potential function. 2. Use antiderivatives - we know fx = M and fy = N . So calculate ˆ M dx = u (x, y) + g (y) x note g (y) appears in place of the integration constant. We can differentiate this with regard to y and try to match it with N . You generally don’t want to integrate N wrt y and match the two equations when trigs are involved it might not be easy to see the equivalence of two portions of the equations, causing confusion. I guess this kind of serves as a constructive proof for the equivalence in the beginning of 7.3. 7 7.4 Curl The curl is defined as: curlF = Nx − My from the definition, we know the curl is 0 when F is a gradient field. In physics context, the curl measures the angular/rotational velocity/torque. Of course, I don’t care about those applications. 8 Green’s Theorem Green’s Thorem relates seemingly unrelated concepts, the curl and the line integral. If C is a closed curve that encloses the region R in counterclockwise direction, and F~ is defined everywhere, Green’s theorem says: ˛ F~ · d~r = C ˛ M dx + N dy = C ¨ Nx − My dA = R ¨ curlF dA R Note this can be used to prove that F~ is conservative if the curl is 0. 8.1 Proof Here’s an outline of the proof. First, we take the special case N = 0. Why does this suffice? The similar strategy can be used to prove the case for M = 0, and then we have two equalities which can be added to get the desired equality. ˛ ¨ M dx = ˛C −My dA ¨R N dy = Nx dA R C Second, we note that we are able to divide the region freely, calculate the integral separately, and sum them. The line which separates the two regions will get counted twice; but if we keep the counterclockwise direction, the two curves will cancel each other out. So, we say we will slice the region into vertically simple slices, which is defined by a ≤ x ≤ b, f1 (x) ≤ y ≤ f2 (x). Then ˛ M dx = ˆ b M (x, f1 (x)) dx − b M (x, f2 (x)) dx a a C ˆ Note the first equality holds because the line integrals are 0 in the vertical edges of the region, since dx = 0. What about the curl integral? ¨ b −My dA = ˆ b = ˆ f2 (x) − f1 (x) a R ˆ ∂M dydx ∂y M (x, f1 (x)) − M (x, f2 (x)) dx a which is equivalent to the expression for ¸ C M dx above and thus gives the proof. 8 9 Flux and Green’s Theorem in Normal Form Flux along a curve C is defined by ˆ F ·n ˆ ds C where n ˆ is a unit vector normal to C, rotated 90 degrees clockwise from the curve. If you think about it, it is actually rotating all the vectors in the vector field and calculating the work done by the field. Therefore, the Flux becomes, if F = hP, Qi, the rotated field F ′ = h−Q, P i ”note the field is rotated clockwise) ˆ F ·n ˆ ds = C ˆ −Qdx + P dy C and after that it’s just the regular line integral. 9.1 Interpretation If F is a velocity field, the flux measures the amount that passes through the given line, each unit time. 9.2 Green’s Theorem in Normal Form If F = hP, Qi, ˛ F ·n ˆ ds = C ˛ −Qdx + P dy = C ¨ Px + Qy dA = ¨ divF dA we know this is equivalent to the regular Green’s theorem. Also, Px + Qy = divF is called the divergence which measures the expanding-ness of the field. 10 Triple Integrals and Spherical Coordinates This portion of the course deals with integrating in the 3D space. The most obvious way is to use (x, y, z) coordinates – but that can be tricky at times. Often, using alternative coordinate system can make things easier. 10.1 Cylinderical Coordinates This is an extension of the polar coordinates, keeping z axis intact. Each point in space is represented by the triplet (r, θ, z). The volume element dV becomes r · dr · dθ · dz. Similarly, dS = rdθdz 10.2 Spherical Coordinates A point is represented by a triplet of coordinates (ρ, φ, θ). ρ represents the distance from the origin (ρ ≥ 0), φ represents the angle from the positive z-axis (0 ≤ φ ≤ π), θ represents the usual angle from the positive x-axis. First, we see z = ρ cos φ. r, the distance from the origin to the projection onto XY -plane, is given by ρ sin φ; therefore we have (x, y, z) = (ρ sin φ cos θ, ρ sin φ sin θ, ρ cos φ) 9 Mostly, they are integrated in the following order: dρdφdθ. What does a volume element dV look like? Note it is approximately a cuboid. The depth of the cuboid is easily found; it is ∆ρ. What are the lengths of the sides? The top-bottom sides ”parallel to the xy plane) are basically arcs of circles of length r = ρ sin φ. Therefore, its length is given by ρ sin φ∆θ. The left-right sides are arcs of circles of length ρ, so they are ρ∆φ. So we have: dV = ρ2 sin φdρdφdθ Similarly, the surface element can be found as dS = ρ2 sin φdφdθ 11 Vector Fields in Space Vector fields in spaces are defined. They are not that different from the 2D cousin. 11.1 Flux in Space The flux is not defined for lines in space; they are defined for surfaces. You should set up a double integral which represents the surface, and take the integral of F · n ˆ dS where F is the vector field, n ˆ is the normal vector to the surface, and dS is the surface area element. Note that there is not a preset way to set the orientation of n ˆ as in 2D cases – you’ll have to set it explicitly. Then Flux = ¨ F~ · ~ndS S If F is a velocity field, the flux of F through the surface S represents the amount of the matter that passes through S per unit time. 11.1.1 Examples: Spheres and Cylinders For a sphere of radius a centered at the origin, we have: 1 n ˆ = ± hx, y, zi a For a cylinder of radius a with its center coinciding with the z axis, we have: 1 n ˆ = ± hx, y, 0i a 11.1.2 General Case When z = f (x, y), you should set up bounds for x and y according to the shadow the surface casts on the xy plane. Also, we have: n ˆ dS = ± h−fx , −fy , 1i dxdy so if F = hP, Q, Si, 10 ¨ F~ · n ˆ dS = ± R ¨ −P fx − Qfy − Rdxdy R What is the proof/intuition behind this? Find a rectangle in the shadow of the curve with two sides ∆x and ∆y: the corresponding part of the curve will roughly look like a parallelogram, if the shadow rectangle is small enough. How do we find the area of a parallelogram? Of course: cross products. Also note the cross product will give us a normal vector; so it gives two in one! Say the lower-left point of the shadow rectangle is (x, y). The corresponding point in the curve will be (x, y, f (x, y)). We now can find two vectors ~u and ~v which coincide with the sides of the parallelogram, and their cross product will be: ~u × ~v = ~ndS Fortunately, they are easy to find. ~u goes from (x, y, f (x, y)) to (x + ∆x, y, f (x + ∆x, y)). Note that f (x + ∆x, y) ≈ f (x, y) + ∆x · fx (x, y) . Using similar process for ~v , we have: ~u = h∆x, 0, fx (x, y) · ∆xi = h1, 0, fx i ∆x ~v = h0, ∆y, fy (x, y) · ∆yi = h0, 1, fy i ∆y Now ˆi ˆj kˆ ~u × ~v = 1 0 fx = h−fx , −fy , 1i dxdy 0 1 f y 11.1.3 Even More General: Parametric Description S : (x, y, z) where each coordinate is a function of two parameters: x = x (u, v) , y = y (u, v) , z = z (u, v). Now, we want to express n ˆ dS as somethingdudv. Therefore, we find a position at ~r (u, v) and find a parallelogram with two sides ending at ~r. Using the similar technique, we get n ˆ ∆S = ± ∂~r ∂~r ∂~r ∂~r ∆u∆v ∆u × ∆v = ± × ∂u ∂v ∂u ∂v Note the analogeous pattern of this to the above formula. 11.1.4 When You Only Know The Normal Vector Say, if you know an equation of a plane, then the normal vector comes free. Another case; if your surface is given by g (x, y, z) = 0, you have ∆g as the normal vector. You can still use the above methods, but there is a shortcut. Slanted Plane If your surface is a slanted plane: find the angle the plane and the xy plane makes. Call this α. Then, the area element of the shadow ∆A is related to the area element of the shadow ∆S by: 11 ∆A = cos α∆S which is trivial. However, note that we can find cos α by taking the inner product between their normal vectors! ~ · kˆ N cos α = ~ N So we have ∆S = and n ˆ ∆S = ~ n ˆ · N ~ · kˆ N ~ N ~ · kˆ N ∆A ∆A = ± ~ N dxdy ~ · kˆ N g (x, y, z) = 0 Say we have a surface described by: z − f (x, y) = 0 Then ~ = ∇g = h−fx , −fy , 1i N Apply the above method and it is equivalent to what we saw in 11.1.2. 11.2 Divergence Theorem Also known as Gauss-Green theorem , it is a 3D analogue for Green in Normal Form. Goes like this: If a closed surface S completely encloses a region D, and we choose to orient the normal vectors to point outwards, and F~ is a vector field defined and differentiable everywhere in D: ‹ S ~= F~ · dS ‹ F~ · n ˆ dS = ˚ divF~ dV D S Now what is the divergence in 3D? Very easy to remember: div P ˆi + Qˆj + Rkˆ = Px + Qy + Rz 11.2.1 Del Notation Before we proceed to proof, a del notation is introduced: ∇= ∂ ∂ ∂ , , ∂x ∂y ∂z 12 which is a very informal notation, but we’ll just use this for now. So now we can think of divergence in terms of: ∇ · F~ = ∂Q ∂R ∂P + + ∂x ∂y ∂z This is not equivalent to a gradient; notice the dot there. The gradient is a vector; the divergence is a function. 11.2.2 Physical Interpretation A divergence of a vector field corresponds to a source rate , which is the amount of flux generated per unit volume. If our vector field is a velocity field of an incompressible fluid, then the surface integral of Flux is the amount of fluid leaving D per unit time. 11.2.3 Proof We employ the exact same strategy we used for proving Green’s theorem. Here’s an outline. • First, assume a vector field with only z component. ”We can later do the same thing to x and y, and sum up three identities to get the theorem.) • Assume a vertically simple region - something that lives between two graphs z1 (x, y) and z2 (x, y). ˝ ‚ • Now evaluate the integral Rz dzdxdy and h0, 0, Ri · n ˆ dS and hope to meet in the middle. – The latter integral can be done by dividing the entire surface into three pieces; the top, the bottom, and the sides. We can expect to have 0 flux for sides since our vector field doesn’t have any x or y component. – The top and the bottom integral can be done by using the formula in 11.1.2. ”Caveat: we’ll need to take care of the different orientation of the normal vector on top and bottom!) • Then we can move on to the general case; we can always slice any region into vertically simple slices, take the flux integral, and sum them up and they will be the same.. 11.2.4 Application: Diffusion Equations Diffusion equation governs how a new liquid diffuses within the ambient ”immobile) liquid. Ink in a cup of water, smoke in the air, ... We want to find the function that represents the concentration at the given point, from its partial derivatives: u (x, y, z, t). We have the set of equations: ∂u =k ∂t 2 =k∇ · ∇u} = k ∇ u | {z |{z} Laplacian div(∇u) ∂2u ∂2u ∂2u + 2 + 2 ∂x2 ∂y ∂z The same equation can be used in modeling heat in immobile air. So how is the vector field defined? Given a concentration, every point moves towards the direction where the concentration decreases the fastest; that is negative gradient. In fact: 13 F~ = −k∇u Now, whats the relationship between F~ and ∂u ∂t ? It’s the divergence theorem! Assume a small closed region D with a surface S. The flux of D through S is given as: ˚ divF~ dV = D d F~ · n ˆ dS = amount of smoke/liquid through S = − dt S ‹ ˚ udV D The first equality holding by the divergence theorem, and the latter equality justified by our definition above. We can modify the RHS to state: ˚ divF~ dV = − D d dt ˚ udV = − D ˚ ∂u dV ∂t The latter equality holds because the middle quantity is the derivative of sums; which is equivalent to sums of derivatives. So we have: divF~ = − ∂u ∂t 11.3 Line Integrals In The Space Doesn’t change things much, but checking if a vector field is a gradient field becomes more complicated. F~ = P ˆi + Qˆj + Rkˆ Now then Work = ˆ C F~ · d~r = ˆ P dx + Qdy + Rdz C Similar to the planar case, we parameterize the line and express the integral in terms of the parameter. ”See above) 11.3.1 Testing If A Vector Field Is A Gradient Field Contrary to the 2D case where we only checked My = Nx , we now have three different conditions to check. We exploit that the second derivatives are the same regardless of the order you take them. So we check: • Py = Q x • Pz = R x • Q z = Ry The field is a gradient field iff these hold and the field is defined for a simply connected region. 11.3.2 Using Antiderivatives to Find The Potential Say we want to solve: • fx = 2xy • fy = x2 + z 3 14 • fz = 3yz 2 − 4z 3 Calculate ˆ fx dx = x2 y + g (y, z) Note the integration constant g (y, z) is a function only consists of y and z. Then differentiate by y and compare with fy . ∂ fy = x + z = x + g y = ∂y 2 3 2 ˆ fx dx that tells us g = yz 3 + h (z). Of course, we will find h (z) with comparing the derivative with fz . fz = 3yz 2 − 4z 3 = 3yz 2 + h′ which tells us h (z) = −z 4 + c then we have f (x, y, z) = x2 y + yz 3 − z 4 + c 11.4 Curl in 3D Curl measures how much a vector field fails to be conservative. It is a vector-valued function: curlF~ = (Ry − Qz ) ˆi + (Pz − Rx ) ˆj + (Qx − Py ) kˆ F~ is defined in a simply connected region and is conservative, iff curl is 0. How to remember this: ˆi ∂ ∇ × F~ = ∂x P ˆj ∂ ∂y Q kˆ ∂ = curlF~ ∂z R Geometrically, curl measures the rotation component in a velocity field. Its direction points to the axis of rotation, and its magnitude gives us twice the angular velocity. 11.5 Stokke’s Theorem ˛ C F~ · d~r = ¨ ∇ × F~ ·ˆ ndS S| {z } curlF~ If C is a closed curve, and S is any surface bounded by C. Any surface, for real? 11.5.1 Orientation The orientations of C and S needs to be compatible. What does it mean? Right-hand rule: if we walk along C with S on our left side, the normal vector should be pointing up for us. Right hand rule: point thumb along C in the positive direction. Point index finger towards interior of S, tangent to S. Middle finger points n ˆ. 15 11.5.2 Relationship To Green’s Theorem When we limit a surface and the curve to a 2D plane, we get Green’s theorem. 11.5.3 Outline of Proof • By Green’s theorem, we know it holds for C and S in x, y plane. • By swapping coordinates accordingly, we can say the same for any coordinate plane; yz and xz planes. • Actually, it holds for any right-handed coordinates; whatever rotation you use, it will hold by Green’ theorem. ”Using that work, flux, curl make sense independently of coordinates.) • Given any S, we can decompose it into tiny, almost flat pieces. I guess we know where we are going! • So using the same cancelling argument, the only line integrals that are calculated are the ones along the edges. • In each of the small surface, since it’s almost flat , Green’s theorem works and the line integral equals the flux. • Now: the sum of flux = the sum of line integrals = line integral along C. 11.6 Green and Related Theorems Thesis 2 dimensions Total work ”line integral) equals total curl/flux of curl ”surface integral) Green’s theorem Flux ”line/surface integral) equals total divergence ”surface/volume integral) Green’s theorem in normal form Note that for 2D ˆ flux of curl equals total z component of 3D curl, which is • since the normal vector of 2D xy plane is k, the total 2D curl. • the surface and the volume are the same things so the Green’s theorem in normal form works in surface form. The direction conventions are set that the theorems will be compatible between 2D and 3D. 11.7 Simply-connectedness A region is simply-connected if every closed loop inside it bound at least one surface inside the region. ”The surface can be arbitrary.) Simply connectedness is critical to applying Green’s or Stoke’s theorem. 11.8 Path Independence and Surface Independence Green’s theorem gives you path independence when F~ is a gradient vector; the line integral will be the same whatever line you choose, when the endpoints are the same. Now note that Stoke’s theorem does somesthing similar; we are getting surface independence , where the surface integrals are the same whenever the edges are the same. 16 S Di In the 2D case, the gradient field property gives this independence; what is its 3D equivalent that yields surface independence? Take two surfaces S1 and S2 bounded by a curve C. What are the difference between two surface integrals? Merge the two surfaces S1 and S2 to get a surface S which is closed. We can calculate the flux across that surface by divergence theorem! Jeez, my head is spinning. ˚ D div ∇ × F~ dV If this divergence integral is always 0, this gives you the surface independence. We can check: ∇ × F~ = hRy − Qz , Pz − Rx , Qx − Py i The divergence of this is div ∇ × F~ = (Ry − Qz )x + (Pz − Rx )y + (Qx − Py )z = Rxy − Qxz + Pyz − Rxy + Qxz − Pyz =0 Yes, the divergence of a curl is always zero! 12 Exam Review We did triple, double, and single integrals in space: 13 Extra: Matrix Derivatives This summarizes the treatment of multivariate calculus through gradient and Jacobian. Although they are mostly technical matter, these have been important in many cases. 13.1 Derivative Matrix and Gradient If f : Rn → Rm , we have f1 f2 f (x1 , x2 , · · · , xn ) = f3 . . . fm the derivative matrix of f , Df is a m × n matrix consisting of first-order partial derivatives: ∂f1 ∂x1 ∂f2 ∂x1 ∂f1 ∂x2 ∂f2 ∂x2 ∂f1 ∂x3 ∂f2 ∂x3 ··· .. . ··· .. . ∂fm ∂x1 ∂fm ∂x2 ∂fm ∂x3 ··· . . . .. . 17 ∂f1 ∂xn ∂f2 ∂xn .. . ∂fm ∂xn ∂fi . Note Dfij = ∂x j When m = 1, which means a f is a real-valued function, we define ∇f , the gradient, to be a column vector consisting of partial derivatives: ∂f 1 ∂x ∂f ∂x 2 ∂f ∂x3 ∂f ∂f ∂f ∂f ∇f = , , ,··· , = . ∂x1 ∂x2 ∂x3 ∂xn . . ∂f ∂xn The latter round bracket notation is introduced for brevity on paper. Note Df is a row vector in this case; Df = ∇f T . 13.2 The Chain Rule for Derivative Matrices We have D (f ◦ g) (x) = Df (g (x)) Dg (x) where the multiplication is a matrix multiplication. Note the order is important! If we change the order it might not work. 13.3 Hessian Matrix When f is real-valued, we can define the Hessian matrix to be a n × n matrix with second-order partial derivatives: ∂2f 2 ∂∂x2 f1 ∂x2 ∂x1 H (f ) = .. . ∂2f ∂xn ∂x1 ∂2f ∂x1 ∂x2 ∂2f ∂x2 ∂x2 ··· .. . ··· .. . ∂2f ∂xn ∂x2 ··· ∂2f ∂x1 ∂xn ∂2f ∂x2 ∂xn .. . ∂2f ∂xn ∂xn Note this is a symmetric matrix. Note that D∇f = ∂f 1 ∂x ∂f ∂x 2 ∂f D ∂x3 2 ∂ f 2 ∂∂x2 f1 ∂x2 ∂x1 = . ... . . ∂2f ∂f ∂xn ∂xn ∂x1 ∂2f ∂x1 ∂x2 ∂2f ∂x2 ∂x2 ··· .. . ··· .. . ∂2f ∂xn ∂x2 ··· 18 ∂2f ∂x1 ∂xn ∂2f ∂x2 ∂xn .. . ∂2f ∂xn ∂xn = H(f ) Convex Optimization Lecture Notes ”Incomplete) jongman@gmail.com January 21, 2015 Contents 1 Introduction 5 2 Convex Sets 5 2.1 Types of Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Affine Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 2.1.2 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Important Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 2.2.1 Norm Balls / Norm Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Positive Semidefinite Cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Operations that Preserve Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 2.3.1 Linear-fractional and Perspective Functions . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Generalized Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Positive Semidefinite Cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Minimum and minimal Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7 7 2.5 Supporting Hyperplane Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6 Dual Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Dual Generalized Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 2.6.2 Dual Generalized Inequality and Minimum/Minimal Elements . . . . . . . . . . . . . . 7 3 Convex Functions 8 3.1 Basic Properties and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Extended-value Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 3.1.2 Equivalent Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.4 Epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.5 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Operations That Preserve Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 3.2.1 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 The Conjugate Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Quasiconvex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 3.4.2 Operations That Preserve Quasiconvexity . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 Log-concavity and Log-convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1 4 Convex Optimization Problems 11 4.1 Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Feasibility Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 4.1.2 Transformations and Equivalent Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 4.2.1 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2.2 Equivalent Convex Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3 Quasiconvex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3.1 Solving Quasiconvex Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3.2 Example: Convex over Concave 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Linear Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.4.2 Linear-fractional Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.5 Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.5.2 Second Order Cone Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.5.3 SOCP: Robust Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.5.4 SOCP: Stochastic Inequality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Geometric Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 4.6.1 Monomials and Posynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.6.2 Geometric Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.6.3 Convex Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.7 Generalized Inequality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Conic Form Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 4.7.2 SDP: Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.8 Vector Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Optimal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 18 4.8.3 Scalarization for Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.8.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Duality 5.1 The Lagrangian And The Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 5.1.1 The Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.2 The Lagrangian Dual Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.3 Lagrangian Dual and Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1.4 Intuitions Behind Lagrangian Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1.5 LP example and Finite Dual Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1.6 Conjugate Functions and Lagrange Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2 The Lagrange Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Duality Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 21 5.2.2 Strong Duality And Slater’s Constraint Qualification . . . . . . . . . . . . . . . . . . . . 21 5.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Strong Duality of Convex Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 5.4 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2 5.4.1 Certificate of Suboptimality and Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . 22 5.4.2 Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 KKT Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 5.5 Solving The Primal Problem via The Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 5.6.1 Global Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Local Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 5.7 Examples and Reformulating Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.7.1 Introducing variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.2 Making explicit constraints implicit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.3 Transforming the objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 25 5.8 Generalized Inequailities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6 Approximation and Fitting 6.1 Norm Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 6.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Different Penalty Functions and Their Consequences 6.1.3 Outliers and Robustness . . . . . . . . . . . . . . . . . 6.1.4 Least-norm Problems . . . . . . . . . . . . . . . . . . . 6.2 Regularized Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 26 27 27 6.2.1 Bi-criterion Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 ℓ1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 28 28 6.2.4 Signal Reconstruction Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Robust Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 29 6.3.1 Stochastic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Worst-case Robust Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Function Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 30 30 6.4.1 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Sparse Descriptions and Basis Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Checking Model Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 30 31 7 Statistical Estimation 7.1 Parametric Distribution Estimation . . . 7.1.1 Logistic Regression Example . . . 7.1.2 MAP Estimation . . . . . . . . . . . 7.2 Nonparameteric Distribution Estimation 31 . . . . . . . . . . . . . . . . . . . . . . . . 31 32 32 32 7.2.1 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 33 7.3 Optimal Detector Design And Hypothesis Testing 7.4 Chebyshev and Chernoff Bounds . . . . . . . . . . 7.4.1 Chebyshev Bounds . . . . . . . . . . . . . . 7.4.2 Chernoff Bounds . . . . . . . . . . . . . . . 7.5 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 33 34 34 7.5.1 Further Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 . . . . . . . . . . . . . . 8 Geometric Problems 34 8.1 Point-to-Set Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 PCA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 35 8.2 Distance between Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Euclidean Distance and Angle Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 8.3.1 Expressing Constraints in Terms of G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Well-Condition Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 36 36 8.4 Extremal Volume Ellipsoids . . . . . . . . . . 8.4.1 Lowner-John Ellipsoid . . . . . . . . . 8.4.2 Maximum Volume Inscribed Ellipsoid 8.4.3 Affine Invariance . . . . . . . . . . . . 8.5 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 36 36 37 37 8.5.1 Chebychev Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Maximum Volume Ellipsoid Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.3 Analytic Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 38 8.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8.6.1 Linear Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Robust Linear Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.3 Nonlinear Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 38 39 8.7 Placement and Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Floor Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 9 Numerical Linear Algebra Background 40 10 Unconstrained Optimization 40 10.1 Unconstrained Minimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 10.1.1 Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Conditional Number of Sublevel Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 41 10.2 Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Performance Analysis on Toy Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 42 10.4 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 10.4.1 Steepest Descent With an ℓ1 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Performance and Choice of Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 10.5 Newton’s Method . . . . . . . . 10.5.1 The Newton Decrement 10.5.2 Newton’s Method . . . . 10.5.3 Convergence Analysis . 10.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 43 43 43 44 10.6 Self-Concordant Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction 2 Convex Sets This chapter introduces numerous convex sets. 2.1 Types of Sets 2.1.1 Affine Sets A set is affine when it contains any line that passes through any two points that belong to the set. Examples: • Hyperplanes • Solution set of linear equation systems. An affine combination is defined to be a linear combination where the coefficients sum up to 1. An affine set is closed under affine combination. 2.1.2 Convex Sets A set is convex when it contains any line segment that passes through any two points that belong to the set. A convex combination is defined to be an affine combination with nonnegative coefficients. A convex set is closed under convex combination. All affine sets are convex. 2.1.3 Cones A set is a cone when a nonnegative scalar multiple of any element belongs to the set. A cone feels like a ice-cream cone; it starts at the origin, and it gets wider going away from the origin. It could be analogous to a pie slice. Given an point in a cone, the cone will contain the ray ”half-lines) originating from the origin and going through the point. A conic combination is a linear combination with nonnegative coefficients. A convex cone is closed under conic combinations. 2.2 Important Sets • Hyperplane is affine. • Halfspaces are convex. • Polyhedra/polyhedron are intersections of halfspaces, so are convex. • Euclidean balls are defined by ||2 . They are convex. Elipsoids are related to balls, related by a positive definite matrix P . 5 2.2.1 Norm Balls / Norm Cones Given a norm, a norm ball replaces the L2 norm for Euclidean balls. A norm cone is a different beast; it is the set C = {(x, t) | kxk ≤ t, t ≥ 0} ∈ Rn+1 where x ∈ Rn . The norm cone with an L2 norm is called the second-order cone. 2.2.2 Positive Semidefinite Cone The following notations are used: • Sn is the set of all symmetric n × n matrices. • Sn+ is the set of all positive semidefinite n × n matrices. • Sn++ are positive definite. Because θ1 , θ2 ≥ 0, A, B ∈ Sn+ =⇒ θ1 A + θ2 B ∈ Sn+ the set is a convex cone. 2.3 Operations that Preserve Convexity Some functions preserve convex-ness of the set. • Intersections • Taking image under an affine function 2.3.1 Linear-fractional and Perspective Functions The perspective function P (z, t) = z/t preserves convexity. ”The domain needs that t > 0) Similarly, a linear-fractional function, f (x) = Ax + b cT x + b which is formed by combining the perspective function with an affine function, preserves convexity as well. 2.4 Generalized Inequalities A cone can be used to define inequalities, if it meets certain criteria. It goes like: x K y ⇐⇒ y − x ∈ K 6 where K is a cone. It certainly makes sense; since K contains any rays, we get transitivity. If K is pointed at the origin, we get asymmetry. Since K contains the origin, we get reflexivity. This is an useful concept that will be exploited later in the course a lot. 2.4.1 Positive Semidefinite Cone Actually, using the positive semidefinite cone to compare matrices is a standard practice so for the rest of the book, matrix inequalities are automatically done used PSD cone. 2.4.2 Minimum and minimal Elements Generalized inequalities do not always give you a single element that is the minimum; we sometimes get a class of elements that are not smaller than any other elements, and are incomparable to each other. 2.5 Supporting Hyperplane Theorem This has not proved to be useful in the course yet; when it is needed I will come back and fill it in... 2.6 Dual Cones Let K be a cone. Then the dual cone is that K ∗ = y|xT y ≥ 0 ∀x ∈ K The idea of it is hard to express intuitively. When the cone is sharp, the dual cone will be obstuse, and vice versa. Also, when K is a proper cone, K∗∗ = K. 2.6.1 Dual Generalized Inequality The relationship K∗ is a generalized inequality induced by the dual cone K∗. In some ways it can relate to the original inequality K . Most notably, x K y ⇐⇒ y − x ∈ K ⇐⇒ λT (y − x) ≥ 0 ⇐⇒ λT x ≤ λT y where λ is any element of K∗. The takeaway of this is that we can use dual cones to compare values with respect to the original cone. See below for how this is used. 2.6.2 Dual Generalized Inequality and Minimum/Minimal Elements Minimum Element From above, we can see x ∈ S will be a minimum element with respect to K, when it is a unique minimizer λT x for all λ ∈ K∗. Geometrically, when x is a minimum element, the hyperplane passing through x with λ as a normal vector z|λT (x − z) = 0 is a strict supporting hyperplane; it touches S only at x. To see this: say x 6= y ∈ S. Since λT x < λT y by our assumption, we have λT (y − z) > 0. Minimal Elements Here, we have a gap between necessary and sufficient conditions. Necessity If x is a minimizer of λT x for all λ ∈ K∗, x is a minimal element. Sufficiency Even if x is a minimal element, it is possible that x is not a minimizer of λT x for some λ ∈ K∗. 7 However, if S is convex, the two conditions are indeed equivalent. 3 Convex Functions 3.1 Basic Properties and Definitions A function f is convex if the domain is convex, and for any x, y ∈ domf and θ ∈ [0, 1] we have θ · f (x) + (1 − θ) · f (y) ≥ f (θx + (1 − θ) y) Colloquially, you can say the chord is above the curve . 3.1.1 Extended-value Extensions We can augment a function f : f˜ (x) = f (x) ∞ x ∈ domf otherwise which is called the extended-value extension of f . Care must be taken in ensuring the extension is still convex/concave: the extension can break such properties. 3.1.2 Equivalent Conditions The following conditions, paired with the domain being convex, is equivalent to convexity. First Order Condition T f (y) ≥ f (x) + ∇f (x) (y − x) Geometrically, you can explain this as the tangential hyperplane which meets a convex function at a certain point, it will be a global lower bound in the entire domain. It has outstanding consequences; by examining a single point, we get information about the entire function. Second Order Condition ∇2 f (x) 0 which says the Hessian is positive semidefinite. 3.1.3 Examples • ln x is concave on R++ • eax is convex on R • xa is convex on R++ if a ≥ 1, concave 0 ≤ a ≤ 1 • |x| is convex on R a • Negative entropy: x log x is convex on R++ 8 • Norms: every norm is convex on Rn • Max functions: max {x1 , x2, x3 , · · · } is convex in Rn • Quadratic-over-linear: x2 /y is convex in {y > 0} • Log-sum-exp: as a soft approximation of the max function, f (x) = log ( • Geometric mean: ( Qn i=1 xi ) 1/n is concave on Rn++ P exi ) is convex on Rn . • Log-determinant: log det X is convex on positive definite X. 3.1.4 Epigraph An epigraph for a function f : Rn → R is a set of points above the graph. Naturally, it’s a subset of Rn+1 and sometimes is an useful way to think about convex functions. A function is convex iff its epigraph is convex! 3.1.5 Jensen’s Inequality The definition of convex function extends to multiple, even infinite, sums: f as long as 3.2 P X X xi θi ≤ f (xi ) θi θ = 1. This can be used to prove a swath of other inequalities. Operations That Preserve Convexity • Nonnegative weighted sums • Composition with an affine mapping: if f is convex/concave, f (Ax + b) also is. • Pointwise maximum/supremum: f (x) = maxi fi (x) is convex if all fi are convex. This can be used to prove that: – Sum of k largest elements is convex: it’s the max of • Max eigenvalue of symmetric matrix is convex. n k combinations. • Minimization: g (x) = inf f (x, y) y∈C is convex if f is convex. You can prove this by using epigraphs – you are slicing the epigraph of g. • Perspective function 9 3.2.1 Composition Setup: h: Rk → R and g: Rn → Rk . The composition f (x) = h (g (x)). Then: • f is convex: – if h is convex and nondecreasing, and g is convex, or – if h is convex and nonincreasing, and g is concave • f is concave: – if h is concave and nondecreasing, and g is concave, or – if h is concave and nonincreasing, and g is convex To summarize: if h is nondecreasing, and h and g has same curvature ”convexity/concavity), f follows. If h is nonincreasing, and h and g have differing curvature, f follows h. When determining nonincreasing/nondecreasing property of h, use its extended-value extension. Since extended-value extension can break nonincreasing/nondecreasing properties, you might want to come up with alternative definitions which are defined everywhere. Composition Examples • exp g (x) is convex if g is. • log g (x) is concave if g is concave and positive. • 1 g(x) is convex if g is positive and concave. • g (x) is convex if p ≥ 1 and g is convex and nonnegative. p Vector Composition If f (x) = h (g1 (x) , g2 (x) , g3 (x) , · · · ) the above rules still hold, except: • All gs need to have the same curvature. • h need to be nonincreasing/nondecreasing with respect to every input. 3.3 The Conjugate Function Given an f : Rn → R, the function f ∗: Rn → R is the conjugate: f ∗ (y) = sup x∈domf y T x − f (x) It is closely related to Lagrange Duals. I will omit further materials on this. 10 3.4 Quasiconvex Functions Quasiconvex functions are defined by having all its sublevel sets convex. Colloquially, they are unimodal functions. Quasiconvex functions are solved by bisection method + solving feasibility problems. The linear-fractional function f (x) = aT x + b cT x + b is quasiconvex. ”Can prove: let f (x) = α and try to come up with the definition of the sublevel set.) Remember how we solved this by bisection methods? :-) 3.4.1 Properties • First-order condition: f (y) ≤ f (x) =⇒ ∇f (x) (y − x) ≤ 0. Note: the gradient defines a supporting T hyperplane. Since f is quasiconvex, all points y with f (y) ≤ f (x) must lie in one side of the hyperplane. • Second-order condition: f ′′ (x) ≥ 0 3.4.2 Operations That Preserve Quasiconvexity • Nonnegative weighted sum • Composition: if h is nondecreasing and g is quasiconvex, then h (g (x)) is quasiconvex. • Composition with affine/linear fractional transformation. • Minimization along an axis: g (x) = miny f (x, y) is quasiconvex if f is. 3.5 Log-concavity and Log-convexity The function is log-concave or log-convex if its log is concave/convex. • The pdf of a Gaussian distribution is log-concave. • The gamma function is log-concave. 4 Convex Optimization Problems 4.1 Optimization Problems A typical optimization problem formulation looks like: minimize f0 (x) subject to fi (x) ≤ 0 (i = 1, · · · , m) hi (x) = 0 (i = 1, · · · , p) 4.1.1 Feasibility Problem There are cases where you want to find a single x which satisfies all equality and inequality constraints. These are feasibility problems. 11 4.1.2 Transformations and Equivalent Problems Each problem can have multiple representations which are same in nature but expressed differently. Different expression can have different properties. • Nonzero equivalence constraints: move everything to LHS. • Minimization/maximization: flip signs. • Transformation of objective/constraint function: transforming objective functions through monotonic functions can yield an equivalent problem. • Slack variables: f (x) ≤ 0 is swapped out by f (x) + s = 0 and s ≥ 0. • Swapping implicit/explicit constraint: move an implicit constraint to explicit, by using extended value extension. 4.2 Convex Optimization A convex optimization problem looks just like the above definition, but have a few differences: • All f s are convex. • All gs are affine. Thus, the set of equivalence constraints can be expressed by aTi x = bi thus Ax = b Note all convex optimization problems do not have any locally optimal points; all optimals are global. Also, another important property arises from them: the feasible set of a convex optimization problem is convex, because it is an intersection of convex and affine sets – a sublevel set of a convex function is convex, and the feasible set is an intersection of sublevel sets and an affine set. Another important thing to note is that convexity is a function of the problem description; different formulations can make a non-convex problem convex and vice versa. It’s one of the most important points of the course. 4.2.1 Optimality Conditions If the objective function f0 is differentiable, x is the unique optimal solution if for all feasible y, we have T ∇f0 (x) (y − x) ≥ 0 this comes trivially from the first order condition of a quasiconvex function ”a convex function is always quasiconvex). Optimality conditions for some special ”mostly trivial) cases of problems are discussed: • Unconstrained problem: Set gradient to 0. • Only equality constraint: We can derive the below from the general optimality condition described above. ∇f0 (x) + AT ν = 0 (ν ∈ Rp ) Here’s a brief outline: for any feasible y, we need to have ∇f0 (x) (y − x) ≥ 0. Note that y − x ∈ N (A) ⊥ ”the null space) because Ax = b = Ay =⇒ A (x − y) = 0. Now, this means ∇f0 (x) ∈ N (A) = R AT , T the last term being the column space of AT . Now, we can let 12 ∇f0 (x) = AT (−ν) for some ν and we are done. • Minimize over nonnegative orthant: for each i, we need to have ∇f (x) = 0 0 i ∇f (x) ≥ 0 0 i if xi > 0 if xi = 0 This is both intuitive, and relates to a concept ”KKT conditions) which is discussed later. If xi is at the boundary, it can have some positive gradient along that axis; we can still be optimal because decreasing xi will make it infeasible. Otherwise, we need to have zero gradients. 4.2.2 Equivalent Convex Problems Just like what we discussed for general optimization problems, but specially for convex problems. • Eliminating affine constraint: Ax = b is equivalent to x ∈ F z + x0 where the column space of F is N (A) and x0 is a particular solution of Ax = b. So we can just put f0 (F z + x0 ) in place of f0 (x). • Uneliminating affine constraint: going in the opposite direction; if we are dealing with f0 (Ai x + bi ), let Ai x + bi = yi . – On eliminating/uneliminating affine constraint: on a naive view, eliminating affine constraint always seem like a good idea. However, it isn’t so; it is usually better to keep the affine constraint, and only do the elimination if it is immediately computationally advantageous. ”This will be discussed later in the course.. but I don’t remember where) • Slack variables • Epigraph form: minimize t subject to f0 (x) − t ≤ 0 is effectively minimizing f0 (x). This seems stupid, but this gives us a convenient framework, because we can make objectives linear. 4.3 Quasiconvex Optimization When the objective function is quasiconvex, it is called a quasiconvex optimization problem. The biggest difference is that we will now have local optimal points; quasiconvex functions are allowed to have flat portions which give rise to local optimal points. 4.3.1 Solving Quasiconvex Optimization Problems Quasiconvex optimization problems are solved by bisection methods; at each iteration we ask if the sublevel set empty for given threshold. We can solve this by a convex feasibility problem. 13 4.3.2 Example: Convex over Concave Say p (x) is convex, q (x) is concave. Then f (x) = p (x) /q (x) is quasiconvex! How do you know? Consider the sublevel set: {x : f (x) ≤ t} = {x : p (x) /q (x) ≤ t} = {x : p (x) − t · q (x) ≤ 0} and p (x) − t · q (x) is convex! So the sublevel sets are convex. 4.4 Linear Optimization Problems In an LP problem, objectives and constraints are all affine functions. LP algorithms are very, very advanced and all these problems are readily solvable in today’s computers. It is a very mature technology. 4.4.1 Examples • Chebyshev Center of a Polyhedron: note that the ball lying inside a halfplane aTi x ≤ bi can be represented as kuk2 ≤ r =⇒ aTi (xc + u) ≤ bi Since sup aTi u = r kai k2 kuk2 ≤r we can rewrite the constraint as aTi xc + r kai k2 ≤ bi which is a linear constraint on xc and r. Therefore, having this inequality constraint for all sides of the polyhedron gives a LP problem. • Piecewise-linear minimization. Minimize: max aTi x + bi i This is equivalent to LP: minimize t subject to aTi x + bi ≤ t! This can be a quick, dirty, cheap way to solve convex optimization problems. 4.4.2 Linear-fractional Programming If the objective is linear-fractional, while the constraints are affine, it becomes a LFP problem. This is a quasiconvex problem, but it can also be translated into a LP problem. I will skip the formulation here. 14 4.5 Quadratic Optimization QP is a special kind of convex optimization where the objective is a convex quadratic function, and the constraint functions are affine. 1 minimize xT P x + q T x + r 2 subject to Gx h Ax = b and P ∈ Sn+ . When the inequality constraint is quadratic as well, it becomes a QCQP ”Quadratically Con- strainted Quadratic Programming) problem. 4.5.1 Examples • Least squares: needs no more introduction. When linear inequality constraints are added, it is no longer analytically solvable, but still is very tractable. • Isotonic regression: we add the following constraint to a least squares algorithm: x1 ≤ x2 ≤ · · · ≤ xn . This is still very easy in QP! • Distance between polyhedra: Minimizing Euclidean distance is a QP problem, and the constraints ”two polyhedras) are convex. • Classic Markowitz Portfolio Optimization – Given an expected return vector p¯ and the covariance matrix Σ, find the minimum variance portfolio with expected return greater than, or equal to, rmin . This is trivially representable in QP. – Many extensions are possible; allow short positions, transaction costs, etc. 4.5.2 Second Order Cone Programming SOCP is closely related to QP. It has a linear objective, but a second-order cone inequality constraint: minimize f T x subject to |Ai x + bi |2 ≤ cTi x + di Fx = g The inequality constraint forces the tuple Ai x + bi , cTi + di to lie in the second-order cone in Rn+1 . When ci = 0 for all i, we can make them regular quadratic constraints and this becomes a QCQP. So basically, using a second-order cone instead of a ”possibly open) polyhedra in LP. t. Note that the linear objective does not make SOCP weaker than QCQP. You can minimize t where f0 (x) ≤ 15 4.5.3 SOCP: Robust Linear Programming Suppose we have a LP minimize cT x subject to aTi x ≤ bi but the numbers given in the problem could be inaccurate. As an example, let’s just assume that the true value of ai can lie in a ellipsoid defined by Pi , centered at the given value: ai ∈ E = {a¯i + Pi u| kuk2 ≤ 1} and other values (c and bi ) are fixed. We want the inequalities to hold for all possible value of a. The inequality constraint can be cast as sup (a¯i + Pi u) = a¯i T x + PiT x2 ≤ bi kuk2 ≤1 which is actually a SOCP constraint. Note that the additional norm term PiT x2 acts as a regularization term; they prevent x from being large in directions with considerable uncertainty in the parameters ai . 4.5.4 SOCP: Stochastic Inequality Constraints When ai are normally distributed vectors with mean a¯i and covariance matrix Σi , the following constraint P aTi x ≤ bi ≥ η says that a linear inequality constraint will hold with a probability of η or better. This can be cast as a SOCP constraint as well. Since x will be concrete numbers, we can say aTi x ∼ n u¯i , σ 2 . Then P aTi x ≤ bi = P ui − u¯i bi − u¯i ≤ σ σ =P Z≤ bi − u¯i σ ≥ η ⇐⇒ Φ bi − u ¯i σ ≥ η ⇐⇒ bi − µ¯i ≥ Φ−1 (η) σ The last condition can be rephrased as 1/2 a¯i T x + Φ−1 (η) Σi x ≤ bi 2 which is a SOCP constraint. 4.6 Geometric Programming Geometric programming problems involve products of powers of variables, not weighted sums of variables. 4.6.1 Monomials and Posynomials A monomial function f is a product of powers of variables in the form 16 f (x) = cxa1 1 xa2 2 · · · xann where c > 0. A sum of monomials are called a posynomial; which looks like f (x) = K X k ck xa1 1k xa2 2k · · · xannk 4.6.2 Geometric Programming A GP problem looks like: minimize f0 (x) subject to fi (x) ≤ 1 (i = 1, 2, 3, · · · , m) hi (x) = 1 (i = 1, 2, 3, · · · , p) where f are posynomials and h are monomials. The domain of this problem is Rn++ . 4.6.3 Convex Transformation GP problems are not convex in general, but a change of variables will turn a GP into a convex optimization problem. Letting yi = log xi ⇐⇒ xi = eyi yields a monomial f (x) to f (x1 , x2 , x3 , · · · ) = f (ey1 , ey2 , ey3 , · · · ) = c · e a 1 y1 · e a 2 y 2 · e a 3 y3 · · · = exp aT y + b which is now an exponential of affine function. Similarly, a posynomial will be converted into a sum of exponentials of affine functions. Now, taking log of the objective and the constraints. The posynomials turn into log-sum-exp ”which are convex), the monomials will be become affine. Thus, this is our regular convex problem now. 4.7 Generalized Inequality Constraints 4.7.1 Conic Form Problems Conic form problem is a generalization of LP, replacing componentwise inequality with generalized linear inequality with a cone K. minimize cT x subject to F x + g K 0 Ax = b 17 The SOCP can be expressed as a conic form problem if we set Ki to be a second-order cone in Rni +1 : minimize cT x subject to − Ai x + bi , cTi x + di Ki 0 (i = 1, · · · , m) Fx = g from which the name of SOCP comes. 4.7.2 SDP: Semidefinite Programming A special form of conic program, where K is Sn+ , which is the set of positive semidefinite matrices, is called a SDP. It has the form: minimize cT x subject to x1 F1 + x2 F2 + · · · + xn Fn 0 Ax = b 4.8 Vector Optimization We can generalize the regular convex optimization by letting the objective function take vector values; we can now use proper cones and generalized inequalities to find the best vector value. These are called vector optimization problems. 4.8.1 Optimal Values When a point x∗ is better or equal to than every other point in the domain of the problem, x∗ is called the optimal. In a vector optimization problem, if an optimal exists, it is unique. ”Why? Vector optimization requires a proper cone; proper cones are pointed – they do not contain lines. However, if x1 and x2 are both optimal, p = x1 − x2 and −p are both in the cone, making it improper.) 4.8.2 Pareto Optimality In many problems we do not have a minimum value achievable, but a set of minimal values. They are incomparable to each other. A point x ∈ D is pareto-optimal when for all y that f0 (y) K f0 (x) implies f0 (y) = f0 (x). Note that there can be multiple values with the same minimal value. Note that every pareto value has to lie on the boundary of the set of achievable values. 4.8.3 Scalarization for Pareto Optimality A standard technique for finding pareto optimal points is to scalarize vector objectives by taking a weighted sum. This can be explained in terms of dual generalized inequality. Pick any λ ∈ K∗ and solve the following problem: 18 minimize λT f0 (x) subject to fi (x) ≤ 0 hi (x) = 0 By what we discussed in 2.6.2, a pareto optimal point must be a minimizer of this objective for any λ ∈ K∗. Now what happens when the problem is convex? Each λ with λ ≻K∗ 0 will likely give us a different pareto point. Note λ K∗ 0 might not give us such guarantee, some elements might not be pareto optimal. 4.8.4 Examples • Regularized linear regression tries to minimize RMSE and norm of the coefficient at the same time so we optimize kAx − bk2 + λxT x. Changing λ lets us explore all pareto optimal points. 2 5 Duality This chapter explores many important ideas. Duality is introduced and used as a tool to derive optimality conditions. KKT conditions are explained. 5.1 The Lagrangian And The Dual 5.1.1 The Lagrangian The Lagrangian L associated with a convex optimization problem minimize f0 (x) subject to fi (x) ≤ 0 (i = 1, · · · , m) hi (x) = 0 (i = 1, · · · , p) is a function taking x and the weights as input, and returning a weighted sum of the objective and constraints: L (x, λ, ν) = f0 (x) + m X λi fi (x) + i=1 p X νi hi (x) i=1 So positive values of fi (x) are going to penalize the objective function. The weights are called dual variables or Lagrangian multiplier vectors. 5.1.2 The Lagrangian Dual Function The Lagrangian dual function takes λ and ν, and minimizes L over all possible x. g (λ, ν) = inf L (x, λ, ν) = inf x∈D x∈D f0 (x) + m X i=1 19 λi fi (x) + p X i=1 νi hi (x) ! 5.1.3 Lagrangian Dual and Lower Bounds It is easy to see that for any elementwise positive λ, the Lagrangian dual function provides a lower bound on the optimal value p∗ of the original problem. This is very easy to see; if x is any feasible point, the dual function value is the sum of ”possibly suboptimal) value p; if xp is the feasible optimal point, we have negative values of fi (i ≥ 1) and zeros for hi . Then, g (λ, ν) = inf L (x, λ, ν) ≤ L (xp , λ, ν) ≤ f (xp ) = p∗ x∈D 5.1.4 Intuitions Behind Lagrangian Dual An alternative way to express constraints is to introduce indicator functions in the objectives: minimize f0 (x) + m X I− (fi (x)) + i=1 p X I0 (hi (x)) i=1 the indicator functions will have a value of 0 when the constraint is met, ∞ otherwise. Now, these represent how much you are irritated by a violated constraint. We can replace them with a linear function - just a different set of preferences. Instead of hard constraints, we are imposing soft constraints. 5.1.5 LP example and Finite Dual Conditions A linear program’s lagrange dual function is g (λ, ν) = inf L (x, λ, ν) = −bT ν + inf c + AT ν − λ x x T x The dual value can be found analytically, since it is a affine function of x. Whenever any element of c + AT ν − λ is nonzero, we can manipulate x to make the dual value −∞. So it is finite only on a line where c + AT ν − λ = 0, which is a surprisingly common occurrence. 5.1.6 Conjugate Functions and Lagrange Dual The two functions are closely related, and Lagrangian dual can be expressed in terms of the conjugate function of the objective function, which makes duals easier to derive if the conjugate is readily known. 5.2 The Lagrange Dual Problem There’s one more thing which is named Lagrangian: the dual problem. The dual problem is the optimization problem maximize g (λ, ν) subject to λ 0 A pair of (λ, ν) is called dual feasible if it is a feasible point of this problem. The solution of this problem, (λ∗, ν∗) is called dual optimal or optimal Lagrange multipliers. The dual problem is always convex; whether or not the primal problem is convex. Why? g is a pointwise infimum of affine functions of λ and ν. Note the langrange dual for many problems were bounded only for a subset of the domain. We can bake this restriction into the problem explicitly, as a constraint. 20 5.2.1 Duality Gap Langrangian dual’s solution d∗ are related to the solution of the primal problem p∗, notably: d∗ ≤ p∗ Regardless of the original problem being convex. When the inequality is not strict, this is called a weak duality. The difference p ∗ −d∗ is called the duality gap. Duality can be used to provide lower bound of the primal p roblem. 5.2.2 Strong Duality And Slater’s Constraint Qualification When the duality gap is 0, we say strong duality holds for the problem. That means the lower bound obtained from the dual equals to the optimal solution of the problem; therefore solving the dual is ”sort of) same as solving the primal. For obvious reasons, strong duality is very desirable but it doesn’t hold in general. But for convex problems, we usually ”but not always) have strong duality. Given a convex problem, how do we know if strong duality holds? There are many qualifications, which ensures strong duality if the qualifications are satisfied. The text discusses one such qualification; Slater’s constraint qualification. The condition is quiet simple: if there exists x ∈ relintD such that all inequality conditions are strictly held, we have strong duality. Put another way: fi (x) < 0 (i = 0, 1, 2, · · · ) , Ax = b Also, it is noted that affine inequality constraints are allowed to held weakly. 5.2.3 Examples • Least-squares: since there are no infeasibility constraints, Slater’s condition just equals feasibility: so as long as the primal problem is feasible, strong duality holds. • QCQP: The Lagrangian is a quadratic form. When all λs are nonnegative, we have a positive semidefinite form and we can solve minimization over x analytically. • Nonconvex example: Minimizing a nonconvex quadratic function over the unit ball has strong duality. 5.3 Geometric Interpretation This section introduces some ways to think about Lagrange dual functions, which offer some intuition about why Slater’s condition works, and why most convex problems have strong duality. 5.3.1 Strong Duality of Convex Problems Let’s try to explain figure 5.3 to 5.5 from the book. Consider following set G: G = {(f1 (x) , · · · , fm (x) , h1 (x) , · · · , hp (x) , f0 (x)) |x ∈ D} = {(u, v, t) |x ∈ D} Note ui = fi (x), vi = gi (x) and t = f0 (x). Now the Lagrangian of this problem 21 L (λ, ν, x) = X λi ui + X ν i vi + t can be interpreted as a hyperplane passing through x with normal vector (λ, ν, 1)1 and that hyperplane will meet t-axis at the value of the Lagrangian. ”See figure 5.3 from the book.) Now, the Lagrange dual function g (λ, ν) = inf L (λ, ν, x) x∈D will find x in the border of D: intuitively, the Lagrangian can still be decreased by wiggling x if x ∈ relintD. Therefore, the value of the Lagrange dual function can now be interpreted as a supporting hyperplane with normal vector (λ, ν, 1). Next, we solve the Lagrange dual problem which maximizes the position where the hyperplane hits the t-axis. Can we hit p∗, the optimal value? When G is convex, the feasible portion of G ”i.e. u 0 and ν = 0) is convex again, and we can find a supporting hyperplane that meets G at the optimal point! But when G is not, p∗ can hide in a nook inside G and the supporting hyperplane might not meet p∗ at all. 5.4 Optimality Conditions 5.4.1 Certificate of Suboptimality and Stopping Criteria We know, without assuming strong duality, g (λ, ν) ≤ p⋆ ≤ f0 (x) Now, f0 (x) − g (λ, ν) gives a upper bound on f0 (x) − p⋆, the quantity which shows how suboptimal x is. This gives us a stopping criteria for iterative algorithms; when f0 (x) − g (λ, ν) ≤ ǫ, it is a certificate that x is less than ǫ suboptimal. The quantity will never drop below the duality gap, so if you want this to work for arbitrarily small ǫ we would need strong duality. 5.4.2 Complementary Slackness Suppose the primal and dual optimal values are attained and equal. Then, f0 (x⋆ ) = g (λ⋆ , ν ⋆ ) X X = inf f0 (x) + λi fi (x) + νi hi (x) x X X ≤ f0 (x⋆ ) + λi fi (x⋆ ) + νi hi (x⋆ ) ≤ f0 (x⋆ ) ”assumed 0 duality gap) ”definition of Lagrangian dual function) ”taking infimum is less than or equal to any x) ”λ⋆i are nonnegative, fi values are nonpositive and hi values are 0) So all inequalities can be replaced by equalities! In particular, it means two things. First, x⋆ minimizes the Lagrangian. Next, X λ⋆i fi (x⋆ ) = 0 Since each term in this sum is nonpositive, we can conclude all terms are 0: so for all i ∈ [1, m] we have: 1 Yep, there are some notation abusing here since λ and ν themselves are vectors. 22 λ⋆ = 0 i either f (x⋆ ) = 0 i This condition is called complementary slackness. 5.4.3 KKT Optimality Conditions KKT is a set of conditions for a tuple (x⋆ , λ⋆ , ν ⋆ ) which are primal and dual feasible. It is a necessary condition for x∗ and (λ∗, ν∗) being optimal points for their respective problems with zero duality gap. That is, all optimal points must satisfy these conditions. KKT condition is: • x⋆ is prime feasible: fi (x⋆ ) ≤ 0 for all i, hi (x⋆ ) = 0 for all i. • (λ⋆ , ν ⋆ ) is dual feasible: λ⋆i ≥ 0 • Complementary slackness: λ⋆i fi (x⋆i ) = 0 • Gradient of Lagrangian disappears: ∇f0 (x⋆ ) + P λ⋆i ∇fi (x⋆ ) + P νi⋆ ∇hi (x⋆ ) = 0 Note the last condition is something we didn’t see before. It makes intuitive sense though - the optimal point for the dual problem must minimize the Lagrangian. Since the primal problem is convex, the Lagrangian is convex - and the only point with 0 gradient is the minimum. KKT and Convex Problems When the primal problem is convex, KKT condition is necessary-sufficient for optimality. This has immense importance. We can frame solving convex optimization problems by solving KKT conditions. Sometimes KKT condition might be solvable analytically, giving us closed form solution for the optimization problem. The text also mentions that when Slater’s condition is satisfied for a convex problem, we can say that arbitrary x is primal optimal iff there are (λ, ν) that satisfies KKT along with x. I actually am not sure about why Slater’s condition is needed for this claim but the lecture doesn’t make a big deal out of it, so meh.. 5.5 Solving The Primal Problem via The Dual When we have strong duality and dual problem is easier to solve”due to some exploitable structure or analytical solution), one might solve the dual first to find the dual optimal point (λ⋆ , ν ⋆ ) and find the x that minimizes the Lagrangian. If this x is feasible, we have a solution! Otherwise, what do we do? If the Lagrangian is strictly convex, then there is a unique minimum: if this minimum is infeasible, then we conclude the primal optimal is unattainable. 5.6 Sensitivity Analysis The Lagrange multipliers for the dual problem can be used to infer the sensitivity of the optimal value with respect to perturbations of the constraints. What kind of perterbations? We can tighten or relax constraints for an arbitrary optimization problem, by changing the constraints to: 23 f (x) ≤ u i i h (x) = v i (i = 1, 2, · · · , m) (i = 1, 2, · · · , p) i Letting ui > 0 means we have more freedom regarding the value of fi ; ui < 0 otherwise. 5.6.1 Global Sensitivity The Lagrange multipliers will give you information about how the optimal value will change when we do this. Let’s denote the optimal value of the perturbed problem as a function of u and v: p⋆ (u, v). We have a lower bound for this value when strong duality holds: f0 (x) ≥ p⋆ (0, 0) − λ⋆T u − ν ⋆T v which can be obtained by manipulating the definitions. Using this lower bound, we can make some inferences about how f0 (x) will change with respect to u and v. Basically, when lower bound increases greatly, we make an inference the optimal value will increase greatly. However, when lower bound decreases, we don’t have such an assurance2 . Examples: • When λ⋆i is large, and we tighten the constraint (ui < 0), this will increase the lower bound a lot; the optimal value will increase greatly. • When λ⋆i is is small, and we loosen the constraint (ui > 0), this will decrease the lower bound a bit, but this might not decrease the optimal value a lot. 5.6.2 Local Sensitivity The text shows an interesting identity: λ⋆i = − ∂p⋆ (0, 0) ∂ui Now, λ⋆i gives you the slope of the optimal value with respect to the particular constraint. All these, along with complementary slackness, can be used to interpret Lagrange multipliers; they tell you how tight a given inequality constraint is. Suppose we found that λ⋆1 = 0.1, and λ⋆2 = 100 after solving a problem. By complementary slackness, we know f1 (x⋆ ) = f2 (x⋆ ) = 0 and they are both tight. However, when we do decrease u2 , we know p⋆ will move much more abruptly, because of the slope interpretation above. On the other hand, what happens when we increase u2 ? Locally we know p⋆ will start to decrease fast; but it doesn’t tell us how it will behave when we keep increasing u2 . 5.7 Examples and Reformulating Problems As different formulations can change convex problems to non-convex and vice versa, dual problems are affected by how the problem is exactly formulated. Because of this, a problem that looks unnecessarily complicated might end up to be a better representation. The text gives some examples of this. 2 Actually I’m a bit curious regarding this as well - lower bound increasing might not increase the optimal value when optimal value was well above the lower bound to begin with. 24 5.7.1 Introducing variables 5.7.2 Making explicit constraints implicit 5.7.3 Transforming the objective 5.8 Generalized Inequailities How does the idea of Lagrangian dual extend to problems with vector inequalities? Well, it generalizes pretty well - we can define everything pretty similarly. Except how the nonnegative restriction for λ becomes a nonnegative restriction with dual cone. Here are some intuition behind this difference. Say we have the following problem: minimize f0 (x) s.t. fi (x) Ki 0 hi (x) = 0 (i = 1, 2, · · · , m) (i = 1, 2, · · · , p) Now the Lagrangian multiplier λ is vector valued. The Lagrangian becomes: f0 (x) + X λTi fi (x) + X νi hi (x) fi (x) is nonpositive with respect to Ki means that −fi (x) ∈ Ki . Remember we want each product (x) to be nonpositive - otherwise this dual won’t be a lower bound anymore. Now we try to find the set of λ that makes λT y negative for all −y ∈ K. We will need to make λT y positive for all y ∈ K. What is λTi fi this set? The dual cone. The dual of SDP is also given as an example. The actual derivation involves more linear algebra than I am comfortable with ”shameful) so I’m skipping things here. 6 Approximation and Fitting With this chapter begins part II of the book on applications of convex optimization. Hopefully, I will be less confused/frustrated by materials in this part. :-) 6.1 Norm Approximation This section discusses various forms of the linear approximation problem: minx |Ax − b| with different norms and constraints. Without doubt, this is one of the most important optimization problems. 6.1.1 Examples • ℓ2 norm: we get least squares. • ℓ∞ norm: Chebyshev approximation problem. Reduces to an LP which is as easy as least squares, but no one discusses it! 25 • ℓ1 norm: Sum of absolute residuals norm. Also reduces to LP, extremely interesting, as we will discuss further in this chapter. 6.1.2 Different Penalty Functions and Their Consequences The shape of the norm used in the approximation affects the results tremendously. The most common norms are ℓp norms - given a residual vector r, X i |ri | p !1/p We can ignore the powering by 1/p and just minimize the base of the exponentiation. Now we can think of ℓp norms giving separate penalties to each of the residual vector. Note that most norms do the same - so we can think in terms of a penalty function φ (r) when we think about norms. The text examines a few notable penalty functions: • Linear: sum of absolute values; associated with ℓ1 norm. • Quadratic: sum of squared errors; associated with ℓ2 norm. • Deadzone-linear: zero penalty for small enough residuals; grows linearly after the barrier. • Log-barrier: grows infinitely as we get near the preset barrier. Now how does these affect our solution? The penalty function measures our level of irritation with regard to the residual. When φ (r) grows rapidly as r becomes large, we are immensly irritated. When φ (r) shrinks rapidly as r becomes small, we don’t care as much. This simple description actually explains the stark difference between ℓ1 and ℓ2 norms. With a ℓ1 norm, the slope of the penalty does not change when the residual gets smaller. Therefore, we still have enough urge to shrink the residual until it becomes 0. On the other hand, with a ℓ2 norm the penalty will quickly get smaller when the residual gets smaller than 1. Now, once we go below 1, we do not have as much motivation to shrink it further - the penalty does not decrease as much. What happens when the residual is large? Then, ℓ1 is actually less irritated by ℓ2 ; the penalty grows much more rapidly. These explanations let us predict how the residuals from both penalty functions will be distributed. ℓ1 will give us a lot of zeros, and a handful of very large residuals. ℓ2 will only have a very small number of large residuals; and it won’t have as many zeros - many residuals will be near zero, but not exactly. The figures in the text confirms this theory. This actually was one of the most valuable intuitions I got out of this course. Awesome. Another little gem discussed in the lecture is that, contrary to the classic approach to fitting problems, the actual algorithms that find the x are not your tools anymore - they are standard now. The penalty function is your tool - you shape your problem to fit your actual needs. This is a very interesting, and at the same time very powerful perspective! 6.1.3 Outliers and Robustness Different penalty functions behave differently when outliers are present. As we can guess, quadratic loss functions are affected much worse than linear losses. When a penalty function is not sensitive to outliers, it is called robust. Linear loss function is an obvious example of this. The text introduces another robust penalty function, which is the Huber penalty function. It is a hybrid between quadratic and linear losses. 26 φhub (u) = u2 (|u| < M ) M (2 |u| − M ) otherwise Huber function grows linearly after the preset barrier. It is the closest thing to a constant-beyondbarrier loss function, without losing convexity. When all the residuals are small, we get the exact least square results - but if there are large residuals, we don’t go nuts with it. It is said in the lecture that 80% of all applications of linear regression could benefit from this. A bold, but very interesting claim. 6.1.4 Least-norm Problems A closely related problem is least-norm problem which has the following form: minimize |x| subject to Ax = b which obviously is meaningful only when Ax = b is underdetermined. This can be cast as a norm approximation problem by noting that the solution space is given by a particular solution, and the null space of A. Let Z consist of column vectors that are basis for N (A), and we minimize: |x0 + Zu| Two concrete examples are discussed in the lecture. • If we use ℓ2 norm, we have a closed form solution using the KKT conditions. • If we use ℓ1 norm, it can be modeled as an LP. This approach is in vogue, say in last 10 years or so. We are now looking for a sparse x. 6.2 Regularized Approximations Regularization is a practice of minimizing the norm of the coefficient |x|, as well as the norm of the residual. It is a popular practice in multiple disciplines. Why do it? The text introduces a few examples. First of all, it can be a way to express our prior knowledge or preference towards smaller coefficients. There might be cases where our model is not a good approximation of reality when x gets larger. Personally, this made the most sense to me; it can be a way of taking variations/errors of the matrix A into account. For example, say we assume an error ∆ in our matrix A. So we are minimizing (A + ∆) x−b = Ax − b + ∆x; the error is multiplied by x! We don’t want a large x. 6.2.1 Bi-criterion Formulation Regularization can be cast as a bi-criterion problem, as we have two objectives to minimize. We can trace the optimal trade-off curve between the two objectives. On one end, where |x| = 0, we have Ax = 0 and the residual norm is |b|. At the other end, there can be multiple Pareto-optimal points which minimize |Ax − b|. ”When both norms are ℓ2 , it is unique) 27 6.2.2 Regularization The actual practice of regularization is more concrete than merely trying to minimize the two objectives; it is a scalarization method. We minimize |Ax − b| + γ |x| where γ is a problem parameter. ”Which, in practice, is typically set by cross validation or manual intervention) Practically, γ is the knobs we turn to solve the problem. Another common practice is taking the weighted sum of squared norms: 2 |Ax − b| + δ |x| 2 Note it is not obvious that the two problems sweep out the same tradeoff curve. ”They do, and you can find the mapping between γ and δ given a specific problem). The most prominent scheme is Tikhonov regularization/ridge regression. We minimize: 2 2 |Ax − b|2 + δ |x|2 which even has an analytic solution. The text also mentions a smoothing regularization scheme - the penalty is on Dx instead of x. D can change depending on your criteria of fitness of solutions. For example, if we want x to be smooth, we can roughly penalize its second derivative by setting D as the Toeplitz matrix: 1 −2 1 0 1 −2 D= 0 0 1 .. .. .. . . . 0 0 −2 .. . 0 0 0 0 1 .. . 0 .. . ··· · · · · · · .. . So that the elements of Dx are approximately the second derivatives (2xi − xi−1 − xi+1 ). 6.2.3 ℓ1 Regularization ℓ1 regularization is introduced as a heuristic for finding sparse solutions. We minimize: |Ax − b|2 + γ |x|1 The optimal tradeoff curve here can be an approximation of the optimal tradeoff curve between |Ax − b|2 and the cardinality cardx, which is the number of nonzero elements of x. ℓ1 regularization can be solved as a SOCP problem. 6.2.4 Signal Reconstruction Problem An important class of problem is introduced: signal reconstruction. There is an underlying signal x which is observed with some noise; resulting in corrupted observation. x is assumed to be smooth. What is the most plausible guess for the time series x? This can be cast as a bicriterion problem: first we want to minimize |ˆ x − xcor |2 where x ˆ is our guess and xcor is the corrupted observation. On the other hand, we think smooth x ˆ are more likely, so we minimize a penalization function: φ (ˆ x). Different penalization schemes are introduced, quadratic smoothing 28 and total variance smoothing. In short, they are ℓ2 and ℓ1 penalizers, respectively. When underlying process has some jumps, as you can expect, total variance smoothing preserves those jumps, while quadratic smoothing tries to smooth out the transition. Some more insights are shared in the lecture videos. Recall ℓ1 regularization gives you a small number of nonzero regularized terms. So if you are penalizing φ (ˆ x) = X |ˆ xi+1 − x ˆi | the first derivative is going to be sparse. What does the resulting function look like? Piecewise constant. Similarly, say we take the approximate second derivative |2ˆ xi − x ˆi−1 − x ˆi+1 |? We get piecewise linear! The theme goes on - if we take the third difference, we get piecewise quadratic ”actually, splines). 6.3 Robust Approximation How do we solve approximation when A is noisy? Let’s say, A = A¯ + U where A¯ represents the componentwise mean, and U represents the random component with zero mean. How do we handle this? The prevalent method is to ignore that A has possible errors. It is okay, as long as you do a posterior analysis on the method: try changing A by a small amount and try the approximation again, see how it changes. 6.3.1 Stochastic Formulation A reasonable formulation for an approximation problem is to minimize: minimize E kAx − bk This is untractable in general, but tractable in some special cases, including when we minimize the ℓ2 norm: minimize E kAx − bk2 2 Then: 2 ¯ − b + U x T Ax ¯ − b + Ux E kAx − bk2 = E Ax ¯ − b + ExT U T U x ¯ − b T Ax = Ax ¯ − b 2 + x T P x = Ax 2 2 2 1/2 ¯ − b + = Ax x P 2 2 Tada, we got Tikhonov regularization! This makes perfect sense - increasing magnitudes of x will increase the variation of Ax, which in turn increase the average value of kAx − bk by Jensen’s inequality. So ¯ − b small with making the variance small. This is a nice interpretation we are trying to balance making Ax for Tikhonov regularization as well. 29 6.3.2 Worst-case Robust Approximation Instead of taking expected value of the error, we can try to minimize the supremum of error across a set A consisting possible values of A. The text describes several types of A we can use to come up with explicit solutions. The following are those examples: • When A is a finite set • When A is mean plus U , where U is an error in a norm ball. • When each row of A is mean plus Pi , where Pi describes an ellipsoid of possible values. • more examples.. Worst-case robust least squares is mentioned in the lecture. This is not a convex problem, but it can be solved exactly. In fact, any optimization problem with two quadratic functions can be solved exactly ”see appendix of the book). 6.4 Function Fitting In a function fitting problem, we try to approximate an unknown function by a linear combination of basis functions. We determine the coefficient vector x which yields the following function: f (u) = X xi fi (u) where fi () is the i-th basis function. A typical basis function are powers of u: the possible set of f is the set of polynomials. You can use piecewise linear and polynomial functions; using piecewise polynomial will give you spline functions. 6.4.1 Constraints We can impose various constraints on the function being fitted. The text introduces some tractable set of constraints. P • Function value interpolation: the function value at a given point f (v) = xi fi (v) is a linear function of x. Therefore, equality constraints and inequality constraints are actually linear constraints. • Derivative constraints: the derivative value at a given point f (v) = of x. P xi ∇fi (v) is also a linear function 6.4.2 Sparse Descriptions and Basis Pursuit In basis pursuit problems, we want to find a sparse f out of a very large number of basis functions. By a sparse f , we mean there are a few nonzero entries in the coefficient vector x. Mathematically, this is equivalent to the regressor selection problem ”quite unsurprisingly), so the similar set of heuristics can be used. First, we can use ℓ1 regularization to approximate optimizing for cardx. 30 6.4.3 Checking Model Consistency The text introduces an interesting problem - given a set of data points, is there a convex function that satisfies all those data? Fortunately, recall the first order convexity condition from 3.1.2 - using this, ensuring the convexity of a function is as easy as finding the gradients at the data points so that the first order condition is satisfied. We want to find g1 , · · · , gm so that: yj ≥ yi + giT (uj − ui ) for any pair of i and j. Fitting Convex Function To The Data We can fit a convex function to the data by finding fitted values of y, and ensuring the above condition holds for the fitted values. Formally, solve: minimize (yi − yˆi ) 2 subject toˆ yj ≥ yˆi + giT (uj − ui ) for any pair of i, j This is a regular QP. Note the result of this problem is not a functional representation of the fitted function, as in regular regression problems. Rather, we get the value of the function - so it’s a point-value representation. Bounding Values Say we want to find out if a new data point is irregular or not - is it consistent with what we saw earlier? In other words, given a new unew , what is the range of values possible given the previous data? We can minimize/maximize for yˆnew subject to the first order constraint, to find the range. These problems are LP. 7 Statistical Estimation I was stuck in this chapter for too long. It’s time to finish this chapter no matter what. This chapter shows some example applications of convex optimization in statistical settings. 7.1 Parametric Distribution Estimation The first example is MLE fitting - the most obvious, but the most useful. We of course require the constraints on x to be convex optimization friendly. A linear model with IID noise is discussed: yi = aTi x + vi The MLE is of course xml = argmaxx l (x) = argmaxx log px (y) px (y) depends on the distribution of vi s. Different assumptions on this distribution leads to different fitting methods: • Gaussian noise gives you OLS 31 • Laplacian noise gives you ℓ1 regularized regression ”of course, Laplacian distribution has a sharp peak at 0, which equates to having high incentive to reduce residual when residual is really small) • Uniform noise Also note that we need to have log px (y) to be concave in x, not y: and exponential families of distribution meet this criteria. Also, in many cases, your natural choice of parameters might not yield a log likelihood function that is concave. Usually with a change of variables, we achieve this. Also, we discuss that these distributions are equivalent to different penalty schemes - as demonstrated by the equivalence of L2 with Gaussian, L1 with Laplacian. There are 1:1 correspondence. If you have a penalty function p (v), the corresponding distribution is ep(v) normalized! 7.1.1 Logistic Regression Example exp aT u+b We model p = S a u + b = 1+exp(aT u+b) where u are the explanatory variables, a and b are model parameters. Say we have n = q + m examples, the first q of them having yi = 1 and next m of them having yi = 0. T Then, the likelihood function has: q Y i=1 pi n Y i=q+1 (1 − pi ) Take log and plug in above equation for p and we get the following concave function: q X i=1 n X log 1 + exp aT ui + b aT ui + b − i=q+1 7.1.2 MAP Estimation MAP is the Bayes equivalent of MLE. The underlying philosophy is vastly different, but the optimization technicality remains more or less the same, except a term that describes the prior distribution. 7.2 Nonparameteric Distribution Estimation A nonparameteric distribution is one which we don’t have any closed formula for. So we will estimate a vector p where prob (x = αk ) = pk which lies in Rn . 7.2.1 Priors • Expected value of any function are just linear equality in terms of p, so we can express them easily. • The variance of the random variable is a concave function of p. Therefore, a lower bound on the variance can be expressed within the convex setting. • The entropy of X is a concave, so we can express the lower bound as well. • The KL-divergence between p and q is convex. So we can impose upper bound here. 32 7.2.2 Objectives • We can minimize/maximize expected values because they are affine to p. • We can find MLE because log likelihood for p in this setting is always concave. • We can find maximum entropy. • We can find minimum KL-divergence between p and q. 7.3 Optimal Detector Design And Hypothesis Testing I will only cover this section briefly. Problem setup: the parameter θ can take m values. For each value of θ, we have a nonparameteric distribution over n possible values α1 · · · αn . The probabilities can be represented by a matrix P = Rn×m . We call each θ a different hypothesis. We want to find which θ generated given sample. So the detector we want to design is a function from a sample to θ. We can create either deterministic or probabilistic detectors; like in game theory, introducing extra randomness can improve the detector in many ways. For a simple and convincing example; say we have a binary problem. Draw an ROC curve which shows the tradeoff between false positive and false negative errors. A deterministic detector might not be able to hit the sweet spot where pf n = pf p depending on the θs - but probabilistic detectors can. 7.4 Chebyshev and Chernoff Bounds 7.4.1 Chebyshev Bounds Chebyshev bounds give an upper bound on a probability of a set based on known quantities; many inequalities follow this form. For example, Markov’s inequality says: If X ∈ R+ has EX = µ then we have prob (X ≥ 1) ≤ µ. ”Of course, this inequality is completely useless when µ > 1 but that’s how all these inequalities are.) This section looks at cases where we can find such bounds using convex optimization. In this setup, our prior knowledge is represented as a pair of functions and their expected values. The set whose probability we want to find bounds for is given as C. We want something like: prob (X ∈ C) ≤ Ef (X) for some function f whose expectation we can take. The recipe is to concoct an f which is a linear combination of the prior knowledges. Then Ef (X) is simply a linear combination of the expectations. How do we ensure the EV is above prob (X ∈ C)? We can impose that f (x) ≥ 1C (x) pointwise, where 1C is an indicator function for C. We can now state the following problem: minimize X ai xi = Efi (X) xi = Ef (X) X subject tof (z) = xi fi (z) ≥ 1 if z ∈ C X f (z) = xi fi (z) ≥ 0 if z ∈ S\C X This is a convex optimization problem, since the constraints are convex. For example, the first constraint can be recast as 33 g1 (x) = 1 − inf f (z) < 0 z∈C which is surely convex. There is another formulation where we solve a case where the first two moments are specified; but I am omitting it. 7.4.2 Chernoff Bounds This section deals with Chernoff bounds, which has a different form, but the same concept. 7.5 Experiment Design We discuss various solutions to the experiment design problem as an application. The setup is as follows. We have a fixed menu of p different experiments which is represented by ai (1 ≤ i ≤ p). We will perform m experiments, each experiment taken from the menu. For each experiment, we get yi as the result which is yi = aTi x + wi where wi are independent unit Gaussian noise. The maximum likelihood estimate is of course given by least squares. Then, the associated error e = x ˆ − x has zero mean and has covariance matrix E: E = EeeT = X ai aTi How do we minimize E? What kind of metrics do we use? −1 7.5.1 Further Modeling First, this is an offline problem and we don’t actually care about the order we perform. So the only thing we care is that for each experiment on the menu, how many times do we perform it. So the optimization variables are a list of nonnegative integers mi which sum up to m. Of course, the above problem is combinatorially hard and we relax it a bit, by modeling what fraction of m do we run each experiments. Still, the objective E is a vector ”actually a matrix) so we need some scalarization scheme to minimize it. The text discusses some strategies including: • D-optimial design: minimize determinant of E. Since determinant is the volume of the box, we are in effect minimizing the volume of the confidence ellipsoid. • E-optimal design: we minimize the largest eigenvalue of E. Rationale: the diameter of the confidence ellipsoid is proportional to norm of the matrix. • A-optimal design: we minimize the trace. This is, effectively, minimizing the error squared. 8 Geometric Problems 8.1 Point-to-Set Distance A project of the point x0 to a closed set C is defined as the closest point in C that minimizes the distance from x0 . When C is closed and convex, and the norm is strictly convex ”e.g. Euclidean), we can prove the 34 projection is unique. When the set C is convex, finding the projection is a convex optimization problem. Some examples are discussed - planes, halfplanes, and a proper cone. Finding the separating hyperplane between a point and a convex set is discussed as well. When we use Euclidean norm, we have a geometric, intuitive way to find one - take x0 and its projection p (x0 ), and use ~ 0 ) − ~x and passes the mean of two points. However, for other norms, the hyperplane which is normal to p (x we have to construct such hyperplane using dual problem; if we find a particular Lagrangian multiplier for which the dual problem is feasible, we know that multiplier constitutes a separating hyperplane. 8.1.1 PCA Example Suppose the set C of m × n matrices with at most k rank. A projection of X0 onto C which minimizes the Euclidean norm is achieved by a truncated SVD - yes PCA! 8.2 Distance between Sets Distance between two convex sets is an convex optimization problem, of course. The dual of this problem can be interpreted as a problem finding a separating hyperplane between the two sets. The argument can be made: if strong duality holds, a positive distance implies an existence of a separating hyperplane. 8.3 Euclidean Distance and Angle Problems This section deals with problems where Euclidean distances and angles between vectors are constrained. Setup: n vectors in Rn , for which we assume their Euclidean lengths are known: li = kai k2 . Distance and angular constraints can be cast as a constraint on G, which is the Gram matrix of A which has ai as column vectors: G = AT A G will be our optimization variable; after the optimization we can back out the interested vectors by Cholesky factorization. This is a SDP since G is always positive semidefinite. 8.3.1 Expressing Constraints in Terms of G • Diagonal entries will give length squared: Gii = li2 • The distance between vector i and j dij can be written as: kai − aj k2 = li2 + lj2 − 2aTi aj which means Gij is an affine function of d2ij : Gij = 1/2 = li2 + lj2 − 2Gij li2 +lj2 −d2ij 2 1/2 This means range constraints on d2ij can be a pair of linear constraints on Gij . • Gij is an affine function of the correlation coefficient ρij . • Gij is also an affine function of cosine of the angle between two vectors: cos α. Since cos−1 is monotonic, we can use this to constrain the range of α. 35 8.3.2 Well-Condition Constraints The condition number of A, σ1 /σn , is a quasiconvex function of G. So we can impose a maximum value or try to minimize it using quasiconvex optimization. Two additional approaches to well-conditionness are discussed - dual basis and maximizing log det G . 8.3.3 Examples • When we only care about angles between vectors ”or correlations) we can set li = 1 for all i. • When we only care about distance between vectors, we can assume that the mean of the vectors are 0. This can be solved using the squared lengths as the optimization variable. Since Gij = li2 + lj2 − 2d2ij /2, we get: G = z1T + 1z T − D /2 which should be PSD ”zi = li2 ). 8.4 Extremal Volume Ellipsoids This section deals with problems which approximates given sets with ellipsoids. 8.4.1 Lowner-John Ellipsoid The LJ ellipsoid ǫlj for a set C is defined as the minimum-volume ellipsoid that contains C. This can be cast as a convex optimization problem, however is only tractable when C is tractable. ”Of course, C has an infinite number of points or whatever, it’s not going to be tractable..) We set our optimization variable A and b such that: εlj = {v| kAv + bk2 ≤ 1} The volume of the LJ ellipsoid is proportional to det A−1 , so that’s what we optimize for. We minimize: log det A−1 subject to supv∈C kAv + bk2 ≤ 1. As a trivial example, consider when C is a finite set of size m; then the constraints translate into m convex constraints on A and b. A notable feature of LJ ellipsoid is that its efficiency can be bounded; if you shrink an LJ ellipsoid by a factor or n ”the dimension), it is guaranteed to fit inside C ”of course, when C is bounded and has nonempty √ interior). So roughly we have a factor of n approximation. ”Argh.. the proof is tricky. Uses modified problem’s KKT conditions.) √ When the set is symmetric about a point x0 , the factor 1/n can be improved to 1/ n. 8.4.2 Maximum Volume Inscribed Ellipsoid A related problem tries to find the maximum volume ellipsoid which lies inside a bounded, convex set C with nonempty interrior. We use a different formulation of the ellipsoid now; it’s a forward projection of a unit ball. 36 ε = {Bu + d| kuk2 ≤ 1} Now its volume is proportional to det B. The constraint would be: sup IC (Bu + d) ≤ 0 kuk2 ≤1 Max Ellipsoid Inside A Polyhedron A polyhedron is described by a set of m linear inequalities: C = x|aTi x ≤ bi We can now optimize regarding B and d. We can translate the constraint as: sup IC (Bu + d) ≤ 0 ⇐⇒ kuk2 ≤1 which is a convex constraint on B and d. sup aTi (Bu + d) ≤ bi kuk2 ≤1 ⇐⇒ BaTi 2 + aTi d ≤ bi 8.4.3 Affine Invariance If T is an invertible matrix, it is stated that the transformed LJ ellipsoid will still cover the set C after transformation. It holds for maximum volume inscribed ellipsoid as well. 8.5 Centering 8.5.1 Chebychev Center Given a bounded, nonempty-interior set C ∈ Rn , a Chebychev centering problem finds a point where the depth is maximized, which is defined as: depth (x, C) = dist (x, Rn \C) So it’s a point which is farthest from the exterior of C. This is not always tractable; suppose C is defined as a set of convex inequalities fi (x) ≤ 0. Then, Chebychev center could be found by solving: maximize subject to R gi (x, R) ≤ 0 (i = 0, 1, · · · ) where gi is a pointwise maximum of fi (x + Ru) where kuk2 ≤ 1. Since fi is convex and x + Ru is affine in x and R, gi is a convex function. However, it’s hard to evaluate gi ; since we have to find a pointwise maximum of convex functions. Therefore, Chebychev center problem is feasible only for specific classes of C. For example, when C is a polyhedron. ”An LP can solve this case) 8.5.2 Maximum Volume Ellipsoid Center A generalization of Chebychev center is MVE center; the center of the maximum volume inscribed ellipsoid. When the MVE problem is solvable, MVE center is trivially attained. 37 8.5.3 Analytic Center An analytic center works with the logarithmic barrier − log x. If C is defined as fi (x) ≤ 0 for all i, the analytic center minimizes − X log (−fi (x)) i This makes sense; when x is feasible, the absolute value of fi (x) kind of denotes the margin between x and infeasible regions. Analytic center tries to maximize the product of those margins. The analytic center is not invariant under different representations of the same set C, obviously. 8.6 Classification This section deals with two sets of data {x1 , x2 , · · · , xN } and {y1 , y2 , · · · , yM }. We want to find a function f (x) such that f (xi ) > 0 and f (yi ) < 0. 8.6.1 Linear Discrimination Linear discrimination finds an affine function f (x) = aT x − b which satisfies the above requirements. Since these requirements are homogeneous in a and b, we can scale them arbitrarily so the following are satisfied: aT x − b ≥ 1 i aT y − b ≤ −1 i 8.6.2 Robust Linear Discrimination If two sets can be linearly discriminated, there will always be multiple functions that separate them. One way to choose among them is to maximize the minimum distance from the line to each sample; in other words, maximum margin or the thickest slab . This leads to the following problem: maximize t subject to aT xi − b ≥ t aT yi − b ≤ −t kak2 ≤ 1 Note the last requirement; we normalize a, since we will be able to arbitrarily increase t unless we normalize a. Support Vector Classifier When two sets cannot be linearly separated, we can relax the constraints f (xi ) > 1 and f (yi ) < −1 by rewriting them as: f (xi ) > 1 − ui , and f (yi ) < −1 + vi 38 where ui and vi are nonnegative. Those numbers can be interpreted as a measure of how much each constraint is violated. We can try to make these sparse by optimizing for the sum; this is an ℓ1 norm and u and v will ”hopefully) be sparse. Support Vector Machine The above are two approaches for robust linear discrimination. First tries to maximize the width of the slab. Second tries to minimize the number of misclassified points ”actually, it optimizes its proxy). We can consider the trade-off between the two. Note the width of the slab z| − 1 ≤ aT z − b ≤ 1 Can be calculated by the distance between the two hyperplanes aT z = b − 1 and aT z = b + 1. Let aT z1 = b − 1 and aT z2 = b + 1. aT (z1 − z2 ) = 2. It follows kz1 − z2 k2 = 2/ kak2 . Now we can solve the following multicriterion optimization problem: minimize kak2 + γ 1T u + 1T v subject to aT xi − b ≥ 1 − ui aT yi − b ≤ −1 + vi u 0, v 0 We have SVM! Logistic Regression Another way to do approximate linear discrimination is logistic regression. This should be very familiar now; the negative log likelihood function is convex. 8.6.3 Nonlinear Discrimination We can create nonlinear separation space by introducing quadratic and polynomial features. For polynomial discrimination, we can do a bisection on the degree to find the smallest polynomial that can separate the input. 8.7 Placement and Location Placement problem deals with n points in Rk , where some locations are given, and the rest of the problems are the optimization variables. The treatment is rather basic. In essense, • You can minimize the sum of distances between connected nodes when the distance metric is convex. • You can place upper bounds on distance between pairs of points, or lengths of certain paths. • When the underlying connectivity represents a DAG, you can also minimize the max distance from a source node to a sink node using a DP-like argument. 8.8 Floor Planning A floor planning tries to place a number of axis-aligned rectangles without overlaps, optimizing for the size of the bounding rectangle. This is a hard combinatorial optimization problem in general, but specifying 39 relative positioning of the boxes can make these problems convex. A relative positioning constraint gives how individual pairs of rectangles are positioned. For example, rectangle i must be either above, below, left to, right to rectangle j. These can be cast as linear inequalities. For example, we can specify that rectangle i is left to rectangle j by specifying: xi + w i ≤ xj Some other constraints we can use: • Minimum area for each rectangle • Aspect ratio constraints are simple linear ”in)equalities. • Alignment constraints: for example, two rectangles are centered at the same line • Symmetry constraints • Distance constraints: given relative positioning constraints, ℓ1 or ℓ∞ constraints can be cast pretty easily. Optimizing for the area of the bounding box gives you a geometric programming problem. 9 Numerical Linear Algebra Background 10 Unconstrained Optimization Welcome to part 3! For the rest of the material I plan to skip over the theoretical parts, only covering the motivation and rationale of the algorithms. 10.1 Unconstrained Minimization Problems An unconstrained minimization doesn’t have any constraints but a function f () we need to minimize. We assume f to be convex and differentiable, so the optimality can be checked by looking at the gradient ∇f . This can sometimes be solved analytically ”for example, least squares), but in general we need to resort to an iterative method ”for example, geometric programming or analytic center). 10.1.1 Strong Convexity For most of this chapter we assume that f is strongly convex; which means that there exists m > 0 such that ∇2 f (x) mI for all symmetric x. This feels like an analogue of having a positive second order coefficient - is this different from having posdef Hessian? ”Hopefully video lecture provides some insights) Anyways, this is an extremely strong assumption, and we can’t, in general, expect our functions to be strongly convex. Then why assume this? We are looking at theoretical convergence, which is already not attainable ”because no algorithm is going to run infinitely). Professor says it’s more of a feel good stuff so let’s make assumptions that can shorten the proof. 40 Strong convexity has interesting consequences; the usual convexity bound can be improved so we have: T f (y) ≥ f (x) + ∇f (x) (y − x) + m 2 ky − xk2 2 We can analytically find the minimum point of RHS of this equation, and plug it back into the RHS to get the lower bound of f (y): f (y) ≥ f (x) − 1 2 k∇f (x)k2 2m So this practically means that we have a near-optimal point when the gradient is small. When we know m 2 this can be a hard guarantee, but m is not attainable in general. Therefore, we resort to making k∇f (x)k2 small enough so that we have a high chance of being near optimal. 10.1.2 Conditional Number of Sublevel Sets Conditional numbers of sublevel sets have a strong effect on efficiency of some algorithms. Conditional number of a set is defined as the ratio between maximum and minimum width of the set. The width for a convex set C along a direction q (kqk2 = 1) is defined by: W (C, q) = sup q T z − inf q T z z∈C z∈C 10.2 Descent Methods The family of iterative optimization algorithms which generate a new solution x(k+1) from x(k) by taking x(k+1) = x(k) + t(k) ∆x(k) where t(k) is called the step size, and ∆x(k) is called the search direction. Depending on how we choose t and ∆x, we get different algorithms. There are two popular ways of choosing t: • Exact line search minimizes f (t) = f x(k) + t · ∆x(k) exactly, by either analytic or iterative means. This is used when this minimization problem can be solved efficiently. • Backtracking search tries to find a t where the objective function sufficiently decreases. The exact details isn’t very important; it is employed when the minimization problem is harder to solve. The algorithm is governed by two parameters, which, practically, does not drastically change the performance of the search. 10.3 Gradient Descent Taking ∆x(k) = −∇f (x) will give you gradient descent algorithm. Some results for the convergence analysis are displayed. The lower bound for iteration is given as: log ((f (x0 ) − p∗ ) /ǫ) log (1/c) where p∗ is the optimal value, and we stop when we have f x(k) −p∗ < ǫ. The numerator is intuitive; the denominator involves the condition number and roughly equal to m/M . Therefore, as condition number increases, the number of required iterations will grow linearly. Given a constant condition number, this 41 bound shows that the error will decrease exponentially. For some reason this is called linear convergence in optimization context. 10.3.1 Performance Analysis on Toy Problems Exact search and backtracking line search is compared on toy problems; the number of iteration can differ by a factor of 2 or something like that. Also, we look at an example where we play with the condition number of the Hessian of f ; and the number of iteration can really blow up. 10.4 Steepest Descent Steepest descent algorithm generalizes gradient descent by employing a different norm. Given a norm, the normalized steepest descent direction is given by n o T ∆x = argmin ∇f (x) v| kvk = 1 Geometrically, we look at all vectors in a unit ball centered at current x and try to minimize f (x). When we use Euclidean norm, we regain gradient descent. Also, in some cases, we can think of SD as GD after a change of coordinates ”intuitively this makes sense, because using a different norm is essentially employing a different view on the coordinate system). 10.4.1 Steepest Descent With an ℓ1 Norm When we use ℓ1 norm, SD essentially becomes the coordinate descent algorithm. It can be trivially shown: take the basis vector with the largest gradient component, and minimize along that direction. Since we are using ℓ1 norm, we can never take a steeper descent. 10.4.2 Performance and Choice of Norm Without any problem-specific assumptions, essentially same as GD. However, remember that the condition number greatly affects the performance of GD - and change of coordinates can change the sublevel set’s condition number. Therefore, if we can choose a norm such that the sublevel sets will approximate an ellipsoid/sphere, SD works very well. A Hessian at the optimal point, if attainable, will minimize the condition number greatly. 10.5 Newton’s Method Newton’s method is the workhorse of convex optimization. The major motivation is that it tries to minimize the quadratic approximation of f () at x. To do this, we choose a Newton step ∆xnt : ∆xnt = −∇2 f (x) −1 ∇f (x) Several properties and interpretations are discussed. • The Newton step minimizes the second-order Taylor approximation of f . So, when f () roughly follows a quadratic form, Newton’s method is tremendously efficient. 42 • It’s the steepest descent direction for the quadratic norm defined by the Hessian. Recall that the Hessian at the optimal point is a great choice for a norm for SD - so when we have a near-optimal point, this choice minimizes the condition number greatly. • Solution of linearized optimality condition: we want to find v such that ∇f (x + v) = 0. And approximately: ∇f (x + v) ≈ ∇f (x) + ∇2 f (x) v = 0 and the Newton update is a solution for this. • Newton step is affinely invariant; so multiplying only a single coordinate by a constant factor will not change convergence. This is a big advantage over the usual gradient descent. Therefore, Newton’s method is much more resistant to high condition number sublevel sts. In practice, extremely high condition number can still hinder us because of finite precision arithmetic, yet it is still a big improvement. 10.5.1 The Newton Decrement The Newton decrement is a scalar value which is closely related; it is used as a stopping criterion as well: 1/2 T −1 λ (x) = ∇f (x) ∇2 f (x) ∇f (x) This is related to our estimate of our error f (x) − p∗ by the following relationship: f (x) − p∗ ≈ f (x) − inf fˆ (y) = y 1 2 λ 2 We stop when this value ”λ2 /2) is less than ǫ. 10.5.2 Newton’s Method Newton’s method closesly follows the gradient descent algorithm, except it uses the Newton decrement for stopping criterion, which is checked before making the update. 10.5.3 Convergence Analysis The story told by the convergence analysis is interesting. There exists a threshold for k∇f (x)k2 , when broken, makes the algorithm converge quadratically. This condition ”broken threshold for gradient), once attained, will hold in all further iterations. Therefore, the algorithm works in two separate stages. • In the dampen Newton phase, line search can give us an update size t < 1, and f will decrease by at least γ, another constant. • Pure Newton phase will follow, where we will only use full updates ”t = 1) and we get quadratic convergence. 43 10.5.4 Summary • Very fast convergence: especially, quadratic convergence when we reach the optimal point. • Affine invariant: much more resistant to high condition numbers. • Performance does not depend much on the correct choice of parameters, unlike SD. 10.6 Self-Concordant Functions This section covers an alternative assumption on f which allows us a better ”or more elegant) analysis on the performance of Newton method. This seems like more of an aesthetic, theoretic result, so unless some insights come up in the video lectures, I am going to skip it. 44 Murphy Ch13: Sparse Linear Models September 11, 2014 1.3 Useful for cases where we have more features than examples. Real world applications, kernel basis functions, representing signals in terms of wavelet basis functions. 1 1. There are lots of heuristics. Forward selection, backward selection. Orthogonal least squares (where we try to fit residuals from previous models). Orthogonal matching pursuits (variant of orthogonal least squares, but we do not refit linear model after we choose a new feature). And on, and on, and on. Bayesian variable selection We want to determine a binary vector γ which means “relevance” of each variable. A Bayesian approach dictates: p (γ|D) = P 2. Stochastic search: use MCMC. Metropolis Hastings algorithm? However it is unnecessarily inefficient, and there are better approximations. p (D|γ) ′ γ ′ p (D|γ ) 3. EM: EM cannot be applied to spike-and-slab model because we cannot back out γ from w, or something like that. We can make another approximation to make EM work (gah). Bernoulli-Gaussian is intractable in EM, but there’s another approximation. If dimension is small enough, we can either 1. take the combination with largest posterior 2. take features with individual posterior probability bigger than 0.5 1.1 Algorithms Spike and slab model 2 The above approach is intractable when dimension is large, of course. Let us approximate it by a model. 2.1 p (γ|D) ∝ p (γ) p (D|γ) ℓ1 regularization: basics Sparsity argument (1) A Laplacian prior results in ℓ1 regularization (yeah I already know this). Then why is this sparse? The peThe spike and slab model represents the prior part nalized optimization is the lagrangian dual of the conp (γ) by treating each variable individually, and giving strained optimization problem. (And this should have each variable a fixed probability to be relevant: π0 . This zero duality gap, I guess) Then we get this famous figure is a dial we use to control sparcity. The likelihood part is modeled as a linear model of variables with normal error. The weight of the linear model comes from γ, and also from a normal prior we also get to set. (when rj = 0, wj is set to be fixed at given parameter.) The prior part peaks sharply at sparse models, and the normal part is kind of spreads further... so this is called the spike and slab model. Then for a given bound on kwk1 , we can draw the feasible region as a diamond around the origin. Also, 1.2 Bernoulli-Gaussian for each value of the objective, we can draw a contour This is essentially similar to spike and slab. Instead, w of weights that results in that objective. Note the two is P not determined by γ - the input to the likelihood is regions are in a tradeoff relationship. If we increase the bound, the diamond will get bigger, and the contour will j γj wj xij . Also it is shown that this can be massaged into ℓ0 be smaller (we will go closer to the least squares soluregularization, where the penalization is on the number tion). Note that at the optimal point, the contour and the diamond will touch each other (if it strictly overlaps, of features. Meh. 1 we can always make the diamond smaller but retain the 2.5 Bayesian Inference same objective value). And geometrically it is more likely to hit the corner of the diamond because it is spiky in There are some results stating that the spike-and-slab model results in better prediction accuracy than lasso. higher dimensions. 2.2 Soft Thresholding 3 Another notable property is the soft thresholding effect. We use subgradients to prove optimality of the ℓ1 loss. This involves a quantity called cj , which is proportional to the correlation between the jth feature and the residual due to the other features. Then, the final weight wj is given as: (cj + λ) /aj if cj < −λ w ˆj = 0 if cj ∈ [−λ, λ] (cj − λ) /aj if cj > λ 3.1 3.2 LARS and homotopy methods Homotopy methods exploit the fact that it is sometimes faster to calculate the weights with given λk when λk ≈ λk−1 . LARS is a special case of such methods - starts from a large enough λ so that only one variable will be chosen. From there, we can add one variable at a time fast, using an analytical result. LARS cannot be used for ℓ1 -regularized GLMs. Regularization paths 3.3 For different values of λ, we can plot how the weight of jth feature changes - this is called the regularization path. The figure in the book shows that the weights for ridge regression deviates gradually from 0; but for lasso they stay at 0 until it becomes nonzero suddenly. It is noted that the regularization path is piecewise linear, and LARS exploits this to solve this problem faster(roughly, the same time required for a single least squares). 2.4 Coordinate descent When all coordinates are fixed except one, the optimal solution might be calculated analytically. This gives rise to the coordinate descent algorithm. So, the weights will be zero if the absolute value of the correlation power is smaller than λ. Also notable is that the weight in other cases are also “shrunk”, which makes lasso a biased estimator. So this is the problem we deal with Hierarchical Adaptive Lasso. 2.3 ℓ1 regularization: algorithms Proximal and gradient projection methods FISTA belongs to this category! When the objective f can be decomposed of two convex functions L and R, L differentiable and R _not_ differentiable, ISTA optimizes f in an iterative fashion. It approximates L with a quadratic function Lq (θ) around last guess. So the update looks like θk+1 = argminz R (z) + Lq (z) Model selection Random facts and strategies. For some popular forms of R we can solve this easily. FISTA approximates L around a point other than the most recent guess. Intuitively, it was said (during the study group session) that this is effectively the “heavyball” method. • Naive strategy: use lasso to pick features, and do a least squares on the chosen variables. (well, in practice, lasso is also used to fight collinearity as well - so I think this is a poor idea) • Theoretical considerations: using cross-validation for model selection might not be optimal. Cross validation optimizes model prediction accuracy which might be different from finding the correct model. 3.4 EM for lasso Gah. Approximate Laplace prior with a Gaussian scale mixture and we have a setup for EM. Why? The variance-gamma distribution, which is a normal distribution with its variance following a gamma distribution, is also called the generalized Laplace distribution. It actually includes the usual Laplace as a special case. I initially thought, WTF? But turns out the formulation is useful for extending ℓ1 . • Problem: ℓ1 regularization is unstable which means small perturbations in data can result in drastically different results. (Therefore Bayesian is more stable) Bagging can help - do a number of bootstrapped lasso and find the probability of each variable being included. This is called bolasso. 2 4 Extensions of ℓ1 regularization come by giving separate penalty parameter for each variable. Using cross-validation to tune those D parameters might be infeasible in traditional sense, but using a Bayesian modeling we can just use EM to fit them. The model looks like: γj → τj2 → wj . If we integrate τ s out, we get a Laplace distribution (wow). γ follows inverse Gamma, which is the conjugate for Laplace (wow), so EM works kind of smoothly.. Impressive. So this is like a generalized version of lasso, I guess. You are searching through different “Laplace-like” priors which include Laplace. The contour of the penalty looks more like a starfish rather than a diamond. • Group lasso solves the problem of one variable being related to multiple variables. For instance, when we have a categorical variable. The approach is to group coefficients together, and penalize each group’s square root of sum of squared coefficients. The text claims that the square root gives you sparsity. Intuitively, if each group consists of just one variable, this is the same as ℓ1 so that kind of make sense. However, still not sure why they do not just use ℓ1 . This can be fit using proximal methods or EM. • Fused lasso is useful for denoising stuff. Usually, the Automatic relevance determinadesign matrix is set to I and we have one weight per 6 one observation. The penalty is given by the ℓ1 norm tion & sparse Bayesian learning of the coefficients and ℓ1 norm of differences between adjacent coefficients. You can also use EM to fit this, All approaches so far had the form: τj2 → wj → y ← X again using the GSM formulation of Laplace. - and we integrated τ out. Instead, we can integrate out w and maximize the likelihood with regard to τ 2 . This • Elastic net is a mixture of lasso and ridge. There are gives rise to a method called ARD or SBL. results that state lasso works poorly with correlated variables - lasso will only pick one of them. Elastic net supposedly solves all these problems. 6.1 ARD for linear regression Conventionally, we denote weight precision (stdev) by αj = 1/τ 2 , and measurement precision by βj = 1/σ2 . −1 and y ∼ N y|wT x, 1/β . Then, w ∼ N w|0, A (A = diag(α)) Now we can integrate out w, and put a Gamma prior on both α and β. Now the likelihood is – Also mentionable is the grouping effect, where a function of parameters to the distribution of α and β, highly correlated variables split up the weights which is maximized. that lasso would have given to only one. Why is it sparse? Not really sure, the sparsity argu– However, the vanilla elastic net is known to ment here is definitely one of the crappier arguments in give you rather unimpressive results. It is this book. :-( said it’s because the shrinkage is performed ARD is different from regular MAP in the sense that twice - and you have to undo the ℓ2 shrink- we integrate out the prior parameter (in this case, α) age. Hmmm........ where can I find research but here we integrate out w. However, there are some regarding this? Maybe Hastie? results that they are connected - ARD is in fact MAP of a certain problem. – You can fit elastic net using a standard lasso algorithm by augmenting the design matrix and response vector (the same trick we used for nnlasso). 5 Non-convex regularizers 6.2 Algorithms for ARD Two problems with Laplace prior is claimed: the peak and the tails are not heavy enough. (?) So the noise ARD objective is not convex, so the results will depend are insufficiently shrunk, and the relevant coefficients are on the initial value. shrunk too much. The text says there are more flexible • Of course, EM can be used. priors that have larger spike at 0 and heavier tails. Practically, it is said that they often outperform ℓ1 regular• Or a fixed-point algorithm can be used as well. We ization. (!) know that at the optimal point the gradients disapdℓ dℓ = 0), so we enforce it iteratively = dβ pear ( dα j 5.1 Hierarchical adaptive lasso until everything converges. So the tradeoff: higher λs suppress noise better, but • Iteratively reweighted ℓ1 algorithm shrinks relevant parameters too much. This can be over- 3 6.2.1 For Logistic Regression When we do binary logistic regression, we cannot use EM because Gaussian prior is not conjugate to the logistic likelihood. We can do some approximations (such as Laplace approximation) to use EM though. 7 Sparse Coding Sparse coding is similar to sparse PCA/ICA in idea. However, sparse coding generates lots of factors, usually more than the number of dimensions; however, the coefficients are promoted to be sparse. So out of a lot of factors, we choose a small number of them to represent each data. 7.1 Learning a sparse factor loading matrix We can maximize the following likelihood: log p (D|W) = N X i=1 log ˆ zi N xi |Wzi , σ 2 I p (zi ) dzi Two problems - this is too hard to optimize, and sparsity is not being promoted. We approximate it by summing maximum likelihood for each zi instead. Also we use a Laplacian prior for z. After the approximation and massaging the formula, this is optimized in the following fashion - pick a guess for Z, find optimal W. Now estimate Z using W, and repeat. These two steps can be done by least squares and lasso, respectively. There are different variants of this - trying to approximate X by WZ (non-negative sparse coding), adding sparsity constraint on both Z and W (sparse matrix factorization). 7.2 Compressed sensing The rough idea: we do not get to observe x, but its low dimensional projection y = Rx + ǫ. How do we optimize it? We model x = Wz with a sparsity promoting prior on z. Picking the right basis is important - wavelet, domain specific. 4 Gaussian Processes January 19, 2015 1 Introduction In supervised learning, we try to learn an unknown function f for some inputs xi and outputs yi , such that f (xi ) = yi , possibly corrupted by noise. Then, we work with the distribution of function over given data: p (f |X, y). Usually, we assume a specific form of f , so we learn a parameter set θ and work with the distribution p (θ|X, y). Nonparametric regression techniques, such as LOWESS (locally weighted scatterplot smoother), overcome this problem by using the given data as our knowledge representation directly. Gaussian Process regression is a Bayesian equivalent of such technique. How is GP different from LOWESS? We make an important assumption: any vector [f (xi )] for any xi is jointly Gaussian. Also, the covariance between two function values f (x1 ) and f (x2 ) relies on the user-provided kernel κ (x1 , x2 ). 2 GP Regression 2.1 Noise-free Observations We usually assume the mean function E [f (x)] = 0 because GPs are usually flexible enough that we don’t need an intercept. (However, µ is not omitted from below discussion because there are some modeling techniques which employ a parametric representation of µ) Say we have the training examples X and y. Since observations are noiseless, we assume yi = f (xi ) and write y = f. Now we can take a test example X∗ and predict the distribution of the corresponding output f∗ . How do we do it? By our assumption of f , the vector of fs are jointly Gaussian, and follows the distribution: f f∗ ! ∼N µ µ∗ ! , K K∗ KT∗ K∗∗ !! where K, K∗ and KT∗ are covariance matrices calculated from user-specified κ. Then, by standard rules for conditioning Gaussians (of course I don’t know how to do this, but see section 4.3), we can get a distribution for f∗ conditioned on the rest of the observed variables: p (f∗ |X∗ , X, f) = N (f∗ |µ∗ , Σ∗ ) where µ∗ = µ (X∗ ) + KT∗ K−1 (f − µ (X)) and Σ∗ = K∗∗ − KT K−1 K∗ . We now got a posterior distribution of output values for X∗ . 1 2.2 Noisy Observations We now have: cov [yp , yq ] = κ (xp , xq ) + σy2 δpq where δpq is the Kronecker delta. 2.3 Kernel Parameters Different kernel parameters will result in different functions. Say we use a squared-exponential kernel: κ (xp , xq ) = σf2 1 2 exp − 2 (xp − xq ) + σy2 δpq 2ℓ • Increasing ℓ will decrease covariance between any two points by discounting distances; the function will be more wiggly as a result. • Increasing σf will increase the vertical scale of the function. • Increasing σy will increase the uncertainty around near points. This basically means we encode our belief on the function’s behavior in the kernel. 2.4 Estimating Kernel Parameters Kernel parameters are chosen by maximizing the marginal likelihood p(y|X). 2 Murphy Ch16: Adaptive Basis Function January 21, 2015 1.1.2 Pros and Cons Instead of manually concocting kernels, we try to learn the features from the input data. Methods CART are easy to interpret, but they do not predict based on ABF use weak learners as a way to come very accurately, and it has a high variance. up with features from the data. 1.2 1 Trees 1.1 Random Forests In a random forest, we fit M different trees each trained by a different subset of the data ”bootstrapped samples, and choosing a subset of the fea- CART tures). So this is bagged decision trees. For performance of RF compared with different tools, see CART ”Classification and Regression Trees) is a clas- Caruana”2006). sical decision tree model. Basically, at each node of Why does RF perform well? Recall the error can the tree, it splits the input space into two chunks by an axis-parallel hyperplane. When we reach the leaf be decomposed into bias and variance. Decision trees are very high variance and low bias. And tak- of the tree, the final prediction/classification is made ing average of multiple decision tree will keep the bias ”if the bias is same across all trees) and lowers by a constant, or a simple learner such as linear regression. variance by a square root factor. ”See mathematicalmonk for proof) In practice, not only we take subsets of data points 1.1.1 Growing and Shrinking Trees and attributes, we add random types of splitting operators to increase the randomness. Given a loss function, we can always try to find the pair of optimal parameter and the threshold which minimizes the average error, assuming the 2 Generalized Additive Models next nodes will be leaves. We can try all parameters; and for each parameter we sort the possible Generalized additive model employs multiple ”pos- values and try each threshold between two adja- sibly nonparameteric) models, one for each feature. cent observed values. We can stop growing the tree Each model j tries to predict yi from feature xij ; so when some stopping criteria is met; this criteria is a the model looks like: heuristic. For example, you can limit the depth, or have a threshold in reduction of cost, etc. Growing f (x) = α + f1 (x1 ) + · · · + fD (xD ) a tree sometimes is sometimes to myopic; we might not make any splits at all. An alternative is grow a fj () can be any scatterplot smoother; regularized splines are common. ”Hmm, why not use something full tree and prune. 1 nonparametric like LOESS?) 3.2 2.1 Different loss functions result in different boosting algorithms. Examples: Backfitting GAMs can be fit by an iterative scheme; at each stage • Squared loss results in L2Boosting each fj tries to fit the residual from all models except • Exponential loss results in AdaBoost fj . Also, note α is not identifiable in the above form; PN we need to set i=1 fj (xij ) = 0 to ensure this. 2.2 • Logloss results in LogitBoost ”since exponential loss explodes for outliers, this is more robust) MARS These approaches typically require that you do a weighted fit at each stage. Multivariate Adaptive Regression Splines introduces interaction terms to GAMs. Interaction terms are generated by multiplying basis functions in the form of (xj − t) , (t − xj ) where t is an observed value of xj . 3.3 Functional Gradient Boosting So far, we tried to predict the residual between the target and our prediction in each stage of the boosting process. Instead, we can try to approximate the gradient of the loss function. The target of the regression g looks like this: 3 Boosting 3.1 Different Flavors of Boosting Forward Stagewise Additive Boosting gim = Boosting tries to optimize the sum of losses given a ∂L (yi , fm−1 (xi )) ∂fm−1 (xi ) And we do a line search to find ρm which mini- loss function L(): mizes L (fm−1 − ρgm ). This does not require the re- f ∗ = min f X gression process to work on weighted sets, which is an advantage. Also it can be easily extended to different types of losses, such as Huber loss. L (yi , f (xi )) i We will have different functions for different types of L(). Typical choices include the squared loss ”in regression contexts), log loss and exponen- 3.4 tial loss ”in classification contexts). At each stage, Sparse boosting uses simple linear regression as the adaptive basis function. This is known as forward stagewise linear regression ”how is this different from stepwise regression?). we try to fit the residual between y and the current prediction, by minimizing: (βm , γm ) = argminβ,γ X Sparse Boosting L (yi , fm−1 (xi ) + βφ (xi ; γ)) i 3.5 where γ is the parameter set for the weak learner; and β is the weight for it. Then fm is taken to be: MART When shallow CART models are used as the weak learner for gradient boosting, it is called MART. This fm (x) = fm−1 (x) + βm φ (x; γm ) has a slight modification to the gradient boosting al- In practice, shrinking β by multiplying it with 0 < ν ≤ 1 works better than the above update. the leaves of the tree to minimize the loss, instead of minimizing the gradient. gorithm though; we re-estimate the parameters at 2 3.6 Interpretations Boosting can be interpreted as an ℓ1 regularization. Also, there’s a Bayesian interpretation ”well, sort of broken IMO) about boosting as well. 4 Notes • Subsampling: empirically max is preferred over avg for some reason • Feeding artificially distorted input ”hmm) • Convolutional neural net uses human ”domain specific) expertise in feature generation 3 Murphy Ch14: Kernel Methods January 21, 2015 Kernels are measures of similarity defined for two elements in an abstract space. Of course, this is hugely useful. 1 Kernel Functions A kernel function κ (x, x′ ) ≥ 0 is a real-valued function for x, x′ ∈ χ. Typically this function is symmetric. This section goes through different types of kernels and their properties. This page also seems to provide valuable insights: http://mlg.eng.cam.ac.uk/duvenaud/cookbook/index.html 1.1 RBF Kernels The Radial Basis Function kernel, or the squared exponential kernel, or the Gaussian kernel looks like: 1 T κ (x, x′ ) = exp − (x − x′ ) Σ−1 (x − x′ ) 2 Σ is commonly diagonal, so the function ends up looking like: 1X 1 2 (xi − x′i ) κ (x, x ) = exp − 2 i σi2 ′ ! where σi is called the characteristic length scale of dimension i. When all dimensions share the characteristic length (you probably want to standardize the data beforehand), we get: kx − x′ k κ (x, x′ ) = exp − 2σ 2 2 ! which only depends on the distance between two vectors (the radial basis), so this is called the RBF kernel. 1.2 Cosine Similarity The cosine similarity between TF/IDF vectors for document similarity can also be thought of a kernel. 1 1.3 Mercer Kernels Mercer Kernel is a class of kernels which a very desirable property. Many kernels we use are Mercer kernels. A Mercer kernel satisfies the following: for any sets of input xi , the Gram matrix K with Kij = κ (xi , xj ) is positive definite kernel. For this reason, Mercer is also called positive definite kernel. The Mercer kernel’s importance comes from the fact that Mercer kernel can be expressed by an inner product. We can decompose the Gram matrix K as: K = U T ΛU And we can now set: where 1 φ (xi ) = Λ 2 U:,i T kij = φ (xi ) φ (xj ) So you can find the feature vector implied by the kernel function. This fact is very important to apply the kernel trick, described below. 1.4 Linear Kernels κ (x, x′ ) = xT x′ Wat. 1.5 Matern Kernel The Matern kernel is commonly used in GP regression.... but don’t know what is going on 1.6 Domain Specific Kernels An example of a highly domain-specific kernel, we define a kernel measuring the similarity between two strings: κ (x, x′ ) = X ws φs (x) φs (x′ ) s where s is any possible substring, φs (x) is the number of times s occurs in x. Another example is the pyramid matching kernel; when spatial matching of two sets of points are desired, we can group points into a multi-sized histogram and each histogram are compared. 2 1.7 Kernels From Probabilistic Generative Models 1.7.1 Probability Product Kernels Say we have a generative model: p (x|θ). Given two examples xi and xj , we can try to estimate the parameter from one and see how the other is probable. This gives rise to the probability product kernel. 1.7.2 Fisher Kernel I don’t fully understand details, but Fisher kernel relies on this idea: if two observations are similar, they will want to update the MLE in a similar direction. So their gradient should look similar. 2 Using Kernels Inside GLMs We can easily use kernels in any GLM, by using a feature vector comprised of distances from some centroids: φ (x) = [κ (x, µ1 ) , κ (x, µ2 ) , κ (x, µ3 ) , · · ·] µi ∈ χ is a set of K centroids. Of course, picking the right set of centroids is very tricky. If we have a low dimension input space, we can make a grid of centroids, but that’s plain impossible in a high dimensional space. One way of solving this problem is making every observation a centroid of its own, and use a sparsitypromoting prior to make sure we don’t get a hairy overfitted monster. There are a few approaches; • L1VM, ℓ1 -regularized vector machine uses ℓ1 regularization. • RVM, relevance-vector machine uses techniques similar to ARD/SBL algorithms. • SVM modifies the likelihood term (into hinge loss, I guess) to promote sparsity. 3 The Kernel Trick Instead of replacing all feature vector with its kernel-based representation φ (x) = [κ (x, µ1 ) , κ (x, µ2 ) , κ (x, µ3 ) , · · ·], we can rewrite algorithms to replace inner product with a call to κ. This is called the Kernel trick. For example, we can replace the following calculation: 2 kxi − xj k2 = xTi xi + xTj xj − 2xTi xj = κ (xi , xi ) + κ (xj , xj ) − 2κ (xi , xj ) The latter equivalence requires that the kernel is Mercer. (Note that x in the first two parts are actually φ (x) - don’t know why the book uses x) 3.1 Kernelized Ridge Regression Unlike k-means or nearest neighbor, ridge regression is not based on the notion of similarity. So how do we kernelize them? The closed form solution for ordinary ridge looks like this: 3 w = X T X + λID −1 XT y Using matrix inversion lemma, this can be rewritten as: w = X T XX T + λIN −1 y See that XX T ? We’re going to replace it with the Gram matrix K! 3.2 Kernel PCA The detailed treatment is kind of complex; here’s a brief description of the idea. We also want to replace XX T with the Gram matrix. We usually calculate eigenvectors from X T X, but to kernelize we need to get them from XX T (so we can put K in place). Fortunately, getting eigenvectors of X T X from that of XX T requires some simple algebraic manipulation. After that, we have to take care of normalization of eigenvectors and centering; but I leave it to future myself to understand if need arises.... 4 Support Vector Machines This section gives a brief treatment of SVMs, both for regression and classification. There seems to be two different ways to derive SVM; the large-margin approach (which coincides with the introduction from 2 Boyd) and the loss-function approach. The loss-function approach will treat the kwk2 term as an ℓ2 regularization term. 4.1 Loss Function Treatment SVM regression uses the epsilon insensitive loss function which is flat 0 around 0, and grows linearly after that. It’s a symmetric version of the hinge loss. Obviously this is not differentiable, but we can introduce slack variables - and we have a constrained problem with a differentiable objective. Also, one can show that the optimal weights w ˆ is given by w ˆ= X α i xi i where α is a nonnegative, sparse vector. So the weight is a weighted sum of input samples; these are the support vectors which are the namesake. SVM classification uses the hinge loss, obviously. 4.2 Multiclass Classification There’s no elegant generalization of SVM for multiclass problems, unlike the Bayesian approaches. (It’s quite evident that Murphy doesn’t think highly of SVMs ;) 4 4.3 Choosing Parameters We need to both choose the kernel parameter and the regularization multiplier. This can be done by a grid-search fashion, or a LARS-like fashion - exploiting narrow kernels need high regularization, and vice versa. 5 Comparison of Discriminative Kernel Methods This section describes the difference between different kernel-based methods. The conclusion seems to be that most methods have comparable performance - and L1VM should be the fastest. SVM doesn’t give you well-calibrated probabilities, so avoid it unless you have a structured output. 6 Kernels for Building Generative Models This section deals with a different type of kernels; smoothing kernels. Topics like Nadaraya-Watson, or LOESS are discussed. 6.1 Smoothing Kernel Obviously, smoothing kernel is more or less a pdf; they sum to 1. Gaussian is a popular smoothing kernel. We Epanechnikov kernel, tri-cube kernel is also mentioned. 6.2 Kernel Density Estimation KDE is a nonparametric method to come up with a distribution given samples: p (x|D) = 1 X κh (x − xi ) N which is also called the Parzen window density estimator. h is a bandwidth parameter you get to set. If you want to do this to estimate a class conditional density, you might want to let each point have a different bandwidth, and expand it until you hit K neighboring samples. (This doesn’t mean much if you are using Gaussian kernels; something with a compact support would be needed.) Anyways, this is similar to K-means. 6.3 Kernel Regression Kernel regression is a natural extension of nearest neighbors: f (x) = X wi (x) yi i and wi is just normalized distance between xi and x: κh (x − xi ) wi (x) = P j κh (x − xj ) 5 So the regression takes the weighted mean of observed ys. You don’t have to use smoothing kernel for this; replace κh (x − xi ) with κ (x, xi ) and you got yourself a LOESS, locally-weighted regression. 6
© Copyright 2024 Paperzz