Logistic_notes.pdf

Count Data
1. Estimating & testing proportions:
Ten customers, 2 purchase a product. We estimate the
probability p of purchase as p=0.20 for all customers.
Could p really be 0.50 in the population?
Binomial
n independent trials
Each results in event (Y=1) or nonevent (Y=0)
p = probability of event: Constant on all trials.
Mean of Y is p
Variance is E{(Y-p)2}=p(1-p)2+(1-p)(0-p)2=p(1-p)
S = sum of Y’s =observed number of events in n
trials
Pr{S=r} = n!/[r!(n-r)!]pr(1-p)(n-r)
n! = n(n-1)(n-2)…(1), 0! = 1! = 1.
p known  Pr{S} is probability function.
p to be estimated and r known  Pr{S} is now a function
of p and is known as the likelihood function L(p).
Logarithm is ln(L(p)).
Ex: (r=2 n=10) : 45 p2(1-p)8 maximum at p=0.20.
Maximum Likelihood:
Visually: Plot L(p) versus p
Find the value of p that makes what we saw (2 events, 8
non-events) most likely i.e.
Find p to maximize L(p) = 45p2(1-p)8 i.e.
Find p to maximize simpler L(p) = p2(1-p)8 i.e.
Find p to maximize ln(L(p)) = 2ln(p)+8ln(1-p) i.e.
Find p to make 2/p-8/(1-p) =0 (chain rule) or…
Find p to minimize -2ln(L(p)) = -4ln(p)-16ln(1-p)
Can’t maximize analytically? Use Gauss-Newton search
Gauss-Newton to make f(x)=0:
(1) Make a guess for x
(2) Iterate this: x  x−f(x)/f’(x)
Example: f(x) is our derivative, start at x=0.9
Top curve: derivative
(right scale)
Bottom curve: -2 ln(L(x)) (left scale)
10 Gauss-Newton steps
Obs
p
change
N2LL
step
1
2
3
4
5
6
7
8
9
10
0.90000
0.80308
0.62096
0.32714
0.16828
0.19585
0.19993
0.20000
0.20000
0.20000
-0.09692
-0.18211
-0.29383
-0.15886
0.02758
0.00408
0.00007
0.00000
0.00000
0.00000
29.6495
19.2630
9.8146
3.1956
2.4633
2.3958
2.3947
2.3947
2.3947
2.3947
1
2
3
4
5
6
7
8
9
10
Run Logistic_A.sas demo.
Deriv
Deriv2
155.556
76.269
35.771
11.552
-4.533
-0.527
-0.008
-0.000
-0.000
0.000
1604.94
418.80
121.74
72.72
164.38
129.02
125.06
125.00
125.00
125.00
2. Contingency Tables
Observed
Coupon
No Coupon
Purchase
86
24
No Purchase
14
76
Expected (under H0: no coupon effect)
Purchase
No Purchase
Coupon
55
45
No Coupon 55
45
(O − E ) 2
=77.65
Pearson Chi-square (k=1 df) χ = all∑
E
cells
2
k
Compare to Chi-square 1 df =1.962 (significant)
Likelihood
Coupon
No Coupon
Purchase
p186
p224
No Purchase
(1-p1)14
(1-p2)76
Likelihood is C p186(1-p1)14p224(1-p2)76
Max at p1=0.86, p2=0.24
Max ln(L) is 86ln(.86)+…+76ln(.76)=-95.6043
under H0:p1=p2:
Max at p1= p2=0.55, (1-p1) = (1-p2) =0.45
Max ln(L) is 86ln(.55)+…+76ln(.45)=-137.628
Likelihood ratio χ2 test (change in -2 ln(L) values) =
2(137.628-95.6043) = 84.0468
Close, but not the same as Pearson Chi-square (77.65)
See Logistic_A.sas demo, last part.
3. Logistic Regression:
X = food storage temperature (degrees C)
Y = 1 if spoilage after 2 months, 0 otherwise
X: -14 -8 -9 -6 2 3 8 9 10 16
Y: 0 0 0 1 0 0 1 1 1 1
Regress Y on X:
Problem: Predicted probabilities >1 or < 0.
Idea: Convert p to logit
Logit = ln(p/(1-p)) = ln(odds)
Model Logit = β0 + β1X
p= exp(Logit)/(1+exp(Logit))=
exp( β0 + β1X )/(1+ exp( β0 + β1X ))
So… use exp( β0 + β1X )/(1+ exp( β0 + β1X )) for p in the
likelihood function (you know X) then find betas that
maximize this function. Equivalently, minimize -2
ln(likelihood).
Any betas whose -2 ln(likelihood) differs from that of the
maximum likelihood betas by an amount exceeding the
Chi-square 95% point would be rejected in a 5%
hypothesis test. Therefore if we truncate our plot at the
right point we will cut off the rejected set of betas and
have an approximate 95% confidence region for the pair
of betas.
Run demo: Logistic_B.sas
intercept=
-0.2878
slope=
0.2083
Pairs: one 0 and one 1
Concordant: actual 1 has higher predicted probability
than actual 0 (1 is to the right of 0 when slope > 0)
Discordant pairs: Actual 0 has higher probability of being
1 than does actual 1.
We have 5 0’s, 5 1’s, so 5x5=25 pairs.
Two of those 25 (circled) are discordant and there are no
ties so 23/25 =92% are concordant.
proc logistic data=logistic;
model spoiled(event="1")=temperature/
itprint ctable pprob=0.5;
Percent Concordant
Percent Discordant
Percent Tied
Pairs
92.0
8.0
0.0
25
Somers' D
Gamma
Tau-a
c
0.840
0.840
0.467
0.920
Prior probability 0.5: Classify any point with higher
probability than 0.5 as 1, others as 0. You will have some
misclassifications.
Classification Table
Prob
Level
0.500
Correct
NonEvent Event
4
3
Incorrect
NonEvent Event
2
1
Correct
70.0
Percentages
Sensi- Speci- False
tivity ficity
POS
80.0
60.0
33.3
Split point at X with -0.2878 + .2083X = 0 (why?)
4 correct events (at X = 8, 9, 10, 16).
3 correct non-events (at X= -14, -9, -8).
2 incorrect events (at X = 2, 3)
1 incorrect non-event (at X = -6).
False
NEG
25.0
Sensitivity: probability of calling an event an event:
1+4=5 actual events, we predicted 4 of them so 4/5=80%
Specificity: Probability of calling an actual non-event a
non-event: 3+2=5 non-events of which we predicted 3 so
3/5=60%
(denominators = numbers of actuals)
False positives: We predicted there would be 2+4 events
but were wrong twice so 2/6 = 33.3%.
False negatives: We predicted there would be 4 nonevents but were wrong once so 1/4=25%
(denominators = numbers of predictions)
Odds Ratio:
Old Logit = β0 + β1 X = ln(odds at X)
New Logit = β0 + β1 (X+1) = ln(odds at X+1)
New Logit – Old Logit = β1=
ln(new odds)-ln(old odds) = ln( (new odds)/(old odds) )=
ln(odds ratio) so…
odds ratio = exp(β1)
eβ = 1 + β + β2/2! + β3/3! + …. (Taylor)
eβ is approximately 1 + β when β is small
Other Stats (source, SAS online help):
The following statistics are all rank based correlation
statistics for assessing the predictive ability of a model:
nc= # concordant(23), nd= # discordant(2),
N = # points(10), t = # pairs with different responses(25)
C (area under the ROC curve) (nc + ½ (# ties))/ t
Somers’ D
(nc-nd)/ t
Goodman-Kruskal Gamma
(nc-nd)/ (nc+nd)
Kendall’s Tau-a
(nc-nd)/ (½N(N-1))
Percent Concordant
Percent Discordant
Percent Tied
Pairs
92.0
8.0
0.0
25
Somers' D
Gamma
Tau-a
c
0.840
0.840
0.467
0.920
Appendix: Details
Exactly what is the food example likelihood function?
X: -14 -8 -9 -6 2 3 8 9 10 16
Y: 0 0 0 1 0 0 1 1 1 1
L = (1-p1)(1-p2)(1-p3)(p4)(1-p5)(1-p6)(p7)(p8)(p9)(p10) =

exp( β 0 − 14 β1 )  
exp( β 0 − 8β1 )  
exp( β 0 − 9 β1 )   exp( β 0 − 6 β1 )   exp( β 0 + 16 β1 ) 
1 −
 1 −
 1 −



 1 + exp( β 0 − 14 β1 )   1 + exp( β 0 − 8β1 )   1 + exp( β 0 − 9 β1 )   1 + exp( β 0 − 6 β1 )   1 + exp( β 0 + 16 β1 ) 
This is a function, L(β0,β1), of β0 and β1
X
Recall: exp(X) is just another way of writing e .
Algebra: If Logit=L=ln(p/(1-p)) then
eL=p/(1-p), eL-peL=p, eL=p(1+eL) and so p=eL/(1+eL)
=1/(e-L+1) =1/(1+e-L)