Introduction
The examples are from [this] textbook, and my class notes are [here].
Example 6.1 The Delivery Time Data
# load data
ex31 = read.table("ex31.txt",header = T)
head(ex31)
## Observation Delivery_Time_y Number_of_Cases_x1 Distance_x2_.ft.
## 1 1 16.68 7 560
## 2 2 11.50 3 220
## 3 3 12.03 3 340
## 4 4 14.88 4 80
## 5 5 13.75 6 150
## 6 6 18.11 7 330
# model
lm1 <- lm(ex31$Delivery_Time_y ~ ex31$Number_of_Cases_x1 + ex31$Distance_x2_.ft., data = ex31)
# hat diagonal
ex31$hii <- hatvalues(lm1)
head(ex31)
## Observation Delivery_Time_y Number_of_Cases_x1 Distance_x2_.ft. hii
## 1 1 16.68 7 560 0.10180178
## 2 2 11.50 3 220 0.07070164
## 3 3 12.03 3 340 0.09873476
## 4 4 14.88 4 80 0.08537479
## 5 5 13.75 6 150 0.07501050
## 6 6 18.11 7 330 0.04286693
# Statistics for detecting influential observations
print(influence.measures(lm1))
## Influence measures of
## lm(formula = ex31$Delivery_Time_y ~ ex31$Number_of_Cases_x1 + ex31$Distance_x2_.ft., data = ex31) :
##
## dfb.1_ dfb.e31.N dfb.e31.D dffit cov.r cook.d hat inf
## 1 -0.18727 0.41131 -0.43486 -0.5709 0.871 1.00e-01 0.1018
## 2 0.08979 -0.04776 0.01441 0.0986 1.215 3.38e-03 0.0707
## 3 -0.00352 0.00395 -0.00285 -0.0052 1.276 9.46e-06 0.0987
## 4 0.45196 0.08828 -0.27337 0.5008 0.876 7.76e-02 0.0854
## 5 -0.03167 -0.01330 0.02424 -0.0395 1.240 5.43e-04 0.0750
## 6 -0.01468 0.00179 0.00108 -0.0188 1.200 1.23e-04 0.0429
## 7 0.07807 -0.02228 -0.01102 0.0790 1.240 2.17e-03 0.0818
## 8 0.07120 0.03338 -0.05382 0.0938 1.206 3.05e-03 0.0637
## 9 -2.57574 0.92874 1.50755 4.2961 0.342 3.42e+00 0.4983 *
## 10 0.10792 -0.33816 0.34133 0.3987 1.305 5.38e-02 0.1963
## 11 -0.03427 0.09253 -0.00269 0.2180 1.172 1.62e-02 0.0861
## 12 -0.03027 -0.04867 0.05397 -0.0677 1.291 1.60e-03 0.1137
## 13 0.07237 -0.03562 0.01134 0.0813 1.207 2.29e-03 0.0611
## 14 0.04952 -0.06709 0.06182 0.0974 1.228 3.29e-03 0.0782
## 15 0.02228 -0.00479 0.00684 0.0426 1.192 6.32e-04 0.0411
## 16 -0.00269 0.06442 -0.08419 -0.0972 1.369 3.29e-03 0.1659
## 17 0.02886 0.00649 -0.01570 0.0339 1.219 4.01e-04 0.0594
## 18 0.24856 0.18973 -0.27243 0.3653 1.069 4.40e-02 0.0963
## 19 0.17256 0.02357 -0.09897 0.1862 1.215 1.19e-02 0.0964
## 20 0.16804 -0.21500 -0.09292 -0.6718 0.760 1.32e-01 0.1017
## 21 -0.16193 -0.29718 0.33641 -0.3885 1.238 5.09e-02 0.1653
## 22 0.39857 -1.02541 0.57314 -1.1950 1.398 4.51e-01 0.3916 *
## 23 -0.15985 0.03729 -0.05265 -0.3075 0.890 2.99e-02 0.0413
## 24 -0.11972 0.40462 -0.46545 -0.5711 0.948 1.02e-01 0.1206
## 25 -0.01682 0.00085 0.00559 -0.0176 1.231 1.08e-04 0.0666
Column hii shows the hat diagonals for the soft drink deliver time data. Since \(p=3\) and \(n=25\), any point for which the hat diagonal \(h_{ii}\) exceeds \(\frac{2p}{n}=\frac{2(3)}{25}=0.24\) is a leverage point. This criterion would identify observations 9 and 22 as leverage points.
To illistrate the effect of these two points on the model, three additional analyses were performed; one deleting observation 9, a second deleting observation 22, and the third deleting both 9 and 22. The results are thus :
# remove 9
data1 <- ex31[-c(9),]
lmd1 <- lm(data1$Delivery_Time_y ~ data1$Number_of_Cases_x1 + data1$Distance_x2_.ft., data = data1)
summary(lmd1)
##
## Call:
## lm(formula = data1$Delivery_Time_y ~ data1$Number_of_Cases_x1 +
## data1$Distance_x2_.ft., data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0325 -1.2331 0.0199 1.4730 4.8167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.447238 0.952469 4.669 0.000131 ***
## data1$Number_of_Cases_x1 1.497691 0.130207 11.502 1.58e-10 ***
## data1$Distance_x2_.ft. 0.010324 0.002854 3.618 0.001614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.43 on 21 degrees of freedom
## Multiple R-squared: 0.9487, Adjusted R-squared: 0.9438
## F-statistic: 194.2 on 2 and 21 DF, p-value: 2.859e-14
# remove 22
data2 <- ex31[-c(22),]
lmd2 <- lm(data2$Delivery_Time_y ~ data2$Number_of_Cases_x1 + data2$Distance_x2_.ft., data = data2)
summary(lmd2)
##
## Call:
## lm(formula = data2$Delivery_Time_y ~ data2$Number_of_Cases_x1 +
## data2$Distance_x2_.ft., data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7075 -0.9139 0.5079 1.4274 5.6756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.915740 1.105105 1.734 0.09766 .
## data2$Number_of_Cases_x1 1.786324 0.201762 8.854 1.56e-08 ***
## data2$Distance_x2_.ft. 0.012369 0.003768 3.282 0.00355 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.173 on 21 degrees of freedom
## Multiple R-squared: 0.9564, Adjusted R-squared: 0.9523
## F-statistic: 230.5 on 2 and 21 DF, p-value: 5.155e-15
# remove both 9 and 22
data3 <- ex31[-c(9, 22),]
lmd3 <- lm(data3$Delivery_Time_y ~ data3$Number_of_Cases_x1 + data3$Distance_x2_.ft., data = data3)
summary(lmd3)
##
## Call:
## lm(formula = data3$Delivery_Time_y ~ data3$Number_of_Cases_x1 +
## data3$Distance_x2_.ft., data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0596 -1.2531 -0.1362 1.5153 5.1396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.642692 1.125981 4.123 0.000527 ***
## data3$Number_of_Cases_x1 1.455607 0.180483 8.065 1.03e-07 ***
## data3$Distance_x2_.ft. 0.010549 0.002988 3.531 0.002099 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.483 on 20 degrees of freedom
## Multiple R-squared: 0.9072, Adjusted R-squared: 0.8979
## F-statistic: 97.75 on 2 and 20 DF, p-value: 4.739e-11
Deleting observation 9 produces only a minor change in \(\hat\beta_1\), but results in approximately a 28% change in \(\hat\beta_2\) and a 90% change in \(\hat\beta_0\). This illustrats that observation influence on the regression coefficient associated with \(x_2\) (distance). In effect, observation 9 may be causing curvature in the \(x_2\) direction. Deleting point 22 produces relatively smaller changes, and deleting both points produces changes similar to those observed when deleting only 9.
Example 6.2 The Delivery Time Data
# Statistics for detecting influential observations
print(influence.measures(lm1))
## Influence measures of
## lm(formula = ex31$Delivery_Time_y ~ ex31$Number_of_Cases_x1 + ex31$Distance_x2_.ft., data = ex31) :
##
## dfb.1_ dfb.e31.N dfb.e31.D dffit cov.r cook.d hat inf
## 1 -0.18727 0.41131 -0.43486 -0.5709 0.871 1.00e-01 0.1018
## 2 0.08979 -0.04776 0.01441 0.0986 1.215 3.38e-03 0.0707
## 3 -0.00352 0.00395 -0.00285 -0.0052 1.276 9.46e-06 0.0987
## 4 0.45196 0.08828 -0.27337 0.5008 0.876 7.76e-02 0.0854
## 5 -0.03167 -0.01330 0.02424 -0.0395 1.240 5.43e-04 0.0750
## 6 -0.01468 0.00179 0.00108 -0.0188 1.200 1.23e-04 0.0429
## 7 0.07807 -0.02228 -0.01102 0.0790 1.240 2.17e-03 0.0818
## 8 0.07120 0.03338 -0.05382 0.0938 1.206 3.05e-03 0.0637
## 9 -2.57574 0.92874 1.50755 4.2961 0.342 3.42e+00 0.4983 *
## 10 0.10792 -0.33816 0.34133 0.3987 1.305 5.38e-02 0.1963
## 11 -0.03427 0.09253 -0.00269 0.2180 1.172 1.62e-02 0.0861
## 12 -0.03027 -0.04867 0.05397 -0.0677 1.291 1.60e-03 0.1137
## 13 0.07237 -0.03562 0.01134 0.0813 1.207 2.29e-03 0.0611
## 14 0.04952 -0.06709 0.06182 0.0974 1.228 3.29e-03 0.0782
## 15 0.02228 -0.00479 0.00684 0.0426 1.192 6.32e-04 0.0411
## 16 -0.00269 0.06442 -0.08419 -0.0972 1.369 3.29e-03 0.1659
## 17 0.02886 0.00649 -0.01570 0.0339 1.219 4.01e-04 0.0594
## 18 0.24856 0.18973 -0.27243 0.3653 1.069 4.40e-02 0.0963
## 19 0.17256 0.02357 -0.09897 0.1862 1.215 1.19e-02 0.0964
## 20 0.16804 -0.21500 -0.09292 -0.6718 0.760 1.32e-01 0.1017
## 21 -0.16193 -0.29718 0.33641 -0.3885 1.238 5.09e-02 0.1653
## 22 0.39857 -1.02541 0.57314 -1.1950 1.398 4.51e-01 0.3916 *
## 23 -0.15985 0.03729 -0.05265 -0.3075 0.890 2.99e-02 0.0413
## 24 -0.11972 0.40462 -0.46545 -0.5711 0.948 1.02e-01 0.1206
## 25 -0.01682 0.00085 0.00559 -0.0176 1.231 1.08e-04 0.0666
Looking at cook.d values, \(D_9=3.419318\), which indicates that deletion of observation 9 would move the least-squares estimate to approximately the boundry of a 96% confidene region around \(\hat\beta\). The next largest value is \(D_{22}=.4510455\), and deletion of point 22 will move the estimate of \(\beta\) to approximately the edge of 35% confidence region.