데이터 및 코드: https://github.com/datascienceabe/study_open/tree/master/Linear_Regression

출처: Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis (Vol. 821). John Wiley & Sons.
데이터: http://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=9068&itemId=0470542810&resourceId=36322

4장, 6장
목차

Multiple Linear Regression
Lack of fit test
Diagonostics for Leverage and Influence

Model Adequacy Checking¶

<가정>
1.y와 x의 관계가 근사하게 linear
2.error의 기댓값 0
3.error의 variance constant
4.error는 상관관계 없음
5.error는 normal 분포=> Residual Analysis를 위해 추가

setwd('C:/Users/bki19/desktop/Linear_Regression/data')

df2<-read.csv('./Delivery_Time2.csv')
colnames(df2)<-c('Time','Case','Distance')
fit2<-lm(Time~.,data=df2)

1. Residual Analysis¶

1) Methods of scaling residuals

1) Residual
$e_{i}=y_{i}-\hat{y}_{i}$
데이터와 fit의 차이로 regression model로 설명 안되는 변동성을 측정할 수 있음

2) Standarized Residual
$d_{i}=\frac{e_{i}}{\sqrt{MS_{res}} }$

3) Studentized Residual(Internally Studentized Residual)
$r_{i}=\frac{e_{i}}{\sqrt{MS_{res}} [1-( \frac{1}{n}+\frac{(x_{i}-\bar{x})^{2} }{S_{xx}} )] }$
=> $x_{i}$ 가 평균 값에 가까워질 수록 커짐
=> 데이터가 많아질 수록 Standarized Residual과 차이가 없어짐

3) PRESS residual
$\frac{e_{(i)}}{\sqrt{Var[e_{(i)}]} }= \frac{e_{(i)}}{\sqrt{\sigma^2(1-h_{ii}) }}$ => (i)번 째 관측치가 없다고 생각하고 나머지로 적합시켰을 때의 residual을 계산

4) R-student (Externally Studentized residual)
$t_{i}=\frac{e_{i}}{\sqrt{S_{(i)}^{2}(1-h_{ii}) } }$
=> Internally Studentized residual과 큰 차이가 없는 경우가 많지만 i번째 관측치가 influential하면 값이 확 커짐

5) Hat element
=> x의 공간에서 i 번째 관측치의 위치를 측정 => 공간의 중심에서 멀어질 수록 variance 낮아져서 hat element의 값은 커짐

PRESS <- function(linear.model) {
    pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
    return (pr)
}

Resid<-data.frame(residual=resid(fit2), stardardized=rstandard(fit2) *(1-lm.influence(fit2)$hat)^0.5,in_studentized=rstandard(fit2),hat=lm.influence(fit2)$hat,PRESS=PRESS(fit2),ex_studentized=rstudent(fit2)   )
Resid

$e_{9}=7.4197$ 로 큼
standardized로 봐도 다른거는 +-2안에 있는데 이거는 밖에 있음
internally studentized로 보면 3.2138로 확연히 큼
모형이 이 점에 대해 fit이 잘 안됨을 알 수 있음

df2[9,]

2) Residual Plot analysis

Residuals vs Fitted values
$y$ 와 $\epsilon$ 의 관계를 짐작하여 모델의 부적합성을 볼 수 있음
1) Fitted가 커질 수록 Residual이 퍼지는 경우 => $y$ 가 커질 수록 $Var[\epsilon]$ 이 커져 분산이 constant가 아닐 가능성(variance가 y의 증가 함수)
2) Fitted가 커질 수록 Reisdual이 퍼지다가 감소하는 경우 => y의 값이 1보다 작아서 variance가 binomial처럼 증가 하다 감소
=>1),2)에 대해서는 x나 y에 transformation하거나 WLS 이용
3) Fitted와 Residual이 비선형 관계
=>x나 y에 transformation

par(mfrow=c(2,2))
plot(fit2)

Normality: plot을 보면 직선에 놓여 있지 않음(꼬리가 너무 두꺼워 보임) -> 한 개 이상의 아웃라이어 있을 듯
Residual vs Fitted value: Misspecification of regressor 없어 보임, inequality of variance 없어 보임

Residuals vs Regressors

par(mfrow=c(1,2))
plot(df2$Case,rstudent(fit2) );abline(h=0, col="red")
plot(df2$Distance,rstudent(fit2) );abline(h=0, col="red")

regressor와 residual의 관계를 plotting 하는 것은 transformation의 필요성을 항상 볼 수 있는 것은 아님
Omitted variable과 residual의 관계를 보면 potentially include 할지 볼 수 있음
interation effect는 못봄
Multicollinearity가 강하면 잘 못 된 관계를 보여 줄 수 있음

Partial Regression Plot
다른 regressor가 주어졌을 때, marginal relationship을 볼 수 있음
이 관계를 보고 transformation을 판단

Y_1<-lm(Time~Distance,data=df2)
X_1<-lm(Case~Distance,data=df2)
Y_2<-lm(Time~Case,data=df2)
X_2<-lm(Distance~Case,data=df2)

par(mfrow=c(1,2))
plot( resid(X_1),resid(Y_1) )
plot( resid(X_2),resid(Y_2) )

두 변수다 linear relationship이 명백

3. PRESS STATISTICS

Regression 모델이 새로운 데이터를 얼마나 잘 예측할 수 있을지 측정
작을 수록 좋음
$PRESS= \sum[y_{i}-\hat{y}_{(i)} ]^{2}=\sum(\frac{e_{i}}{1-h_{ii}})$

233.7317
sum( PRESS(fit2)^2)

SSres에 비해 PRESS STAT이 2배가 넘음 -> 절반이 넘는 PRESS STAT이 point 9에 기여했기 때문
이 모델이 extrapolation에 취약함을 알 수 있음
PRESS STAT은 모델 간 비교에도 사용 가능

$R_{prediction}^{2}=1-\frac{PRESS}{SS_{T}}$
회귀 모형의 예측력에 대한 지표

1-(sum( PRESS(fit2)^2)/5784.5426)

회귀 모형이 새로운 관측치를 예측하는데 있어 92.09%의 변동성을 설명할 수 있음
원래 데이터에서는 LSE 적합으로 95.96% 설명 가능

2. Lack of Fit Test

Systematic lack of fit이 존재하는지 판단하는 절차
예를 들어 Quadratic term을 추가할지, 혹은 다른 regressor를 추가해야 되는지
Replicated observation 필요

$i$ 번째 regressor $x_{i}$ 에 대응하는 response가 $n_{i}$ 개 존재 (distinct response)
$y_{ij}$ : $x_{i}$ 에 대한 $j$ 번째 관측치
총 $n= \sum_{i=1}^{m}n_{i}$ 의 관측치 존재

$SS_{Res}=SS_{PE}+SS_{LOF}$
$SS_{Res}$ 는 fit이 잘 안되서 발생하는 $SS_{LOF}$ 와 자연적으로 발생하는 $SS_{PE}$ 으로 나눠짐
$\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}(y_{ij}-\hat{y}_{i} )^{2}=\sum_{i=1}^{n}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i} )^{2}+ \sum_{i=1}^{m}n_{i}(\bar{y}_{i} -\hat{y}_{i} )^{2}$
=> pure error는 같은 x 값에 중복되는 y 값의 변동에 의한 값
=> Lack of fit error는 x 값에 대응 되는 y 값의 평균과 적합 된 y 값의 변동

$H_{0}: E[y_{i}]=\beta_{0}+\beta_{1}x_{i}$
$H_{a}: not H_{0}$

$F_{0}=\frac{SS_{LOF}/(m-2)}{SS_{PE}/(n-m)} \sim F_{m-2,n-m} Under H_{0}$
만약 $H_{0}$ 가 참이면 $E(MS_{LOF})=\sigma^{2}$ 이지만 참이 아니면 $E(MS_{LOF})>\sigma^{2}$ 가 되어 $F_{0}$ 값이 커짐
Reject H0면, 더 적합한 모형을 찾아야 됨
주의 할 것은 $H_{0}$ 를 기각 못해 적합한 모형을 찾아도 prediction 관점에서 좋은 모형이란 것은 아님

df_lack<-read.csv('./ex4.8.csv')

fit_lack<-lm(y~x,data=df_lack)
plot(df_lack)
lines(df_lack[,1],fitted(fit_lack),col=4)

summary(fit_lack)

Call:
lm(formula = y ~ x, data = df_lack)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4536 -1.6158  0.5638  2.6358  7.4246 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.2139     2.6649   4.959 0.000172 ***
x             2.1304     0.5645   3.774 0.001839 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.084 on 15 degrees of freedom
Multiple R-squared:  0.487, Adjusted R-squared:  0.4528 
F-statistic: 14.24 on 1 and 15 DF,  p-value: 0.001839

$i$ 번째 regressor $x_{i}$ 에 대응하는 response가 $n_{i}$ 개 존재 (distinct response)
$y_{ij}$ : $x_{i}$ 에 대한 $j$ 번째 관측치
총 $n= \sum_{i=1}^{m}n_{i}$ 의 관측치 존재

$SS_{Res}=SS_{PE}+SS_{LOF}$
$SS_{Res}$ 는 fit이 잘 안되서 발생하는 $SS_{LOF}$ 와 자연적으로 발생하는 $SS_{PE}$ 으로 나눠짐
$\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}(y_{ij}-\hat{y}_{i} )^{2}=\sum_{i=1}^{n}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i} )^{2}+ \sum_{i=1}^{m}n_{i}(\bar{y}_{i} -\hat{y}_{i} )^{2}$
=> pure error는 같은 x 값에 중복되는 y 값의 변동에 의한 값
=> Lack of fit error는 x 값에 대응 되는 y 값의 평균과 적합 된 y 값의 변동

fit0 <- lm(y ~ 1,data=df_lack) #  y is being predicted by no other variable so the natural prediction is the mean of y
anova(fit0,fit_lack)

$SS_{LOF}=237.4788$
P-value가 매우 작아 모델이 데이터에 잘 적합 됐다는 귀무가설 기각

3. Diagonostics for Leverage and Influence¶

Leverage point: 일반적이지 않은 관측치가 regression coefficient를 추정할 때 영향은 없음
Influential point: 일반적이지 않은 관측치로 regression coefficient를 추정할 때 큰 영향을 줌

df<-read.csv('./Rocket_Prop.csv')
df<-df[,c(2,3)]
colnames(df)<-c('Shear_length','Age_Propellant')
fit<-lm(Shear_length~Age_Propellant,data=df)
resid(fit)

5,6번의 절대값이 커 Outlier일 가능성이 큼

par(mfrow=c(2,2))
plot(fit)

summary(fit)
summary(lm(Shear_length~Age_Propellant,data=df[-c(5,6),]))

Call:
lm(formula = Shear_length ~ Age_Propellant, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-215.98  -50.68   28.74   66.61  106.76 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2627.822     44.184   59.48  < 2e-16 ***
Age_Propellant  -37.154      2.889  -12.86 1.64e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 96.11 on 18 degrees of freedom
Multiple R-squared:  0.9018, Adjusted R-squared:  0.8964 
F-statistic: 165.4 on 1 and 18 DF,  p-value: 1.643e-10

Call:
lm(formula = Shear_length ~ Age_Propellant, data = df[-c(5, 6), 
    ])

Residuals:
    Min      1Q  Median      3Q     Max 
-118.07  -35.67   11.31   44.75   83.98 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2658.973     30.533   87.08  < 2e-16 ***
Age_Propellant  -37.694      1.979  -19.05 2.02e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62.97 on 16 degrees of freedom
Multiple R-squared:  0.9578, Adjusted R-squared:  0.9551 
F-statistic: 362.9 on 1 and 16 DF,  p-value: 2.023e-12

5,6번 제거 후 MS_res 5000 이상이나 감소했지만 회귀 계수는 거의 변화 없음->influential point 아님
만약 진짜 나쁜 값이 었다면 제거 후 파라미터 추정 값의 정확성 향상 되거나, CI, PI 길이 줄어 들었어야 됨
그렇다면 influential point는 어떻게 판별할까?

1) Hat element

df2<-read.csv('./Delivery_Time2.csv')
colnames(df2)<-c('Time','Case','Distance')
fit2<-lm(Time~.,data=df2)
p=dim(df2)[2]
n<-dim(df2)[1]
Criterion=2*p/n #If h_ii> Criterion then leverage point
hii<-lm.influence(fit2)$hat

$\bar{h}=\frac{p}{n}$
$\frac{2p}{n}$ 보다 크면 데이터 공간에서 먼 곳 있으므로 leverage point로 진단
Hat element은 x 공간에서의 위치만 보기 때문에 R-student 같은 다른 척도들과 같이 봐야 됨

OUTLIERS<-rep(NA,n)
for (i in 1:n){
    if (hii[i]>Criterion){
        OUTLIERS[i]<-i
    }
}

na.omit(OUTLIERS) #hii that bigger than Criterion

hat element로 봤을 때 포인트 9,22가 leverage point
위에서 internally나 externally studentized residual은 22에 대해 별로 크지 않았어서 influential point 아닐 가능성이 큼
하지만 포인트 9에 대해서는 커서 influential 가능성 높음

#deleting leverage point
coef(lm(Time~.,data=df2)) #9,22 in
coef(lm(Time~.,data=df2[-9,])) #22 in
coef(lm(Time~.,data=df2[-22,])) #9 in
coef(lm(Time~.,data=df2[-c(9,22),])) #no

9번을 지웠을 때 $\beta_{1}$ 은 28%정도 변화하지만 $\beta_{2}$ 는 90퍼센트나 변화
이것은 $x_2$ 의 9번 관측치가 매우 크기 때문일 듯

2) Cook's D : Measure of Inlfuence

모든 관측치로 추정 된 $\hat{\beta}$ 와 $i$ 번째 관측치를 제거하고 계산한 $\hat{\beta}_{(i)}$ 의 거리의 제곱
$D_{i}=\frac{(\hat{\beta}_{(i)}-\hat{\beta})^{'}\mathbf{X^{'}X}(\hat{\beta}_{(i)}-\hat{\beta}) }{pMS_{res}}=\frac{(\hat{y}_{(i)}-\hat{y})^{'}\mathbf{X^{'}X}(\hat{y}_{(i)}-\hat{y}) }{pMS_{res}}$
$D_{i}$ 가 큰 관측치는 $\beta$ 에 상당한 영향을 주는 관측치
$F_{0.5,p,n-p}$ 는 1에 가까워 $D_{i}$ 가 1보다 크면 influential point라고 진단
가장 직관적인 해석은 i번째 관측치를 없다고 생각했을 떄 적합된 y값의 변화(Eulidean distance)

#install.packages("car", repos='http://cran.us.r-project.org')
library(car)

cooks.distance(fit2)

which.max(cooks.distance(fit2))

Cook's distance로 봤을 때 가장 높은 것은 9번째 값으로 이 값을 지우면 LSE 추정값이 coefficient의 96% CI의 경계에서 벗어날 것이라는 것을 의미
두번째로 높은 22를 지우면 값을 지우면 LSE 추정값이 coefficient의 35% CI의 경계에서 벗어날 것이라는 것을 의미
이것으로 보아 9번은 influential but 22 is not

3) Influential point는 어떻게 다뤄야 할까?

Influential point가 중요한 정보일 수 있지만 추정할 때에는 방해가 될 수 있음
robust estimation의 필요성

residual	stardardized	in_studentized	hat	PRESS	ex_studentized
-5.0280843	-1.54260631	-1.62767993	0.10180178	-5.59796734	-1.69562881
1.1463854	0.35170879	0.36484267	0.07070164	1.23360321	0.35753764
-0.0497937	-0.01527661	-0.01609165	0.09873476	-0.05524867	-0.01572177
4.9243539	1.51078203	1.57972040	0.08537479	5.38401290	1.63916491
-0.4443983	-0.13634053	-0.14176094	0.07501050	-0.48043610	-0.13856493
-0.2895743	-0.08884082	-0.09080847	0.04286693	-0.30254339	-0.08873728
0.8446235	0.25912883	0.27042496	0.08179867	0.91986749	0.26464769
1.1566049	0.35484408	0.36672118	0.06372559	1.23532680	0.35938983
7.4197062	2.27635117	3.21376278	0.49829216	14.78889824	4.31078012
2.3764129	0.72907878	0.81325432	0.19629595	2.95682585	0.80677584
2.2374930	0.68645843	0.71807970	0.08613260	2.44837821	0.70993906
-0.5930409	-0.18194377	-0.19325733	0.11365570	-0.66908638	-0.18897451
1.0270093	0.31508443	0.32517935	0.06112463	1.09387183	0.31846924
1.0675359	0.32751789	0.34113547	0.07824332	1.15815364	0.33417725
0.6712018	0.20592338	0.21029137	0.04111077	0.69997845	0.20566324
-0.6629284	-0.20338513	-0.22270023	0.16594043	-0.79482144	-0.21782566
0.4363603	0.13387449	0.13803929	0.05943202	0.46393280	0.13492400
3.4486213	1.05803019	1.11295196	0.09626046	3.81594602	1.11933065
1.7931935	0.55014821	0.57876634	0.09644857	1.98460588	0.56981420
-5.7879699	-1.77573772	-1.87354643	0.10168486	-6.44313971	-1.99667657
-2.6141789	-0.80202492	-0.87784258	0.16527689	-3.13179171	-0.87308697
-3.6865279	-1.13101946	-1.44999541	0.39157522	-6.05913500	-1.48962473
-4.6075679	-1.41359270	-1.44368977	0.04126005	-4.80585779	-1.48246718
-4.5728535	-1.40294240	-1.49605875	0.12060826	-5.20001871	-1.54221512
-0.2125839	-0.06522033	-0.06750861	0.06664345	-0.22776283	-0.06596332

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
16	487.6126	NA	NA	NA	NA
15	250.1338	1	237.4788	14.2411	0.001839409

Recent Posts

3. 모형 적합성 평가 (R)

Model Adequacy Checking¶

1. Residual Analysis¶

2. Lack of Fit Test

3. Diagonostics for Leverage and Influence¶

댓글 쓰기

0 댓글

Recent Posts

다변량 통계 분석: 3. MANOVA (R)

추천시스템: 1. 연관성 규칙 (Association Rule Mining - Apriori) (Python)

추천 시스템: 4. 협업 필터링 (Collaborative Filtering) (Python)

Popular Posts

2. 다중 선형 회귀 분석 (R)

4. 변수 변환 (R)

페이지

Tags - Data Science

Tags - Economy

Profile

Recent in Statistics

다변량 통계 분석: 3. MANOVA (R)

추천시스템: 1. 연관성 규칙 (Association Rule Mining - Apriori) (Python)

Recent in Economy

Most Popular

2. 다중 선형 회귀 분석 (R)

4. 변수 변환 (R)

Social Widget

Recent Posts

3. 모형 적합성 평가 (R)

Model Adequacy Checking¶

1. Residual Analysis¶

2. Lack of Fit Test

3. Diagonostics for Leverage and Influence¶

관심 있을 만한 글

댓글 쓰기

0 댓글

Recent Posts

Popular Posts

페이지

Tags - Data Science

Tags - Economy

Profile

Recent in Statistics

Recent in Economy

Most Popular

Social Widget