2 In the simple linear regression model \(y=\beta_{0}+\beta_{1}x+u\), suppose that \(E(u)\neq 0\). Letting \(\alpha_{0}=E(u)\), show that the model can always be rewritten with the same slope, but a new intercept and error, where the new error has a zero expected value.
Solution:
\[\begin{align} y=&\beta_{0}+\beta_{1}x+u\quad \mbox{then we can add $\alpha_{0}$ and subtract $\alpha_{0}$}\\ y=&(\alpha_{0}+\beta_{0}) +\beta_{1}x+ (u−\alpha_{0})\\ \end{align}\]
We can call the new error \(e=u-\alpha_{0}\) and so that \(E(e) = 0\). The new intercept is \(α_0+β_0\), but the slope is still \(β_1\).
3 The following table contains the ACT scores and the GPA (grade point average) for eight college students. Grade point average is based on a four-point scale and has been rounded to one digit after the decimal
Student | GPA | ACT |
---|---|---|
1 | 2.8 | 21 |
2 | 3.4 | 24 |
3 | 3 | 26 |
4 | 3.5 | 27 |
5 | 3.6 | 29 |
6 | 3 | 25 |
7 | 2.7 | 25 |
8 | 3.7 | 30 |
\[ \widehat{GPA}=\widehat{\beta}_{0}+\widehat{\beta}_{1}ACT \]
Comment on the direction of the relationship. Does the intercept have useful interpretation here ? Explain. How much higher is the GPA predicted to be if the ACT score is increased by five points ?
Solution:
The estimated relationship between GPA and ACT is:
\[ \widehat{GPA}=0.568+0.102ACT \]
In R we can call:
table<-read.csv(file="https://raw.githubusercontent.com/rhozon/homeworkI/main/gpaxact.csv",head=TRUE,sep=";")
GPA.fitted<-lm(GPA~ACT,data=table)
Dependent variable: | |
GPA | |
Constant | 0.568 |
(0.928) | |
ACT | 0.102** |
(0.036) | |
Observations | 8 |
R2 | 0.577 |
Adjusted R2 | 0.507 |
Residual Std. Error | 0.269 (df = 6) |
F Statistic | 8.199** (df = 1; 6) |
Note: | p<0.1; p<0.05; p<0.01 |
The direction of the relationship is positive, i.e. when ACT grows (one point), GPA is expected to grow by about 0.10 score points.
The intercept shows the GPA value when the other coefficients are equal to zero. That is, if any student in the class gets a score of 0 we expect GPA = 0.568.
If the ACT score is increased by 5 points the GPA will be about 0.51 (=0.102*5) greater.
Solution:
The fitted values are obtained by applying the OLS equation and multiplying by the ACT values. For example, for the second student, we estimate a GPA of 0.568 + 0.102 * 24 = 3.021.
The residuals are obtained by the equation \(GPA-\widehat{GPA}\), and the sum \(\displaystyle\sum_{i=1}^{n}(GPA-\widehat{GPA})\)
Student | GPA | ACT | GPA.fitted | residuals |
---|---|---|---|---|
1 | 2.8 | 21 | 2.714 | 0.086 |
2 | 3.4 | 24 | 3.021 | 0.379 |
3 | 3.0 | 26 | 3.225 | -0.225 |
4 | 3.5 | 27 | 3.327 | 0.173 |
5 | 3.6 | 29 | 3.532 | 0.068 |
6 | 3.0 | 25 | 3.123 | -0.123 |
7 | 2.7 | 25 | 3.123 | -0.423 |
8 | 3.7 | 30 | 3.634 | 0.066 |
NA | NA | NA | NA | 0.000 |
In R we can call:
sum(GPA.fitted$.residuals)
0
Solution:
Using the OLS estimated equation \(\widehat{GPA}=0.568+0.102\times 20 = 2.612\)
Solution:
By the \(R^{2}=0.5774\) we cain expect that about 57.74% of the variations in GPA can be explained by ACT.
10 Let \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) be the OLS intercept and slope estimators, respectively, and let \(\overline{u}\) be the sample average of errors (not the residuals!).
Solution:
If we can call \(w=\frac{\displaystyle\sum_{i=1}^{n}(x_{i}-\overline{x})u_{i}}{\displaystyle\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}\) than \(\widehat{\beta}_{1}=\beta_{1}+\frac{\displaystyle\sum_{i=1}^{n}(x_{i}-\overline{x})u_{i}}{\displaystyle\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}\). Then, now, define that \(w_{i}=d_{i}/SST_{x}\) .
Solution:
We show that the latter is zero because \(Cov(\widehat{\beta}_{1},\overline{u})=E[(\widehat{\beta}_{1}-\beta_{1})\overline{u}]\).
Remember that \(E[(\widehat{\beta}_{1}-\beta_{1})\overline{u}]=E[(\displaystyle\sum^{n}_{i=1}w_{i}u_{i})\overline{u}]=\displaystyle\sum^{n}_{i=1}w_{i}E(u_{i}\overline{u}).\)
The \(u_{i}\) are pairwise uncorrelated (they are independent), \(E(u_{i},\overline{u})=E(u_{i}^{2}/n)=\sigma^{2}/n\) (because \(E(u_{i},u_{h})=0,\,\,i\neq h\)). Therefore, \(\displaystyle\sum_{i=1}^{n}w_{i}E(u_{i},\overline{u})=\sum^{n}_{i=1}w_{i}(\sigma^{2}/n)=(\sigma^{2}/n)\displaystyle\sum^{n}_{i=1}w_{i}=0\)
Solution:
Remember that the OLS intercept formula is: \(\widehat{\beta}_{0}=\overline{y}-\widehat{\beta}\overline{x}\) and if we can insert this in \(\overline{y}=\beta_{0}+\beta_{1}\overline{x}+\overline{u}\) we can obtain:
\[ \widehat{\beta}_{0}=(\beta_{0}+\beta_{1}\overline{x}+\overline{u})-\widehat{\beta}_{1}\overline{x}=\beta_{0}+\overline{u}-(\widehat{\beta}-\beta_{1})\overline{x} \]
Solution:
\(\widehat{\beta}_{1}\) and \(\overline{u}\) are uncorrelated than
\[ Var(\beta_{0})=Var(\overline{u})+Var(\widehat{\beta_{1}})\overline{x}^{2}=\\ \sigma^{2}/n+(\sigma^{2}/SST_{x})\overline{x}^{2}=\\ \sigma^{2}/n+\sigma^{2}\overline{x^{2}}/SST_{x} \]
Solution:
The substitution gives
\[ Var(\widehat{\beta}_{0})=\sigma^{2}[SST_{x}/n+\overline{x}^{2}]/SST_{x}=\\ \sigma^{2}[(n^{-1}\sum^{n}_{i=1})x_{i}^{2}-\overline{x}^{2}]/SST_{x}=\\ \sigma^{2}(n^{-1}\sum^{n}_{i=1}x^{2}_{i})/SST_{x} \]
3 The following model is a simplified version of the multiple regression model used by Biddle and Hamermesh (1990) to study the tradeoff between time spent sleeping and working and to look at other factors affecting sleep:
\[ sleep=\beta_{0}+\beta_{1}totwrk+\beta_{2}educ+\beta_{3}age+u, \]
where \(sleep\) and \(totwrk\) (total work) are measured in minutes per week and \(educ\) and \(age\) are measured in years (See also Computer Exercise C3 in Chapter 2.)
Solution:
library(wooldridge)
str(sleep75)
'data.frame': 706 obs. of 34 variables:
$ age : int 32 31 44 30 64 41 35 47 32 30 ...
$ black : int 0 0 0 0 0 0 0 0 0 0 ...
$ case : int 1 2 3 4 5 6 7 8 9 10 ...
$ clerical: num 0 0 0 0 0 0 0 0 0 0 ...
$ construc: num 0 0 0 0 0 0 0 0 0 0 ...
$ educ : int 12 14 17 12 14 12 12 13 17 15 ...
$ earns74 : num 0 9500 42500 42500 2500 ...
$ gdhlth : int 0 1 1 1 1 1 1 1 1 1 ...
$ inlf : int 1 1 1 1 1 1 1 1 1 1 ...
$ leis1 : int 3529 2140 4595 3211 4052 4812 4787 3544 4359 4211 ...
$ leis2 : int 3479 2140 4505 3211 4007 4797 4157 3469 4359 4061 ...
$ leis3 : int 3479 2140 4227 3211 4007 4797 4157 3439 4121 4061 ...
$ smsa : int 0 0 1 0 0 0 0 1 0 1 ...
$ lhrwage : num 1.956 0.358 3.022 2.264 1.012 ...
$ lothinc : num 10.08 0 0 0 9.33 ...
$ male : int 1 1 1 0 1 1 1 1 1 1 ...
$ marr : int 1 0 1 1 1 1 1 1 1 1 ...
$ prot : int 1 1 0 1 1 1 1 1 0 0 ...
$ rlxall : int 3163 2920 3038 3083 3493 4078 3810 3033 3606 3168 ...
$ selfe : int 0 1 1 1 0 0 0 1 0 1 ...
$ sleep : int 3113 2920 2670 3083 3448 4063 3180 2928 3368 3018 ...
$ slpnaps : int 3163 2920 2760 3083 3493 4078 3810 3003 3368 3168 ...
$ south : int 0 1 0 0 0 0 0 0 0 0 ...
$ spsepay : num 0 0 20000 5000 2400 0 12000 0 0 6000 ...
$ spwrk75 : int 0 0 1 1 1 0 1 0 0 1 ...
$ totwrk : int 3438 5020 2815 3786 2580 1205 2113 3608 2353 2851 ...
$ union : int 0 0 0 0 0 0 0 0 1 0 ...
$ worknrm : int 3438 5020 2815 3786 2580 0 2113 3608 2353 2851 ...
$ workscnd: int 0 0 0 0 0 1205 0 0 0 0 ...
$ exper : int 14 11 21 12 44 23 17 28 9 9 ...
$ yngkid : int 0 0 0 0 0 0 1 0 0 0 ...
$ yrsmarr : int 13 0 0 12 33 23 0 24 11 7 ...
$ hrwage : num 7.07 1.43 20.53 9.62 2.75 ...
$ agesq : int 1024 961 1936 900 4096 1681 1225 2209 1024 900 ...
- attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
We can run the OLS regression in R
fit<-lm(sleep~totwrk+educ+age,data=sleep75)
Dependent variable: | |
sleep | |
fit | |
Constant | 3,638.245*** |
(112.275) | |
totwrk | -0.148*** |
(0.017) | |
educ | -11.134* |
(5.885) | |
age | 2.200 |
(1.446) | |
Observations | 706 |
R2 | 0.113 |
Adjusted R2 | 0.110 |
Residual Std. Error | 419.359 (df = 702) |
F Statistic | 29.919*** (df = 3; 702) |
Note: | p<0.1; p<0.05; p<0.01 |
#generating the dataset
library(xlsx)
write.xlsx(sleep75, file="sleep75.xlsx")
You can download the dataset here
Soon if adults trade off sleep for work, more work implies less sleep (other things equal), so \(β_1 < 0\).
Solution
The signs of \(\beta_2\) and \(\beta_3\) are not obvious. One could argue that more educated people like to get more out of life, and so, other things equal, they sleep less (\(\beta_2 < 0\)). The relationship between sleeping and age is more complicated than this model suggests, and economists are not in the best position to judge such things.
\[ \widehat{sleep}=3,638.25-.148totwrk-11.13educ+2.20age\\ n=706,\,\, R^{2}=.113. \]
If someone works five more hours per week, by how many minutes is sleep predicted to fall ? Is this a large tradeoff ?
Solution
Since \(totwrk\) is in minutes, we must convert five hours into minutes: \(∆totwrk = 5(60) = 300\). Then sleep is predicted to fall by 0.148(300) = 44.4 minutes. Fo a week, 45 minutes less sleep is not an overwhelming change
Solution
More education implies less predicted time sleeping, but the effect is quite small. If we assume the difference between college and high school is four years, the college graduate sleeps about 45 minutes less per week, other things equal
Solution
As shown in the (i) the \(R^2=0.113\) that the three explanatory variables explain only about 11.3% of the variation in sleep. One important factor in the error term is general health.
Another is marital status, and whether the person has children. Health (however we measure that), marital status, and number and ages of children would generally be correlated with totwrk. (For example, less healthy people would tend to work less.)
9 The following equation describes the median housing price in a community in terms of amount of pollution (nox for nitrous oxide) and the average number of rooms in houses in the community (rooms):
\[ \log(price)=\beta_{0} +\beta_{1}\log(nox)+\beta_{2}rooms+u. \]
Solution
We hope \(\beta_1 < 0\) because more pollution can be expected to lower housing values; \(\beta_1\) is the elasticity of price with respect to \(nox\). \(\beta_2\) is probably positive because \(rooms\) roughly measures the size of a house. (However, it does not allow us to distinguish homes where each room is large from homes where each room is small.)
Soloution
If we assume that \(rooms\) increases with quality of the home, then \(log(nox)\) and \(rooms\) are negatively correlated when poorer neighborhoods have more pollution, something that is often true. If \(\beta_2 > 0\) and \(Corr(x1, x2) < 0\), the simple regression estimator \(\widetilde{β}_1\) has a downward bias. But because \(\beta_1 < 0\), this means that the simple regression, on average, overstates the importance of pollution. \((E(\widetilde{\beta}_1)\) is more negative than \(\beta_1\).)
\[ \widehat{\log(price)}=11.71-1.043\log(nox), n=506,\,R^{2}=.264\\ \widehat{\log(price)}=9.23-.718\log(nox)+.306rooms, n=506,\,R^{2}=.514 \]
Is the relationship between the simple and multiple regression estimates of the elasticity of \(price\) with respect to \(nox\) what would have predicted, given your answer in part (ii)? Does this mean that -.718 is definitely closer to the true elasticity than -1.043 ?
Solution:
This is what we expect from the typical sample based on our analysis in part (i). The simple regression estimate, −1.043, is more negative (larger in magnitude) than the multiple regression estimate, −0.718. As those estimates are only for one sample, we can never know which is closer to \(\beta_1\). But if this is a “typical” sample, \(\beta_1\) is closer to −0.718.
16 The following equations were estimated using the data in LAWSCH85:
\[\begin{align} \widehat{lsalary}=& 9.9 -.0041rank +.294GPA\\ &(.24)\quad(.0003)\quad(.069)\\ &n=142\quad R^{2}=.8238 \end{align}\]
\[\begin{align} \widehat{lsalary}=& 9.86 -.0038rank +.295GPA+.00017age\\ &(.29)\quad(.0004)\,\,\,\,\quad(.083)\quad\,\,(.00036)\\ &n=99\quad R^{2}=.8036 \end{align}\]
How can it be that the \(R^{2}\) is smaller when the variable \(age\) is added to equation.
Solution:
By using the wooldridge package in R we can see:
library(wooldridge)
#inspect the dataset
str(lawsch85)
'data.frame': 156 obs. of 21 variables:
$ rank : int 128 104 34 49 95 98 124 157 145 91 ...
$ salary : num 31400 33098 32870 35000 33606 ...
$ cost : int 8340 6980 16370 17566 8350 8350 6020 5986 4785 7680 ...
$ LSAT : int 155 160 155 157 162 161 155 152 155 160 ...
$ GPA : num 3.15 3.5 3.25 3.2 3.38 ...
$ libvol : int 216 256 424 329 332 311 220 230 230 157 ...
$ faculty: int 45 44 78 136 56 40 40 45 101 44 ...
$ age : int 12 113 134 89 70 29 61 60 70 128 ...
$ clsize : int 210 190 270 277 150 156 151 149 322 70 ...
$ north : int 1 0 0 0 0 0 0 0 0 0 ...
$ south : int 0 1 0 0 0 0 1 1 0 1 ...
$ east : int 0 0 1 1 0 0 0 0 1 0 ...
$ west : int 0 0 0 0 1 1 0 0 0 0 ...
$ lsalary: num 10.4 10.4 10.4 10.5 10.4 ...
$ studfac: num 4.67 4.32 3.46 2.04 2.68 ...
$ top10 : int 0 0 0 0 0 0 0 0 0 0 ...
$ r11_25 : int 0 0 0 0 0 0 0 0 0 0 ...
$ r26_40 : int 0 0 1 0 0 0 0 0 0 0 ...
$ r41_60 : int 0 0 0 1 0 0 0 0 0 0 ...
$ llibvol: num 5.38 5.55 6.05 5.8 5.81 ...
$ lcost : num 9.03 8.85 9.7 9.77 9.03 ...
- attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
head(lawsch85)
rank salary cost LSAT GPA libvol faculty age clsize north south east west lsalary studfac top10 r11_25 r26_40 r41_60 llibvol lcost
1 128 31400 8340 155 3.15 216 45 12 210 1 0 0 0 10.35456 4.666667 0 0 0 0 5.375278 9.028818
2 104 33098 6980 160 3.50 256 44 113 190 0 1 0 0 10.40723 4.318182 0 0 0 0 5.545177 8.850804
3 34 32870 16370 155 3.25 424 78 134 270 0 0 1 0 10.40032 3.461539 0 0 1 0 6.049734 9.703206
4 49 35000 17566 157 3.20 329 136 89 277 0 0 1 0 10.46310 2.036765 0 0 0 1 5.796058 9.773721
5 95 33606 8350 162 3.38 332 56 70 150 0 0 0 1 10.42246 2.678571 0 0 0 0 5.805135 9.030017
6 98 31700 8350 161 3.40 311 40 29 156 0 0 0 1 10.36407 3.900000 0 0 0 0 5.739793 9.030017
sum(is.na(lawsch85))
101
na_count <-sapply(lawsch85, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count)
na_count
na_count
rank 0
salary 8
cost 6
LSAT 6
GPA 7
libvol 1
faculty 4
age 45
clsize 3
north 0
south 0
east 0
west 0
lsalary 8
studfac 6
top10 0
r11_25 0
r26_40 0
r41_60 0
llibvol 1
lcost 6
sum(na_count)
101
#generating the dataset
library(xlsx)
write.xlsx(lawsch85, file="lawsch85.xlsx")
Now you can download the lawsch85.xlsx here
By running the two regression models, we can have:
mod1<-lm(lsalary~rank+GPA,data=lawsch85)
mod2<-lm(lsalary~rank+GPA+age,data=lawsch85)
Dependent variable: | ||
lsalary | ||
mod1 | mod2 | |
(1) | (2) | |
Constant | 9.899*** | 9.860*** |
(0.245) | (0.293) | |
rank | -0.004*** | -0.004*** |
(0.0003) | (0.0004) | |
GPA | 0.294*** | 0.295*** |
(0.069) | (0.083) | |
age | 0.0002 | |
(0.0004) | ||
Observations | 142 | 99 |
R2 | 0.824 | 0.804 |
Adjusted R2 | 0.821 | 0.797 |
Residual Std. Error | 0.117 (df = 139) | 0.120 (df = 95) |
F Statistic | 324.858*** (df = 2; 139) | 129.535*** (df = 3; 95) |
Note: | p<0.1; p<0.05; p<0.01 |
The coefficient of the age variable was shown to be insignificant in model 2.
With different observations the \(R^2\) are been incomparable (see the \(n\) in the two estimated equations). The residual standard error consumed more degrees of freedom in model 2 and increased compared to model 1.As seen above, the age variable has 45 missing data and the other variables in the both models…
library(dplyr)
regvars<-lawsch85%>%
select(lsalary,rank,GPA,age)
na_count_reg <-sapply(regvars, function(y) sum(length(which(is.na(y)))))
na_count_reg <- data.frame(na_count_reg)
na_count_reg
na_count_reg
lsalary 8
rank 0
GPA 7
age 45
Although the general \(F\) test is significant in both models, the \(t\) statistic shows that the inclusion of the \(age\) variable does not improve the model’s fit to explain lsalary.