See the repo on github


2 In the simple linear regression model \(y=\beta_{0}+\beta_{1}x+u\), suppose that \(E(u)\neq 0\). Letting \(\alpha_{0}=E(u)\), show that the model can always be rewritten with the same slope, but a new intercept and error, where the new error has a zero expected value.

Solution:

\[\begin{align} y=&\beta_{0}+\beta_{1}x+u\quad \mbox{then we can add $\alpha_{0}$ and subtract $\alpha_{0}$}\\ y=&(\alpha_{0}+\beta_{0}) +\beta_{1}x+ (u−\alpha_{0})\\ \end{align}\]

We can call the new error \(e=u-\alpha_{0}\) and so that \(E(e) = 0\). The new intercept is \(α_0+β_0\), but the slope is still \(β_1\).


3 The following table contains the ACT scores and the GPA (grade point average) for eight college students. Grade point average is based on a four-point scale and has been rounded to one digit after the decimal

Student GPA ACT
1 2.8 21
2 3.4 24
3 3 26
4 3.5 27
5 3.6 29
6 3 25
7 2.7 25
8 3.7 30
  1. Estimate the relationship GPA and ACT using OLS; that is, obtain the intercept and the slope estimates in this equation

\[ \widehat{GPA}=\widehat{\beta}_{0}+\widehat{\beta}_{1}ACT \]

Comment on the direction of the relationship. Does the intercept have useful interpretation here ? Explain. How much higher is the GPA predicted to be if the ACT score is increased by five points ?

Solution:

*Download the table here

The estimated relationship between GPA and ACT is:

\[ \widehat{GPA}=0.568+0.102ACT \]

In R we can call:

table<-read.csv(file="https://raw.githubusercontent.com/rhozon/homeworkI/main/gpaxact.csv",head=TRUE,sep=";")

GPA.fitted<-lm(GPA~ACT,data=table)

 

Dependent variable:
GPA
Constant 0.568
(0.928)
ACT 0.102**
(0.036)
Observations 8
R2 0.577
Adjusted R2 0.507
Residual Std. Error 0.269 (df = 6)
F Statistic 8.199** (df = 1; 6)
Note: p<0.1; p<0.05; p<0.01

 

The direction of the relationship is positive, i.e. when ACT grows (one point), GPA is expected to grow by about 0.10 score points.

The intercept shows the GPA value when the other coefficients are equal to zero. That is, if any student in the class gets a score of 0 we expect GPA = 0.568.

If the ACT score is increased by 5 points the GPA will be about 0.51 (=0.102*5) greater.

  1. Compute the fitted values and residuals for each observation and verify that residuals (approximately) sum to zero.

Solution:

The fitted values are obtained by applying the OLS equation and multiplying by the ACT values. For example, for the second student, we estimate a GPA of 0.568 + 0.102 * 24 = 3.021.

The residuals are obtained by the equation \(GPA-\widehat{GPA}\), and the sum \(\displaystyle\sum_{i=1}^{n}(GPA-\widehat{GPA})\)

Student GPA ACT GPA.fitted residuals
1 2.8 21 2.714 0.086
2 3.4 24 3.021 0.379
3 3.0 26 3.225 -0.225
4 3.5 27 3.327 0.173
5 3.6 29 3.532 0.068
6 3.0 25 3.123 -0.123
7 2.7 25 3.123 -0.423
8 3.7 30 3.634 0.066
NA NA NA NA 0.000

In R we can call:

sum(GPA.fitted$.residuals)
 0
  1. What is the predicted value of GPA when ACT=20 ?

Solution:

Using the OLS estimated equation \(\widehat{GPA}=0.568+0.102\times 20 = 2.612\)

  1. How much of the variation in GPA for these eight students is explained by ACT ? Explain.

Solution:

By the \(R^{2}=0.5774\) we cain expect that about 57.74% of the variations in GPA can be explained by ACT.


10 Let \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) be the OLS intercept and slope estimators, respectively, and let \(\overline{u}\) be the sample average of errors (not the residuals!).

  1. Show that \(\widehat{\beta}_{1}\) can be written as \(\widehat{\beta}_{1}=\beta_{1}+\displaystyle\sum_{i=1}^{n}w_{i}u_{i},\) where \(w_{i}=d_{i}/\mbox{SST}_{x}\) and \(d_{i}=x_{i}-\overline{x}\)

Solution:

If we can call \(w=\frac{\displaystyle\sum_{i=1}^{n}(x_{i}-\overline{x})u_{i}}{\displaystyle\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}\) than \(\widehat{\beta}_{1}=\beta_{1}+\frac{\displaystyle\sum_{i=1}^{n}(x_{i}-\overline{x})u_{i}}{\displaystyle\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}\). Then, now, define that \(w_{i}=d_{i}/SST_{x}\) .

  1. Use part (i), along with \(\displaystyle\sum_{i=1}^{n}w_{i}u_{i}=0\), to show that \(\widehat{\beta}_{1}\) and \(\overline{u}\) are uncorrelated. [Hint: You are been asked to show that \(E[(\widehat{\beta}_{1}-\beta_{1})\cdot \overline{u}]=0\)]

Solution:

We show that the latter is zero because \(Cov(\widehat{\beta}_{1},\overline{u})=E[(\widehat{\beta}_{1}-\beta_{1})\overline{u}]\).

Remember that \(E[(\widehat{\beta}_{1}-\beta_{1})\overline{u}]=E[(\displaystyle\sum^{n}_{i=1}w_{i}u_{i})\overline{u}]=\displaystyle\sum^{n}_{i=1}w_{i}E(u_{i}\overline{u}).\)

The \(u_{i}\) are pairwise uncorrelated (they are independent), \(E(u_{i},\overline{u})=E(u_{i}^{2}/n)=\sigma^{2}/n\) (because \(E(u_{i},u_{h})=0,\,\,i\neq h\)). Therefore, \(\displaystyle\sum_{i=1}^{n}w_{i}E(u_{i},\overline{u})=\sum^{n}_{i=1}w_{i}(\sigma^{2}/n)=(\sigma^{2}/n)\displaystyle\sum^{n}_{i=1}w_{i}=0\)

  1. Show that \(\widehat{\beta}_{0}\) can be written as \(\widehat{\beta}_{0}=\beta_{0}+\overline{u}-(\widehat{\beta}_{1}-\beta_{1})\overline{x}\)

Solution:

Remember that the OLS intercept formula is: \(\widehat{\beta}_{0}=\overline{y}-\widehat{\beta}\overline{x}\) and if we can insert this in \(\overline{y}=\beta_{0}+\beta_{1}\overline{x}+\overline{u}\) we can obtain:

\[ \widehat{\beta}_{0}=(\beta_{0}+\beta_{1}\overline{x}+\overline{u})-\widehat{\beta}_{1}\overline{x}=\beta_{0}+\overline{u}-(\widehat{\beta}-\beta_{1})\overline{x} \]

  1. Use parts (ii) and (iii) to show that \(Var(\widehat{\beta}_{0}) =\sigma^{2}/n+\sigma^{2}(\overline{x})^{2}/SST_{x}\).

Solution:

\(\widehat{\beta}_{1}\) and \(\overline{u}\) are uncorrelated than

\[ Var(\beta_{0})=Var(\overline{u})+Var(\widehat{\beta_{1}})\overline{x}^{2}=\\ \sigma^{2}/n+(\sigma^{2}/SST_{x})\overline{x}^{2}=\\ \sigma^{2}/n+\sigma^{2}\overline{x^{2}}/SST_{x} \]

  1. Do the algebra to simplify the expression in part (iv) to equation (2.58) [Hint: \(SST_{x}/n=n^{-1}\displaystyle\sum_{i=1}^{n}x^{2}_{i}-(\overline{x})^{2}\).]

Solution:

The substitution gives

\[ Var(\widehat{\beta}_{0})=\sigma^{2}[SST_{x}/n+\overline{x}^{2}]/SST_{x}=\\ \sigma^{2}[(n^{-1}\sum^{n}_{i=1})x_{i}^{2}-\overline{x}^{2}]/SST_{x}=\\ \sigma^{2}(n^{-1}\sum^{n}_{i=1}x^{2}_{i})/SST_{x} \]


3 The following model is a simplified version of the multiple regression model used by Biddle and Hamermesh (1990) to study the tradeoff between time spent sleeping and working and to look at other factors affecting sleep:

\[ sleep=\beta_{0}+\beta_{1}totwrk+\beta_{2}educ+\beta_{3}age+u, \]

where \(sleep\) and \(totwrk\) (total work) are measured in minutes per week and \(educ\) and \(age\) are measured in years (See also Computer Exercise C3 in Chapter 2.)

  1. If adults trade off sleep for work, what is the sign of \(\beta_{1}\) ?

Solution:

library(wooldridge)

str(sleep75)
'data.frame':   706 obs. of  34 variables:
 $ age     : int  32 31 44 30 64 41 35 47 32 30 ...
 $ black   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ case    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ clerical: num  0 0 0 0 0 0 0 0 0 0 ...
 $ construc: num  0 0 0 0 0 0 0 0 0 0 ...
 $ educ    : int  12 14 17 12 14 12 12 13 17 15 ...
 $ earns74 : num  0 9500 42500 42500 2500 ...
 $ gdhlth  : int  0 1 1 1 1 1 1 1 1 1 ...
 $ inlf    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ leis1   : int  3529 2140 4595 3211 4052 4812 4787 3544 4359 4211 ...
 $ leis2   : int  3479 2140 4505 3211 4007 4797 4157 3469 4359 4061 ...
 $ leis3   : int  3479 2140 4227 3211 4007 4797 4157 3439 4121 4061 ...
 $ smsa    : int  0 0 1 0 0 0 0 1 0 1 ...
 $ lhrwage : num  1.956 0.358 3.022 2.264 1.012 ...
 $ lothinc : num  10.08 0 0 0 9.33 ...
 $ male    : int  1 1 1 0 1 1 1 1 1 1 ...
 $ marr    : int  1 0 1 1 1 1 1 1 1 1 ...
 $ prot    : int  1 1 0 1 1 1 1 1 0 0 ...
 $ rlxall  : int  3163 2920 3038 3083 3493 4078 3810 3033 3606 3168 ...
 $ selfe   : int  0 1 1 1 0 0 0 1 0 1 ...
 $ sleep   : int  3113 2920 2670 3083 3448 4063 3180 2928 3368 3018 ...
 $ slpnaps : int  3163 2920 2760 3083 3493 4078 3810 3003 3368 3168 ...
 $ south   : int  0 1 0 0 0 0 0 0 0 0 ...
 $ spsepay : num  0 0 20000 5000 2400 0 12000 0 0 6000 ...
 $ spwrk75 : int  0 0 1 1 1 0 1 0 0 1 ...
 $ totwrk  : int  3438 5020 2815 3786 2580 1205 2113 3608 2353 2851 ...
 $ union   : int  0 0 0 0 0 0 0 0 1 0 ...
 $ worknrm : int  3438 5020 2815 3786 2580 0 2113 3608 2353 2851 ...
 $ workscnd: int  0 0 0 0 0 1205 0 0 0 0 ...
 $ exper   : int  14 11 21 12 44 23 17 28 9 9 ...
 $ yngkid  : int  0 0 0 0 0 0 1 0 0 0 ...
 $ yrsmarr : int  13 0 0 12 33 23 0 24 11 7 ...
 $ hrwage  : num  7.07 1.43 20.53 9.62 2.75 ...
 $ agesq   : int  1024 961 1936 900 4096 1681 1225 2209 1024 900 ...
 - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

We can run the OLS regression in R

fit<-lm(sleep~totwrk+educ+age,data=sleep75)

 

Dependent variable:
sleep
fit
Constant 3,638.245***
(112.275)
totwrk -0.148***
(0.017)
educ -11.134*
(5.885)
age 2.200
(1.446)
Observations 706
R2 0.113
Adjusted R2 0.110
Residual Std. Error 419.359 (df = 702)
F Statistic 29.919*** (df = 3; 702)
Note: p<0.1; p<0.05; p<0.01

 

#generating the dataset
library(xlsx)
write.xlsx(sleep75, file="sleep75.xlsx")

You can download the dataset here

Soon if adults trade off sleep for work, more work implies less sleep (other things equal), so \(β_1 < 0\).

  1. What signs do you think \(\beta_{2}\) and \(\beta_{3}\) will have ?

Solution

The signs of \(\beta_2\) and \(\beta_3\) are not obvious. One could argue that more educated people like to get more out of life, and so, other things equal, they sleep less (\(\beta_2 < 0\)). The relationship between sleeping and age is more complicated than this model suggests, and economists are not in the best position to judge such things.

  1. Using th data in SLEEP75, the estimated equation is

\[ \widehat{sleep}=3,638.25-.148totwrk-11.13educ+2.20age\\ n=706,\,\, R^{2}=.113. \]

If someone works five more hours per week, by how many minutes is sleep predicted to fall ? Is this a large tradeoff ?

Solution

Since \(totwrk\) is in minutes, we must convert five hours into minutes: \(∆totwrk = 5(60) = 300\). Then sleep is predicted to fall by 0.148(300) = 44.4 minutes. Fo a week, 45 minutes less sleep is not an overwhelming change

  1. Discuss the sign and magnitude of the estimated coefficient on educ

Solution

More education implies less predicted time sleeping, but the effect is quite small. If we assume the difference between college and high school is four years, the college graduate sleeps about 45 minutes less per week, other things equal

  1. Would you say \(totwrk\), \(educ\) and \(age\) explain much of the variation in sleep? What other factors might affect the time spent sleeping ? Are these likely to be correlated with \(totwrk\) ?

Solution

As shown in the (i) the \(R^2=0.113\) that the three explanatory variables explain only about 11.3% of the variation in sleep. One important factor in the error term is general health.

Another is marital status, and whether the person has children. Health (however we measure that), marital status, and number and ages of children would generally be correlated with totwrk. (For example, less healthy people would tend to work less.)


9 The following equation describes the median housing price in a community in terms of amount of pollution (nox for nitrous oxide) and the average number of rooms in houses in the community (rooms):

\[ \log(price)=\beta_{0} +\beta_{1}\log(nox)+\beta_{2}rooms+u. \]

  1. What is the interpretation of \(\beta_{1}\) and \(\beta_{2}\)? What is the interpretation of \(\beta_{1}\) ? Explain.

Solution

We hope \(\beta_1 < 0\) because more pollution can be expected to lower housing values; \(\beta_1\) is the elasticity of price with respect to \(nox\). \(\beta_2\) is probably positive because \(rooms\) roughly measures the size of a house. (However, it does not allow us to distinguish homes where each room is large from homes where each room is small.)

  1. Why might \(nox\) [or more precisely \(\log(nox)\)] and \(rooms\) be negatively correlated? If this is the case, does the simple regression of \(log(price)\) on \(log(nox)\) produce an upward or a downward biased estimator of \(\beta_{1}\)?

Soloution

If we assume that \(rooms\) increases with quality of the home, then \(log(nox)\) and \(rooms\) are negatively correlated when poorer neighborhoods have more pollution, something that is often true. If \(\beta_2 > 0\) and \(Corr(x1, x2) < 0\), the simple regression estimator \(\widetilde{β}_1\) has a downward bias. But because \(\beta_1 < 0\), this means that the simple regression, on average, overstates the importance of pollution. \((E(\widetilde{\beta}_1)\) is more negative than \(\beta_1\).)

  1. Using the data in HPRICE2, the following equations were estimated:

\[ \widehat{\log(price)}=11.71-1.043\log(nox), n=506,\,R^{2}=.264\\ \widehat{\log(price)}=9.23-.718\log(nox)+.306rooms, n=506,\,R^{2}=.514 \]

Is the relationship between the simple and multiple regression estimates of the elasticity of \(price\) with respect to \(nox\) what would have predicted, given your answer in part (ii)? Does this mean that -.718 is definitely closer to the true elasticity than -1.043 ?

Solution:

This is what we expect from the typical sample based on our analysis in part (i). The simple regression estimate, −1.043, is more negative (larger in magnitude) than the multiple regression estimate, −0.718. As those estimates are only for one sample, we can never know which is closer to \(\beta_1\). But if this is a “typical” sample, \(\beta_1\) is closer to −0.718.


16 The following equations were estimated using the data in LAWSCH85:

\[\begin{align} \widehat{lsalary}=& 9.9 -.0041rank +.294GPA\\ &(.24)\quad(.0003)\quad(.069)\\ &n=142\quad R^{2}=.8238 \end{align}\]

\[\begin{align} \widehat{lsalary}=& 9.86 -.0038rank +.295GPA+.00017age\\ &(.29)\quad(.0004)\,\,\,\,\quad(.083)\quad\,\,(.00036)\\ &n=99\quad R^{2}=.8036 \end{align}\]

How can it be that the \(R^{2}\) is smaller when the variable \(age\) is added to equation.

Solution:

By using the wooldridge package in R we can see:

library(wooldridge)

#inspect the dataset
str(lawsch85)
'data.frame':   156 obs. of  21 variables:
 $ rank   : int  128 104 34 49 95 98 124 157 145 91 ...
 $ salary : num  31400 33098 32870 35000 33606 ...
 $ cost   : int  8340 6980 16370 17566 8350 8350 6020 5986 4785 7680 ...
 $ LSAT   : int  155 160 155 157 162 161 155 152 155 160 ...
 $ GPA    : num  3.15 3.5 3.25 3.2 3.38 ...
 $ libvol : int  216 256 424 329 332 311 220 230 230 157 ...
 $ faculty: int  45 44 78 136 56 40 40 45 101 44 ...
 $ age    : int  12 113 134 89 70 29 61 60 70 128 ...
 $ clsize : int  210 190 270 277 150 156 151 149 322 70 ...
 $ north  : int  1 0 0 0 0 0 0 0 0 0 ...
 $ south  : int  0 1 0 0 0 0 1 1 0 1 ...
 $ east   : int  0 0 1 1 0 0 0 0 1 0 ...
 $ west   : int  0 0 0 0 1 1 0 0 0 0 ...
 $ lsalary: num  10.4 10.4 10.4 10.5 10.4 ...
 $ studfac: num  4.67 4.32 3.46 2.04 2.68 ...
 $ top10  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ r11_25 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ r26_40 : int  0 0 1 0 0 0 0 0 0 0 ...
 $ r41_60 : int  0 0 0 1 0 0 0 0 0 0 ...
 $ llibvol: num  5.38 5.55 6.05 5.8 5.81 ...
 $ lcost  : num  9.03 8.85 9.7 9.77 9.03 ...
 - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
head(lawsch85)
  rank salary  cost LSAT  GPA libvol faculty age clsize north south east west  lsalary  studfac top10 r11_25 r26_40 r41_60  llibvol    lcost
1  128  31400  8340  155 3.15    216      45  12    210     1     0    0    0 10.35456 4.666667     0      0      0      0 5.375278 9.028818
2  104  33098  6980  160 3.50    256      44 113    190     0     1    0    0 10.40723 4.318182     0      0      0      0 5.545177 8.850804
3   34  32870 16370  155 3.25    424      78 134    270     0     0    1    0 10.40032 3.461539     0      0      1      0 6.049734 9.703206
4   49  35000 17566  157 3.20    329     136  89    277     0     0    1    0 10.46310 2.036765     0      0      0      1 5.796058 9.773721
5   95  33606  8350  162 3.38    332      56  70    150     0     0    0    1 10.42246 2.678571     0      0      0      0 5.805135 9.030017
6   98  31700  8350  161 3.40    311      40  29    156     0     0    0    1 10.36407 3.900000     0      0      0      0 5.739793 9.030017
sum(is.na(lawsch85))
 101
na_count <-sapply(lawsch85, function(y) sum(length(which(is.na(y)))))

na_count <- data.frame(na_count)

na_count
        na_count
rank           0
salary         8
cost           6
LSAT           6
GPA            7
libvol         1
faculty        4
age           45
clsize         3
north          0
south          0
east           0
west           0
lsalary        8
studfac        6
top10          0
r11_25         0
r26_40         0
r41_60         0
llibvol        1
lcost          6
sum(na_count)
 101
#generating the dataset
library(xlsx)
write.xlsx(lawsch85, file="lawsch85.xlsx")

Now you can download the lawsch85.xlsx here

By running the two regression models, we can have:

mod1<-lm(lsalary~rank+GPA,data=lawsch85)

mod2<-lm(lsalary~rank+GPA+age,data=lawsch85)

 

Dependent variable:
lsalary
mod1 mod2
(1) (2)
Constant 9.899*** 9.860***
(0.245) (0.293)
rank -0.004*** -0.004***
(0.0003) (0.0004)
GPA 0.294*** 0.295***
(0.069) (0.083)
age 0.0002
(0.0004)
Observations 142 99
R2 0.824 0.804
Adjusted R2 0.821 0.797
Residual Std. Error 0.117 (df = 139) 0.120 (df = 95)
F Statistic 324.858*** (df = 2; 139) 129.535*** (df = 3; 95)
Note: p<0.1; p<0.05; p<0.01

 

The coefficient of the age variable was shown to be insignificant in model 2.

With different observations the \(R^2\) are been incomparable (see the \(n\) in the two estimated equations). The residual standard error consumed more degrees of freedom in model 2 and increased compared to model 1.As seen above, the age variable has 45 missing data and the other variables in the both models…

library(dplyr)
regvars<-lawsch85%>%
select(lsalary,rank,GPA,age)

na_count_reg <-sapply(regvars, function(y) sum(length(which(is.na(y)))))

na_count_reg <- data.frame(na_count_reg)

na_count_reg
        na_count_reg
lsalary            8
rank               0
GPA                7
age               45

Although the general \(F\) test is significant in both models, the \(t\) statistic shows that the inclusion of the \(age\) variable does not improve the model’s fit to explain lsalary.