2 Consider an equation to explain salaries of CEOs in terms of annual firm sales, return on equity (roe in percentage form), and return of the firm´s stock (ros in percentage form):
\[ log(salary) = \beta_0 +\beta_1 log(sales)+\beta_2 roe+\beta_3 ros +u \]
Answer:
\[ H_0 : β_3 = 0,\\ H_1 : β_3 > 0 \]
By what percentage is \(salary\) predicted to increase if \(ros\) increases by 50 point ? Does \(ros\) have a practically large effect on \(salary\) ?
Answer:
By exploring the CEOSAL1 data, we can see:
library(wooldridge)
str(ceosal1)
'data.frame': 209 obs. of 12 variables:
$ salary : int 1095 1001 1122 578 1368 1145 1078 1094 1237 833 ...
$ pcsalary: int 20 32 9 -9 7 5 10 7 16 5 ...
$ sales : num 27595 9958 6126 16246 21783 ...
$ roe : num 14.1 10.9 23.5 5.9 13.8 ...
$ pcroe : num 106.4 -30.6 -16.3 -25.7 -3 ...
$ ros : int 191 13 14 -21 56 55 62 44 37 37 ...
$ indus : int 1 1 1 1 1 1 1 1 1 1 ...
$ finance : int 0 0 0 0 0 0 0 0 0 0 ...
$ consprod: int 0 0 0 0 0 0 0 0 0 0 ...
$ utility : int 0 0 0 0 0 0 0 0 0 0 ...
$ lsalary : num 7 6.91 7.02 6.36 7.22 ...
$ lsales : num 10.23 9.21 8.72 9.7 9.99 ...
- attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
write.csv(ceosal1,"ceosal1.csv")
library(xlsx)
write.xlsx(ceosal1,"ceosal1.xlsx")
Now you can download this data here.
Using R, the estimated model, looks like:
fit<-lm(lsalary~lsales+roe+ros, data=ceosal1)
summary(fit)
Call:
lm(formula = lsalary ~ lsales + roe + ros, data = ceosal1)
Residuals:
Min 1Q Median 3Q Max
-0.96060 -0.27144 -0.03264 0.22563 2.79805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.3117125 0.3154329 13.669 < 2e-16 ***
lsales 0.2803149 0.0353200 7.936 1.34e-13 ***
roe 0.0174168 0.0040923 4.256 3.17e-05 ***
ros 0.0002417 0.0005418 0.446 0.656
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4832 on 205 degrees of freedom
Multiple R-squared: 0.2827, Adjusted R-squared: 0.2722
F-statistic: 26.93 on 3 and 205 DF, p-value: 1.001e-14
Click here to see in Stata how to run this regression
The proportionate effect on \(\widehat{salary}\) is \(.00024(50) = .012\). or 1.2%. Therefore, a 50 point ceteris paribus increase in \(ros\) is predicted to increase \(salary\) by only 1.2%. Practically speaking, this is a very small effect for such a large change in \(ros\).
Answer:
The 10% critical value for a one-tailed test, using \(df = \infty\), is obtained from Table G.2 as 1.282. The \(t\) statistic on \(ros\) is \(.00024/.00054 ≈ .44\), which is well below the critical value. Therefore, we fail to reject \(H_0\) at the 10% significance level.
Answer:
Based on this sample, the estimated \(ros\) coefficient appears to be different from zero only because of sampling variation. On the other hand, including \(ros\) may not be causing any harm; it depends on how correlated it is with the other independent variables (although these are very significant even with \(ros\) in the equation).
If you are a policy maker trying to estimate the causal effect of per-student spending on math test performance, explain why the first equation is more relevant than the second. What is the estimated effect of a 10% increase in expenditures per student?
9 In Problem 3 in Chapter 3, we estimated the equation:
\[\begin{align} \widehat{sleep} =& 3,638.25 - .148 totwrk - 11.13 educ + 2.20 age\\ & (112.28)\quad (.0172)\quad\quad\quad (5.88)\quad\quad (1.45)\\ &\quad\quad\quad\quad n=706, R^2 = .113, \end{align}\]
where we now report standard errors along with the estimates.
Show your work.
Answer:
With \(df = 706 – 4 = 702\), we use the standard normal critical value (\(df = \infty\) in Table G.2), which is 1.96 for a two-tailed test at the 5% level. Now \(t_{educ} = −11.13/5.88 ≈ −1.89\), so \(|t_{educ}| = 1.89 < 1.96\), and we fail to reject \(H0: β_{educ} = 0\) at the 5% level. Also, \(t_{age} ≈ 1.52\), so age is also statistically insignificant at the 5% level.
Are \(educ\) and \(age\) jointly significant in the original equation at the 5% level? Justify your answer.
Answer:
We could to compute the \(R^2\) form of the \(F\) statistic for joint significance. \(F = \frac{0.113−0.103}{1−0.113} \frac{702}{2} = 3.9572\). The 5% critical value is the \(F_{2,702}\) distribution can be obtained with a denominator \(df = \infty\): 3.00. Therefore, \(educ\) and \(age\) are jointly significant at the 5% level. (In fact, the \(p\) value is about 0.019, and so \(educ\) and \(age\) are jointly significant at the 2% level).
Answer:
Not really. These variables are jointly significant, but including them only changes the coefficient on \(totwork\) from −0.151 to −0.148.
Answer:
The \(t\) and \(F\) statistics that we used assume homoskedasticity. If there is heteroskedasticity in the equation, the tests are no longer valid.
4 In the simple regression model (5.16), under the first four Gauss-Markov assumptions, we showed that estimators of the form (5.17) are consistent for the slope, \(\beta_1\). Given such an estimator, define an estimator of \(\beta_0\) by \(\widetilde{\beta}_0 = \overline{y}-\widetilde{\beta}_{1}\overline{x}\). Show that \(plim \widetilde{\beta_0}=\beta_0\).
Answer:
Write \(y = β_0 + β_1x + u\), and take the expected value: \(E(y) = β_0 + β_{1}E(x) + E(u)\), or \(μ_y = β_0 + β_{1}μ_x\), since \(E(u) = 0\), where \(μ_y = E(y)\) and \(µ_x = E(x)\). We can rewrite this as \(β_0 = µ_y − β_1 µ_x\). Now, \(\widetilde{β_0} = y − β_{1} \overline{x}\) . Taking the plim of this we have plim\((\widetilde{β_0}) = plim(\overline{y} − \widetilde{β_1}\overline{x}) = plim(\overline{y}) – plim( \widetilde{β1})\cdot plim(\overline{x}) = μ_y − β_1 μ_x\), where we use the fact that \(plim(\overline{y}) = μ_y\) and \(plim(\overline{x}) = μ_x\) by the law of large numbers, and \(plim(\widetilde{β_1}) = β_1\) . We have also used the parts of the Property PLIM.2 from Appendix C.
3 Using the data in RDCHEM, the following equation was obtained by OLS:
\[\begin{align} \widehat{rditens}=&2.613+.00030sales-.0000000070 sales^2\\ &\quad(.429)\quad(.00014)\quad\quad(.0000000037)\\ &n=32,R^2 = .1484 \end{align}\]
Answer:
\[\begin{align} \Delta(rdintens/sales) =& 0.0003 – 0.000000007sales\\ =& sales = 21428.57. \end{align}\]
At the point $21428.57 millions of \(sales\), \(rdintens\) reaches the highest point. When \(salesexceeds\) $21428.57 millions, the marginal effect of \(sales\) on \(rdintens\) becomes negative.
Answer:
\[ H0: \beta_2 = 0,\\ H1: \beta_2 = 0 \]
\(t\) = - 0.000000007 / 0.0000000037 = -1.89 and the critical \(t\) (29, 0.01) = 0.256
If \(|t| > \mbox{critical}\,\, t\) => Reject \(H_0 \\\)
\(sale^2\) has significant impact on \(rdintens\), thus \(sale^2\) should be included in the model
Answer:
Using R we can do:
str(rdchem)
'data.frame': 32 obs. of 8 variables:
$ rd : num 430.6 59 23.5 3.5 1.7 ...
$ sales : num 4570 2830 597 134 42 ...
$ profits : num 186.9 467 107.4 -4.3 8 ...
$ rdintens: num 9.42 2.08 3.94 2.62 4.05 ...
$ profmarg: num 4.09 16.5 18 -3.22 19.05 ...
$ salessq : num 20886730 8008900 356170 17849 1764 ...
$ lsales : num 8.43 7.95 6.39 4.89 3.74 ...
$ lrd : num 6.065 4.078 3.157 1.253 0.531 ...
- attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
library(dplyr)
rdchem<-rdchem%>%
mutate(salesbil = sales/1000,
salesbil2 = sales^2 / 1000^2) #Adding the new variables on the dataset
str(rdchem)
'data.frame': 32 obs. of 10 variables:
$ rd : num 430.6 59 23.5 3.5 1.7 ...
$ sales : num 4570 2830 597 134 42 ...
$ profits : num 186.9 467 107.4 -4.3 8 ...
$ rdintens : num 9.42 2.08 3.94 2.62 4.05 ...
$ profmarg : num 4.09 16.5 18 -3.22 19.05 ...
$ salessq : num 20886730 8008900 356170 17849 1764 ...
$ lsales : num 8.43 7.95 6.39 4.89 3.74 ...
$ lrd : num 6.065 4.078 3.157 1.253 0.531 ...
$ salesbil : num 4.57 2.83 0.597 0.134 0.042 ...
$ salesbil2: num 20.88673 8.0089 0.35617 0.01785 0.00176 ...
- attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
I can generate the new dataset for run in other econometric softwares:
write.csv(rdchem,"rdchem.csv")
write.xlsx(rdchem,"rdchem.xlsx")
You can download the .csv file here and the .xlsx file here
The new fitted model is:
summary(lm(rdintens~salesbil+salesbil2, data=rdchem))
Call:
lm(formula = rdintens ~ salesbil + salesbil2, data = rdchem)
Residuals:
Min 1Q Median 3Q Max
-2.1418 -1.3630 -0.2257 1.0688 5.5808
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.612512 0.429442 6.084 1.27e-06 ***
salesbil 0.300571 0.139295 2.158 0.0394 *
salesbil2 -0.006946 0.003726 -1.864 0.0725 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.788 on 29 degrees of freedom
Multiple R-squared: 0.1484, Adjusted R-squared: 0.08969
F-statistic: 2.527 on 2 and 29 DF, p-value: 0.09733
Answer:
By comparing the standard errors and the \(R^2\) we can see:
\[\begin{align} \widehat{rdintens}=&2.613+.00030sales-.0000000070 sales^2\\ &\quad(.429)\quad(.00014)\quad\quad(.0000000037)\\ &n=32,R^2 = .1484 \end{align}\]
versus
\[\begin{align} \widehat{rdintens}=&2.612+0.305salesbil-0.007salesbil2\\ &\quad(.429)\quad(0.139)\quad(0.003)\\ &n=32,R^2 = .1484 \end{align}\]
The first model has lower standard errors and maybe the chosen as prefered.
Answer:
By declaring the hypothesis test:
\[ H_0 : \widehat{\beta}_{lexppp} = 0\\ H_0 : \widehat{\beta}_{lexppp} \neq 0 \]
By dividing the coefficient \(\widehat{\beta}_{lexppp}\) by his standard error ($t_{{lexppp}}={lexppp}/se(_{lexppp})9.01/4.01 2.231.96 $) and compare it to the critical value at 5% level of significance at 224 degrees of freedom with four explanatory variables.
The coefficient of (\(lexppp\)) is statistically significant because the null hypothesis is rejected.
For the second regression model the coefficient of (\(lexppp\)) is statistically insignificant because the null hypothesis is not rejected (\(t_{\beta_{lexppp}}=1.93/2.82\)).
The estimated change in the math scores by substituting the increase in expenditure per student by 10%. This indicates that 10% increase in the expenditure per student increases the math scores by 0.901%.
\[\begin{align} \widehat{\Delta math4} =& 9.01(\Delta lexppp)\\ =& 9.01/100\equiv 0.901\% \end{align}\]
Answer:
By calling the dataset, we can see:
str(meapsingle)
'data.frame': 229 obs. of 18 variables:
$ dcode : int 63010 63010 63270 63270 63010 63010 63010 63130 63130 63130 ...
$ bcode : int 3030 3133 2023 2978 316 5670 1494 1631 1753 2254 ...
$ math4 : num 92.8 100 72.1 76.1 95.2 88.6 95.2 66.7 83.9 95.7 ...
$ read4 : num 82.5 94.3 46.5 65.7 80.6 72.7 90.5 46.3 44.6 56.5 ...
$ enroll : int 607 370 220 356 329 331 288 452 428 238 ...
$ exppp : num 6620 6620 5608 5830 6620 ...
$ free : num 1 0 5.9 8.1 0.3 1.2 12.2 50.2 40.2 24.4 ...
$ reduced : num 0.7 0 5 2.8 0.3 0.9 5.2 17.5 10 17.6 ...
$ lunch : num 1.7 0 10.9 10.9 0.6 2.1 17.4 67.7 50.2 42 ...
$ medinc : int 110322 110322 65119 65119 109313 109313 109313 43750 43750 43750 ...
$ totchild: int 4076 4076 2524 2524 3486 3486 3486 4651 4651 4651 ...
$ married : int 3542 3542 2091 2091 3241 3241 3241 3258 3258 3258 ...
$ single : int 534 534 433 433 245 245 245 1393 1393 1393 ...
$ pctsgle : num 13.1 13.1 17.16 17.16 7.03 ...
$ zipcode : int 48009 48009 48017 48017 48025 48025 48025 48030 48030 48030 ...
$ lenroll : num 6.41 5.91 5.39 5.87 5.8 ...
$ lexppp : num 8.8 8.8 8.63 8.67 8.8 ...
$ lmedinc : num 11.6 11.6 11.1 11.1 11.6 ...
We can generate the dataset in .csv or .xlxs format using:
write.csv(meapsingle, "meapsingle.csv")
write.xlsx(meapsingle,"meapsingle.xlsx")
The files are avaiable here and here.
By comparing the two regression models:
model1<-lm(math4~lexppp + free + lmedinc + pctsgle, data=meapsingle)
model2<-lm(math4~lexppp + free + lmedinc + pctsgle +read4, data=meapsingle)
summary(model1)
Call:
lm(formula = math4 ~ lexppp + free + lmedinc + pctsgle, data = meapsingle)
Residuals:
Min 1Q Median 3Q Max
-33.259 -7.422 1.615 7.274 49.524
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.48949 59.23781 0.413 0.6797
lexppp 9.00648 4.03530 2.232 0.0266 *
free -0.42164 0.07064 -5.969 9.27e-09 ***
lmedinc -0.75221 5.35816 -0.140 0.8885
pctsgle -0.27444 0.16086 -1.706 0.0894 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.59 on 224 degrees of freedom
Multiple R-squared: 0.4716, Adjusted R-squared: 0.4622
F-statistic: 49.98 on 4 and 224 DF, p-value: < 2.2e-16
summary(model2)
Call:
lm(formula = math4 ~ lexppp + free + lmedinc + pctsgle + read4,
data = meapsingle)
Residuals:
Min 1Q Median 3Q Max
-29.5690 -4.6729 -0.0349 4.3644 24.8425
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 149.37870 41.70293 3.582 0.000419 ***
lexppp 1.93215 2.82480 0.684 0.494688
free -0.06004 0.05399 -1.112 0.267297
lmedinc -10.77595 3.75746 -2.868 0.004529 **
pctsgle -0.39663 0.11143 -3.559 0.000454 ***
read4 0.66656 0.04249 15.687 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.012 on 223 degrees of freedom
Multiple R-squared: 0.7488, Adjusted R-squared: 0.7432
F-statistic: 132.9 on 5 and 223 DF, p-value: < 2.2e-16
By putting the results sidye by side:
Dependent variable: | ||
math4 | ||
model1 | model2 | |
(1) | (2) | |
Constant | 24.489 | 149.379*** |
(59.238) | (41.703) | |
lexppp | 9.006** | 1.932 |
(4.035) | (2.825) | |
free | -0.422*** | -0.060 |
(0.071) | (0.054) | |
lmedinc | -0.752 | -10.776*** |
(5.358) | (3.757) | |
pctsgle | -0.274* | -0.397*** |
(0.161) | (0.111) | |
read4 | 0.667*** | |
(0.042) | ||
Observations | 229 | 229 |
R2 | 0.472 | 0.749 |
Adjusted R2 | 0.462 | 0.743 |
Residual Std. Error | 11.594 (df = 224) | 8.012 (df = 223) |
F Statistic | 49.979*** (df = 4; 224) | 132.941*** (df = 5; 223) |
Note: | p<0.1; p<0.05; p<0.01 |
By including the variable \(read4\) we see the increseases de \(R^2\) but the \(lexppp\) and \(free\) lost statistical significance and the \(lmeing\) and \(postgle\) be significative.
Answer:
The importance of causal relationships in an econometric model is more fundamental and interesting to remain prominent in relation to the selection of a model compared to the mere objective of statistical adjustment, which can often lead us to an inadequate interpretation of an economic phenomenon .
The statistical significance of the variables in their set of explanations provides us with a more robust interpretation than a biased \(R ^ 2\).
C8 Use the data in HPRICE1 for this exercise.
\[ price = \beta_0 + \beta_1 lotsize + \beta_2 sqrft + \beta_3 bdrms + u \]
and report the results in the usual form, including the standard error of the regression. Obtain predicted price, when we plug in \(lotsize = 10,000\), \(sqrft = 2,300\), and \(bdrms = 4\); round this price to the nearest dollar.
Answer:
Calling R for access the dataset:
str(hprice1)
'data.frame': 88 obs. of 10 variables:
$ price : num 300 370 191 195 373 ...
$ assess : num 349 352 218 232 319 ...
$ bdrms : int 4 3 3 3 4 5 3 3 3 3 ...
$ lotsize : num 6126 9903 5200 4600 6095 ...
$ sqrft : int 2438 2076 1374 1448 2514 2754 2067 1731 1767 1890 ...
$ colonial: int 1 1 0 1 1 1 1 1 0 0 ...
$ lprice : num 5.7 5.91 5.25 5.27 5.92 ...
$ lassess : num 5.86 5.86 5.38 5.45 5.77 ...
$ llotsize: num 8.72 9.2 8.56 8.43 8.72 ...
$ lsqrft : num 7.8 7.64 7.23 7.28 7.83 ...
- attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
Generating dataset in .csv and .xlsx format for download:
write.csv(hprice1, "hprice1.csv")
write.xlsx(hprice1, "hprice1.xlsx")
Now estimating the model:
model <- lm(price~lotsize + sqrft + bdrms, data=hprice1)
summary(model)
Call:
lm(formula = price ~ lotsize + sqrft + bdrms, data = hprice1)
Residuals:
Min 1Q Median 3Q Max
-120.026 -38.530 -6.555 32.323 209.376
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.177e+01 2.948e+01 -0.739 0.46221
lotsize 2.068e-03 6.421e-04 3.220 0.00182 **
sqrft 1.228e-01 1.324e-02 9.275 1.66e-14 ***
bdrms 1.385e+01 9.010e+00 1.537 0.12795
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59.83 on 84 degrees of freedom
Multiple R-squared: 0.6724, Adjusted R-squared: 0.6607
F-statistic: 57.46 on 3 and 84 DF, p-value: < 2.2e-16
By obtained coefficients we can see predicted prices
estimated <- summary(model)$coef[1,1]+summary(model)$coef[2,1]*1000+summary(model)$coef[3,1]*2300+summary(model)$coef[4,1]*4
estimated
[1] 318.0973
Is the same to do \(\widehat{price}=-2.177e+01 + 2.068e-03lotsize+1.228e-01sqrft+1.385e+01bdrms\) and changing by the expected values:
\[ \widehat{price}=-2.177e+01 + 2.068e-03\times 1000 + 1.228e-01\times 2300 + 1.385e+01\times 4 = \mbox{US\$ }318.10 \]
Answer:
confint(model)
2.5 % 97.5 %
(Intercept) -80.384661400 36.844045104
lotsize 0.000790769 0.003344644
sqrft 0.096454149 0.149102222
bdrms -4.065140551 31.770184040
The estimated \(\beta_1=2.068e-03\) is with the interval [0.000790769, 0.003344644], \(\beta_2=1.228e-01\) is with the interval [0.096454149, 0.149102222] and \(\beta_3=1.385e+01\) is with the interval [-4.065140551, 31.770184040]
Answer:
First we can simulate the future selling price oriented by (i) results:
lotsize <- c(1000)
sqrft <- c(2300)
bdrms <- c(4)
unknown.price.zero <- data.frame(lotsize, sqrft, bdrms)
By using predict function …
unknown.price.zero
lotsize sqrft bdrms
1 1000 2300 4
predict(model, newdata = unknown.price.zero, interval = "confidence")
fit lwr upr
1 318.0973 299.3994 336.7952
The 95% prediction intervals associated with a sqrft of 2300 is (299.4, 336.8). This means that, according to our model, 95% of the prices with a sqrft of 2300 have a selling prices between 299.4 and 336.8.
Wooldridge, J.M. Introductory Econometrics: A modern approach, 6 .ed, 2016. Avaible in zlib