Wize University Statistics Textbook > Multiple Regression

Hypothesis Testing for Multiple Regression

0:00 / 0:00

Hypothesis Testing for Multiple Regression

In a multiple regression model, there are more than one explanatory variables used to explain or predict one response variable yy.
  • We conduct an F-test to see if the overall model is significant in predicting yy.
  • Specifically, we are assessing how good all the explanatory variables are, collectively, at predicting yy.

Hypotheses for F-test:

Ho:H_o: β1=β2=β3=...=βk=0\beta_1=\beta_2=\beta_3=...=\beta_k=0 ("The overall model is not significant.”)
Ha:H_a: at least one βi0\ \beta_i\ne0 ("The overall model is significant.”)

"F" for "full" model
F=SSRkSSE(nk1)=MSRMSE\boxed{F=\frac{\frac{SSR}{k}}{\frac{SSE}{\left(n-k-1\right)}}=\frac{MSR}{MSE}}

where,
  • k=k= # of explanatory variables, xisx_i{'}s
  • df numerator =k=k
  • df denominator =nk1=n-k-1

PAGE BREAK

Important
  • The F-stat only tells us if the overall model is sufficient.
  • It does not tell us which individual explanatory variables are significant!
  • Each explanatory variable will have its own t-score so you will be able to assess the significance of each one by running t-tests.
  • There is only one F-score in a regression model.
PAGE BREAK

F-Distribution

The F-distribution is one-sided and skewed to the right. It start at 0 and goes to infinity. The larger the F score, the better the model is overall.






0:00 / 0:00

Example: Hypothesis Testing for Multiple Regression

We wish to predict grade using 4 predictor variables:

x1=hours of studyingx_1=hours\ of\ studying
x2=students IQx_2=student's\ IQ
x3=students cumulative GPAx_3=student's\ cumulative\ GPA
x4=hours spent playing video gamesx_4=hours\ spent\ playing\ video\ games
y=gradey=grade

We randomly sampled 16 students. Results:



PAGE BREAK
We test if the overall model is appropriate to predict grade. Hypotheses:

HoH_o : β1=β2=β3=β4=0\beta_1=\beta_2=\beta_3=\beta_4=0
Ha H_a\ : at least one βi0\beta_i\ne0
We conduct an F-test to test if the overall model is sufficient:

F=MSRMSE=MSMMSE\displaystyle\boxed{F=\frac{MSR}{MSE}=\frac{MSM}{MSE}}

df numerator =k=k
df denominator =nk1=n-k-1

PAGE BREAK


(i) What percent of grade is explained by the model?
R2=SSRSST=6058.709056224.9375=0.9733\displaystyle{R^2=\frac{SSR}{SST}=\frac{6058.70905}{6224.9375}=0.9733}

97.33% of grade is explained by the model.


(ii) Based on the F-stat and its p-value, how do you conclude?
(a) The overall model is sufficient.
(b) The overall model is not sufficient.
(c) All the explanatory variables in the model are significant.
(d) None of the explanatory variables in the model are significant.

The F-score is 100.32. The p-value is = 0.00
The F-stat only tells us if the overall model is sufficient. It does not tell us which individual explanatory variables are significant!
PAGE BREAK
(iii) What does the coefficient b4b_4 (i.e. hours spent playing video games) tell us?

For each hour spent playing video games, your grade is increase/reduced by 1.45 percent, all else equal. Better grade arises if a student spends more/less time playing video games.
Reduced; less

(iv) Which of the following must be true about x4x_4 ?
(a) It is a significant explanatory variable on its own.
(b) It is a significant explanatory variable in this multiple regression model.
(c) It is a not significant explanatory variable on its own.
(d) It is a not significant explanatory variable in this multiple regression model.

To only assess one explanatory variable in the multiple regression model, we look at its t-score:
t4=3.68433t_4=-3.68433
The p-value is 0.0036, which is less than 1%, indicating that "Games (hrs)" is a very significant explanatory variable.

b4 is a significant explanatory variable as a "team member" in this multiple regression model. We don’t know if it is by itself; we will need to plot it alone in a simple linear regression model - it very well could be significant but we don't know with the information provided.
"Beyonce"
(v) Sammy studied for 40 hours, has an IQ of 130, has a GPA of 3.50, and played 3 hours of Mario Kart. Predict his grade.

y^=42.099+0.535(40)+0.037(130)+3.446(3.50)1.454(3)\hat{y}=42.099+0.535\left(40\right)+0.037\left(130\right)+3.446\left(3.50\right)-1.454\left(3\right)
=76.008
We predict his grade will be 76%.
PAGE BREAK
(vi) At the 5% significance level, test if GPA should be included in the model.

Ho:β3=0H_o:\beta_3=0
Ha:β30H_a:\beta_3\ne0


df=nk1=11df=n-k-1=11

GPA:t3=b3SE(b3)=3.445913.12076=1.1049\displaystyle{GPA:t_3=\frac{b_3}{SE\left(b_3\right)}=\frac{3.44591}{3.12076}=1.1049}

The t-score is less than 2 so we can already tell it is not significant. But let's verify that using the t-table.

CV(0.05,2-tail,11)=2.201. The t-score (1.1049) is less than CV (2.201), so we fail to reject HoH_o.

Given the t-score (1.1049) and df (11), the p-value is between 0.20 and 0.30, which means the p-value is greater than the significance level (0.05), so we fail to reject HoH_o.




[Software] Exact p-value =TDIST(1.1049,11,2)=0.293>α=0.050.293>\alpha=0.05

PAGE BREAK

GPA is/is not a significant variable for explaining grade; it should/should not be removed from the model.

is not; should be removed

(vii) Given the t-stats for each explanatory variable, what do you recommend for the model?
Hours of study: KEEP REMOVE
IQ: KEEP REMOVE
GPA: KEEP REMOVE
Hours of playing video games: KEEP REMOVE

Keep: hours study, games
Remove: IQ, GPA
0:00 / 0:00

Example: Hypothesis Testing for Multiple Regression

We want to predict the earnings of an Instagram "Influencer" yy based on three explanatory variables:

x1=x_1= number of followers (in thousands)
x2=x_2= hours of volunteer work
x3=x_3= grade

Ho:H_o: β1=β2=β3=0\beta_1=\beta_2=\beta_3=0 ("The overall model is not significant.”)
Ha:H_a: at least one βi0\ \beta_i\ne0 ("The overall model is significant.”)


PARTIAL OUTPUT

(a) Determine the F-statistic:
F=SSRkSSE(nk1)=MSRMSEF=\frac{\frac{SSR}{k}}{\frac{SSE}{\left(n-k-1\right)}}=\frac{MSR}{MSE}


F=81568.861095.85=74.43\displaystyle{F=\frac{81568.86}{1095.85}=74.43}

(b) What are the degrees of freedoms? (For the F-stat, there are two df's.)

df numerator = k=3k=3

df denominator = nk1=1511=11n-k-1=15-1-1=11

The sample size n is not 14!!! The df (total) = n-1.
PAGE BREAK
(c) At the 5% significance level, what is the critical value for F? [Use F-table]
CV=3.5874CV=3.5874
The F-score must be larger than 3.5874 to reject HoH_o.


PAGE BREAK

(d) At the 1% significance level, what is the critical value for F? [Use F-table]
CV=6.2167CV=6.2167
The F-score must be larger than 6.2167 to reject Ho.


PAGE BREAK

(e) What is the p-value for the F-statistic? [Use F-table]
6.2167 < [F=74.43]
CV(0.01) > p-value

Therefore, p-value is less than 0.01.

You can also see the p-value value in the ANOVA table under "Significance F", which simply means the "p-value of the F-stat".

We can reject Ho at the 1% significance level (and certainly at the 5% significance level).


Wow! The F-score is huge! Does that mean all the explanatory variables are significant?

NO!

Recall:
  • The F-stat only tells us if the overall model is sufficient. It does not tell us which individual explanatory variables are significant!
  • Each explanatory variable will have its own t-score so you will be able to assess the significance of each one by running t-tests.
  • There is only one F-score in a regression model.

PAGE BREAK
Let's see the full ANOVA table:


  • You see that only Followers is a significant explanatory variable because its p-value is low. How much time a Instagram "Influencer" volunteers and their grade are not good predictors of their earnings, based on their large p-value.
  • Also notice that only the confidence interval for the Followers coefficient β1\beta_1 does not contain 0.
PAGE BREAK

Ho:H_o: β1=β2=β3=0\beta_1=\beta_2=\beta_3=0 ("The overall model is not significant.”)
Ha:H_a: at least one βi0\ \beta_i\ne0 ("The overall model is significant.”)

It is true: at least one explanatory variable is significant. In this example, it's just Followers. That is enough to reject the null hypothesis and conclude that the overall model is significant.

Finally, notice how high R2R^2 is. This suggests that the one significant explanatory variable, Followers, is doing almost all the work in explaining earnings.

Michaela is good at statistics but is famous for her cooking website. She believes that the number of new membership subscriptions per month depends on money spent on advertising, number of new recipes posted, number of times her page is shared on social media, and number of guest appearances she makes on TV. She randomly samples 18 months and applies multiple regression:



(i) How is the model overall for explaining new membership subscriptions? Select the null hypothesis. (Check all that applies.)

An account manager's salary (YY) is estimated using a regression model. There are 3 explanatory variables: years of experience, number of complaints, and height (cm). Salary is in $'000.

Use the partial Excel output provided below to answer the series of questions.

(i) Which is the correct regression equation?
Extra Practice