Wize University Statistics Textbook > Multiple Regression

Hypothesis Testing for Multiple Regression

Multiple Regression ModelPrevious Section

0:00 / 0:00

Hypothesis Testing for Multiple Regression
In a multiple regression model, there are more than one explanatory variables used to explain or predict one response variable yyy. 
We conduct an F-test to see if the overall model is significant in predicting yyy. 
Specifically, we are assessing how good all the explanatory variables are, collectively, at predicting yyy. 

Hypotheses for F-test:

Ho:H_o:Ho​:  β1=β2=β3=...=βk=0\beta_1=\beta_2=\beta_3=...=\beta_k=0β1​=β2​=β3​=...=βk​=0     ("The overall model is not significant.”)
Ha:H_a:Ha​:   at least one βi≠0\ \beta_i\ne0 βi​=0                         ("The overall model is significant.”)

"F" for "full" model

F=SSRkSSE(n−k−1)=MSRMSE\boxed{F=\frac{\frac{SSR}{k}}{\frac{SSE}{\left(n-k-1\right)}}=\frac{MSR}{MSE}}F=(n−k−1)SSE​kSSR​​=MSEMSR​​

where,
k=k=k= # of explanatory variables, xi′sx_i{'}sxi​′s
df numerator =k=k=k
df denominator =n−k−1=n-k-1=n−k−1

PAGE BREAK

Important
The F-stat only tells us if the overall model is sufficient. 
It does not tell us which individual explanatory variables are significant! 
Each explanatory variable will have its own t-score so you will be able to assess the significance of each one by running t-tests. 
There is only one F-score in a regression model. 
PAGE BREAK
F-Distribution 
The F-distribution is one-sided and skewed to the right. It start at 0 and goes to infinity. The larger the F score, the better the model is overall. 

0:00 / 0:00

Example: Hypothesis Testing for Multiple Regression
We wish to predict grade using 4 predictor variables:

x1=hours of studyingx_1=hours\ of\ studyingx1​=hours of studying
x2=student′s IQx_2=student's\ IQx2​=student′s IQ
x3=student′s cumulative GPAx_3=student's\ cumulative\ GPAx3​=student′s cumulative GPA
x4=hours spent playing video gamesx_4=hours\ spent\ playing\ video\ gamesx4​=hours spent playing video games
y=gradey=gradey=grade

We randomly sampled 16 students. Results:

PAGE BREAK
  
We test if the overall model is appropriate to predict grade. Hypotheses:

HoH_oHo​ : β1=β2=β3=β4=0\beta_1=\beta_2=\beta_3=\beta_4=0β1​=β2​=β3​=β4​=0
Ha H_a\ Ha​ : at least one βi≠0\beta_i\ne0βi​=0
  
We conduct an F-test to test if the overall model is sufficient:

F=MSRMSE=MSMMSE\displaystyle\boxed{F=\frac{MSR}{MSE}=\frac{MSM}{MSE}}F=MSEMSR​=MSEMSM​​

df numerator =k=k=k
df denominator =n−k−1=n-k-1=n−k−1

PAGE BREAK

(i) What percent of grade is explained by the model?

  
R2=SSRSST=6058.709056224.9375=0.9733\displaystyle{R^2=\frac{SSR}{SST}=\frac{6058.70905}{6224.9375}=0.9733}R2=SSTSSR​=6224.93756058.70905​=0.9733

97.33% of grade is explained by the model.

(ii) Based on the F-stat and its p-value, how do you conclude?
(a) The overall model is sufficient.
(b) The overall model is not sufficient.
(c) All the explanatory variables in the model are significant.
(d) None of the explanatory variables in the model are significant. 

The F-score is 100.32. The p-value is = 0.00
The F-stat only tells us if the overall model is sufficient. It does not tell us which individual explanatory variables are significant!

  
PAGE BREAK
(iii) What does the coefficient b4b_4b4​ (i.e. hours spent playing video games) tell us?

For each hour spent playing video games, your grade is increase/reduced by 1.45 percent, all else equal. Better grade arises if a student spends more/less time playing video games.

Reduced; less

(iv) Which of the following must be true about x4x_4x4​ ?
(a) It is a significant explanatory variable on its own.
(b) It is a significant explanatory variable in this multiple regression model.
(c) It is a not significant explanatory variable on its own.
(d) It is a not significant explanatory variable in this multiple regression model.

To only assess one explanatory variable in the multiple regression model, we look at its t-score:
t4=−3.68433t_4=-3.68433t4​=−3.68433
The p-value is 0.0036, which is less than 1%, indicating that "Games (hrs)" is a very significant explanatory variable.

b4 is a significant explanatory variable as a "team member" in this multiple regression model. We don’t know if it is by itself; we will need to plot it alone in a simple linear regression model - it very well could be significant but we don't know with the information provided.
"Beyonce"

(v) Sammy studied for 40 hours, has an IQ of 130, has a GPA of 3.50, and played 3 hours of Mario Kart. Predict his grade.

y^=42.099+0.535(40)+0.037(130)+3.446(3.50)−1.454(3)\hat{y}=42.099+0.535\left(40\right)+0.037\left(130\right)+3.446\left(3.50\right)-1.454\left(3\right)y^​=42.099+0.535(40)+0.037(130)+3.446(3.50)−1.454(3)
=76.008
 We predict his grade will be 76%.

   
PAGE BREAK
(vi) At the 5% significance level, test if GPA should be included in the model. 

Ho:β3=0H_o:\beta_3=0Ho​:β3​=0
Ha:β3≠0H_a:\beta_3\ne0Ha​:β3​=0

df=n−k−1=11df=n-k-1=11
df=n−k−1=11

GPA:t3=b3SE(b3)=3.445913.12076=1.1049\displaystyle{GPA:t_3=\frac{b_3}{SE\left(b_3\right)}=\frac{3.44591}{3.12076}=1.1049}GPA:t3​=SE(b3​)b3​​=3.120763.44591​=1.1049

The t-score is less than 2 so we can already tell it is not significant. But let's verify that using the t-table.

CV(0.05,2-tail,11)=2.201. The t-score (1.1049) is less than CV (2.201), so we fail to reject HoH_oHo​.

Given the t-score (1.1049) and df (11), the p-value is between 0.20 and 0.30, which means the p-value is greater than the significance level (0.05), so we fail to reject HoH_oHo​.

[Software] Exact p-value =TDIST(1.1049,11,2)=0.293>α=0.050.293>\alpha=0.050.293>α=0.05

PAGE BREAK

GPA is/is not a significant variable for explaining grade; it should/should not be removed from the model.

is not; should be removed

 (vii) Given the t-stats for each explanatory variable, what do you recommend for the model? 
  
Hours of study:                                  KEEP           REMOVE
IQ:                                                        KEEP           REMOVE
GPA:                                                    KEEP           REMOVE
Hours of playing video games:       KEEP           REMOVE

Keep: hours study, games
Remove: IQ, GPA

0:00 / 0:00

Example: Hypothesis Testing for Multiple Regression
We want to predict the earnings of an Instagram "Influencer" yyy based on three explanatory variables:

x1=x_1=x1​= number of followers (in thousands)
x2=x_2=x2​= hours of volunteer work
x3=x_3=x3​= grade

Ho:H_o:Ho​:  β1=β2=β3=0\beta_1=\beta_2=\beta_3=0β1​=β2​=β3​=0        ("The overall model is not significant.”)
Ha:H_a:Ha​:   at least one βi≠0\ \beta_i\ne0 βi​=0        ("The overall model is significant.”)


PARTIAL OUTPUT

  
(a) Determine the F-statistic:
F=SSRkSSE(n−k−1)=MSRMSEF=\frac{\frac{SSR}{k}}{\frac{SSE}{\left(n-k-1\right)}}=\frac{MSR}{MSE}F=(n−k−1)SSE​kSSR​​=MSEMSR​

F=81568.861095.85=74.43\displaystyle{F=\frac{81568.86}{1095.85}=74.43}F=1095.8581568.86​=74.43

  
(b) What are the degrees of freedoms? (For the F-stat, there are two df's.)

df numerator = k=3k=3k=3

df denominator = n−k−1=15−1−1=11n-k-1=15-1-1=11n−k−1=15−1−1=11

The sample size n is not 14!!! The df (total) = n-1.

PAGE BREAK
(c) At the 5% significance level, what is the critical value for F? [Use F-table]

CV=3.5874CV=3.5874CV=3.5874
The F-score must be larger than 3.5874 to reject HoH_oHo​.

PAGE BREAK

(d) At the 1% significance level, what is the critical value for F? [Use F-table]

CV=6.2167CV=6.2167CV=6.2167
The F-score must be larger than 6.2167 to reject Ho.

PAGE BREAK

(e) What is the p-value for the F-statistic? [Use F-table]

6.2167 < [F=74.43] 
CV(0.01) > p-value 

Therefore, p-value is less than 0.01.

You can also see the p-value value in the ANOVA table under "Significance F", which simply means the "p-value of the F-stat".

We can reject Ho at the 1% significance level (and certainly at the 5% significance level). 

Wow! The F-score is huge! Does that mean all the explanatory variables are significant?

NO!

Recall:
The F-stat only tells us if the overall model is sufficient. It does not tell us which individual explanatory variables are significant! 
Each explanatory variable will have its own t-score so you will be able to assess the significance of each one by running t-tests. 
There is only one F-score in a regression model. 

PAGE BREAK
Let's see the full ANOVA table:

  

You see that only Followers is a significant explanatory variable because its p-value is low. How much time a Instagram "Influencer" volunteers and their grade are not good predictors of their earnings, based on their large p-value. 
Also notice that only the confidence interval for the Followers coefficient β1\beta_1β1​ does not contain 0.
PAGE BREAK

Ho:H_o:Ho​:  β1=β2=β3=0\beta_1=\beta_2=\beta_3=0β1​=β2​=β3​=0        ("The overall model is not significant.”)
Ha:H_a:Ha​:   at least one βi≠0\ \beta_i\ne0 βi​=0        ("The overall model is significant.”)

It is true: at least one explanatory variable is significant. In this example, it's just Followers. That is enough to reject the null hypothesis and conclude that the overall model is significant. 

Finally, notice how high R2R^2R2 is. This suggests that the one significant explanatory variable, Followers, is doing almost all the work in explaining earnings. 

Michaela is good at statistics but is famous for her cooking website. She believes that the number of new membership subscriptions per month depends on money spent on advertising, number of new recipes posted, number of times her page is shared on social media, and number of guest appearances she makes on TV. She randomly samples 18 months and applies multiple regression:

(i) How is the model overall for explaining new membership subscriptions? Select the null hypothesis. (Check all that applies.)

HoH_oHo​: The overall model is significant.

HoH_oHo​: The overall model is not significant.

HoH_oHo​: β1=β2=β3=β4=0\beta_1=\beta_2=\beta_3=\beta_4=0β1​=β2​=β3​=β4​=0 

HoH_oHo​: All the explanatory variables are not statistically significantly different from zero.

I don't know

An account manager's salary (YYY) is estimated using a regression model. There are 3 explanatory variables: years of experience, number of complaints, and height (cm). Salary is in $'000.

Use the partial Excel output provided below to answer the series of questions.

(i) Which is the correct regression equation? 

y^=90.58x1+2.31x2−2.45x3+0.34x4\hat y=90.58x_1+2.31x_2-2.45x_3+0.34x_4y^​=90.58x1​+2.31x2​−2.45x3​+0.34x4​

y^=90.58+2.31x1−2.45x2+0.34x3\hat y=90.58+2.31x_1-2.45x_2+0.34x_3y^​=90.58+2.31x1​−2.45x2​+0.34x3​

y^=90.58+2.31x1+2.45x2+0.34x3\hat y=90.58+2.31x_1+2.45x_2+0.34x_3y^​=90.58+2.31x1​+2.45x2​+0.34x3​

y^=90.58+2.31b1−2.45b2+0.34b3\hat y=90.58+2.31b_1-2.45b_2+0.34b_3y^​=90.58+2.31b1​−2.45b2​+0.34b3​

I don't know

Extra Practice

Hypothesis Testing for Multiple Regression

University XYZ offers four different general programs of study, Science, Art, Business, and Social Sciences.  They want to build a model to predict how well admissions candidates would perform if they were to be admitted to the university.  In building the model, they collect the following information for 9 randomly sampled current students based on their original application to the university:  high school GPA, SAT scores, program of study the candidate is applying to, and gender. 
a)  For the indicator variable "Application Program of Study" how many coefficients will you need in your regression equation?
b)  The university decides to initially omit gender from the regression model.  What is the regression equation?

Hypothesis Testing for Multiple Regression

We want to estimate the price of houses in a given neighbourhood.  We sample 9 houses and record the following output.  Note that X3=1X_3=1X3​=1 if the house has a pool and X3=0X_3=0X3​=0 if it doesn't.
a) Predict the price of a 3000 sqft house built in 1995, with a pool.
b) Interpret the meaning of b3b_3b3​.

Wize University Statistics Textbook > Multiple Regression

Hypothesis Testing for Multiple Regression

Popular Courses

Hypothesis Testing for Multiple Regression

F-Distribution

Example: Hypothesis Testing for Multiple Regression

Example: Hypothesis Testing for Multiple Regression

Extra Practice

Hypothesis Testing for Multiple Regression

Hypothesis Testing for Multiple Regression