0:00 / 0:00

Estimating the Coefficients of the Linear Regression Model


As you know, the simple linear regression equation is:

y^=bo+b1x\displaystyle\boxed{\hat{y}=b_o+b_1x}
We use the statistics from our sample to infer about the parameter in the population.

PAGE BREAK

The least-squares regression line is an estimate of the true population regression line, which is represented by this formal model:

E(Y)=βo+β1X+ε\displaystyle\boxed{E\left(Y\right)=\beta_o+\beta_1X+\varepsilon}

YY is the unknown dependent variable.
  • All YisY_i^{'}s are independent of one another.
  • YY is assumed to be normally distributed with mean E(Y)=βo+β1XiE\left(Y\right)=\beta_o+\beta_1X_i and standard deviation σY\sigma_Y is constant, regardless of what XX is.
What is ε?\colorThree{\varepsilon?}

The notion ε\varepsilon, the residual or error, is the deviation of the actual values of YY and from their means E(Y)E(Y).
  • The error term includes everything that separates your model from actual reality. This includes:
  • Other explanatory variables that are not included in the model.
  • Poor fit (e.g. a linear model doesn't fit a quadratic relationship)
  • Unpredictable effects
  • Random error
  • We assume that ε\varepsilon normally distributed with mean 0 and standard deviation σε\sigma_{\varepsilon}

PAGE BREAK
The regression line shows how Y changes with X:
XX is the known independent variable
βo\beta_o is the true intercept of the population regression line
β1\beta_1 is the true slope of the population regression line

Example
Unlike the other variables above (i.e. βo+β1Xi\beta_o+\beta_1X_i), which are all constant variables, ε\varepsilon a random variable.
  • The average values of all the εis=0\varepsilon_i^{'}s=0







0:00 / 0:00

Measures of Variation in Regression

The coefficient of determination (R Squared) R2\colorFour {R^2} measures how close the data are to the regression model or how much of the variation in the response variable YY could be explained by the explanatory variable XX.

Example

X=X= length of a movie (minutes)
Y=Y= time it takes to edit a movie (days)
If R2=0.63R^2=0.63: "About 63% of the variation in the time it takes to edit a movie (Y)(Y) can be explained by the length of a movie (X)(X)."

The variation of YY can be broken down by three measures:
  1. SSR: Sum of Squares (Regression)
  2. SSE: Sum of Squares (Error)
  3. SST: Sum of Squares (Total)

Watch Out!
This part can be confusing for some students, but it is very important and useful!

PAGE BREAK

SSR: Sum of Squares (Regression)

SSR is the quantified measure of the variation that is attributed to the relationship between XX and YY.
  • In other words, SSR measures the explained variability in the regression model.
  • The explained variation of y is the vertical distance between the predicted value y^i\hat{y}_i and the sample mean y\overline{y}.
  • Therefore, the explained variation for each yiy_i is:
(y^iy)2\boxed{\sum(\hat{y}_i-\overline{y})^2}
  • Therefore, the explained variation for each yiy_i is:
y^iy\boxed{\hat{y}_i-\overline{y}}

\to This is known as the regression.


Wize Concept
Some textbooks use SSMSSM (Sum of Squares Model) instead of SSRSSR (Sum of Squares Regression). They both refer to how much the Regression Model can explain so they are the same thing.


PAGE BREAK

SSE: Sum of Squares (Error)

SSE is the quantified measure of the variation that is not attributed to the relationship between XX and YY. It may be due to:
  1. Other explanatory variables that are not included in the model.
  2. Random error
  • In other words, SSE measures the unexplained variability in the regression model.
  • The unexplained variation of y is the vertical distance between the actual value yiy_i and the predicted value y^i\hat{y}_i.
(yiy^i)2\boxed{\sum(y_i-\hat{y}_i)^2}

  • Therefore, the unexplained variation for each yiy_i is:
yiy^i\boxed{y_i-\hat{y}_i}

\to This is known as the error.


Watch Out!
SSR does not mean "Sum of Squares Residuals" (that is incorrect and is not a real term)! SSR is the Sum of Squares Regression.

PAGE BREAK

SST: Sum of Squares (Total)

SST is the quantified measure of the variation that is attributed to the relationship between XX and YY plus what is not attributed to that relationship.
  • In other words, SST measures the explained variability in the regression model PLUS the unexplained variability in the regression model.
  • The total variation of y is the sum of the squares of the differences between all actual values yisy_i^{'}s and the sample mean y\overline{y}:

(yiy)2\boxed{\sum_{ }^{ }\left(y_i-\overline{y}\right)^2}
  • Then, for each yiy_i , we get:

yiy\boxed{y_i-\overline{y}}

Thus:
[Total variation of y]=[Explained variation of y]+[Unexplained  variation of y]\left[Total\ variation\ of\ y\right]=\left[Explained\ variation\ of\ y\right]+\left[Un\exp lained\ \ variation\ of\ y\right]

yiy=(y^iy)+(yiy^i)\boxed{y_i-\overline{y}=\left(\hat{y}_i-\overline{y}\right)+\left(y_i-\hat{y}_i\right)}


Notice that the y^is\hat y_i's at the right side of the equation cancels each other out. What is left is yiyy_i-\overline{y}.


Square and sum them all, and we get:

[Total variation of y] = [Total variation of y] + [Total unexplained variation of y]

(yi y)2=(y^iy)2+(yiy^i)2\boxed{\sum_{ }^{ }\left(y_i-\overline{\ y}\right)^2=\sum_{ }^{ }\left(\hat{y}_i-\overline{y}\right)^2+\sum_{ }^{ }\left(y_i-\hat{y}_i\right)^2}

This can be rewritten as:

[Sum of Squares (Total)] = [Sum of Squares (Regression)] + [Sum of Squares (Error)]

or

SST=SSR+SSE\boxed{SST=SSR+SSE}

PAGE BREAK

Example





Practice: Measures of Variation in Regression


SST=1070SSE=450SST=1070\\SSE=450


Find SSRSSR.
Extra Practice