Linear Regression

This page features information about creating and interpreting regression models.


Simple Linear Regression

The goal of simple linear regression is to figure out an equation of a line through the data in an attempt to find the relationship between feature and response to predict future values. The equation is of the form:

\(y_i = \alpha + \beta x_i + \epsilon_i\)

where

  • \(y_i\) is the prediction at \(x_i\)
  • \(\alpha\) is the “y-intercept”
  • \(\beta\) is the “slope”
  • \(\epsilon_i \sim N(0, \sigma^2)\), and each are independent

In vector (or set) form:

\(Y = \alpha + \beta X + \epsilon\)

where

  • \(X\): the independent variable, the predictor, the explanatory variable, the feature
  • \(Y\): the dependent variable, the response variable
  • \(\epsilon\): the random deviation, random error – accounts for the fact that the world is uncertain and that there are random deviations around the true process.

The Expected Value of the Response Variable

Given \(Y = \alpha + \beta X + \epsilon\), what is \(E[Y]\)?

\(E[Y] = E[\alpha + \beta X + \epsilon]\)

\(= E[\alpha] + E[\beta X] + E[\epsilon]\)

\(= \alpha + \beta E[X] + 0\)

\(= \alpha + \beta X\)

Interpreting Parameters

\(\beta\): the slope of our true regression line represents the increase in our response from a unit increase in our feature.

Minimizing the Residuals

Given our data, \((x_1, y_1), \dots, (x_n, y_n)\) how do we minimize the residuals, \(\epsilon_i = y_i - (\alpha + \beta x_i)\)? In other words, how do we minimize the sum of the squared errors?

insert image of vertical line from regression line to point, label it e_i

Given:

  • \(y_i\): the actual value of the data point
  • \(\hat{y_i}\): the predicted value of the \(i^{th}\) data point

\(SSE = \sum\limits_{i=1}^{n} (y_i - \hat{y_i})^2\)

the point-estimates (single value estimated from the data) of the slope and intercept parameters are called the least-squares estimates, and are defined to be the values that minimize the SSE. The SSE can be thought of as a measure of how much variation in \(Y\) is left unexplained by the model.

How we Find the Parameter Estimates?

\(SSE = \sum\limits_{i=1}^{n} (y_i - \hat{y_i})^2\)

\(= \sum\limits_{i=1}^{n} (y_i - (\alpha - \beta x_i))^2\)

After making this substitution, compute…

  • \(\frac{\partial SSE}{\partial \alpha} = 0\)
  • \(\frac{\partial SSE}{\partial \beta} = 0\)

Which yield the following solutions, respectively…

  • \(\alpha = \bar{y} - \beta \bar{x}\)
  • \(\beta = \frac{\bar{xy} - \bar{x} \bar{y}}{\bar{x^2} - (\bar{x})^2}\)

Estimating the Variance

Given that the parameter \(\sigma^2\) determines the spread of the data about the true regression line, our estimate of the variance is \(\hat{\sigma}^2\), which is equivalent to:

\(\hat{\sigma}^2 = \frac{SSE}{n-2}\)

Coefficient of Determination

The coefficient of determination, \(R^2\), quantifies how well the model explains the data. \(R^2\) ranges from \(0\) to \(1\).

The regression sum of squares is given by:

\(SSR = \sum\limits_{i=1}^n (\hat{y_i} - \bar{y})^2\)

and gives a sense of how much variation in \(Y\) is explained by our model.

A quantitative measure of the total amount of variation in observed \(Y\) values is given by the total sum of squares:

\(SST = \sum\limits_{i=1}^n (y_i - \bar{y})^2\)

SST is what we would get if we used the mean of the data as our model.

Note that the sum of squared deviations about the least-squares line is smaller than the sum of squared deviations about nay other line.

\(SSE \leq SST\)

Include image of the three measures.

\(R^2\)

Therefore, the ratio of \(\frac{SSE}{SST}\) is the proportion of the total variation in the data (SST) that cannot be explained by the SLR model (SSE). So we define the coefficient of determination \(R^2\) to be the proportion of variance that can be explained by the model.

\(R^2\) has a magnitude between \(0\) and \(1\), with the closer to \(1\) be the better (i.e. the higher the number the more of the variation that can be explained by the model).

\(R^2 = 1 - \frac{SSE}{SST}\)

Slope Distribution

Distribution

\(\hat{\beta} \sim N(\beta, \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2})\)

\(SE(\hat{\beta}) = \frac{\hat{\sigma}}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}}\)

Hypothesis Tests

\(H_0\): \(\beta = 0\)

\(H_A\): \(\beta \neq 0\)

Confidence Intervals

\(\hat{\beta} \pm t_{\alpha / 2, df = n-2} SE(\hat{\beta})\)

Note the statistic from the confidence interval, we use t-tests for the coefficients (i.e. individul coefficients for linear regression models). Specifically, used in finding if a feature has an effect on the response. In the case of a small enough p-value, we can reject the null hypothesis in favor of the alternative hypothesis which means there is statistical evidence that the associated feature (variable) has an effect on the response variable.

SUMMARY OF SLR

Main Formula

\(Y = \alpha + \beta X + \epsilon\)

Variance & Slope Distribution (\(\beta\)s)

\(\hat{\sigma}^2 = \frac{SSE}{n-2}\)

\(SE(\hat{\beta}) = \frac{\hat{\sigma}}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}}\)

CI for t-tests on \(\beta\)

\(\hat{\beta} \pm t_{\alpha / 2, df = n-2} SE(\hat{\beta})\)

Coefficient of Determination

SSE SSR SST \(R^2\)
Description Measure of how much variation in \(Y\) is left unexplained by the model How much variation in \(Y\) is explained by our model Total amount of variation in observed \(Y\) values (what we would get if we used the mean of the data as our model) The proportion of variance that can be explained by the model
Formula \(\sum\limits_{i=1}^{n} (y_i - \hat{y_i})^2\) \(\sum\limits_{i=1}^n (\hat{y_i} - \bar{y})^2\) \(\sum\limits_{i=1}^n (y_i - \bar{y})^2\) \(1 - \frac{SSE}{SST}\)

The closer to 1 \(R^2\) is, the better the model fits the data.

Workflow for Simple Linear Regression

  1. Plot the data as a scatter plot
  1. Does linearity seem appropriate?
  2. Compute and overlay best-fit line
  1. Consider assumptions in SLR
  1. Plot a histogram of the residuals (are they normal?): WANT NORMAL
  2. Plot the residuals against x (are they changing?): WANT RANDOM

Include Plots

Simple Linear Regression in R

The following functions are paramount in implementing and interpreting simple linear regression in R:

Load Library

Code
# import libraries
library(tidyverse)

Load Data (use txhousing from built in data)

Code
df <- txhousing
head(df)
# A tibble: 6 × 9
  city     year month sales   volume median listings inventory  date
  <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.

Create a simple linear regression model to see if we can explain and predict volume from listings

Code
slr <- lm(volume ~ listings, data = df)
summary(slr)

Call:
lm(formula = volume ~ listings, data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-704067392  -27385680  -12084812    -951885 1686521981 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1501601.1  1750052.1  -0.858    0.391    
listings       36990.0      258.1 143.319   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 130500000 on 7174 degrees of freedom
  (1426 observations deleted due to missingness)
Multiple R-squared:  0.7411,    Adjusted R-squared:  0.7411 
F-statistic: 2.054e+04 on 1 and 7174 DF,  p-value: < 2.2e-16

From this summary, we can see that listings is a statistically significant factor (low p-value) and that the \(R^2\) for the model is somewhat acceptable at \(0.7411\). Although the intercept being negative does raise some concern, this would mean that for a time period with \(0\) listings, we would get a negative volume. The model also says that for each new listing, we can expect the volume to increase by \(36990\).


Multiple Linear Regression

We can extend the simple linear regression model to accommodate more variables. Note that we’ll now refer to \(\alpha\) as \(\beta_0\).

The main vector model can be represented as:

\(Y = \beta_0 + B_1 X_1 + \dots + B_p X_p + \epsilon\)

where for each of the \(n\) data points (each vector has the same number of data points), we assume for \(i = 1, \dots, n\), the \(i^{th}\) response variable is represented as:

\(y_i = \beta_0 + B_1 x_{i, 1} + \dots + B_p x_{i, p} + \epsilon_i\)

Matrix Algebra Applications & Solutions

If we have a perfectly square \(X\) matrix, then we can solve for \(\beta\) via:

\(Y = X \beta \rightarrow X^{-1} Y = \beta\).

However, we can almost guarantee that \(X\) will not be square. Look at this following linear algebra application:

Given a matrix \(A\) of any dimensions, both \(A^TA\) and \(AA^T\) will result in square matrices.

Therefore, given \(Y = X \beta\), we can solve via the following:

\((X^TX)^{-1}X^TY = \beta\)

Interpreting MLR

Given \(y_i = \beta_0 + B_1 x_{i, 1} + \dots + B_p x_{i, p} + \epsilon_i\), parameter \(\beta_k\) is the expected change in the response associated with a unit change in the value of feature \(x_k\) while all of the other features are held fixed.

Quantifying Goodness of Fit

\(R^2\) vs. \(R_a^2\)

Like in SLR, we can also calculate measures like SSE, SST, and \(R^2\) for MLR.

  • \(SSE = \sum\limits_{i=1}^n (y_i - \hat{y_i})^2 = \sum\limits_{i=1}^n (y_i - (\beta_0 + B_1 x_{i, 1} + \dots + B_p x_{i, p}))^2\)
  • \(SST = \sum\limits_{i=1}^n (y_i - \bar{y})^2\)
  • \(R^2 = 1 - \frac{SSE}{SST}\)

Although we can calculate \(R^2\) for an MLR model, we run into the issue of multiple comparisons. The Adjusted \(R^2\), \(R_a^2\), is a better indicator of goodness of fit for MLR models. The adjusted version penalizes for having too many features that are not reducing SSE.

\(R_a^2 = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)}\)

Covariance & Correlation

We can discover relationships among features by performing a correlation analysis. If the value of one feature changes, how will this affect the other features.

Traditionally, these measurements are calculated via:

  • Covariance: \(Cov(X, Y) = E[(X - E[X])(Y -E[Y])]\)
  • Correlation (Pearson’s): \(\rho(X, Y) = \frac{Cov(X, Y)}{\sqrt{Var(X) Var(Y)}}\)

However, in an MLR, we estimate these relationships using formulas analogous to the sample variance.

  • Sample Covariance: \(S^2_{XY} = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\)
  • Sample Correlation \(\hat{\rho}(X < Y) = \frac{S^2_{XY}}{\sqrt{S^2_{X}S^2_{Y}}}\)

where \(S^2_{X}\) and \(S^2_{Y}\) are the variances for \(X\) and \(Y\), respectively.

Individual t-tests

Suppose we want to test:

  • \(H_0\): \(\beta_j = c\)
  • \(H_A\): \(\beta_j \neq c\)

for some \(c \in \mathbb{R}\).

We would use a t-test with the test statistic:

\(t_{stat} = \frac{\hat{\beta_j} - c}{SE(\hat{\beta_j})}\)

Recall our CI for \(\beta\):

CI: \(\hat{\beta} \pm t_{\alpha/2, df = n - 2} \frac{\hat{\sigma}}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}}\)

\(= \hat{\beta} \pm t_{\alpha/2, df = n - 2} \hat{SE(\hat{\beta_j})}\)

Compare the test statistic with the t-test critical value (\(t_{\alpha/2, df = n - 2}\)) and proceed from there.

MLR in R: Individual t-test Output from summary

From same dataset used in SLR, create MLR. Note: we can specify all variables like y ~ x1 + x2 + ... + xn, or we can call all of the variables besides the response variable by using ..

Code
mlr <- lm(volume ~ month + sales + listings, data = df)
summary(mlr)

Call:
lm(formula = volume ~ month + sales + listings, data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-425617453   -8326819    1919907    8285352  451201576 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7064965.5  1055382.2  -6.694 2.33e-11 ***
month        -103150.3   141167.5  -0.731    0.465    
sales         274509.7     1078.4 254.559  < 2e-16 ***
listings      -12218.7      209.8 -58.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41190000 on 7172 degrees of freedom
  (1426 observations deleted due to missingness)
Multiple R-squared:  0.9742,    Adjusted R-squared:  0.9742 
F-statistic: 9.035e+04 on 3 and 7172 DF,  p-value: < 2.2e-16

In the above summary printout, note that:

  • Std. Error = \(SE(\hat{\beta})\)
  • t value: \(t_{stat}\)
  • Pr(>|t|): p-value

Note that the lm() and summary functions asume a hypothesis test of:

  • \(H_0\): \(\beta_j = 0\)
  • \(H_A\): \(\beta_j \neq 0\)

So, depending on the significance level we had set beforehand, we either reject the null hypothesis in favor of the alternative hypothesis or fail to reject the null hypothesis. In the case of a small enough p-value, we can reject the null hypothesis in favor of the alternative hypothesis which means there is statistical evidence that the associated feature (variable) has an effect on the response variable.

Type I Errors in Multiple t-tests

However, it is not a good idea to conduct several t-tests at a time!

Example: Suppose that we conduct 10 hypothesis tests at significance level \(\alpha = 0.05\). So, the probability of type 1 error is 0.05 for any individual test.

What is the probability of a type I error ocurring in at least one of these 10 tests?

\(P(\text{at least 1 type I error}) = 1 - P(\text{no type I error})\)

\(= 1 - (0.95)^{10}\)

\(\approx 0.40\)

Which is high!

Multicollinearity / Collinearity / Non-Identifiability

Multicollinearity, also known non-identifiability, can occur when there is linear dependence between columns of data. “Strict” non-identifiability occurs with true linear dependence and can be diagnosed with NA rows for MLR coefficients in a given summary, is rare. “Near” non-identifiability is less rare but can be tricky to diagnose. One method of diagnosing “near” collinearity is to recognize “fishy” coefficients, such as negatives when expecting positive.

This is a concern because linear dependence causes non-invertibility in the matrices \(X^TX\) and \(XX^T\) for any given matrix \(X\). This is disruptive in creating MLR models.

F-test

How do control the overall type I error rate when we need to test whether to drop several predictors from a model. We can use a simultaneous test, the F-test.

Consider two different models from a dataset with \(p\) features:

  • Full Model: \(\Omega\): \(Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p\)
  • Reduced Model: \(\omega\): \(Y = \beta_0 + \beta_1 X_1 + \dots + \beta_q X_q\), where \(q < p\)

We can consider testing the following hypothesis:

  • \(H_0\): \(\omega\) is sufficient
  • \(H_A\): \(\omega\) is not sufficient

The Full F-test

The null hypothesis is the most reduced model possible

The idea:

  • \(H_0\): \(y_i = \beta_0 + \epsilon_i\)
  • \(H_A\): at least 1 \(B_j \neq 0\), where \(j = 1, \dots, p\)

In other words, we want to check whether ALL coefficients are 0:

  • \(H_0\): \(\beta_1 = \beta_2 = \dots = \beta_p = 0\)
  • \(H_A\): \(B_k \neq 0\) for at least one value of \(k = 1, \dots, p\)

The null hypothesis in the MLR case says that there is no useful linear relationship between \(y\) and any of the \(p\) predictors.

The null hypothesis essentially states that there is no useful linear relationship between the response and any of the predictors. A small enough p-value suggests strong evidence against the null hypothesis, or in other words the model is better than the most reduced model possible. Useful in multiple linear regression (MLR) where the individual t-tests are suggesting evidence against the null hypothesis may result in type I errors.

\(F_{stat} = \frac{\frac{SST - SSE}{p}}{\frac{SSE}{n-p-1}}\)

where \(p = df_{SST} - df_{SSE}\)

The \(F_{stat}\) is a measure of how much better our model is than just using the mean.

If \(H_0\) were true, then we would expect to see \(F_{stat} \approx 1\). In other words, if we see an \(F_{stat}\) around 1, we will likely fail to reject the null hypothesis.

If \(H_A\) were true, then \(SSE < SST \rightarrow F_{stat} >> 1\)

The Partial F-test

Suppose we have the following setup:

  • Full Model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4\)
  • Reduced Model: \(y = \beta_0 + \beta_2 x_2 + \beta_4 x_4\)

Are the missing features important, or are we okay with the reduced model?

Hypothesis Test for the Partial F-test:

  • \(H_0\): \(\beta_1 = \beta_3 = 0\)
  • \(H_A\): at least one of \(\beta_1\), \(\beta_3\) \(\neq 0\)

Strategy: fit the full and reduced models, then determine if the difference in performance is real or just due to chance.

Let’s examine some measurements from the full and reduced models.

  • \(SSE_{full}\): variation unexplained by the full model
  • \(SSE_{reduced}\): variation unexplained by the reduced model

Intuitively, if \(SSE_{full}\) is much smaller than \(SSE_{reduced}\), the full model fits the data much better than the reduced model. Therefore, the appropriate test statistic should depend on the difference between the unexplained variation.

Give a full model with \(p\) features and a reduced model with \(k\) features:

\(F_{stat-partial} = \frac{(SSE_{reduced} - SSE_{full}) / p - k}{SSE_{full} / (n-p-1)} \sim F_{p-k, n-p-1}\)

with the rejection region: \(F \geq F_{\alpha, p-k, n-p-1}\)

Note that \(p-k\) comes from the differences in degrees of freedom:

\(df_{reduced} - df_{full} = (n-k-1) - (n-p-1) = p - k\)

Large \(F_{stat} \rightarrow\) Low P-Value from F-distribution \(\rightarrow\) Well Fitting Model

Principal of Parsimony: The principle that the most acceptable explanation of an occurrence, phenomenon, or event is the simplest, involving the fewest entities, assumptions, or changes.

We’re striving to create an MLR model which explains the variation in the data with relatively few features that are easily interpreted.

Feature Selection

The number of models to test can exponentially increase with the number of features.

Given \(p\) features, there are \(\sum\limits_{i=0}^p = 2^p\) possible models.

What are some ways we can reduce the impossible amount of models to try?

Forward Selection

A greedy algorithm for adding features.

  1. Fit a null model with an intercept but no slopes
  2. Fit p individual SLR models - one for each possible feature. Add to the null model the one that improves the performance the most, based on some measure (i.e. decreases SSE the most, increases F-statistic the most, etc.)
  3. Fit (p - 1) MLR models - one for each of the remaining features along with the feature we isolated from step 2. Add the one that improves model performance the most.
  4. Repeat until some stopping criterion is reached (i.e. some threshold SSE, or some fixed number of features, etc.)

Backward Selection

A greedy algorithm for removing features.

  1. Fit model with available features.
  2. Remove the feature with the largest p-value (i.e. the least significant feature).
  3. Repeat until some stopping criterion is reached (i.e. some threshold SSE, or some fixed number of features, etc.).

Further Interpretations and R Examples

Basic Interpretations

Let’s recreate our MLR from earlier:

Code
mlr <- lm(volume ~ month + sales + listings, data = df)

summary

Code
summary(mlr)

Call:
lm(formula = volume ~ month + sales + listings, data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-425617453   -8326819    1919907    8285352  451201576 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7064965.5  1055382.2  -6.694 2.33e-11 ***
month        -103150.3   141167.5  -0.731    0.465    
sales         274509.7     1078.4 254.559  < 2e-16 ***
listings      -12218.7      209.8 -58.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41190000 on 7172 degrees of freedom
  (1426 observations deleted due to missingness)
Multiple R-squared:  0.9742,    Adjusted R-squared:  0.9742 
F-statistic: 9.035e+04 on 3 and 7172 DF,  p-value: < 2.2e-16

anova

Code
anova(mlr)
Analysis of Variance Table

Response: volume
            Df     Sum Sq    Mean Sq    F value    Pr(>F)    
month        1 1.5469e+17 1.5469e+17     91.194 < 2.2e-16 ***
sales        1 4.5387e+20 4.5387e+20 267563.442 < 2.2e-16 ***
listings     1 5.7558e+18 5.7558e+18   3393.153 < 2.2e-16 ***
Residuals 7172 1.2166e+19 1.6963e+15                         
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova table gives us:

  • Df: degrees of freedom
  • Sum Sq: SSE (for Residuals)
  • Sum Sq: SST (total column)
  • Mean Sq: SSR (for features)

residuals

Code

Code
ggplot() +
  geom_point(aes(x = seq(1, length(residuals(mlr))), y=residuals(mlr)))

Don’t run, but this will return the predicted response values: fitted(mlr)


F-tests

Let’s check out the F-tests.

Code
summary(mlr)

Call:
lm(formula = volume ~ month + sales + listings, data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-425617453   -8326819    1919907    8285352  451201576 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7064965.5  1055382.2  -6.694 2.33e-11 ***
month        -103150.3   141167.5  -0.731    0.465    
sales         274509.7     1078.4 254.559  < 2e-16 ***
listings      -12218.7      209.8 -58.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41190000 on 7172 degrees of freedom
  (1426 observations deleted due to missingness)
Multiple R-squared:  0.9742,    Adjusted R-squared:  0.9742 
F-statistic: 9.035e+04 on 3 and 7172 DF,  p-value: < 2.2e-16

Here we notice that the individual t-tests are statistically significant, however be wary of type I errors. However, we can notice that the t-test associated with month is not significant, suggesting there is no evidence that parameter isn’t zero. Let’s check out the F-stat, which does have a high value with a low p-value associated with it. This is indicative of strong evidence that at least one of the \(B_{k}\)s are not \(0\).

Let’s test a model removing month with a partial f-test.

Code
mlr_reduced <- lm(volume ~ sales + listings, data = df)
Code
anova(mlr_reduced, mlr)
Analysis of Variance Table

Model 1: volume ~ sales + listings
Model 2: volume ~ month + sales + listings
  Res.Df        RSS Df  Sum of Sq      F Pr(>F)
1   7173 1.2167e+19                            
2   7172 1.2166e+19  1 9.0568e+14 0.5339  0.465

The p-value for the f-test is very large, indicate that we don’t have enough evidence to reject the null hypothesis, which means that the reduced model is likely sufficient.