Linear regression analysis

Methods of Corpus Linguistics (class 5)

Mariana Montes

Outline

  • Simple linear regression
  • Multiple linear regression
  • Categorical predictors
  • Easystats

Simple linear regression

Set up code

In case you want to run this by yourself…

library(tidyverse)
library(kableExtra)
set.seed(2022)

utt_lengths <- tibble(age = 3:17) %>% 
  mutate(utterance_length = 3 + 0.5*age + rnorm(n(), sd = 0.3))
m1 <- lm(utterance_length ~ age, data = utt_lengths)
utt_lengths$fit <- m1$fitted.values

Scatterplot

What is the relationship between the age of a child and the mean length of their utterances (in a corpus)?

g <- ggplot(utt_lengths, aes(
  x = age,
  y = utterance_length
  )) +
  geom_point(size = 3) +
  labs(x = "Age",
       y = "Utterance length") +
  xlim(0,18) + ylim(0,15) +
  theme_minimal(base_size = 20) +
  theme(aspect.ratio = 1)
g

Intercept

The average utterance length is 7.885:

g_int <- g +
  geom_hline(
    aes(
      yintercept = mean(utterance_length)
      ),
    linetype = 2,
    color = "coral",
    size = 1
    )
g_int

Fitted model

But we try to find a line that fits the trend the best:

g_fit <- g +
  geom_abline(
    intercept = m1$coefficients[[1]],
    slope = m1$coefficients[[2]],
    linetype = 2,
    color = "darkcyan",
    size = 1
    )
g_fit

Error

Sum of squares total

Code
g_int +
  geom_segment(aes(xend = age, yend = mean(utterance_length)), color = "coral3", size = 2)

Sum of squares error

Code
g_fit +
  geom_segment(aes(xend = age, yend = fit), color = "cyan4", size = 2)

Data

Code
utt_lengths %>% mutate(intercept = mean(utterance_length)) %>%
  slice_sample(n = 7) %>% kb_style()
age utterance_length fit intercept
17 11.48 11.52 7.88
14 9.94 9.96 7.88
16 11.03 11.00 7.88
8 6.13 6.85 7.88
9 7.18 7.37 7.88
15 10.21 10.48 7.88
7 6.40 6.33 7.88

lm() output

Code
summary(m1)

Call:
lm(formula = utterance_length ~ age, data = utt_lengths)

Residuals:
   Min     1Q Median     3Q    Max 
-0.717 -0.154 -0.016  0.174  0.518 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.6958     0.2036    13.2  6.3e-09 ***
age           0.5189     0.0187    27.8  5.8e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.313 on 13 degrees of freedom
Multiple R-squared:  0.983, Adjusted R-squared:  0.982 
F-statistic:  771 on 1 and 13 DF,  p-value: 5.84e-13

Simple linear regression

Simple

one predictor variable

Linear

linear relation between estimated parameters and response variable

Notation

y ~ x

Estimation: OSL

Ordinary least squares, minimizing sum of squares of residuals

Multiple linear regression

Add a predictor

set.seed(7)
utt_lengths <- utt_lengths %>% 
  mutate(pages_read = n()/2 + rnorm(n(), mean = 6, sd = 3))

m2 <- lm(utterance_length ~ age + pages_read, data = utt_lengths)
utt_lengths$fit2 <- m2$fitted.values

utt_lengths %>% slice_tail(n = 7) %>% kb_style()
age utterance_length fit pages_read fit2
11 8.72 8.40 14.0 8.35
12 9.07 8.92 20.1 9.04
13 9.80 9.44 14.6 9.38
14 9.94 9.96 21.6 10.10
15 10.21 10.48 20.3 10.56
16 11.03 11.00 14.5 10.89
17 11.48 11.52 19.2 11.54

Scatterplot and fitted model

Model

Code
summary(m2)

Call:
lm(formula = utterance_length ~ age + pages_read, data = utt_lengths)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6100 -0.1804 -0.0159  0.2139  0.4207 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.3841     0.3157    7.55  6.7e-06 ***
age           0.5052     0.0212   23.82  1.8e-11 ***
pages_read    0.0295     0.0232    1.27     0.23    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.306 on 12 degrees of freedom
Multiple R-squared:  0.985, Adjusted R-squared:  0.983 
F-statistic:  405 on 2 and 12 DF,  p-value: 9.72e-12

Multiple linear regression

Multiple

more than one predictor

Notation

y ~ x1 + x2 (fitting on a plane)

y ~ x1 + x2 + ... + xn (fitting on a hyperplane)

Estimation: OSL

Ordinary least squares, minimizing sum of squares of residuals

Categorical predictors

Two levels

  • E.g. “monolingual family”: yes/no

  • Translated into 1 dummy predictor with two levels

    • 0: reference level, e.g. no

    • 1: other level, e.g. yes

  • Line fitted between 0 and 1, slope is the difference in y when yes (compared to no)

Two levels: example

utt_lengths_cat <- utt_lengths %>% mutate(mono = factor(c(rep("no", 5), "yes", "no", "no", rep("yes", 7))))
Code
ggplot(utt_lengths_cat, aes(x = as.numeric(mono)-1, y = utterance_length)) +
  geom_point(size = 3) + geom_smooth(method = "lm", se = FALSE, color = "darkolivegreen4") +
  labs(x = "Is the family monolingual?", y = "Utterance length") +
  ylim(c(0,15)) + scale_x_continuous(breaks = c(0, 1), labels = c("no", "yes")) +
  theme_minimal(base_size = 20) + theme(aspect.ratio = 1)

Code
summary(lm(utterance_length ~ mono, data = utt_lengths_cat))

Call:
lm(formula = utterance_length ~ mono, data = utt_lengths_cat)

Residuals:
   Min     1Q Median     3Q    Max 
-3.419 -0.788  0.253  0.928  2.100 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.983      0.566    10.6  9.5e-08 ***
monoyes        3.566      0.776     4.6    5e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.5 on 13 degrees of freedom
Multiple R-squared:  0.619, Adjusted R-squared:  0.59 
F-statistic: 21.1 on 1 and 13 DF,  p-value: 5e-04

More than two levels

  • E.g. “L1”: EN, ES, FR…

  • Translated into n-1 dummy predictors with two levels

    • 1: one of the levels, not the reference level, e.g. Es

    • 0: the rest of the levels, including the reference level (e.g. EN)

  • (Hyper)plane fitted between combinations of 0 and 1, slope is the difference in y when compared to the reference level.

Data: categorical predictors

mean_utts <- c(EN = 11, ES = 7.4, ZH = 7.8)
set.seed(2022)
by_lang <- tibble(L1 = factor(rep(c("EN", "ES", "ZH"), 15))) %>% 
  mutate(utterance_length = rnorm(n(), mean = mean_utts[L1], sd = 0.8))
kb_style(filter(by_lang, L1 == "EN"), "float_left")
kb_style(filter(by_lang, L1 == "ES"), "float_left")
kb_style(filter(by_lang, L1 == "ZH"), "left")
L1 utterance_length
EN 11.72
EN 9.84
EN 10.15
EN 11.19
EN 10.21
EN 10.94
EN 11.82
EN 11.31
EN 10.72
EN 11.26
L1 utterance_length
ES 6.46
ES 7.13
ES 7.62
ES 8.21
ES 7.47
ES 6.88
ES 8.09
ES 8.29
ES 6.71
ES 6.99
L1 utterance_length
ZH 7.08
ZH 5.48
ZH 8.40
ZH 7.65
ZH 7.76
ZH 7.04
ZH 8.09
ZH 8.77
ZH 8.32
ZH 7.61

Model with categorical predictors

Code
m_lang1 <- lm(utterance_length ~ L1, data = by_lang)
summary(m_lang1)

Call:
lm(formula = utterance_length ~ L1, data = by_lang)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2294 -0.5615  0.0488  0.6111  1.0603 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10.928      0.189    57.9  < 2e-16 ***
L1ES          -3.620      0.267   -13.6  < 2e-16 ***
L1ZH          -3.219      0.267   -12.1  3.1e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.731 on 42 degrees of freedom
Multiple R-squared:  0.84,  Adjusted R-squared:  0.833 
F-statistic:  111 on 2 and 42 DF,  p-value: <2e-16
Code
es_first <- mutate(by_lang, L1 = fct_relevel(L1, "ES"))
m_lang2 <- lm(utterance_length ~ L1, data = es_first)
summary(m_lang2)

Call:
lm(formula = utterance_length ~ L1, data = es_first)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2294 -0.5615  0.0488  0.6111  1.0603 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    7.308      0.189    38.7   <2e-16 ***
L1EN           3.620      0.267    13.6   <2e-16 ***
L1ZH           0.401      0.267     1.5     0.14    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.731 on 42 degrees of freedom
Multiple R-squared:  0.84,  Adjusted R-squared:  0.833 
F-statistic:  111 on 2 and 42 DF,  p-value: <2e-16

Categorical predictors

  • Comparison is done between each level and the reference level, not in all combinations

  • t-test: does the slope of an individual dummy predictor differ from 0?

  • F-test of nested models: do the dummy predictors jointly reduce unexplained variation?

Easystats

{report} package

Code
report(m1)
We fitted a linear model (estimated using OLS) to predict utterance_length with
age (formula: utterance_length ~ age). The model explains a statistically
significant and substantial proportion of variance (R2 = 0.98, F(1, 13) =
771.19, p < .001, adj. R2 = 0.98). The model's intercept, corresponding to age
= 0, is at 2.70 (95% CI [2.26, 3.14], t(13) = 13.24, p < .001). Within this
model:

  - The effect of age is statistically significant and positive (beta = 0.52, 95%
CI [0.48, 0.56], t(13) = 27.77, p < .001; Std. beta = 0.99, 95% CI [0.91,
1.07])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.
Code
report(m2)
We fitted a linear model (estimated using OLS) to predict utterance_length with
age (formula: utterance_length ~ age + pages_read). The model explains a
statistically significant and substantial proportion of variance (R2 = 0.99,
F(2, 12) = 404.69, p < .001, adj. R2 = 0.98). The model's intercept,
corresponding to age = 0, is at 2.38 (95% CI [1.70, 3.07], t(12) = 7.55, p <
.001). Within this model:

  - The effect of age is statistically significant and positive (beta = 0.51, 95%
CI [0.46, 0.55], t(12) = 23.82, p < .001; Std. beta = 0.97, 95% CI [0.88,
1.05])
  - The effect of pages read is statistically non-significant and positive (beta
= 0.03, 95% CI [-0.02, 0.08], t(12) = 1.27, p = 0.228; Std. beta = 0.05, 95% CI
[-0.04, 0.14])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation. and We fitted a linear
model (estimated using OLS) to predict utterance_length with pages_read
(formula: utterance_length ~ age + pages_read). The model explains a
statistically significant and substantial proportion of variance (R2 = 0.99,
F(2, 12) = 404.69, p < .001, adj. R2 = 0.98). The model's intercept,
corresponding to pages_read = 0, is at 2.38 (95% CI [1.70, 3.07], t(12) = 7.55,
p < .001). Within this model:

  - The effect of age is statistically significant and positive (beta = 0.51, 95%
CI [0.46, 0.55], t(12) = 23.82, p < .001; Std. beta = 0.97, 95% CI [0.88,
1.05])
  - The effect of pages read is statistically non-significant and positive (beta
= 0.03, 95% CI [-0.02, 0.08], t(12) = 1.27, p = 0.228; Std. beta = 0.05, 95% CI
[-0.04, 0.14])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

{parameters} package

Code
print_md(model_parameters(m1))
plot(model_parameters(m1))
Parameter Coefficient SE 95% CI t(13) p
(Intercept) 2.70 0.20 (2.26, 3.14) 13.24 < .001
age 0.52 0.02 (0.48, 0.56) 27.77 < .001

Code
print_md(model_parameters(m2))
plot(model_parameters(m2))
Parameter Coefficient SE 95% CI t(12) p
(Intercept) 2.38 0.32 (1.70, 3.07) 7.55 < .001
age 0.51 0.02 (0.46, 0.55) 23.82 < .001
pages read 0.03 0.02 (-0.02, 0.08) 1.27 0.228

Code
print_md(compare_parameters(m1, m2))
plot(compare_parameters(m1, m2))
Parameter m1 m2
(Intercept) 2.70 (2.26, 3.14) 2.38 (1.70, 3.07)
age 0.52 (0.48, 0.56) 0.51 (0.46, 0.55)
pages read 0.03 (-0.02, 0.08)
Observations 15 15

{performance} package I

Code
print_md(model_performance(m1))
Indices of model performance
AIC BIC R2 R2 (adj.) RMSE Sigma
11.54 13.67 0.98 0.98 0.29 0.31
Code
print_md(model_performance(m2))
Indices of model performance
AIC BIC R2 R2 (adj.) RMSE Sigma
11.65 14.48 0.99 0.98 0.27 0.31
Code
m_perf <- compare_performance(m1, m2)
print_md(m_perf)
plot(m_perf)
Comparison of Model Performance Indices
Name Model AIC (weights) BIC (weights) R2 R2 (adj.) RMSE Sigma
m1 lm 11.5 (0.51) 13.7 (0.60) 0.98 0.98 0.29 0.31
m2 lm 11.6 (0.49) 14.5 (0.40) 0.99 0.98 0.27 0.31

{performance} package II

Code
check_heteroscedasticity(m1)
OK: Error variance appears to be homoscedastic (p = 0.237).
Code
check_model(m1, check = c("qq", "ncv"))

Code
check_heteroscedasticity(m2)
OK: Error variance appears to be homoscedastic (p = 0.947).
Code
check_model(m2, check = c("qq", "ncv"))

Next: Logistic regression analysis