Association measures

Methods of Corpus Linguistics (class 4)

Mariana Montes

Outline

  • Introduction
  • Effect size measures
  • Strength of evidence measures
  • Usage in practice

Introduction

Sources

Terms

Target item Other items Total
Target context O11 O12 R1
Reference context O21 O22 R2
Total C1 C2 N
Target item Other items Total
Target context E11 E12 R1
Reference context E21 E22 R2
Total C1 C2 N

Set up data

library(tidyverse)
library(mclm)
corpus_folder <- here::here("studies", "_corpora", "brown") # adapt path!!
brown_fnames <- get_fnames(corpus_folder) %>% keep_re("/c[a-z]")
hot_assoc <- surf_cooc(brown_fnames, "^hot/jj", re_token_splitter = "\\s+") %>% 
  assoc_scores() %>% as_tibble()
head(hot_assoc)
# A tibble: 6 × 17
  type       a     b     c       d   dir exp_a  DP_rows RR_rows    OR       MS
  <chr>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>    <dbl>   <dbl> <dbl>    <dbl>
1 ,/,       49   728 58104 1116762     1  38.4  0.0136    1.28  1.29  0.000843
2 the/at    39   738 68974 1105892    -1  45.6 -0.00851   0.855 0.847 0.000565
3 ./.       38   739 48774 1126092     1  32.3  0.00739   1.18  1.19  0.000778
4 and/cc    36   741 28506 1146360     1  18.9  0.0221    1.91  1.95  0.00126 
5 a/at      28   749 22915 1151951     1  15.2  0.0165    1.85  1.88  0.00122 
6 of/in     21   756 35007 1139859    -1  23.2 -0.00277   0.907 0.904 0.000600
# … with 6 more variables: Dice <dbl>, PMI <dbl>, chi2_signed <dbl>,
#   G_signed <dbl>, t <dbl>, p_fisher_1 <dbl>

Effect size measures

Difference of proportions (\(\Delta p\))

\[\frac{O_{11}}{R_1}-\frac{O_{21}}{R_2}\]

  • \(< 0\): repulsion
  • \(0\): neutral
  • \(> 0\): attraction
hot_assoc %>% arrange(desc(DP_rows)) %>% 
  select(type, a, b, c, d, DP_rows) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)
type a b c d DP_rows
and/cc 36 741 28506 1146360 0.022
a/at 28 749 22915 1151951 0.017
water/nn 13 764 413 1174453 0.016
,/, 49 728 58104 1116762 0.014
was/bedz 13 764 9793 1165073 0.008
cold/jj 6 771 133 1174733 0.008
./. 38 739 48774 1126092 0.007
with/in 10 767 7251 1167615 0.007
it/pps 9 768 5844 1169022 0.007
sun/nn 5 772 96 1174770 0.006

Relative risk

\[\frac{O_{11}/R_1}{O_{21}/R_2}\]

  • \(< 1\): repulsion
  • \(1\): neutral
  • \(> 1\): attraction
hot_assoc %>% arrange(desc(RR_rows)) %>% 
  select(type, a, b, c, d, RR_rows) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)
type a b c d RR_rows
cereal/nn 4 773 12 1174854 504.02
soup/nn 3 774 13 1174853 348.94
cup/nn 3 774 40 1174826 113.40
coffee/nn 4 773 72 1174794 84.00
sun/nn 5 772 96 1174770 78.75
weather/nn 3 774 63 1174803 72.00
cold/jj 6 771 133 1174733 68.21
water/nn 13 764 413 1174453 47.59
summer/nn 3 774 128 1174738 35.44
day/nn 4 773 623 1174243 9.71

Odds ratio

\[\frac{O_{11}/O_{12}}{O_{21}/O_{22}}\]

  • \(< 1\): repulsion
  • \(1\): neutral
  • \(> 1\): attraction
hot_assoc %>% arrange(desc(OR)) %>% 
  select(type, a, b, c, d, OR) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)
type a b c d OR
cereal/nn 4 773 12 1174854 506.62
soup/nn 3 774 13 1174853 350.28
cup/nn 3 774 40 1174826 113.84
coffee/nn 4 773 72 1174794 84.43
sun/nn 5 772 96 1174770 79.26
weather/nn 3 774 63 1174803 72.28
cold/jj 6 771 133 1174733 68.74
water/nn 13 764 413 1174453 48.39
summer/nn 3 774 128 1174738 35.57
day/nn 4 773 623 1174243 9.75

log odds ratio

\[\log\left(\frac{O_{11}/O_{12}}{O_{21}/O_{22}}\right)\]

  • \(< 0\): repulsion
  • \(0\): neutral
  • \(> 0\): attraction
hot_assoc %>% mutate(log_OR = log(OR)) %>% arrange(desc(log_OR)) %>% 
  select(type, a, b, c, d, OR, log_OR) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)
type a b c d OR log_OR
cereal/nn 4 773 12 1174854 506.62 6.23
soup/nn 3 774 13 1174853 350.28 5.86
cup/nn 3 774 40 1174826 113.84 4.74
coffee/nn 4 773 72 1174794 84.43 4.44
sun/nn 5 772 96 1174770 79.26 4.37
weather/nn 3 774 63 1174803 72.28 4.28
cold/jj 6 771 133 1174733 68.74 4.23
water/nn 13 764 413 1174453 48.39 3.88
summer/nn 3 774 128 1174738 35.57 3.57
day/nn 4 773 623 1174243 9.75 2.28

PMI

\[\log_2 \left(\frac{O_{11}}{E_{11}}\right)\]

  • \(< 0\): repulsion
  • \(0\): neutral
  • \(> 0\): attraction
hot_assoc %>% arrange(desc(PMI)) %>% 
  select(type, a, exp_a, PMI) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)
type a exp_a PMI
cereal/nn 4 0.011 8.56
soup/nn 3 0.011 8.15
cup/nn 3 0.028 6.72
coffee/nn 4 0.050 6.32
sun/nn 5 0.067 6.23
weather/nn 3 0.044 6.10
cold/jj 6 0.092 6.03
water/nn 13 0.282 5.53
summer/nn 3 0.087 5.12
day/nn 4 0.414 3.27

DICE coefficient

\[\frac{2O_{11}}{R_1+C_1}\]

Harmonic mean of \(\frac{O_{11}}{R_1}\) and \(\frac{O_{11}}{C_1}\)

  • Range: 0-1
hot_assoc %>% arrange(desc(Dice)) %>% 
  select(type, a, b, c, Dice) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)
type a b c Dice
water/nn 13 764 413 0.022
cold/jj 6 771 133 0.013
sun/nn 5 772 96 0.011
cereal/nn 4 773 12 0.010
coffee/nn 4 773 72 0.009
soup/nn 3 774 13 0.008
cup/nn 3 774 40 0.007
weather/nn 3 774 63 0.007
summer/nn 3 774 128 0.007
day/nn 4 773 623 0.006

Sum up effect size measures

  • Intuitive
  • Fragile, especially with low frequencies (which we often have)

Strength of evidence measures

About these measures

  • How certain are we that there is a difference?

  • Take into amount effect size and amount of information.

    • If you have a big difference you don’t need much data.

    • If you have a subtle difference you need a lot of data.

  • BUT they combine attraction and repulsion!

Chi-square (\(\chi^2\))

  • \(\sum_{i=1}^{2}\sum_{j=1}^{2}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\)

  • Not a good test for low frequency items (one of the expected frequencies is below 5)

hot_assoc %>% arrange(desc(chi2_signed)) %>% 
  select(type, a, exp_a, b, c, d, chi2_signed) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)
type a exp_a b c d chi2_signed
cereal/nn 4 0.011 773 12 1174854 1506.1
soup/nn 3 0.011 774 13 1174853 845.7
water/nn 13 0.282 764 413 1174453 575.1
cold/jj 6 0.092 771 133 1174733 380.3
sun/nn 5 0.067 772 96 1174770 364.9
cup/nn 3 0.028 774 40 1174826 310.9
coffee/nn 4 0.050 773 72 1174794 310.8
weather/nn 3 0.044 774 63 1174803 200.5
summer/nn 3 0.087 774 128 1174738 98.1
day/nn 4 0.414 773 623 1174243 31.1

Log-likelihood ratio: \(G\) or \(G^2\)

\[2\sum_{i=1}^{2}\sum_{j=1}^{2}\left(O_{ij} \times log\left(\frac{O_{ij}}{E_{ij}}\right) \right)\]

  • Also problematic with low frequency but not as much as \(\chi^2\) (expected values can be as low as 3).
hot_assoc %>% arrange(desc(G_signed)) %>% 
  select(type, a, exp_a, b, c, G_signed) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)
type a exp_a b c G_signed
water/nn 13 0.282 764 413 74.8
cereal/nn 4 0.011 773 12 40.6
cold/jj 6 0.092 771 133 38.6
sun/nn 5 0.067 772 96 33.6
soup/nn 3 0.011 774 13 28.5
coffee/nn 4 0.050 773 72 27.3
cup/nn 3 0.028 774 40 22.2
weather/nn 3 0.044 774 63 19.6
summer/nn 3 0.087 774 128 15.5
and/cc 36 18.864 741 28506 12.7

Fisher exact test

  • We’ll use the p-value: the lower the better.

  • Used for low frequencies - accurate but computationally expensive.

hot_assoc %>% arrange(p_fisher_1) %>% 
  select(type, a, b, c, d, p_fisher_1) %>% head(10) %>% 
  kbl(digits = 5) %>% kable_paper(font_size = 22)
type a b c d p_fisher_1
water/nn 13 764 413 1174453 0.00000
cereal/nn 4 773 12 1174854 0.00000
cold/jj 6 771 133 1174733 0.00000
sun/nn 5 772 96 1174770 0.00000
soup/nn 3 774 13 1174853 0.00000
coffee/nn 4 773 72 1174794 0.00000
cup/nn 3 774 40 1174826 0.00000
weather/nn 3 774 63 1174803 0.00001
summer/nn 3 774 128 1174738 0.00010
and/cc 36 741 28506 1146360 0.00024

Usage in practice

1. assoc object

Obtain a dataframe with association measures between different items and the target context.

  • collocations: e.g. between different words and hot/jj

  • keywords: e.g. between different words and the academic files of the corpus.

Code: mclm::assoc_scores() or mclm::assoc_abcd().

What counts as association?

  • Set a threshold (always arbitrary, may be theoretically informed).

  • Use a ranking.

  • Combine: choose the n-best elements.

Tip

This can also be combined with different measures, e.g. use frequency threshold, a minimum value of PMI and of \(G^2\) and rank by either PMI or \(G^2\).

How to choose a measure?

  • It depends on the goal, previous literature…

  • For the final paper, comparing measures is a valid objective.

Suggestion

Combine an effect-size measure and a strength-of-evidence measure. Use them for thresholds and compare the rankings. A good pair is PMI and \(G^2\).

Next: Logistic regression