Association measures

Methods of Corpus Linguistics (class 4)

Mariana Montes

Outline

Introduction
Effect size measures
Strength of evidence measures
Usage in practice

Introduction

Sources

Check the documentation of mclm::assoc_scores().
Evert (2007) in Toledo.

Terms

Observed frequencies
Expected frequencies

	Target item	Other items	Total
Target context	O₁₁	O₁₂	R₁
Reference context	O₂₁	O₂₂	R₂
Total	C₁	C₂	N

	Target item	Other items	Total
Target context	E₁₁	E₁₂	R₁
Reference context	E₂₁	E₂₂	R₂
Total	C₁	C₂	N

Tip

See slides on Contingency tables.

Set up data

library(tidyverse)
library(mclm)
corpus_folder <- here::here("studies", "_corpora", "brown") # adapt path!!
brown_fnames <- get_fnames(corpus_folder) %>% keep_re("/c[a-z]")
hot_assoc <- surf_cooc(brown_fnames, "^hot/jj", re_token_splitter = "\\s+") %>% 
  assoc_scores() %>% as_tibble()
head(hot_assoc)

# A tibble: 6 × 17
  type       a     b     c       d   dir exp_a  DP_rows RR_rows    OR       MS
  <chr>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>    <dbl>   <dbl> <dbl>    <dbl>
1 ,/,       49   728 58104 1116762     1  38.4  0.0136    1.28  1.29  0.000843
2 the/at    39   738 68974 1105892    -1  45.6 -0.00851   0.855 0.847 0.000565
3 ./.       38   739 48774 1126092     1  32.3  0.00739   1.18  1.19  0.000778
4 and/cc    36   741 28506 1146360     1  18.9  0.0221    1.91  1.95  0.00126 
5 a/at      28   749 22915 1151951     1  15.2  0.0165    1.85  1.88  0.00122 
6 of/in     21   756 35007 1139859    -1  23.2 -0.00277   0.907 0.904 0.000600
# … with 6 more variables: Dice <dbl>, PMI <dbl>, chi2_signed <dbl>,
#   G_signed <dbl>, t <dbl>, p_fisher_1 <dbl>

Effect size measures

Difference of proportions (\(\Delta p\))

\[\frac{O_{11}}{R_1}-\frac{O_{21}}{R_2}\]

\(< 0\): repulsion
\(0\): neutral
\(> 0\): attraction

hot_assoc %>% arrange(desc(DP_rows)) %>% 
  select(type, a, b, c, d, DP_rows) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)

type	a	b	c	d	DP_rows
and/cc	36	741	28506	1146360	0.022
a/at	28	749	22915	1151951	0.017
water/nn	13	764	413	1174453	0.016
,/,	49	728	58104	1116762	0.014
was/bedz	13	764	9793	1165073	0.008
cold/jj	6	771	133	1174733	0.008
./.	38	739	48774	1126092	0.007
with/in	10	767	7251	1167615	0.007
it/pps	9	768	5844	1169022	0.007
sun/nn	5	772	96	1174770	0.006

Relative risk

\[\frac{O_{11}/R_1}{O_{21}/R_2}\]

\(< 1\): repulsion
\(1\): neutral
\(> 1\): attraction

hot_assoc %>% arrange(desc(RR_rows)) %>% 
  select(type, a, b, c, d, RR_rows) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)

type	a	b	c	d	RR_rows
cereal/nn	4	773	12	1174854	504.02
soup/nn	3	774	13	1174853	348.94
cup/nn	3	774	40	1174826	113.40
coffee/nn	4	773	72	1174794	84.00
sun/nn	5	772	96	1174770	78.75
weather/nn	3	774	63	1174803	72.00
cold/jj	6	771	133	1174733	68.21
water/nn	13	764	413	1174453	47.59
summer/nn	3	774	128	1174738	35.44
day/nn	4	773	623	1174243	9.71

Odds ratio

\[\frac{O_{11}/O_{12}}{O_{21}/O_{22}}\]

\(< 1\): repulsion
\(1\): neutral
\(> 1\): attraction

hot_assoc %>% arrange(desc(OR)) %>% 
  select(type, a, b, c, d, OR) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)

type	a	b	c	d	OR
cereal/nn	4	773	12	1174854	506.62
soup/nn	3	774	13	1174853	350.28
cup/nn	3	774	40	1174826	113.84
coffee/nn	4	773	72	1174794	84.43
sun/nn	5	772	96	1174770	79.26
weather/nn	3	774	63	1174803	72.28
cold/jj	6	771	133	1174733	68.74
water/nn	13	764	413	1174453	48.39
summer/nn	3	774	128	1174738	35.57
day/nn	4	773	623	1174243	9.75

log odds ratio

\[\log\left(\frac{O_{11}/O_{12}}{O_{21}/O_{22}}\right)\]

\(< 0\): repulsion
\(0\): neutral
\(> 0\): attraction

hot_assoc %>% mutate(log_OR = log(OR)) %>% arrange(desc(log_OR)) %>% 
  select(type, a, b, c, d, OR, log_OR) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)

type	a	b	c	d	OR	log_OR
cereal/nn	4	773	12	1174854	506.62	6.23
soup/nn	3	774	13	1174853	350.28	5.86
cup/nn	3	774	40	1174826	113.84	4.74
coffee/nn	4	773	72	1174794	84.43	4.44
sun/nn	5	772	96	1174770	79.26	4.37
weather/nn	3	774	63	1174803	72.28	4.28
cold/jj	6	771	133	1174733	68.74	4.23
water/nn	13	764	413	1174453	48.39	3.88
summer/nn	3	774	128	1174738	35.57	3.57
day/nn	4	773	623	1174243	9.75	2.28

PMI

\[\log_2 \left(\frac{O_{11}}{E_{11}}\right)\]

\(< 0\): repulsion
\(0\): neutral
\(> 0\): attraction

hot_assoc %>% arrange(desc(PMI)) %>% 
  select(type, a, exp_a, PMI) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)

type	a	exp_a	PMI
cereal/nn	4	0.011	8.56
soup/nn	3	0.011	8.15
cup/nn	3	0.028	6.72
coffee/nn	4	0.050	6.32
sun/nn	5	0.067	6.23
weather/nn	3	0.044	6.10
cold/jj	6	0.092	6.03
water/nn	13	0.282	5.53
summer/nn	3	0.087	5.12
day/nn	4	0.414	3.27

DICE coefficient

\[\frac{2O_{11}}{R_1+C_1}\]

Harmonic mean of \(\frac{O_{11}}{R_1}\) and \(\frac{O_{11}}{C_1}\)

Range: 0-1

hot_assoc %>% arrange(desc(Dice)) %>% 
  select(type, a, b, c, Dice) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)

type	a	b	c	Dice
water/nn	13	764	413	0.022
cold/jj	6	771	133	0.013
sun/nn	5	772	96	0.011
cereal/nn	4	773	12	0.010
coffee/nn	4	773	72	0.009
soup/nn	3	774	13	0.008
cup/nn	3	774	40	0.007
weather/nn	3	774	63	0.007
summer/nn	3	774	128	0.007
day/nn	4	773	623	0.006

Sum up effect size measures

Intuitive

Fragile, especially with low frequencies (which we often have)

Strength of evidence measures

About these measures

How certain are we that there is a difference?
Take into amount effect size and amount of information.
- If you have a big difference you don’t need much data.
- If you have a subtle difference you need a lot of data.

BUT they combine attraction and repulsion!

Chi-square (\(\chi^2\))

\(\sum_{i=1}^{2}\sum_{j=1}^{2}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\)
Not a good test for low frequency items (one of the expected frequencies is below 5)

hot_assoc %>% arrange(desc(chi2_signed)) %>% 
  select(type, a, exp_a, b, c, d, chi2_signed) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)

type	a	exp_a	b	c	d	chi2_signed
cereal/nn	4	0.011	773	12	1174854	1506.1
soup/nn	3	0.011	774	13	1174853	845.7
water/nn	13	0.282	764	413	1174453	575.1
cold/jj	6	0.092	771	133	1174733	380.3
sun/nn	5	0.067	772	96	1174770	364.9
cup/nn	3	0.028	774	40	1174826	310.9
coffee/nn	4	0.050	773	72	1174794	310.8
weather/nn	3	0.044	774	63	1174803	200.5
summer/nn	3	0.087	774	128	1174738	98.1
day/nn	4	0.414	773	623	1174243	31.1

Log-likelihood ratio: \(G\) or \(G^2\)

\[2\sum_{i=1}^{2}\sum_{j=1}^{2}\left(O_{ij} \times log\left(\frac{O_{ij}}{E_{ij}}\right) \right)\]

Also problematic with low frequency but not as much as \(\chi^2\) (expected values can be as low as 3).

hot_assoc %>% arrange(desc(G_signed)) %>% 
  select(type, a, exp_a, b, c, G_signed) %>% head(10) %>% 
  kbl() %>% kable_paper(font_size = 22)

type	a	exp_a	b	c	G_signed
water/nn	13	0.282	764	413	74.8
cereal/nn	4	0.011	773	12	40.6
cold/jj	6	0.092	771	133	38.6
sun/nn	5	0.067	772	96	33.6
soup/nn	3	0.011	774	13	28.5
coffee/nn	4	0.050	773	72	27.3
cup/nn	3	0.028	774	40	22.2
weather/nn	3	0.044	774	63	19.6
summer/nn	3	0.087	774	128	15.5
and/cc	36	18.864	741	28506	12.7

Fisher exact test

We’ll use the p-value: the lower the better.
Used for low frequencies - accurate but computationally expensive.

hot_assoc %>% arrange(p_fisher_1) %>% 
  select(type, a, b, c, d, p_fisher_1) %>% head(10) %>% 
  kbl(digits = 5) %>% kable_paper(font_size = 22)

type	a	b	c	d	p_fisher_1
water/nn	13	764	413	1174453	0.00000
cereal/nn	4	773	12	1174854	0.00000
cold/jj	6	771	133	1174733	0.00000
sun/nn	5	772	96	1174770	0.00000
soup/nn	3	774	13	1174853	0.00000
coffee/nn	4	773	72	1174794	0.00000
cup/nn	3	774	40	1174826	0.00000
weather/nn	3	774	63	1174803	0.00001
summer/nn	3	774	128	1174738	0.00010
and/cc	36	741	28506	1146360	0.00024

Usage in practice

1. `assoc` object

Obtain a dataframe with association measures between different items and the target context.

collocations: e.g. between different words and hot/jj
keywords: e.g. between different words and the academic files of the corpus.

Code: mclm::assoc_scores() or mclm::assoc_abcd().

What counts as association?

Set a threshold (always arbitrary, may be theoretically informed).
Use a ranking.
Combine: choose the n-best elements.

Tip

This can also be combined with different measures, e.g. use frequency threshold, a minimum value of PMI and of \(G^2\) and rank by either PMI or \(G^2\).

How to choose a measure?

It depends on the goal, previous literature…
For the final paper, comparing measures is a valid objective.

Suggestion

Combine an effect-size measure and a strength-of-evidence measure. Use them for thresholds and compare the rankings. A good pair is PMI and \(G^2\).

Association measures

Outline

Introduction

Sources

Terms

Set up data

Effect size measures

Difference of proportions (\(\Delta p\))

Relative risk

Odds ratio

log odds ratio

PMI

DICE coefficient

Sum up effect size measures

Strength of evidence measures

About these measures

Chi-square (\(\chi^2\))

Log-likelihood ratio: \(G\) or \(G^2\)

Fisher exact test

Usage in practice

1. assoc object

What counts as association?

How to choose a measure?

Next: Logistic regression

1. `assoc` object