Target item | Other items | Total | |
---|---|---|---|
Target context | O11 | O12 | R1 |
Reference context | O21 | O22 | R2 |
Total | C1 | C2 | N |
Methods of Corpus Linguistics (class 4)
Check the documentation of mclm::assoc_scores().
Evert (2007) in Toledo.
Introduction
Target item | Other items | Total | |
---|---|---|---|
Target context | O11 | O12 | R1 |
Reference context | O21 | O22 | R2 |
Total | C1 | C2 | N |
Target item | Other items | Total | |
---|---|---|---|
Target context | E11 | E12 | R1 |
Reference context | E21 | E22 | R2 |
Total | C1 | C2 | N |
Introduction
library(tidyverse)
library(mclm)
corpus_folder <- here::here("studies", "_corpora", "brown") # adapt path!!
brown_fnames <- get_fnames(corpus_folder) %>% keep_re("/c[a-z]")
hot_assoc <- surf_cooc(brown_fnames, "^hot/jj", re_token_splitter = "\\s+") %>%
assoc_scores() %>% as_tibble()
head(hot_assoc)
# A tibble: 6 × 17
type a b c d dir exp_a DP_rows RR_rows OR MS
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ,/, 49 728 58104 1116762 1 38.4 0.0136 1.28 1.29 0.000843
2 the/at 39 738 68974 1105892 -1 45.6 -0.00851 0.855 0.847 0.000565
3 ./. 38 739 48774 1126092 1 32.3 0.00739 1.18 1.19 0.000778
4 and/cc 36 741 28506 1146360 1 18.9 0.0221 1.91 1.95 0.00126
5 a/at 28 749 22915 1151951 1 15.2 0.0165 1.85 1.88 0.00122
6 of/in 21 756 35007 1139859 -1 23.2 -0.00277 0.907 0.904 0.000600
# … with 6 more variables: Dice <dbl>, PMI <dbl>, chi2_signed <dbl>,
# G_signed <dbl>, t <dbl>, p_fisher_1 <dbl>
Introduction
\[\frac{O_{11}}{R_1}-\frac{O_{21}}{R_2}\]
hot_assoc %>% arrange(desc(DP_rows)) %>%
select(type, a, b, c, d, DP_rows) %>% head(10) %>%
kbl() %>% kable_paper(font_size = 22)
type | a | b | c | d | DP_rows |
---|---|---|---|---|---|
and/cc | 36 | 741 | 28506 | 1146360 | 0.022 |
a/at | 28 | 749 | 22915 | 1151951 | 0.017 |
water/nn | 13 | 764 | 413 | 1174453 | 0.016 |
,/, | 49 | 728 | 58104 | 1116762 | 0.014 |
was/bedz | 13 | 764 | 9793 | 1165073 | 0.008 |
cold/jj | 6 | 771 | 133 | 1174733 | 0.008 |
./. | 38 | 739 | 48774 | 1126092 | 0.007 |
with/in | 10 | 767 | 7251 | 1167615 | 0.007 |
it/pps | 9 | 768 | 5844 | 1169022 | 0.007 |
sun/nn | 5 | 772 | 96 | 1174770 | 0.006 |
Effect size measures
\[\frac{O_{11}/R_1}{O_{21}/R_2}\]
hot_assoc %>% arrange(desc(RR_rows)) %>%
select(type, a, b, c, d, RR_rows) %>% head(10) %>%
kbl() %>% kable_paper(font_size = 22)
type | a | b | c | d | RR_rows |
---|---|---|---|---|---|
cereal/nn | 4 | 773 | 12 | 1174854 | 504.02 |
soup/nn | 3 | 774 | 13 | 1174853 | 348.94 |
cup/nn | 3 | 774 | 40 | 1174826 | 113.40 |
coffee/nn | 4 | 773 | 72 | 1174794 | 84.00 |
sun/nn | 5 | 772 | 96 | 1174770 | 78.75 |
weather/nn | 3 | 774 | 63 | 1174803 | 72.00 |
cold/jj | 6 | 771 | 133 | 1174733 | 68.21 |
water/nn | 13 | 764 | 413 | 1174453 | 47.59 |
summer/nn | 3 | 774 | 128 | 1174738 | 35.44 |
day/nn | 4 | 773 | 623 | 1174243 | 9.71 |
Effect size measures
\[\frac{O_{11}/O_{12}}{O_{21}/O_{22}}\]
hot_assoc %>% arrange(desc(OR)) %>%
select(type, a, b, c, d, OR) %>% head(10) %>%
kbl() %>% kable_paper(font_size = 22)
type | a | b | c | d | OR |
---|---|---|---|---|---|
cereal/nn | 4 | 773 | 12 | 1174854 | 506.62 |
soup/nn | 3 | 774 | 13 | 1174853 | 350.28 |
cup/nn | 3 | 774 | 40 | 1174826 | 113.84 |
coffee/nn | 4 | 773 | 72 | 1174794 | 84.43 |
sun/nn | 5 | 772 | 96 | 1174770 | 79.26 |
weather/nn | 3 | 774 | 63 | 1174803 | 72.28 |
cold/jj | 6 | 771 | 133 | 1174733 | 68.74 |
water/nn | 13 | 764 | 413 | 1174453 | 48.39 |
summer/nn | 3 | 774 | 128 | 1174738 | 35.57 |
day/nn | 4 | 773 | 623 | 1174243 | 9.75 |
Effect size measures
\[\log\left(\frac{O_{11}/O_{12}}{O_{21}/O_{22}}\right)\]
hot_assoc %>% mutate(log_OR = log(OR)) %>% arrange(desc(log_OR)) %>%
select(type, a, b, c, d, OR, log_OR) %>% head(10) %>%
kbl() %>% kable_paper(font_size = 22)
type | a | b | c | d | OR | log_OR |
---|---|---|---|---|---|---|
cereal/nn | 4 | 773 | 12 | 1174854 | 506.62 | 6.23 |
soup/nn | 3 | 774 | 13 | 1174853 | 350.28 | 5.86 |
cup/nn | 3 | 774 | 40 | 1174826 | 113.84 | 4.74 |
coffee/nn | 4 | 773 | 72 | 1174794 | 84.43 | 4.44 |
sun/nn | 5 | 772 | 96 | 1174770 | 79.26 | 4.37 |
weather/nn | 3 | 774 | 63 | 1174803 | 72.28 | 4.28 |
cold/jj | 6 | 771 | 133 | 1174733 | 68.74 | 4.23 |
water/nn | 13 | 764 | 413 | 1174453 | 48.39 | 3.88 |
summer/nn | 3 | 774 | 128 | 1174738 | 35.57 | 3.57 |
day/nn | 4 | 773 | 623 | 1174243 | 9.75 | 2.28 |
Effect size measures
\[\log_2 \left(\frac{O_{11}}{E_{11}}\right)\]
hot_assoc %>% arrange(desc(PMI)) %>%
select(type, a, exp_a, PMI) %>% head(10) %>%
kbl() %>% kable_paper(font_size = 22)
type | a | exp_a | PMI |
---|---|---|---|
cereal/nn | 4 | 0.011 | 8.56 |
soup/nn | 3 | 0.011 | 8.15 |
cup/nn | 3 | 0.028 | 6.72 |
coffee/nn | 4 | 0.050 | 6.32 |
sun/nn | 5 | 0.067 | 6.23 |
weather/nn | 3 | 0.044 | 6.10 |
cold/jj | 6 | 0.092 | 6.03 |
water/nn | 13 | 0.282 | 5.53 |
summer/nn | 3 | 0.087 | 5.12 |
day/nn | 4 | 0.414 | 3.27 |
Effect size measures
\[\frac{2O_{11}}{R_1+C_1}\]
Harmonic mean of \(\frac{O_{11}}{R_1}\) and \(\frac{O_{11}}{C_1}\)
hot_assoc %>% arrange(desc(Dice)) %>%
select(type, a, b, c, Dice) %>% head(10) %>%
kbl() %>% kable_paper(font_size = 22)
type | a | b | c | Dice |
---|---|---|---|---|
water/nn | 13 | 764 | 413 | 0.022 |
cold/jj | 6 | 771 | 133 | 0.013 |
sun/nn | 5 | 772 | 96 | 0.011 |
cereal/nn | 4 | 773 | 12 | 0.010 |
coffee/nn | 4 | 773 | 72 | 0.009 |
soup/nn | 3 | 774 | 13 | 0.008 |
cup/nn | 3 | 774 | 40 | 0.007 |
weather/nn | 3 | 774 | 63 | 0.007 |
summer/nn | 3 | 774 | 128 | 0.007 |
day/nn | 4 | 773 | 623 | 0.006 |
Effect size measures
Effect size measures
How certain are we that there is a difference?
Take into amount effect size and amount of information.
If you have a big difference you don’t need much data.
If you have a subtle difference you need a lot of data.
Strength of evidence measures
\(\sum_{i=1}^{2}\sum_{j=1}^{2}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\)
Not a good test for low frequency items (one of the expected frequencies is below 5)
hot_assoc %>% arrange(desc(chi2_signed)) %>%
select(type, a, exp_a, b, c, d, chi2_signed) %>% head(10) %>%
kbl() %>% kable_paper(font_size = 22)
type | a | exp_a | b | c | d | chi2_signed |
---|---|---|---|---|---|---|
cereal/nn | 4 | 0.011 | 773 | 12 | 1174854 | 1506.1 |
soup/nn | 3 | 0.011 | 774 | 13 | 1174853 | 845.7 |
water/nn | 13 | 0.282 | 764 | 413 | 1174453 | 575.1 |
cold/jj | 6 | 0.092 | 771 | 133 | 1174733 | 380.3 |
sun/nn | 5 | 0.067 | 772 | 96 | 1174770 | 364.9 |
cup/nn | 3 | 0.028 | 774 | 40 | 1174826 | 310.9 |
coffee/nn | 4 | 0.050 | 773 | 72 | 1174794 | 310.8 |
weather/nn | 3 | 0.044 | 774 | 63 | 1174803 | 200.5 |
summer/nn | 3 | 0.087 | 774 | 128 | 1174738 | 98.1 |
day/nn | 4 | 0.414 | 773 | 623 | 1174243 | 31.1 |
Strength of evidence measures
\[2\sum_{i=1}^{2}\sum_{j=1}^{2}\left(O_{ij} \times log\left(\frac{O_{ij}}{E_{ij}}\right) \right)\]
hot_assoc %>% arrange(desc(G_signed)) %>%
select(type, a, exp_a, b, c, G_signed) %>% head(10) %>%
kbl() %>% kable_paper(font_size = 22)
type | a | exp_a | b | c | G_signed |
---|---|---|---|---|---|
water/nn | 13 | 0.282 | 764 | 413 | 74.8 |
cereal/nn | 4 | 0.011 | 773 | 12 | 40.6 |
cold/jj | 6 | 0.092 | 771 | 133 | 38.6 |
sun/nn | 5 | 0.067 | 772 | 96 | 33.6 |
soup/nn | 3 | 0.011 | 774 | 13 | 28.5 |
coffee/nn | 4 | 0.050 | 773 | 72 | 27.3 |
cup/nn | 3 | 0.028 | 774 | 40 | 22.2 |
weather/nn | 3 | 0.044 | 774 | 63 | 19.6 |
summer/nn | 3 | 0.087 | 774 | 128 | 15.5 |
and/cc | 36 | 18.864 | 741 | 28506 | 12.7 |
Strength of evidence measures
We’ll use the p-value: the lower the better.
Used for low frequencies - accurate but computationally expensive.
hot_assoc %>% arrange(p_fisher_1) %>%
select(type, a, b, c, d, p_fisher_1) %>% head(10) %>%
kbl(digits = 5) %>% kable_paper(font_size = 22)
type | a | b | c | d | p_fisher_1 |
---|---|---|---|---|---|
water/nn | 13 | 764 | 413 | 1174453 | 0.00000 |
cereal/nn | 4 | 773 | 12 | 1174854 | 0.00000 |
cold/jj | 6 | 771 | 133 | 1174733 | 0.00000 |
sun/nn | 5 | 772 | 96 | 1174770 | 0.00000 |
soup/nn | 3 | 774 | 13 | 1174853 | 0.00000 |
coffee/nn | 4 | 773 | 72 | 1174794 | 0.00000 |
cup/nn | 3 | 774 | 40 | 1174826 | 0.00000 |
weather/nn | 3 | 774 | 63 | 1174803 | 0.00001 |
summer/nn | 3 | 774 | 128 | 1174738 | 0.00010 |
and/cc | 36 | 741 | 28506 | 1146360 | 0.00024 |
Strength of evidence measures
assoc
objectObtain a dataframe with association measures between different items and the target context.
collocations: e.g. between different words and hot/jj
keywords: e.g. between different words and the academic files of the corpus.
Code: mclm::assoc_scores()
or mclm::assoc_abcd()
.
Usage in practice
Set a threshold (always arbitrary, may be theoretically informed).
Use a ranking.
Combine: choose the n-best elements.
Usage in practice
It depends on the goal, previous literature…
For the final paper, comparing measures is a valid objective.
Usage in practice