Contingency tables

Methods of Corpus Linguistics (class 3)

Mariana Montes

Outline

  • Hot example
  • Observed frequencies
  • Expected frequencies
  • Target and reference contexts
  • Summing up

Hot example

Set up data

library(tidyverse)
library(mclm)
corpus_folder <- here::here("studies", "_corpora", "brown")
brown_fnames <- get_fnames(corpus_folder) %>% keep_re("/c[a-z]")
flist <- freqlist(brown_fnames, re_token_splitter = re("\\s+"))
print(flist, n = 5)
Frequency list (types in list: 63517, tokens in list: 1162192)
rank   type abs_freq nrm_freq
---- ------ -------- --------
   1 the/at    69013  593.818
   2    ,/,    58153  500.373
   3    ./.    48812  419.999
   4  of/in    35028  301.396
   5 and/cc    28542  245.588
...

Frequency of hot

flist %>% keep_re("^hot/")
Frequency list (types in list: 3, tokens in list: 130)
<total number of tokens: 1162192>
rank orig_rank      type abs_freq nrm_freq
---- --------- --------- -------- --------
   1       805    hot/jj      123    1.058
   2     14852 hot/jj-tl        5    0.043
   3     28147 hot/jj-hl        2    0.017

Concordance of hot

hot <- conc(brown_fnames, "\\bhot/jj")
hot
Concordance-based data frame (number of observations: 118)
idx                             left|match |right                           
  1 .../-- After/in a/at long/jj ,/,|hot/jj|controversy/nn ,/, Miller/np-...
  2 ...$ Dave/np Mills/np in/in a/at|hot/jj|duel/nn in/in 1.10.1/cd ./. K...
  3 ...opped/vbd this/dt suddenly/rb|hot/jj|potato/nn in/in a/at very/ql ...
  4 ...wps got/vbd off/rp to/in a/at|hot/jj|start/nn and/cc made/vbd a/at...
  5 .../at cup/nn of/in steaming/vbg|hot/jj|coffee/nn awaiting/vbg him/pp...
  6 ...A/at measure/nn of/in how/wrb|hot/jj|the/at stock/nn was/bedz ,/, ...
  7 ...oods/nns-tl issue/nn was/bedz|hot/jj|long/rb before/cs it/pps was/...
  8 ...ock/nn issue/vb such/abl a/at|hot/jj|one/cd ?/. ?/. The/at answer/...
  9 ... Foods/nns-tl stock/nn the/at|hot/jj|issue/nn that/cs it/pps was/b...
 10 ..., introduced/vbd the/at ``/``|hot/jj|dog/nn ''/'' and/cc paved/vbd...
 11 ...g/vbg temperatures/nns and/cc|hot/jj|summer/nn pavements/nns are/b...
 12 ...der/rbr ./. An/at ordinary/jj|hot/jj|bath/nn or/cc shower/nn will/...
 13 ...smell/nn and/cc feel/nn of/in|hot/jj|,/, wet/jj woolen/jj sleeves/...
 14 ...ns ./. No/at matter/nn how/ql|hot/jj|the/at day/nn ,/, they/ppss a...
 15 ... who/wps was/bedz blowing/vbg|hot/jj|and/cc cold/jj ,/, exalting/v...
 16 ...der/in sunny/jj skies/nns ,/,|hot/jj|sun/nn ,/, and/cc a/at fresh/...
 17 ...nd/vb the/at summer/nn too/ql|hot/jj|for/in comfort/nn ./. And/cc ...
 18 ...n't/doz* mean/vb that/cs a/at|hot/jj|rodder/nn must/md necessarily...
 19 ...ctive/jj and/cc successful/jj|hot/jj|rodder/nn for/in years/nns wi...
 20 ...d/vbn to/in its/pp$ normal/jj|hot/jj|operating/vbg pressure/nn and...
 21 ... rated/vbn at/in a/at very/ql|hot/jj|2,460/cd fps/nn ,/, and/cc it...
 22 ...napkins/nns that/wps kept/vbd|hot/jj|a/at platter/nn of/in oyster/...
 23 ...nn that/wps kept/vbd them/ppo|hot/jj|when/wrb served/vbn --/-- was...
 24 ...n ./. Heat/vb and/cc serve/vb|hot/jj|on/in toast/nn ./. The/at ome...
 25 ...f/cs they/ppss are/ber too/ql|hot/jj|,/, and/cc to/to stop/vb flam...
 26 ...nns ./. And/cc lots/nns of/in|hot/jj|pads/nns !/. !/. Do/do keep/v...
 27 .../nn chilled/vbn or/cc soup/nn|hot/jj|./. Be/be sure/jj to/to get/v...
 28 ...e/at chili/nn and/cc kraut/nn|hot/jj|with/in the/at franks/nns ./....
 29 ...p$ daytime/jj naps/nns and/cc|hot/jj|meals/nns ,/, and/cc be/be pu...
 30 ...shed/vbn and/cc free/jj of/in|hot/jj|weather/nn nerves/nns ./. You...
...
...

This data frame has 6 columns:
   column
1 glob_id
2      id
3  source
4    left
5   match
6   right

Concordance of hot

hot %>% 
  arrange(right)
Concordance-based data frame (number of observations: 118)
idx                             left|match |right                           
  1 ...t was/bedz the/at word/nn ,/,|hot/jj|!/. !/. Hair/nn like/cs a/at ...
  2 .../vbz in/in the/at long/jj (/(|hot/jj|)/) run/nn to/to take/vb good...
  3 ...f/cs they/ppss are/ber too/ql|hot/jj|,/, and/cc to/to stop/vb flam...
  4 ...d/cc the/at dice/nns were/bed|hot/jj|,/, but/cc he/pps couldn't/md...
  5 ...pps had/hvd been/ben still/rb|hot/jj|,/, she/pps might/md even/rb ...
  6 ...e/pps felt/vbd cold/jj and/cc|hot/jj|,/, sticky/jj and/cc chilly/j...
  7 ...ed/vbn on/in sensation/nn (/(|hot/jj|,/, sweet/jj ,/, bitter/jj ,/...
  8 ...dz unusually/rb dry/jj and/cc|hot/jj|,/, the/at spring/nn produced...
  9 ...smell/nn and/cc feel/nn of/in|hot/jj|,/, wet/jj woolen/jj sleeves/...
 10 .../nn chilled/vbn or/cc soup/nn|hot/jj|./. Be/be sure/jj to/to get/v...
 11 ...n and/cc felt/vbd a/at bit/nn|hot/jj|./. He/pps stayed/vbd home/nr...
 12 ...d away/rb ./. It/pps was/bedz|hot/jj|./. The/at dogs/nns were/bed ...
 13 ...and/cc the/at sun/nn was/bedz|hot/jj|./. The/at new/jj Riverside/n...
 14 ... rated/vbn at/in a/at very/ql|hot/jj|2,460/cd fps/nn ,/, and/cc it...
 15 ...napkins/nns that/wps kept/vbd|hot/jj|a/at platter/nn of/in oyster/...
 16 ...vbd around/rb ,/, suddenly/rb|hot/jj|all/ql over/rp ,/, finding/vb...
 17 ... who/wps was/bedz blowing/vbg|hot/jj|and/cc cold/jj ,/, exalting/v...
 18 ...onal/jj wars/nns ,/, both/abx|hot/jj|and/cc cold/jj ./. In/in ever...
 19 .... His/pp$ prescription/nn :/:|hot/jj|and/cc cold/jj compresses/nns...
 20 ...heated/jj in/in winter/nn ,/,|hot/jj|and/cc damp/jj under/in the/a...
 21 .../cs your/pp$ blood/nn got/vbd|hot/jj|and/cc danced/vbd with/in the...
 22 .../. The/at theatre/nn was/bedz|hot/jj|and/cc they/ppss were/bed dru...
 23 ...The/at sun/nn ,/, blazing/vbg|hot/jj|as/cs prophesied/vbn ,/, was/...
 24 ...rb it/pps was/bedz already/rb|hot/jj|at/in 7:30/cd A.M./rb ,/, and...
 25 ...d out/rp pink/jj from/in a/at|hot/jj|bath/nn ,/, and/cc I/ppss gav...
 26 ...der/rbr ./. An/at ordinary/jj|hot/jj|bath/nn or/cc shower/nn will/...
 27 ...at drifting/vbg odor/nn of/in|hot/jj|biscuits/nns ./. The/at old/j...
 28 ...pss want/vb to/to buy/vb a/at|hot/jj|Bodhisattva/np ./. Additional...
 29 ...t/nn of/in Mrs./np Fogg's/np$|hot/jj|broth/nn before/cs starting/v...
 30 .../nn and/cc wiping/vbg his/pp$|hot/jj|brow/nn ./. It/pps may/md app...
...
...

This data frame has 6 columns:
   column
1 glob_id
2      id
3  source
4    left
5   match
6   right

Concordance of hot

hot %>% 
  arrange(right) %>% 
  print_kwic(from = 25, n = 15)
idx                             left|match |right                           
...
 25 ...d out/rp pink/jj from/in a/at|hot/jj|bath/nn ,/, and/cc I/ppss gav...
 26 ...der/rbr ./. An/at ordinary/jj|hot/jj|bath/nn or/cc shower/nn will/...
 27 ...at drifting/vbg odor/nn of/in|hot/jj|biscuits/nns ./. The/at old/j...
 28 ...pss want/vb to/to buy/vb a/at|hot/jj|Bodhisattva/np ./. Additional...
 29 ...t/nn of/in Mrs./np Fogg's/np$|hot/jj|broth/nn before/cs starting/v...
 30 .../nn and/cc wiping/vbg his/pp$|hot/jj|brow/nn ./. It/pps may/md app...
 31 ...eakfast/nn of/in fruit/nn ,/,|hot/jj|cereal/nn ,/, milk/nn ,/, hon...
 32 ...sed/vbd to/to like/vb any/dti|hot/jj|cereal/nn ,/, now/rb that's/d...
 33 .../nns they/ppss miss/vb the/at|hot/jj|cereal/nn ./. The/at school/n...
 34 ...to/to come/vb down/rp with/in|hot/jj|chills/nns and/cc puzzling/jj...
 35 ... sprinkled/vbd sugar/nn on/in|hot/jj|coals/nns and/cc held/vbd the...
 36 ...t nonexistent/jj cup/nn of/in|hot/jj|coffee/nn ,/, and/cc that/cs ...
 37 ...h/nn for/in a/at cup/nn of/in|hot/jj|coffee/nn ./. They/ppss are/b...
 38 .../at cup/nn of/in steaming/vbg|hot/jj|coffee/nn awaiting/vbg him/pp...
 39 ...n boiled/vbn ,/, applying/vbg|hot/jj|compresses/nns throughout/in ...
...

Co-occurrences with hot

hot_cooc <- surf_cooc(brown_fnames, "^hot/jj", re_token_splitter = "\\s+")


head(hot_cooc$target_freqlist, 7)
Frequency list (types in list: 7, tokens in list: 226)
<total number of tokens: 777>
rank orig_rank   type abs_freq nrm_freq
---- --------- ------ -------- --------
   1         1    ,/,       49  630.631
   2         2 the/at       39  501.931
   3         3    ./.       38  489.060
   4         4 and/cc       36  463.320
   5         5   a/at       28  360.360
   6         6  of/in       21  270.270
   7         7  in/in       15  193.050
head(hot_cooc$ref_freqlist, 7)
Frequency list (types in list: 7, tokens in list: 282996)
<total number of tokens: 1174866>
rank orig_rank   type abs_freq nrm_freq
---- --------- ------ -------- --------
   1         1 the/at    68974  587.080
   2         2    ,/,    58104  494.559
   3         3    ./.    48774  415.145
   4         4  of/in    35007  297.966
   5         5 and/cc    28506  242.632
   6         6   a/at    22915  195.044
   7         7  in/in    20716  176.326

Hot coffee

map(hot_cooc, keep_re, "^coffee/")
$target_freqlist
Frequency list (types in list: 1, tokens in list: 4)
<total number of tokens: 777>
rank orig_rank      type abs_freq nrm_freq
---- --------- --------- -------- --------
   1        24 coffee/nn        4    51.48

$ref_freqlist
Frequency list (types in list: 2, tokens in list: 74)
<total number of tokens: 1174866>
rank orig_rank         type abs_freq nrm_freq
---- --------- ------------ -------- --------
   1      1450    coffee/nn       72    0.613
   2     25804 coffee/nn-tl        2    0.017

Observed frequencies

Important frequencies

  • Frequency of hot/jj context = \(f(n)\) = 7771
  • Frequency of coffee/nn = \(f(c)\) = 76
  • Frequency of coffee/nn in the context of hot/jj = \(f(n,c)\) = 4
  • Sum of the hot/jj context and not-hot/jj context = \(N\) = 1175643

a, b, c, d…

Coffee Not coffee Total
Context of hot a b m = a + b
Not context of hot c d n = c + d
Total k = a + c l = b + d N = m + n

O, R, C, N

Coffee Not coffee Total
Context of hot O11 O12 R1
Not context of hot O21 O22 R2
Total C1 C2 N

Building the contingency table

Coffee Not coffee Total
Context of hot 4 773 777
Not context of hot 72 1 174 794 1 174 866
Total 76 1 175 567 1 175 643
  • How important is it that coffee occurs 4 times in the context of hot when

    • there are 773 other events in the context of hot

    • and coffee also occurs 72 times outside the context of hot?

Expected frequencies

Expected a, b, c, d…

Coffee Not coffee Total
Context of hot (m * k)/N (m * l)/N m
Not context of hot (n * k)/N (n * l)/N n
Total k l N

E, R, C, N

Coffee Not coffee Total
Context of hot E11 E12 R1
Not context of hot E21 E22 R2
Total C1 C2 N

Expected frequencies

Coffee Not coffee Total
Context of hot 0.05 776.95 777
Not context of hot 75.95 1 174 790.05 1 174 866
Total 76.00 1 175 567.00 1 175 643

Cf. Observed frequencies.

Coffee Not coffee Total
Context of hot 4 773 777
Not context of hot 72 1 174 794 1 174 866
Total 76 1 175 567 1 175 643

Target and reference contexts

Contexts in the contingency table

Target item Other items Total
Target context O11 O12 R1
Reference context O21 O22 R2
Total C1 C2 N

Collocation analysis

  • The target context is the surroundings of an item, the node, e.g. hot.

    • In surface collocations, a certain window around the node, e.g. 3 tokens to either side.

    • In textual collocations, the text in which the node occurs.

    • In syntactic collocations, a certain syntactic relationship with the node.

  • The reference context is all the contexts not surrounding the node.

Collocation analysis - table

Collocate Other items Total
Context of node O11 O12 R1
Not context of node O21 O22 R2
Total C1 C2 N

Distinctive collocation analysis

  • The target context is the surroundings of a node, e.g. hot.

  • The reference context is the surroundings of a second node for contrast, e.g. cold.

Note

In (distinctive) collocation analysis, the “target” item is the collocate.

Distinctive collocation analysis - table

Collocate Other items Total
Context of node O11 O12 R1
Context of alternative node O21 O22 R2
Total C1 C2 N

Keyword analysis

  • The target context is a text or corpus.

  • The reference context is another bigger, reference corpus.

Note

In distinctive keyword analysis the reference context is another target corpus.

Keyword analysis - table

Target item Other items Total
Target text/corpus O11 O12 R1
Reference corpus O21 O22 R2
Total C1 C2 N

Collostructional analysis

  • The target context is a certain slot in a construction.

  • The reference context is a comparable slot in a comparable construction.

Note

We talk about distinctive collexeme analysis when that second construction is very specific.

Collostructional analysis - table

Target item Other items Total
Target construction O11 O12 R1
Comparable construction O21 O22 R2
Total C1 C2 N

Summing up

Procedure

  1. For each type in the corpus, compute their frequencies in the target and reference contexts.

  2. We obtain two frequency lists from which we can obtain all necessary values of the contingency table.

  3. Based on the contingency tables, compute the association strength between each type and the target context.

  4. Rank and/or filter.

Code

Main function mclm::assoc_scores().

  1. Obtain two frequency lists:
  • “Manually” (with freqlist() and different corpora, for example.)

  • With surf_cooc() for surface collocations.

  • With text_cooc() for textual collocations.

  1. Give the frequency lists to assoc_scores().1

Hot example

assoc_scores(hot_cooc)
Association scores (types in list: 44)
       type  a    PMI G_signed|  b     c       d dir  exp_a DP_rows
 1      ,/, 49  0.350    2.824|728 58104 1116762   1 38.434   0.014
 2   the/at 39 -0.226   -1.069|738 68974 1105892  -1 45.612  -0.009
 3      ./. 38  0.236    1.010|739 48774 1126092   1 32.261   0.007
 4   and/cc 36  0.932   12.660|741 28506 1146360   1 18.864   0.022
 5     a/at 28  0.885    8.898|749 22915 1151951   1 15.163   0.017
 6    of/in 21 -0.141   -0.213|756 35007 1139859  -1 23.151  -0.003
 7    in/in 15  0.131    0.122|762 20716 1154150   1 13.701   0.002
 8 was/bedz 13  1.004    5.120|764  9793 1165073   1  6.481   0.008
 9 water/nn 13  5.529   74.799|764   413 1174453   1  0.282   0.016
10  with/in 10  1.059    4.321|767  7251 1167615   1  4.799   0.007
11   it/pps  9  1.218    4.975|768  5844 1169022   1  3.868   0.007
12    to/to  8 -0.301   -0.380|769 14909 1159957  -1  9.859  -0.002
13    on/in  7  0.795    1.796|770  6098 1168768   1  4.035   0.004
14  cold/jj  6  6.029   38.634|771   133 1174733   1  0.092   0.008
15    to/in  6 -0.283   -0.249|771 11040 1163826  -1  7.300  -0.002
16    at/in  5  0.499    0.538|772  5348 1169518   1  3.538   0.002
17  from/in  5  0.801    1.299|772  4337 1170529   1  2.870   0.003
18   sun/nn  5  6.227   33.572|772    96 1174770   1  0.067   0.006
19  this/dt  5  0.558    0.664|772  5133 1169733   1  3.396   0.002
20           4 -1.166   -3.520|773 13577 1161289  -1  8.976  -0.006
...
<number of extra columns to the right: 7>

Hot-coffee association

assoc_scores(hot_cooc)["coffee/nn",]
Association scores (types in list: 1)
       type a   PMI G_signed|  b  c       d dir exp_a DP_rows RR_rows
1 coffee/nn 4 6.315   27.349|773 72 1174794   1  0.05   0.005  84.003
<number of extra columns to the right: 6>

Next: Association measures