Keywords and collocation analysis on the BASE corpus

Based on Dirk Speelman’s course material

1 Setup

This document illustrates a keyword analysis and a collocation analysis. They both use the BASE corpus and the same criteria for association strength.

For the analysis we’ll mainly use the tidyverse package and the mclm package. We’ll also use here to find the corpus and, for the report itself when printing tables, kableExtra.

library(tidyverse)
library(mclm)
library(here)
library(kableExtra) # for the report
options(digits = 3) # to print up to 3 digits

1.1 Association strength criteria

Both of the studies rely on the assoc_scores objects of the mclm package, which return frequencies and association scores, i.e. measures based on the frequency of a certain event in a target context and in a reference context. For keyword analysis, the target context is a target (sub)corpus and the reference context is a reference (sub)corpus. For collocation analysis based on surface co-occurrences, the target context is the text surrounding the occurrences of the node term and the reference context is all the other text in the corpus.

In order to define keywords (for the first study) or collocations (for the second study) we will filter the output of assoc_scores() based on the following three criteria:

A frequency of three or higher in the target context.
A PMI score of two or higher.
A signed G score of two or higher.

A PMI score of two or higher means that the probability of the keyword in the target context is at least four times higher than its probability in all the data taken together.

Incidentally: a PMI of one indicates that this probability is twice as high, a PMI of two indicates it is four times as high, a PMI of three indicates it is eight times as high, a PMI of four indicates it is sixteen times as high, etc.

Remember that, as a strength of evidence measure, $G^2$ is higher when there is more evidence that two words are not independent, but it does not distinguish between attraction and repulsion. Therefore, a signed G score as returned by assoc_scores() is a modification that adds a minus to $G^2$ when the observed frequency of the event is lower than its expected frequency. A threshold of two or higher is not a very strict criterion. With a signed $G^2$ of 2 there is only mild evidence for attraction. In a traditional $G$ test (log likelihood ratio test) for a two-by-two contingency table, a $G^2$ score of 1 would indicate no evidence for any association whatsoever and a $G^2$ score of 3.84 would be needed for the test to indicate a significant association (at a 95% confidence level). So a value of 2 would indicate that there is not enough evidence to establish significance.

Selection criteria

In short, the selection criteria that were chosen here value both the effect size and amount of evidence, but could be said to be more demanding with respect to the former than with respect to the latter. This choice is acceptable, as long as the researcher is aware to be prioritizing (to some extent) effect size over amount of evidence.

1.2 Data

The first step to analyzing the data is to read the corpus. The first line of the code below sets the path to the “BASE” directory where all the corpus files are stored, in this case inside a “_corpora” folder, inside a “studies” folder at the top level of the project. In order to run this in your own computer, set corpus_folder to the path to where your copy of the corpus is stored (inside your project).

The second line of the code collects all the file names in that folder in an fnames object and keeps those with the “txt” extension.

The hide_path argument in the print() method for an fnames object allows us to hide a (redundant) bit of the filenames when printing them.

corpus_folder <- here("studies", "_corpora", "BASE")
fnames_BASE <- get_fnames(corpus_folder) %>% 
  keep_re("[.]txt")

print(fnames_BASE, 10, hide_path = corpus_folder)

Filename collection of length 198
              filename
   -------------------
 1 txt/ah/ahlct001.txt
 2 txt/ah/ahlct002.txt
 3 txt/ah/ahlct003.txt
 4 txt/ah/ahlct004.txt
 5 txt/ah/ahlct005.txt
 6 txt/ah/ahlct006.txt
 7 txt/ah/ahlct007.txt
 8 txt/ah/ahlct008.txt
 9 txt/ah/ahlct009.txt
10 txt/ah/ahlct010.txt
...

The functions to actually read the corpus for the analyses will be freqlist() and surf_cooc(). In both cases we’ll use three non-default settings that are more appropriate for the format of the BASE corpus.

In re_token_splitter we use the regular expression \s+; in other words, we treat all chunks of whitespace as token separators. We do this because the default tokenizer, which roughly identifies the chunks of alphanumeric characters as tokens, would e.g. cut up the corpus snippet a simple [0.4] example into the tokens a, simple, 0, 4, and example, which is not what we want. The tokenizer we do use would cut the same snippet up in the tokens a, simple, [0.4], and example. This is still not exactly what we want, but see the next point.
In re_drop_token we use [:\[\]] in order to drop all the tokens that match this regular expression; in other words we drop all tokens that contain either a colon, an opening square bracket or a closing square bracket. So, in the aforementioned example, the pseudo-token [0.4], which actually is a pause indication, would be dropped eventually. Tokens that contain a colon are also dropped, because those are speaker identifiers in the BASE corpus, not real tokens.
In file_encoding we specify windows-1252, which indeed is the encoding used in the BASE corpus.

1.3 Steps

The main function we will use is assoc_scores(), which creates an object of class assoc_scores, i.e. a special kind of dataframe with association scores information. In the case of keyword analysis (Section 2) we’ll run it with two frequency lists created from different subcorpora, whereas for collocation analysis (Section 3) we’ll provide a cooc_info object created with surf_cooc().

By default, assoc_scores() will not return values where the frequency in the target context was lower than 3, so we don’t need to do anything else to define our first criterion. For the other two criteria, instead, we’ll need to filter the assoc_scores object to only retain elements with a high enough PMI and G signed. Section 4 will illustrates steps to follow that are common to both workflows.

2 Keyword analysis

For the keyword analysis, the target corpus will be the file ahlct001.txt, and the reference corpus, the remaining 198 files of our corpus. We will store the target filename in a variable called fnames_target and the reference corpus filenames in a variable called fnames_ref.

print()

The print() method invisibly returns the same object you are printing (without any modifications performed by arguments, such as number of items to print), so you can safely add it to an assignment in order to assign and print at the same time.

# store names of target corpus files in fnames_target
fnames_target <- fnames_BASE %>%
  keep_re("ahlct001") %>%
  print(hide_path = corpus_folder)

Filename collection of length 1
             filename
  -------------------
1 txt/ah/ahlct001.txt

# store names of reference corpus files in fnames_ref
fnames_ref <- fnames_BASE %>%
  drop_re("ahlct001") %>%
  print(n = 10, hide_path = corpus_folder)

Filename collection of length 197
              filename
   -------------------
 1 txt/ah/ahlct002.txt
 2 txt/ah/ahlct003.txt
 3 txt/ah/ahlct004.txt
 4 txt/ah/ahlct005.txt
 5 txt/ah/ahlct006.txt
 6 txt/ah/ahlct007.txt
 7 txt/ah/ahlct008.txt
 8 txt/ah/ahlct009.txt
 9 txt/ah/ahlct010.txt
10 txt/ah/ahlct011.txt
...

2.1 Frequency lists

Next, we build the frequency lists, both for the target corpus and for the reference corpus. The former we store in a variable flist_target and the latter in a variable flist_ref.

In both cases we’ll use raw strings for the regular expressions, although they are a bit of an overkill with such simple expressions. We do it out of principle, to get used to their syntax.

# build frequency list for target corpus
flist_target <- fnames_target %>%
  freqlist(re_token_splitter = r"--[(?xi)    \s+   ]--", # whitespace as token splitter
           re_drop_token     = r"--[(?xi)  [:\[\]] ]--", # drop tokens with :, [ or ]
           file_encoding     = "windows-1252") %>%
  print()
# build frequency list for reference corpus
flist_ref <- fnames_ref %>%
  freqlist(re_token_splitter = r"--[(?xi)  \s+   ]--",
           re_drop_token     = r"--[(?xi)  [:\[\]] ]--",
           file_encoding     = "windows-1252") %>%
  print()

Frequency list (types in list: 2528, tokens in list: 10361)
rank type abs_freq nrm_freq
---- ---- -------- --------
   1  the      595    574.3
   2   of      336    324.3
   3  and      319    307.9
   4    a      277    267.3
   5   to      248    239.4
   6   in      216    208.5
   7   er      175    168.9
   8    i      168    162.1
   9  you      118    113.9
  10  her      106    102.3
  11   he      102     98.4
  12  was      102     98.4
  13   is       96     92.7
  14  she       81     78.2
  15 that       81     78.2
  16   it       78     75.3
  17 with       71     68.5
  18   as       64     61.8
  19  for       64     61.8
  20  his       64     61.8
...

Frequency list (types in list: 36491, tokens in list: 1614252)
rank type abs_freq nrm_freq
---- ---- -------- --------
   1  the    86625    536.6
   2   of    48929    303.1
   3  and    44915    278.2
   4   to    42614    264.0
   5   er    39352    243.8
   6    a    36154    224.0
   7 that    31807    197.0
   8   in    30177    186.9
   9  you    29426    182.3
  10   is    25986    161.0
  11   it    20365    126.2
  12   so    16849    104.4
  13    i    16757    103.8
  14 this    14907     92.3
  15   we    13097     81.1
  16 have    10326     64.0
  17 what    10261     63.6
  18   on    10183     63.1
  19   be    10177     63.0
  20  but    10134     62.8
...

We can turn freqlist objects into tibbles and print them nicely with kableExtra. When the output is HTML, we can also print the table in a scrollable box, like in Table 1.

flist_target %>% 
  as_tibble() %>% 
  kbl(col.names = c("Rank", "Type", "Absolute", "Relative")) %>% 
  kable_minimal(full_width = FALSE) %>% 
  add_header_above(c(" " = 2, "Frequency" = 2)) %>% 
  scroll_box(height = "400px")

Table 1: Frequency list of the target corpus.
		Frequency
Rank	Type	Absolute	Relative
1	the	595	574.269
2	of	336	324.293
3	and	319	307.885
4	a	277	267.349
5	to	248	239.359
6	in	216	208.474
7	er	175	168.903
8	i	168	162.147
9	you	118	113.889
10	her	106	102.307
11	he	102	98.446
12	was	102	98.446
13	is	96	92.655
14	she	81	78.178
15	that	81	78.178
16	it	78	75.282
17	with	71	68.526
18	as	64	61.770
19	for	64	61.770
20	his	64	61.770
21	my	62	59.840
22	this	61	58.875
23	at	59	56.944
24	from	58	55.979
25	we	58	55.979
26	about	56	54.049
27	so	56	54.049
28	by	53	51.153
29	it's	48	46.328
30	on	47	45.362
31	but	46	44.397
32	know	46	44.397
33	up	44	42.467
34	when	44	42.467
35	have	37	35.711
36	all	36	34.746
37	him	36	34.746
38	like	35	33.781
39	me	34	32.815
40	they	34	32.815
41	or	32	30.885
42	what	32	30.885
43	which	32	30.885
44	no	31	29.920
45	not	30	28.955
46	had	29	27.990
47	their	29	27.990
48	who	29	27.990
49	into	28	27.024
50	sea	28	27.024
51	be	27	26.059
52	do	27	26.059
53	kind	27	26.059
54	out	26	25.094
55	write	26	25.094
56	our	25	24.129
57	right	25	24.129
58	slavery	24	23.164
59	them	24	23.164
60	will	24	23.164
61	an	23	22.199
62	more	23	22.199
63	yeah	23	22.199
64	could	22	21.233
65	very	22	21.233
66	because	21	20.268
67	called	21	20.268
68	where	21	20.268
69	would	21	20.268
70	are	20	19.303
71	man	20	19.303
72	only	20	19.303
73	some	20	19.303
74	then	20	19.303
75	dead	19	18.338
76	each	19	18.338
77	now	19	18.338
78	us	19	18.338
79	way	19	18.338
80	has	18	17.373
81	land	18	17.373
82	there	18	17.373
83	white	18	17.373
84	you're	18	17.373
85	first	17	16.408
86	if	17	16.408
87	one	17	16.408
88	turner	17	16.408
89	another	16	15.443
90	before	16	15.443
91	down	16	15.443
92	just	16	15.443
93	through	16	15.443
94	african	15	14.477
95	come	15	14.477
96	great	15	14.477
97	manu	15	14.477
98	poem	15	14.477
99	shop	15	14.477
100	even	14	13.512
101	face	14	13.512
102	he's	14	13.512
103	say	14	13.512
104	wanted	14	13.512
105	can't	13	12.547
106	fish	13	12.547
107	gladstone	13	12.547
108	men	13	12.547
109	mouth	13	12.547
110	over	13	12.547
111	than	13	12.547
112	your	13	12.547
113	after	12	11.582
114	can	12	11.582
115	going	12	11.582
116	hand	12	11.582
117	other	12	11.582
118	think	12	11.582
119	time	12	11.582
120	were	12	11.582
121	writing	12	11.582
122	don't	11	10.617
123	go	11	10.617
124	language	11	10.617
125	love	11	10.617
126	nigger	11	10.617
127	novel	11	10.617
128	off	11	10.617
129	painting	11	10.617
130	sense	11	10.617
131	something	11	10.617
132	well	11	10.617
133	get	10	9.652
134	passage	10	9.652
135	people	10	9.652
136	really	10	9.652
137	rohini	10	9.652
138	see	10	9.652
139	sex	10	9.652
140	shah	10	9.652
141	still	10	9.652
142	words	10	9.652
143	years	10	9.652
144	back	9	8.686
145	being	9	8.686
146	black	9	8.686
147	century	9	8.686
148	child	9	8.686
149	did	9	8.686
150	end	9	8.686
151	guyana	9	8.686
152	here	9	8.686
153	reading	9	8.686
154	said	9	8.686
155	should	9	8.686
156	that's	9	8.686
157	there's	9	8.686
158	thought	9	8.686
159	troilus	9	8.686
160	two	9	8.686
161	until	9	8.686
162	whole	9	8.686
163	yes	9	8.686
164	against	8	7.721
165	around	8	7.721
166	behind	8	7.721
167	felt	8	7.721
168	hands	8	7.721
169	himself	8	7.721
170	i'm	8	7.721
171	its	8	7.721
172	most	8	7.721
173	mother	8	7.721
174	old	8	7.721
175	own	8	7.721
176	read	8	7.721
177	these	8	7.721
178	too	8	7.721
179	took	8	7.721
180	why	8	7.721
181	woman	8	7.721
182	work	8	7.721
183	awakened	7	6.756
184	been	7	6.756
185	between	7	6.756
186	blood	7	6.756
187	body	7	6.756
188	came	7	6.756
189	caught	7	6.756
190	earth	7	6.756
191	got	7	6.756
192	head	7	6.756
193	home	7	6.756
194	latin	7	6.756
195	left	7	6.756
196	live	7	6.756
197	many	7	6.756
198	mean	7	6.756
199	much	7	6.756
200	myself	7	6.756
201	reached	7	6.756
202	saw	7	6.756
203	ship	7	6.756
204	space	7	6.756
205	things	7	6.756
206	thistlewood	7	6.756
207	used	7	6.756
208	village	7	6.756
209	we're	7	6.756
210	again	6	5.791
211	always	6	5.791
212	anyway	6	5.791
213	away	6	5.791
214	beauty	6	5.791
215	big	6	5.791
216	book	6	5.791
217	breath	6	5.791
218	british	6	5.791
219	centre	6	5.791
220	classics	6	5.791
221	desire	6	5.791
222	eighteenth	6	5.791
223	english	6	5.791
224	estate	6	5.791
225	eyes	6	5.791
226	family	6	5.791
227	god	6	5.791
228	guyanese	6	5.791
229	house	6	5.791
230	how	6	5.791
231	idea	6	5.791
232	instead	6	5.791
233	last	6	5.791
234	life	6	5.791
235	look	6	5.791
236	looked	6	5.791
237	made	6	5.791
238	makes	6	5.791
239	miriam	6	5.791
240	moon	6	5.791
241	namex	6	5.791
242	new	6	5.791
243	night	6	5.791
244	once	6	5.791
245	place	6	5.791
246	pounds	6	5.791
247	stillborn	6	5.791
248	utterly	6	5.791
249	want	6	5.791
250	water	6	5.791
251	went	6	5.791
252	without	6	5.791
253	women	6	5.791
254	wrote	6	5.791
255	'cause	5	4.826
256	abandoned	5	4.826
257	across	5	4.826
258	among	5	4.826
259	ancestry	5	4.826
260	animals	5	4.826
261	becomes	5	4.826
262	boy	5	4.826
263	bush	5	4.826
264	cabin	5	4.826
265	captain	5	4.826
266	colours	5	4.826
267	curtain	5	4.826
268	day	5	4.826
269	different	5	4.826
270	doing	5	4.826
271	drowned	5	4.826
272	except	5	4.826
273	flesh	5	4.826
274	gave	5	4.826
275	imagination	5	4.826
276	indian	5	4.826
277	kampta	5	4.826
278	looking	5	4.826
279	magazine	5	4.826
280	marginal	5	4.826
281	matter	5	4.826
282	miles	5	4.826
283	nothing	5	4.826
284	pages	5	4.826
285	passages	5	4.826
286	peripheral	5	4.826
287	plantation	5	4.826
288	plot	5	4.826
289	put	5	4.826
290	rage	5	4.826
291	servant	5	4.826
292	set	5	4.826
293	skin	5	4.826
294	sky	5	4.826
295	slave	5	4.826
296	slipped	5	4.826
297	small	5	4.826
298	sometimes	5	4.826
299	source	5	4.826
300	stone	5	4.826
301	storm	5	4.826
302	story	5	4.826
303	subject	5	4.826
304	sun	5	4.826
305	terms	5	4.826
306	they're	5	4.826
307	today	5	4.826
308	towards	5	4.826
309	tribe	5	4.826
310	upon	5	4.826
311	whatever	5	4.826
312	within	5	4.826
313	actually	4	3.861
314	air	4	3.861
315	am	4	3.861
316	amazonian	4	3.861
317	anarch	4	3.861
318	aperture	4	3.861
319	art	4	3.861
320	asked	4	3.861
321	basically	4	3.861
322	bearing	4	3.861
323	beyond	4	3.861
324	break	4	3.861
325	caesar	4	3.861
326	call	4	3.861
327	case	4	3.861
328	coins	4	3.861
329	colour	4	3.861
330	coming	4	3.861
331	coolie	4	3.861
332	course	4	3.861
333	cries	4	3.861
334	criseyde	4	3.861
335	darkness	4	3.861
336	deep	4	3.861
337	didn't	4	3.861
338	ears	4	3.861
339	ellar	4	3.861
340	entered	4	3.861
341	fall	4	3.861
342	five	4	3.861
343	flooded	4	3.861
344	floor	4	3.861
345	future	4	3.861
346	good	4	3.861
347	grandpa	4	3.861
348	greatest	4	3.861
349	ground	4	3.861
350	i'll	4	3.861
351	i-,	4	3.861
352	jungle	4	3.861
353	kaka	4	3.861
354	knew	4	3.861
355	lachrimae	4	3.861
356	lay	4	3.861
357	let	4	3.861
358	lets	4	3.861
359	lies	4	3.861
360	little	4	3.861
361	long	4	3.861
362	longer	4	3.861
363	lot	4	3.861
364	might	4	3.861
365	mind	4	3.861
366	mine	4	3.861
367	must	4	3.861
368	never	4	3.861
369	nor	4	3.861
370	obviously	4	3.861
371	overboard	4	3.861
372	perhaps	4	3.861
373	periphery	4	3.861
374	pond	4	3.861
375	probably	4	3.861
376	rains	4	3.861
377	reach	4	3.861
378	s-,	4	3.861
379	slaves	4	3.861
380	someone	4	3.861
381	speak	4	3.861
382	speech	4	3.861
383	stars	4	3.861
384	strength	4	3.861
385	sudden	4	3.861
386	surface	4	3.861
387	tell	4	3.861
388	therefore	4	3.861
389	those	4	3.861
390	trying	4	3.861
391	turner's	4	3.861
392	use	4	3.861
393	waters	4	3.861
394	ways	4	3.861
395	we'll	4	3.861
396	who's	4	3.861
397	abandons	3	2.895
398	accompany	3	2.895
399	accountant	3	2.895
400	africa	3	2.895
401	afterwards	3	2.895
402	aggressive	3	2.895
403	almost	3	2.895
404	along	3	2.895
405	any	3	2.895
406	area	3	2.895
407	beatings	3	2.895
408	beautiful	3	2.895
409	begin	3	2.895
410	below	3	2.895
411	bleak	3	2.895
412	bloody	3	2.895
413	booths	3	2.895
414	boys	3	2.895
415	bright	3	2.895
416	brown	3	2.895
417	bury	3	2.895
418	caribbean	3	2.895
419	carries	3	2.895
420	catch	3	2.895
421	cemetery	3	2.895
422	close	3	2.895
423	closed	3	2.895
424	collapsed	3	2.895
425	coloured	3	2.895
426	counting	3	2.895
427	culture	3	2.895
428	curse	3	2.895
429	dangerous	3	2.895
430	desperation	3	2.895
431	die	3	2.895
432	diomede	3	2.895
433	direction	3	2.895
434	distance	3	2.895
435	dragged	3	2.895
436	dry	3	2.895
437	early	3	2.895
438	effort	3	2.895
439	eighteen-forty	3	2.895
440	ellar's	3	2.895
441	endlessly	3	2.895
442	england	3	2.895
443	enormously	3	2.895
444	enough	3	2.895
445	entering	3	2.895
446	examiner	3	2.895
447	eye	3	2.895
448	faces	3	2.895
449	fat	3	2.895
450	father	3	2.895
451	feet	3	2.895
452	few	3	2.895
453	fifty	3	2.895
454	footnotes	3	2.895
455	form	3	2.895
456	found	3	2.895
457	four	3	2.895
458	friend	3	2.895
459	full	3	2.895
460	gets	3	2.895
461	giving	3	2.895
462	gladstone's	3	2.895
463	gods	3	2.895
464	goes	3	2.895
465	grasp	3	2.895
466	grave	3	2.895
467	grotesque	3	2.895
468	grows	3	2.895
469	guidance	3	2.895
470	happened	3	2.895
471	hear	3	2.895
472	historic	3	2.895
473	history	3	2.895
474	hundred	3	2.895
475	i've	3	2.895
476	jamaican	3	2.895
477	killed	3	2.895
478	learn	3	2.895
479	learning	3	2.895
480	least	3	2.895
481	leave	3	2.895
482	legs	3	2.895
483	less	3	2.895
484	light	3	2.895
485	london	3	2.895
486	make	3	2.895
487	melody	3	2.895
488	memory	3	2.895
489	minutes	3	2.895
490	miriam's	3	2.895
491	mist	3	2.895
492	money	3	2.895
493	motive	3	2.895
494	mrs	3	2.895
495	name	3	2.895
496	neck	3	2.895
497	onto	3	2.895
498	open	3	2.895
499	opens	3	2.895
500	oxford	3	2.895
501	pain	3	2.895
502	paki	3	2.895
503	part	3	2.895
504	passed	3	2.895
505	period	3	2.895
506	picked	3	2.895
507	poetry	3	2.895
508	presence	3	2.895
509	pressed	3	2.895
510	prose	3	2.895
511	pushed	3	2.895
512	putting	3	2.895
513	rainy	3	2.895
514	remember	3	2.895
515	rerum	3	2.895
516	rest	3	2.895
517	returned	3	2.895
518	rima	3	2.895
519	rum	3	2.895
520	sailors	3	2.895
521	same	3	2.895
522	savannah	3	2.895
523	saying	3	2.895
524	scars	3	2.895
525	seascape	3	2.895
526	season	3	2.895
527	sexually	3	2.895
528	shame	3	2.895
529	shape	3	2.895
530	shock	3	2.895
531	sight	3	2.895
532	sign	3	2.895
533	since	3	2.895
534	sisters	3	2.895
535	slot	3	2.895
536	song	3	2.895
537	soon	3	2.895
538	spaces	3	2.895
539	special	3	2.895
540	status	3	2.895
541	stood	3	2.895
542	such	3	2.895
543	sunt	3	2.895
544	take	3	2.895
545	talk	3	2.895
546	teeth	3	2.895
547	thing	3	2.895
548	thirty	3	2.895
549	though	3	2.895
550	three	3	2.895
551	throughout	3	2.895
552	thy	3	2.895
553	top	3	2.895
554	tries	3	2.895
555	turn	3	2.895
556	turned	3	2.895
557	turns	3	2.895
558	universe	3	2.895
559	university	3	2.895
560	uttered	3	2.895
561	valleys	3	2.895
562	vast	3	2.895
563	voice	3	2.895
564	waited	3	2.895
565	waiting	3	2.895
566	wall	3	2.895
567	wanting	3	2.895
568	welcome	3	2.895
569	west	3	2.895
570	whom	3	2.895
571	word	3	2.895
572	world	3	2.895
573	writer	3	2.895
574	young	3	2.895
575	younger	3	2.895
576	a-	2	1.930
577	able	2	1.930
578	above	2	1.930
579	afraid	2	1.930
580	africans	2	1.930
581	also	2	1.930
582	amazon	2	1.930
583	anger	2	1.930
584	arms	2	1.930
585	aside	2	1.930
586	atlantic	2	1.930
587	authoritarian	2	1.930
588	authority	2	1.930
589	autobiographical	2	1.930
590	automatically	2	1.930
591	basest	2	1.930
592	beadless	2	1.930
593	beads	2	1.930
594	became	2	1.930
595	beckon	2	1.930
596	bed	2	1.930
597	believed	2	1.930
598	belly	2	1.930
599	beside	2	1.930
600	bewildered	2	1.930
601	birth	2	1.930
602	bit	2	1.930
603	blackness	2	1.930
604	blindly	2	1.930
605	blue	2	1.930
606	board	2	1.930
607	boldest	2	1.930
608	bone	2	1.930
609	booth	2	1.930
610	bottom	2	1.930
611	boundaries	2	1.930
612	bow	2	1.930
613	brilliant	2	1.930
614	brooding	2	1.930
615	brothers	2	1.930
616	bundle	2	1.930
617	bus	2	1.930
618	calf	2	1.930
619	care	2	1.930
620	centuries	2	1.930
621	certain	2	1.930
622	chap	2	1.930
623	cheeks	2	1.930
624	childbirth	2	1.930
625	children	2	1.930
626	cigarette	2	1.930
627	circus	2	1.930
628	class	2	1.930
629	clear	2	1.930
630	closes	2	1.930
631	coastal	2	1.930
632	coconuts	2	1.930
633	coin	2	1.930
634	cold	2	1.930
635	comes	2	1.930
636	commerce	2	1.930
637	community	2	1.930
638	constant	2	1.930
639	corn	2	1.930
640	corner	2	1.930
641	corners	2	1.930
642	couch	2	1.930
643	couple	2	1.930
644	courtly	2	1.930
645	crammed	2	1.930
646	crashing	2	1.930
647	creatures	2	1.930
648	cry	2	1.930
649	customers	2	1.930
650	dared	2	1.930
651	days	2	1.930
652	dazed	2	1.930
653	death	2	1.930
654	debased	2	1.930
655	deserving	2	1.930
656	desperate	2	1.930
657	died	2	1.930
658	direct	2	1.930
659	does	2	1.930
660	doesn't	2	1.930
661	doubts	2	1.930
662	drowning	2	1.930
663	drunk	2	1.930
664	dunciad	2	1.930
665	dust	2	1.930
666	easily	2	1.930
667	easy	2	1.930
668	edge	2	1.930
669	eh	2	1.930
670	elders	2	1.930
671	ends	2	1.930
672	equiano	2	1.930
673	eschatological	2	1.930
674	especially	2	1.930
675	eventually	2	1.930
676	ever	2	1.930
677	every	2	1.930
678	everything	2	1.930
679	exactly	2	1.930
680	existence	2	1.930
681	expecting	2	1.930
682	experience	2	1.930
683	external	2	1.930
684	faith	2	1.930
685	falls	2	1.930
686	far	2	1.930
687	fields	2	1.930
688	figure	2	1.930
689	find	2	1.930
690	flow	2	1.930
691	folds	2	1.930
692	foot	2	1.930
693	foreign	2	1.930
694	foretell	2	1.930
695	fortune	2	1.930
696	frantically	2	1.930
697	free	2	1.930
698	freedom	2	1.930
699	fright	2	1.930
700	fruit	2	1.930
701	g-,	2	1.930
702	gathered	2	1.930
703	genuine	2	1.930
704	ghosts	2	1.930
705	girl	2	1.930
706	girls	2	1.930
707	glass	2	1.930
708	goat	2	1.930
709	gown	2	1.930
710	grain	2	1.930
711	greeks	2	1.930
712	grew	2	1.930
713	grown	2	1.930
714	guilt	2	1.930
715	guyana's	2	1.930
716	half	2	1.930
717	hard	2	1.930
718	hate	2	1.930
719	held	2	1.930
720	herds	2	1.930
721	herself	2	1.930
722	hidden	2	1.930
723	hindu	2	1.930
724	holes	2	1.930
725	holocaust	2	1.930
726	hook	2	1.930
727	huge	2	1.930
728	human	2	1.930
729	humiliation	2	1.930
730	hundred-and-	2	1.930
731	hut	2	1.930
732	idiot	2	1.930
733	ignorant	2	1.930
734	image	2	1.930
735	impatient	2	1.930
736	important	2	1.930
737	impose	2	1.930
738	inherently	2	1.930
739	instructions	2	1.930
740	involved	2	1.930
741	joke	2	1.930
742	jouti	2	1.930
743	kaka's	2	1.930
744	keeps	2	1.930
745	key	2	1.930
746	kinds	2	1.930
747	knock	2	1.930
748	knowing	2	1.930
749	laid	2	1.930
750	later	2	1.930
751	leaned	2	1.930
752	leaving	2	1.930
753	levels	2	1.930
754	lights	2	1.930
755	lips	2	1.930
756	lives	2	1.930
757	logies	2	1.930
758	loss	2	1.930
759	lost	2	1.930
760	magazines	2	1.930
761	magicians	2	1.930
762	making	2	1.930
763	mangoes	2	1.930
764	manu's	2	1.930
765	martin	2	1.930
766	marvelling	2	1.930
767	middle	2	1.930
768	migrate	2	1.930
769	mocking	2	1.930
770	moment	2	1.930
771	momentum	2	1.930
772	monotonously	2	1.930
773	months	2	1.930
774	morning	2	1.930
775	mortar	2	1.930
776	mouths	2	1.930
777	moved	2	1.930
778	mud	2	1.930
779	muddy	2	1.930
780	mythology	2	1.930
781	nameless	2	1.930
782	naming	2	1.930
783	nature	2	1.930
784	neither	2	1.930
785	neuroses	2	1.930
786	next	2	1.930
787	nineteen	2	1.930
788	noise	2	1.930
789	nose	2	1.930
790	odd	2	1.930
791	often	2	1.930
792	oh	2	1.930
793	ornamental	2	1.930
794	others	2	1.930
795	page	2	1.930
796	passionate	2	1.930
797	past	2	1.930
798	paths	2	1.930
799	pavement	2	1.930
800	pay	2	1.930
801	peculiar	2	1.930
802	peepshow	2	1.930
803	pence	2	1.930
804	perilous	2	1.930
805	perish	2	1.930
806	piccadilly	2	1.930
807	pictures	2	1.930
808	pincher	2	1.930
809	places	2	1.930
810	plague	2	1.930
811	plane	2	1.930
812	pleasure	2	1.930
813	pocket	2	1.930
814	pool	2	1.930
815	poor	2	1.930
816	possibility	2	1.930
817	previous	2	1.930
818	pride	2	1.930
819	promised	2	1.930
820	protection	2	1.930
821	pure	2	1.930
822	purse	2	1.930
823	quick	2	1.930
824	rack	2	1.930
825	rain	2	1.930
826	rainforest	2	1.930
827	raw	2	1.930
828	re-,	2	1.930
829	recognize	2	1.930
830	relationship	2	1.930
831	relax	2	1.930
832	remains	2	1.930
833	response	2	1.930
834	river	2	1.930
835	romance	2	1.930
836	round	2	1.930
837	rubbish	2	1.930
838	rudeness	2	1.930
839	sake	2	1.930
840	salt	2	1.930
841	school	2	1.930
842	scunt	2	1.930
843	search	2	1.930
844	searched	2	1.930
845	secret	2	1.930
846	secretly	2	1.930
847	secure	2	1.930
848	seemed	2	1.930
849	seeps	2	1.930
850	self	2	1.930
851	setting	2	1.930
852	settled	2	1.930
853	seven-and-a-half-thousand	2	1.930
854	shah's	2	1.930
855	she's	2	1.930
856	shores	2	1.930
857	show	2	1.930
858	shyness	2	1.930
859	silver	2	1.930
860	sits	2	1.930
861	slap	2	1.930
862	sleep	2	1.930
863	sought	2	1.930
864	sound	2	1.930
865	stare	2	1.930
866	stared	2	1.930
867	staring	2	1.930
868	stayed	2	1.930
869	steal	2	1.930
870	steel	2	1.930
871	strange	2	1.930
872	street	2	1.930
873	strip	2	1.930
874	struggle	2	1.930
875	sublime	2	1.930
876	suddenly	2	1.930
877	suggest	2	1.930
878	sunk	2	1.930
879	sure	2	1.930
880	swum	2	1.930
881	tabla	2	1.930
882	taken	2	1.930
883	talking	2	1.930
884	talks	2	1.930
885	tanda's	2	1.930
886	tax	2	1.930
887	tea	2	1.930
888	teach	2	1.930
889	tempts	2	1.930
890	ten	2	1.930
891	terrified	2	1.930
892	thomas	2	1.930
893	throwing	2	1.930
894	thrown	2	1.930
895	tilt	2	1.930
896	tongue	2	1.930
897	toys	2	1.930
898	trail	2	1.930
899	trance	2	1.930
900	transformed	2	1.930
901	treasures	2	1.930
902	trees	2	1.930
903	tribes	2	1.930
904	tropical	2	1.930
905	under	2	1.930
906	unstable	2	1.930
907	urged	2	1.930
908	urns	2	1.930
909	v-,	2	1.930
910	voices	2	1.930
911	vulgar	2	1.930
912	walcott	2	1.930
913	walk	2	1.930
914	walked	2	1.930
915	warming	2	1.930
916	watching	2	1.930
917	wear	2	1.930
918	while	2	1.930
919	whilst	2	1.930
920	whisper	2	1.930
921	whispered	2	1.930
922	wind	2	1.930
923	window	2	1.930
924	wisdom	2	1.930
925	wood	2	1.930
926	worms	2	1.930
927	wounds	2	1.930
928	y-,	2	1.930
929	you'll	2	1.930
930	you've	2	1.930
931	a-,	1	0.965
932	a-n-a-r-c-h	1	0.965
933	aboard	1	0.965
934	abolition	1	0.965
935	abor-,	1	0.965
936	aborigines	1	0.965
937	aborted	1	0.965
938	abscond	1	0.965
939	absent	1	0.965
940	absolutely	1	0.965
941	acceptable	1	0.965
942	accepted	1	0.965
943	accustomed	1	0.965
944	acha	1	0.965
945	aching	1	0.965
946	acknowledged	1	0.965
947	acquired	1	0.965
948	adjoining	1	0.965
949	admiration	1	0.965
950	admire	1	0.965
951	admired	1	0.965
952	admixture	1	0.965
953	adorned	1	0.965
954	adv-,	1	0.965
955	advance	1	0.965
956	affect	1	0.965
957	age	1	0.965
958	agitated	1	0.965
959	agreement	1	0.965
960	ahead	1	0.965
961	ain't	1	0.965
962	alarm	1	0.965
963	already	1	0.965
964	although	1	0.965
965	ambiguity	1	0.965
966	ambushes	1	0.965
967	amen	1	0.965
968	amerindian	1	0.965
969	amerindians	1	0.965
970	amusement	1	0.965
971	ancestral	1	0.965
972	anchored	1	0.965
973	ancient	1	0.965
974	anew	1	0.965
975	angel	1	0.965
976	angelic	1	0.965
977	anglican	1	0.965
978	anguish	1	0.965
979	anniversary	1	0.965
980	announces	1	0.965
981	answer	1	0.965
982	antique	1	0.965
983	antisocial	1	0.965
984	anything	1	0.965
985	anywhere	1	0.965
986	ap-,	1	0.965
987	apologies	1	0.965
988	apologist	1	0.965
989	apologize	1	0.965
990	appear	1	0.965
991	appearance	1	0.965
992	appease	1	0.965
993	arbitrary	1	0.965
994	arcades	1	0.965
995	aristotle	1	0.965
996	arose	1	0.965
997	arrange	1	0.965
998	arrived	1	0.965
999	arrogance	1	0.965
1000	arrow	1	0.965
1001	article	1	0.965
1002	artifice	1	0.965
1003	as-,	1	0.965
1004	aspects	1	0.965
1005	assault	1	0.965
1006	assigned	1	0.965
1007	assumed	1	0.965
1008	astonished	1	0.965
1009	await	1	0.965
1010	awaken	1	0.965
1011	awakens	1	0.965
1012	awed	1	0.965
1013	awhile	1	0.965
1014	awoke	1	0.965
1015	babbled	1	0.965
1016	babbling	1	0.965
1017	babies	1	0.965
1018	backdam	1	0.965
1019	backsides	1	0.965
1020	backwards	1	0.965
1021	bags	1	0.965
1022	baju's	1	0.965
1023	baked	1	0.965
1024	banged	1	0.965
1025	bank	1	0.965
1026	barely	1	0.965
1027	barred	1	0.965
1028	barrels	1	0.965
1029	barren	1	0.965
1030	base	1	0.965
1031	based	1	0.965
1032	basis	1	0.965
1033	bastards	1	0.965
1034	bathos	1	0.965
1035	battleground	1	0.965
1036	bawling	1	0.965
1037	bear	1	0.965
1038	beasts	1	0.965
1039	beaten	1	0.965
1040	beforehand	1	0.965
1041	beggar	1	0.965
1042	beggared	1	0.965
1043	begging	1	0.965
1044	beginning	1	0.965
1045	beguiled	1	0.965
1046	beholds	1	0.965
1047	bellies	1	0.965
1048	belonged	1	0.965
1049	beloved	1	0.965
1050	bench	1	0.965
1051	bends	1	0.965
1052	beneath	1	0.965
1053	bereft	1	0.965
1054	beseech	1	0.965
1055	bespeak	1	0.965
1056	best	1	0.965
1057	betoken	1	0.965
1058	betrayal	1	0.965
1059	betrayed	1	0.965
1060	betrayer	1	0.965
1061	better	1	0.965
1062	billboards	1	0.965
1063	bitch	1	0.965
1064	bites	1	0.965
1065	bits	1	0.965
1066	bizarre	1	0.965
1067	bla-,	1	0.965
1068	blacker	1	0.965
1069	blade	1	0.965
1070	blasted	1	0.965
1071	bleakly	1	0.965
1072	blessed	1	0.965
1073	blindfolds	1	0.965
1074	blinds	1	0.965
1075	blood-cloth	1	0.965
1076	blossoming	1	0.965
1077	blow	1	0.965
1078	blowing	1	0.965
1079	blurb	1	0.965
1080	boiled	1	0.965
1081	boils	1	0.965
1082	bore	1	0.965
1083	bosom	1	0.965
1084	both	1	0.965
1085	bourg-,	1	0.965
1086	bourgeoisie	1	0.965
1087	bowl	1	0.965
1088	box	1	0.965
1089	boys'	1	0.965
1090	braced	1	0.965
1091	breakfast	1	0.965
1092	breaks	1	0.965
1093	breast	1	0.965
1094	breasts	1	0.965
1095	breathless	1	0.965
1096	breeding	1	0.965
1097	brick	1	0.965
1098	brides	1	0.965
1099	briefly	1	0.965
1100	brings	1	0.965
1101	bro-,	1	0.965
1102	broad	1	0.965
1103	brought	1	0.965
1104	bruise	1	0.965
1105	bruised	1	0.965
1106	bruises	1	0.965
1107	bruising	1	0.965
1108	bubbling	1	0.965
1109	bugger	1	0.965
1110	buoying	1	0.965
1111	burden	1	0.965
1112	burial	1	0.965
1113	buries	1	0.965
1114	burning	1	0.965
1115	buy	1	0.965
1116	c-s	1	0.965
1117	cackled	1	0.965
1118	cakes	1	0.965
1119	calculating	1	0.965
1120	calling	1	0.965
1121	canals	1	0.965
1122	canefields	1	0.965
1123	cannon	1	0.965
1124	cannot	1	0.965
1125	canonical	1	0.965
1126	canyons	1	0.965
1127	capitalizing	1	0.965
1128	captain's	1	0.965
1129	careful	1	0.965
1130	caressed	1	0.965
1131	cargo	1	0.965
1132	carnival	1	0.965
1133	carve	1	0.965
1134	carved	1	0.965
1135	cast	1	0.965
1136	catching	1	0.965
1137	caused	1	0.965
1138	ceremonies	1	0.965
1139	chained	1	0.965
1140	chains	1	0.965
1141	chants	1	0.965
1142	chapter	1	0.965
1143	character	1	0.965
1144	chasing	1	0.965
1145	chaucer	1	0.965
1146	chaucer's	1	0.965
1147	cheap	1	0.965
1148	checked	1	0.965
1149	checks	1	0.965
1150	cheque	1	0.965
1151	cherubims	1	0.965
1152	chests	1	0.965
1153	chisel	1	0.965
1154	choice	1	0.965
1155	choose	1	0.965
1156	christian	1	0.965
1157	chronicles	1	0.965
1158	chuck	1	0.965
1159	church	1	0.965
1160	cinemas	1	0.965
1161	circumference	1	0.965
1162	civilization	1	0.965
1163	clasped	1	0.965
1164	clean	1	0.965
1165	clearly	1	0.965
1166	clears	1	0.965
1167	cleeps	1	0.965
1168	clenched	1	0.965
1169	climax	1	0.965
1170	climbing	1	0.965
1171	clogged	1	0.965
1172	cloth	1	0.965
1173	clothes	1	0.965
1174	cloudless	1	0.965
1175	clouds	1	0.965
1176	clowns	1	0.965
1177	coarse	1	0.965
1178	coarsely	1	0.965
1179	coast	1	0.965
1180	coastline	1	0.965
1181	coax	1	0.965
1182	cobs	1	0.965
1183	cocaine	1	0.965
1184	collapse	1	0.965
1185	collide	1	0.965
1186	colonies	1	0.965
1187	colonize	1	0.965
1188	columbus	1	0.965
1189	comedy	1	0.965
1190	comely	1	0.965
1191	command	1	0.965
1192	commer-,	1	0.965
1193	common	1	0.965
1194	compassion	1	0.965
1195	complete	1	0.965
1196	composed	1	0.965
1197	compound	1	0.965
1198	concealing	1	0.965
1199	concentration	1	0.965
1200	concerned	1	0.965
1201	conclusion	1	0.965
1202	conference	1	0.965
1203	confidently	1	0.965
1204	confront	1	0.965
1205	confronting	1	0.965
1206	confuses	1	0.965
1207	congealed	1	0.965
1208	conjures	1	0.965
1209	connect	1	0.965
1210	connection	1	0.965
1211	conquistador	1	0.965
1212	consciousness	1	0.965
1213	consideration	1	0.965
1214	considering	1	0.965
1215	conspiracy	1	0.965
1216	content	1	0.965
1217	context	1	0.965
1218	continues	1	0.965
1219	convinced	1	0.965
1220	coolies	1	0.965
1221	coral	1	0.965
1222	core	1	0.965
1223	cork	1	0.965
1224	corpse	1	0.965
1225	cosmos	1	0.965
1226	cougars	1	0.965
1227	count	1	0.965
1228	counted	1	0.965
1229	counterfeit	1	0.965
1230	country	1	0.965
1231	courtship	1	0.965
1232	cousin	1	0.965
1233	cover	1	0.965
1234	cowardice	1	0.965
1235	cowrie	1	0.965
1236	cows	1	0.965
1237	crab-back	1	0.965
1238	crazy	1	0.965
1239	creased	1	0.965
1240	created	1	0.965
1241	creates	1	0.965
1242	creating	1	0.965
1243	creature	1	0.965
1244	creolization	1	0.965
1245	crest	1	0.965
1246	crevices	1	0.965
1247	cried	1	0.965
1248	crimson	1	0.965
1249	criseyde's	1	0.965
1250	criteria	1	0.965
1251	critic	1	0.965
1252	critical	1	0.965
1253	crop	1	0.965
1254	cross-legged	1	0.965
1255	crossed	1	0.965
1256	crowd	1	0.965
1257	crucifixes	1	0.965
1258	crudest	1	0.965
1259	crumbling	1	0.965
1260	cubla	1	0.965
1261	cuff	1	0.965
1262	cultural	1	0.965
1263	cunning	1	0.965
1264	cunt	1	0.965
1265	cunt-doll	1	0.965
1266	cup	1	0.965
1267	curls	1	0.965
1268	currents	1	0.965
1269	curry	1	0.965
1270	curson	1	0.965
1271	curves	1	0.965
1272	cut	1	0.965
1273	d-phil	1	0.965
1274	dams	1	0.965
1275	daniel	1	0.965
1276	dark	1	0.965
1277	dart	1	0.965
1278	dawn	1	0.965
1279	day's	1	0.965
1280	dazzle	1	0.965
1281	debauchery	1	0.965
1282	debt	1	0.965
1283	decipher	1	0.965
1284	decorative	1	0.965
1285	deed	1	0.965
1286	deeds	1	0.965
1287	defensive	1	0.965
1288	defined	1	0.965
1289	defining	1	0.965
1290	definite	1	0.965
1291	defoe	1	0.965
1292	defoe's	1	0.965
1293	degradation	1	0.965
1294	degrees	1	0.965
1295	deliberately	1	0.965
1296	delivered	1	0.965
1297	demons	1	0.965
1298	deny	1	0.965
1299	depths	1	0.965
1300	descriptions	1	0.965
1301	deserve	1	0.965
1302	design	1	0.965
1303	destination	1	0.965
1304	destroy	1	0.965
1305	detects	1	0.965
1306	devices	1	0.965
1307	devoted	1	0.965
1308	devotion	1	0.965
1309	diamond	1	0.965
1310	diet	1	0.965
1311	digging	1	0.965
1312	diminished	1	0.965
1313	diomede's	1	0.965
1314	dips	1	0.965
1315	directly	1	0.965
1316	dirt	1	0.965
1317	dirty	1	0.965
1318	disappear	1	0.965
1319	disappeared	1	0.965
1320	disappears	1	0.965
1321	discernible	1	0.965
1322	dishevelled	1	0.965
1323	disperse	1	0.965
1324	display	1	0.965
1325	dissertation	1	0.965
1326	dissolve	1	0.965
1327	divided	1	0.965
1328	done	1	0.965
1329	door	1	0.965
1330	doorways	1	0.965
1331	dough	1	0.965
1332	douse	1	0.965
1333	dozen	1	0.965
1334	drank	1	0.965
1335	dread	1	0.965
1336	dreadful	1	0.965
1337	dream	1	0.965
1338	dreamed	1	0.965
1339	dregs	1	0.965
1340	dress	1	0.965
1341	drew	1	0.965
1342	drifting	1	0.965
1343	drink	1	0.965
1344	drives	1	0.965
1345	dropped	1	0.965
1346	droppings	1	0.965
1347	dug	1	0.965
1348	dull	1	0.965
1349	dumping	1	0.965
1350	dung	1	0.965
1351	during	1	0.965
1352	dutifully	1	0.965
1353	dying	1	0.965
1354	e-,	1	0.965
1355	earlier	1	0.965
1356	ease	1	0.965
1357	easier	1	0.965
1358	eaten	1	0.965
1359	ecstatic	1	0.965
1360	edit	1	0.965
1361	eggs	1	0.965
1362	eight	1	0.965
1363	eighteen	1	0.965
1364	eighteen-	1	0.965
1365	either	1	0.965
1366	elapsed	1	0.965
1367	electronic	1	0.965
1368	eloquently	1	0.965
1369	else	1	0.965
1370	emerges	1	0.965
1371	empire	1	0.965
1372	emptiness	1	0.965
1373	endless	1	0.965
1374	endow	1	0.965
1375	endowed	1	0.965
1376	enfolding	1	0.965
1377	engrossed	1	0.965
1378	enrich	1	0.965
1379	enslaved	1	0.965
1380	entrance	1	0.965
1381	entreaties	1	0.965
1382	epic	1	0.965
1383	epistemological	1	0.965
1384	equal	1	0.965
1385	eradication	1	0.965
1386	erases	1	0.965
1387	escarpments	1	0.965
1388	eskimos	1	0.965
1389	essay	1	0.965
1390	evening	1	0.965
1391	everybody	1	0.965
1392	evil	1	0.965
1393	evoke	1	0.965
1394	ex-student	1	0.965
1395	examination	1	0.965
1396	examine	1	0.965
1397	examined	1	0.965
1398	examining	1	0.965
1399	exchanges	1	0.965
1400	excitement	1	0.965
1401	exercised	1	0.965
1402	exhaustion	1	0.965
1403	exotic	1	0.965
1404	explores	1	0.965
1405	exploring	1	0.965
1406	exposed	1	0.965
1407	extravagant	1	0.965
1408	failed	1	0.965
1409	faithful	1	0.965
1410	falling	1	0.965
1411	familiar	1	0.965
1412	familiarity	1	0.965
1413	fanned	1	0.965
1414	fanning	1	0.965
1415	fantastic	1	0.965
1416	fare	1	0.965
1417	fascist	1	0.965
1418	fashion	1	0.965
1419	fate	1	0.965
1420	fatten	1	0.965
1421	fattening	1	0.965
1422	favour	1	0.965
1423	favourite	1	0.965
1424	favours	1	0.965
1425	fearful	1	0.965
1426	feel	1	0.965
1427	feign	1	0.965
1428	fell	1	0.965
1429	fence	1	0.965
1430	fidgeting	1	0.965
1431	field	1	0.965
1432	fiercely	1	0.965
1433	fifteen-thousand	1	0.965
1434	fifty-thousand-million	1	0.965
1435	file	1	0.965
1436	filled	1	0.965
1437	filling	1	0.965
1438	film	1	0.965
1439	filter	1	0.965
1440	fine	1	0.965
1441	fingers	1	0.965
1442	finish	1	0.965
1443	finished	1	0.965
1444	fireside	1	0.965
1445	fished	1	0.965
1446	fist	1	0.965
1447	fit	1	0.965
1448	five-hundredth	1	0.965
1449	flames	1	0.965
1450	flashed	1	0.965
1451	flat	1	0.965
1452	flayed	1	0.965
1453	flecks	1	0.965
1454	flicked	1	0.965
1455	flicking	1	0.965
1456	flies	1	0.965
1457	float	1	0.965
1458	flood	1	0.965
1459	flooding	1	0.965
1460	flowers	1	0.965
1461	flowers'	1	0.965
1462	flush	1	0.965
1463	fly	1	0.965
1464	foil	1	0.965
1465	foliage	1	0.965
1466	followed	1	0.965
1467	fondled	1	0.965
1468	food	1	0.965
1469	footstep	1	0.965
1470	forbidden	1	0.965
1471	force	1	0.965
1472	foreground	1	0.965
1473	forehead	1	0.965
1474	forget	1	0.965
1475	forgive	1	0.965
1476	forgot	1	0.965
1477	forgotten	1	0.965
1478	forlornly	1	0.965
1479	forms	1	0.965
1480	forth	1	0.965
1481	fortifications	1	0.965
1482	fountain	1	0.965
1483	fountaining	1	0.965
1484	fours	1	0.965
1485	frail	1	0.965
1486	frangipani	1	0.965
1487	freshly	1	0.965
1488	freshness	1	0.965
1489	fried	1	0.965
1490	friends	1	0.965
1491	frig-,	1	0.965
1492	frigates	1	0.965
1493	frigged	1	0.965
1494	front	1	0.965
1495	froze	1	0.965
1496	fuck-arse	1	0.965
1497	fumbled	1	0.965
1498	further	1	0.965
1499	gaiety	1	0.965
1500	game	1	0.965
1501	gang	1	0.965
1502	garden	1	0.965
1503	gardeners	1	0.965
1504	gasps	1	0.965
1505	gate	1	0.965
1506	gather	1	0.965
1507	gaudy	1	0.965
1508	generally	1	0.965
1509	generations	1	0.965
1510	gentle	1	0.965
1511	genuinely	1	0.965
1512	germans	1	0.965
1513	germany	1	0.965
1514	gestures	1	0.965
1515	gifts	1	0.965
1516	gilmore	1	0.965
1517	gilmore's	1	0.965
1518	girl's	1	0.965
1519	give	1	0.965
1520	gives	1	0.965
1521	glad	1	0.965
1522	glanced	1	0.965
1523	glances	1	0.965
1524	glared	1	0.965
1525	gleaming	1	0.965
1526	glimpses	1	0.965
1527	glittering	1	0.965
1528	globe	1	0.965
1529	gloomy	1	0.965
1530	glow	1	0.965
1531	gobble	1	0.965
1532	god's	1	0.965
1533	golding	1	0.965
1534	gone	1	0.965
1535	goods	1	0.965
1536	gorgeous	1	0.965
1537	gothic	1	0.965
1538	gouging	1	0.965
1539	grabbed	1	0.965
1540	gradual	1	0.965
1541	grandpa's	1	0.965
1542	grap-,	1	0.965
1543	grappled	1	0.965
1544	graves	1	0.965
1545	gravestone	1	0.965
1546	gravestones	1	0.965
1547	graveyard	1	0.965
1548	graze	1	0.965
1549	gri-,	1	0.965
1550	grief	1	0.965
1551	grievance	1	0.965
1552	grin	1	0.965
1553	gripping	1	0.965
1554	grope	1	0.965
1555	gropes	1	0.965
1556	grunted	1	0.965
1557	grunting	1	0.965
1558	guarded	1	0.965
1559	gulped	1	0.965
1560	guy-,	1	0.965
1561	ha-,	1	0.965
1562	habituated	1	0.965
1563	halfway	1	0.965
1564	handles	1	0.965
1565	hang	1	0.965
1566	happen	1	0.965
1567	hardening	1	0.965
1568	harvest	1	0.965
1569	harvests	1	0.965
1570	haughty	1	0.965
1571	haven	1	0.965
1572	hazarded	1	0.965
1573	headed	1	0.965
1574	headset	1	0.965
1575	headstone	1	0.965
1576	heal	1	0.965
1577	healed	1	0.965
1578	hearts	1	0.965
1579	heathen	1	0.965
1580	heavy	1	0.965
1581	heed	1	0.965
1582	hence	1	0.965
1583	hercules	1	0.965
1584	herd	1	0.965
1585	heritage	1	0.965
1586	hers	1	0.965
1587	hibiscus	1	0.965
1588	hides	1	0.965
1589	high	1	0.965
1590	higher	1	0.965
1591	hispaniola	1	0.965
1592	hitherto	1	0.965
1593	hoes	1	0.965
1594	hog	1	0.965
1595	hogarth's	1	0.965
1596	hold	1	0.965
1597	holding	1	0.965
1598	honour	1	0.965
1599	hooked	1	0.965
1600	hoping	1	0.965
1601	horrible	1	0.965
1602	horse	1	0.965
1603	hospital	1	0.965
1604	hot	1	0.965
1605	hue	1	0.965
1606	hugging	1	0.965
1607	huh	1	0.965
1608	humiliated	1	0.965
1609	hundred-and-fifty	1	0.965
1610	hungry	1	0.965
1611	hunter	1	0.965
1612	hunting	1	0.965
1613	husband	1	0.965
1614	i'd	1	0.965
1615	ice	1	0.965
1616	ideologically	1	0.965
1617	idle	1	0.965
1618	idly	1	0.965
1619	ignorantly	1	0.965
1620	illegal	1	0.965
1621	illusion	1	0.965
1622	imaginations	1	0.965
1623	imagined	1	0.965
1624	imagines	1	0.965
1625	immediately	1	0.965
1626	immensities	1	0.965
1627	immortalize	1	0.965
1628	imperiously	1	0.965
1629	implanted	1	0.965
1630	impregnated	1	0.965
1631	impressing	1	0.965
1632	imprint	1	0.965
1633	imprisoned	1	0.965
1634	including	1	0.965
1635	indelicately	1	0.965
1636	indented	1	0.965
1637	indians	1	0.965
1638	indies	1	0.965
1639	indifferently	1	0.965
1640	individual	1	0.965
1641	infested	1	0.965
1642	influenced	1	0.965
1643	ing-,	1	0.965
1644	injured	1	0.965
1645	innocent	1	0.965
1646	inscribed	1	0.965
1647	inside	1	0.965
1648	insisted	1	0.965
1649	instinct	1	0.965
1650	instruct	1	0.965
1651	instruction	1	0.965
1652	instrument	1	0.965
1653	insufficient	1	0.965
1654	insurance	1	0.965
1655	intellectually	1	0.965
1656	interior	1	0.965
1657	internal	1	0.965
1658	interrupt	1	0.965
1659	invent	1	0.965
1660	inventions	1	0.965
1661	invocations	1	0.965
1662	involuntarily	1	0.965
1663	iris	1	0.965
1664	iron	1	0.965
1665	isn't	1	0.965
1666	italian	1	0.965
1667	itch	1	0.965
1668	itself	1	0.965
1669	jacket	1	0.965
1670	jamaica	1	0.965
1671	jaws	1	0.965
1672	jellyfish	1	0.965
1673	jest	1	0.965
1674	jesus	1	0.965
1675	jhal	1	0.965
1676	jocularly	1	0.965
1677	john	1	0.965
1678	johnson	1	0.965
1679	joined	1	0.965
1680	journal	1	0.965
1681	jubilation	1	0.965
1682	jump	1	0.965
1683	jumps	1	0.965
1684	keeper	1	0.965
1685	kept	1	0.965
1686	kick	1	0.965
1687	kidnapping	1	0.965
1688	kill	1	0.965
1689	killing	1	0.965
1690	kingdom	1	0.965
1691	kisses	1	0.965
1692	knight	1	0.965
1693	knives	1	0.965
1694	knows	1	0.965
1695	kwesi	1	0.965
1696	labba	1	0.965
1697	labouring	1	0.965
1698	lacked	1	0.965
1699	lame	1	0.965
1700	lamentation	1	0.965
1701	lance	1	0.965
1702	lands	1	0.965
1703	landscape	1	0.965
1704	languages	1	0.965
1705	lap	1	0.965
1706	large	1	0.965
1707	largely	1	0.965
1708	lashes	1	0.965
1709	late	1	0.965
1710	laughed	1	0.965
1711	laughter	1	0.965
1712	laurels	1	0.965
1713	lawn	1	0.965
1714	lays	1	0.965
1715	lea-,	1	0.965
1716	leading	1	0.965
1717	leaf	1	0.965
1718	lean	1	0.965
1719	learned	1	0.965
1720	leaves	1	0.965
1721	led	1	0.965
1722	legendary	1	0.965
1723	lesser	1	0.965
1724	lest	1	0.965
1725	letter	1	0.965
1726	lettering	1	0.965
1727	letterings	1	0.965
1728	letters	1	0.965
1729	level	1	0.965
1730	lewis	1	0.965
1731	liberally	1	0.965
1732	library	1	0.965
1733	lifeless	1	0.965
1734	lift	1	0.965
1735	lilies	1	0.965
1736	lined	1	0.965
1737	lingered	1	0.965
1738	linton	1	0.965
1739	lip	1	0.965
1740	liquids	1	0.965
1741	listen	1	0.965
1742	literary	1	0.965
1743	littered	1	0.965
1744	lived	1	0.965
1745	livid	1	0.965
1746	living	1	0.965
1747	livingness	1	0.965
1748	local	1	0.965
1749	lock	1	0.965
1750	lofty	1	0.965
1751	loneliness	1	0.965
1752	lonely	1	0.965
1753	longing	1	0.965
1754	longs	1	0.965
1755	looks	1	0.965
1756	loop	1	0.965
1757	loose	1	0.965
1758	loosed	1	0.965
1759	loosen	1	0.965
1760	loosening	1	0.965
1761	lopsided	1	0.965
1762	losses	1	0.965
1763	lots	1	0.965
1764	loud	1	0.965
1765	louder	1	0.965
1766	loved	1	0.965
1767	lover's	1	0.965
1768	loving	1	0.965
1769	lowers	1	0.965
1770	lucri-,	1	0.965
1771	lullaby	1	0.965
1772	lungs	1	0.965
1773	lurked	1	0.965
1774	lying	1	0.965
1775	m-,	1	0.965
1776	m-a	1	0.965
1777	machines	1	0.965
1778	magical	1	0.965
1779	magician	1	0.965
1780	maju	1	0.965
1781	male	1	0.965
1782	malice	1	0.965
1783	man's	1	0.965
1784	manner	1	0.965
1785	margins	1	0.965
1786	mark	1	0.965
1787	marked	1	0.965
1788	marks	1	0.965
1789	married	1	0.965
1790	marvelled	1	0.965
1791	mashing	1	0.965
1792	mask	1	0.965
1793	massage	1	0.965
1794	massive	1	0.965
1795	master's	1	0.965
1796	masterpiece	1	0.965
1797	materialized	1	0.965
1798	may	1	0.965
1799	maybe	1	0.965
1800	meaning	1	0.965
1801	means	1	0.965
1802	meant	1	0.965
1803	meantime	1	0.965
1804	meats	1	0.965
1805	meet	1	0.965
1806	mentio-,	1	0.965
1807	mere	1	0.965
1808	merely	1	0.965
1809	mid-ground	1	0.965
1810	mile	1	0.965
1811	mingling	1	0.965
1812	mispronouncing	1	0.965
1813	misses	1	0.965
1814	mock	1	0.965
1815	mocked	1	0.965
1816	monkeys	1	0.965
1817	morally	1	0.965
1818	mother's	1	0.965
1819	mothers	1	0.965
1820	motives	1	0.965
1821	mouldy	1	0.965
1822	mound	1	0.965
1823	movement	1	0.965
1824	movements	1	0.965
1825	moves	1	0.965
1826	mr	1	0.965
1827	multiply	1	0.965
1828	multiplying	1	0.965
1829	munificence	1	0.965
1830	murdered	1	0.965
1831	mushrooms	1	0.965
1832	muttering	1	0.965
1833	mysterious	1	0.965
1834	mysteriously	1	0.965
1835	nag	1	0.965
1836	naked	1	0.965
1837	named	1	0.965
1838	names	1	0.965
1839	narrow	1	0.965
1840	nationalistic	1	0.965
1841	nauseous	1	0.965
1842	nearly	1	0.965
1843	necessary	1	0.965
1844	necklaces	1	0.965
1845	necks	1	0.965
1846	need	1	0.965
1847	needed	1	0.965
1848	negotiate	1	0.965
1849	neo-	1	0.965
1850	nero	1	0.965
1851	net	1	0.965
1852	newly	1	0.965
1853	news	1	0.965
1854	nibbling	1	0.965
1855	niggers	1	0.965
1856	nights	1	0.965
1857	nineteen-ninety	1	0.965
1858	nineteenth	1	0.965
1859	ninety-four	1	0.965
1860	ninety-three	1	0.965
1861	ninety-two	1	0.965
1862	nipple	1	0.965
1863	noah	1	0.965
1864	nobel	1	0.965
1865	noblest	1	0.965
1866	noblewoman	1	0.965
1867	nodded	1	0.965
1868	none	1	0.965
1869	nostalgia	1	0.965
1870	nostrils	1	0.965
1871	nothingness	1	0.965
1872	noticed	1	0.965
1873	numbers	1	0.965
1874	nurture	1	0.965
1875	obliterating	1	0.965
1876	obscene	1	0.965
1877	obscure	1	0.965
1878	obsessed	1	0.965
1879	obsessively	1	0.965
1880	obstinately	1	0.965
1881	occupied	1	0.965
1882	odyssey	1	0.965
1883	offering	1	0.965
1884	office	1	0.965
1885	omeros	1	0.965
1886	one's	1	0.965
1887	ones	1	0.965
1888	oozing	1	0.965
1889	opening	1	0.965
1890	ordinariness	1	0.965
1891	ordinary	1	0.965
1892	organized	1	0.965
1893	original	1	0.965
1894	ornaments	1	0.965
1895	otters	1	0.965
1896	ours	1	0.965
1897	outrageous	1	0.965
1898	outside	1	0.965
1899	outweighed	1	0.965
1900	overcome	1	0.965
1901	overpay	1	0.965
1902	overwhelm	1	0.965
1903	overwhelmed	1	0.965
1904	oxen	1	0.965
1905	packed	1	0.965
1906	paid	1	0.965
1907	paintings	1	0.965
1908	palette	1	0.965
1909	panic	1	0.965
1910	papa	1	0.965
1911	paper	1	0.965
1912	parameters	1	0.965
1913	parcelled	1	0.965
1914	parent	1	0.965
1915	parlours	1	0.965
1916	particles	1	0.965
1917	particular	1	0.965
1918	parts	1	0.965
1919	patiently	1	0.965
1920	pattern	1	0.965
1921	patterned	1	0.965
1922	pausing	1	0.965
1923	paying	1	0.965
1924	pe-,	1	0.965
1925	pearls	1	0.965
1926	pebbles	1	0.965
1927	peculiarly	1	0.965
1928	peeled	1	0.965
1929	peeping	1	0.965
1930	peoples	1	0.965
1931	percentage	1	0.965
1932	percentages	1	0.965
1933	performance	1	0.965
1934	permission	1	0.965
1935	permitted	1	0.965
1936	person	1	0.965
1937	persuaded	1	0.965
1938	perverse	1	0.965
1939	philosophically	1	0.965
1940	photograph	1	0.965
1941	phrase	1	0.965
1942	physically	1	0.965
1943	pick	1	0.965
1944	picnic	1	0.965
1945	piece	1	0.965
1946	piercing	1	0.965
1947	pillars	1	0.965
1948	pink	1	0.965
1949	pinned	1	0.965
1950	pipe	1	0.965
1951	piranha	1	0.965
1952	pirate	1	0.965
1953	pitched	1	0.965
1954	placed	1	0.965
1955	placing	1	0.965
1956	plaid	1	0.965
1957	plainest	1	0.965
1958	plantains	1	0.965
1959	plaque	1	0.965
1960	plastic	1	0.965
1961	plato	1	0.965
1962	playful	1	0.965
1963	playground	1	0.965
1964	playing	1	0.965
1965	pleased	1	0.965
1966	pleasing	1	0.965
1967	plough	1	0.965
1968	plucked	1	0.965
1969	plucks	1	0.965
1970	po-,	1	0.965
1971	pockets	1	0.965
1972	poetic	1	0.965
1973	poisoned	1	0.965
1974	poofs	1	0.965
1975	pope	1	0.965
1976	portent	1	0.965
1977	portions	1	0.965
1978	possess	1	0.965
1979	possessed	1	0.965
1980	possession	1	0.965
1981	possessions	1	0.965
1982	potions	1	0.965
1983	power	1	0.965
1984	practised	1	0.965
1985	praise	1	0.965
1986	preacher	1	0.965
1987	precarious	1	0.965
1988	precious	1	0.965
1989	prefer	1	0.965
1990	pregnancy	1	0.965
1991	preparations	1	0.965
1992	prepared	1	0.965
1993	press	1	0.965
1994	pretending	1	0.965
1995	prick	1	0.965
1996	priest	1	0.965
1997	prince	1	0.965
1998	print	1	0.965
1999	pristine	1	0.965
2000	privilege	1	0.965
2001	prize	1	0.965
2002	professor	1	0.965
2003	profitably	1	0.965
2004	profound	1	0.965
2005	profusion	1	0.965
2006	property	1	0.965
2007	prophecy	1	0.965
2008	prophesy	1	0.965
2009	proprietor	1	0.965
2010	proprietors	1	0.965
2011	protectively	1	0.965
2012	proudly	1	0.965
2013	provoked	1	0.965
2014	pseudo-jamaican	1	0.965
2015	published	1	0.965
2016	puffed	1	0.965
2017	punishment	1	0.965
2018	purchase	1	0.965
2019	purity	1	0.965
2020	purple	1	0.965
2021	purpose	1	0.965
2022	push	1	0.965
2023	pushing	1	0.965
2024	pussy	1	0.965
2025	puzzled	1	0.965
2026	qualification	1	0.965
2027	quality	1	0.965
2028	queers	1	0.965
2029	quell	1	0.965
2030	question	1	0.965
2031	queue	1	0.965
2032	quickened	1	0.965
2033	rab-,	1	0.965
2034	racks	1	0.965
2035	raggedly	1	0.965
2036	rags	1	0.965
2037	rainwater	1	0.965
2038	raised	1	0.965
2039	raises	1	0.965
2040	rank	1	0.965
2041	rape	1	0.965
2042	rapture	1	0.965
2043	rash	1	0.965
2044	rastafarian	1	0.965
2045	rather	1	0.965
2046	rats	1	0.965
2047	raven	1	0.965
2048	ravish	1	0.965
2049	reaches	1	0.965
2050	ready	1	0.965
2051	realization	1	0.965
2052	realize	1	0.965
2053	reaped	1	0.965
2054	rear	1	0.965
2055	reassurance	1	0.965
2056	received	1	0.965
2057	receives	1	0.965
2058	reckoning	1	0.965
2059	reclaimed	1	0.965
2060	reclimbing	1	0.965
2061	recognition	1	0.965
2062	reconceptualizing	1	0.965
2063	recover	1	0.965
2064	recreate	1	0.965
2065	redemptive	1	0.965
2066	refashioned	1	0.965
2067	reflect	1	0.965
2068	reflected	1	0.965
2069	reflection	1	0.965
2070	regain	1	0.965
2071	regular	1	0.965
2072	relief	1	0.965
2073	remaining	1	0.965
2074	remembered	1	0.965
2075	remove	1	0.965
2076	renewed	1	0.965
2077	reparation	1	0.965
2078	repeated	1	0.965
2079	report	1	0.965
2080	research	1	0.965
2081	resolve	1	0.965
2082	restaurants	1	0.965
2083	restoring	1	0.965
2084	restricted	1	0.965
2085	result	1	0.965
2086	resurrection	1	0.965
2087	retreat	1	0.965
2088	retreated	1	0.965
2089	return	1	0.965
2090	returns	1	0.965
2091	revelation	1	0.965
2092	revengefulness	1	0.965
2093	reverence	1	0.965
2094	revisualization	1	0.965
2095	revisualizing	1	0.965
2096	revive	1	0.965
2097	rhetoric	1	0.965
2098	riches	1	0.965
2099	richly	1	0.965
2100	richness	1	0.965
2101	rid-,	1	0.965
2102	rioting	1	0.965
2103	ripped	1	0.965
2104	rise	1	0.965
2105	ritual	1	0.965
2106	roam	1	0.965
2107	roamed	1	0.965
2108	roar	1	0.965
2109	rolling	1	0.965
2110	romantic	1	0.965
2111	romantics	1	0.965
2112	roni-,	1	0.965
2113	rooted	1	0.965
2114	roots	1	0.965
2115	rope	1	0.965
2116	ropes	1	0.965
2117	rose	1	0.965
2118	rouge	1	0.965
2119	roundness	1	0.965
2120	royalties	1	0.965
2121	royalty	1	0.965
2122	rub	1	0.965
2123	rubies	1	0.965
2124	runs	1	0.965
2125	rush	1	0.965
2126	rushed	1	0.965
2127	ruskin	1	0.965
2128	rusty	1	0.965
2129	saba	1	0.965
2130	sadu	1	0.965
2131	safety	1	0.965
2132	sailor's	1	0.965
2133	sale	1	0.965
2134	salty	1	0.965
2135	sat	1	0.965
2136	saturday	1	0.965
2137	savagely	1	0.965
2138	saving	1	0.965
2139	says	1	0.965
2140	scaling	1	0.965
2141	scarce	1	0.965
2142	scarify	1	0.965
2143	scatter	1	0.965
2144	scattered	1	0.965
2145	scavenged	1	0.965
2146	scented	1	0.965
2147	scholarship	1	0.965
2148	scorn	1	0.965
2149	screamed	1	0.965
2150	screaming	1	0.965
2151	screeching	1	0.965
2152	scrolls	1	0.965
2153	sea's	1	0.965
2154	seal's	1	0.965
2155	secondborn	1	0.965
2156	seconds	1	0.965
2157	secreted	1	0.965
2158	security	1	0.965
2159	seed	1	0.965
2160	seeing	1	0.965
2161	seeking	1	0.965
2162	seen	1	0.965
2163	seeped	1	0.965
2164	self-	1	0.965
2165	selflessly	1	0.965
2166	sell	1	0.965
2167	selling	1	0.965
2168	senior	1	0.965
2169	senses	1	0.965
2170	sensing	1	0.965
2171	serious	1	0.965
2172	seriously	1	0.965
2173	services	1	0.965
2174	seven	1	0.965
2175	seventeen	1	0.965
2176	seventeen-eighties	1	0.965
2177	sh-,	1	0.965
2178	shaped	1	0.965
2179	shaping	1	0.965
2180	shark	1	0.965
2181	sharp	1	0.965
2182	sharpening	1	0.965
2183	shave	1	0.965
2184	shelf	1	0.965
2185	shells	1	0.965
2186	shield	1	0.965
2187	shiny	1	0.965
2188	shipped	1	0.965
2189	ships	1	0.965
2190	shit	1	0.965
2191	shoals	1	0.965
2192	shocked	1	0.965
2193	shone	1	0.965
2194	shops	1	0.965
2195	short	1	0.965
2196	shorten	1	0.965
2197	shouted	1	0.965
2198	shove	1	0.965
2199	showed	1	0.965
2200	shows	1	0.965
2201	shrieks	1	0.965
2202	shrimps	1	0.965
2203	shuddered	1	0.965
2204	shut	1	0.965
2205	shy	1	0.965
2206	sickle	1	0.965
2207	sideways	1	0.965
2208	silent	1	0.965
2209	silently	1	0.965
2210	simpler	1	0.965
2211	sin	1	0.965
2212	sinking	1	0.965
2213	sinning	1	0.965
2214	sixty	1	0.965
2215	skels	1	0.965
2216	sketches	1	0.965
2217	skirt	1	0.965
2218	sky's	1	0.965
2219	slag	1	0.965
2220	slapped	1	0.965
2221	slaps	1	0.965
2222	sleeping	1	0.965
2223	slight	1	0.965
2224	slithers	1	0.965
2225	slob	1	0.965
2226	slow	1	0.965
2227	sluggishly	1	0.965
2228	smell	1	0.965
2229	smells	1	0.965
2230	smile	1	0.965
2231	smiling	1	0.965
2232	smoked	1	0.965
2233	smooched	1	0.965
2234	smouldered	1	0.965
2235	snakes	1	0.965
2236	snaps	1	0.965
2237	sniffs	1	0.965
2238	snorting	1	0.965
2239	so-,	1	0.965
2240	social	1	0.965
2241	sofa	1	0.965
2242	softening	1	0.965
2243	soothe	1	0.965
2244	soothes	1	0.965
2245	sort	1	0.965
2246	sounds	1	0.965
2247	south	1	0.965
2248	southernly	1	0.965
2249	spanking	1	0.965
2250	spears	1	0.965
2251	spectacle	1	0.965
2252	speed	1	0.965
2253	spheres	1	0.965
2254	spills	1	0.965
2255	spite	1	0.965
2256	sponges	1	0.965
2257	spontaneously	1	0.965
2258	sport	1	0.965
2259	spot	1	0.965
2260	sprawled	1	0.965
2261	sprawling	1	0.965
2262	sprightly	1	0.965
2263	spurting	1	0.965
2264	squirming	1	0.965
2265	stampeded	1	0.965
2266	standing	1	0.965
2267	stares	1	0.965
2268	start	1	0.965
2269	steadfast	1	0.965
2270	stepping	1	0.965
2271	steps	1	0.965
2272	sternly	1	0.965
2273	sticking	1	0.965
2274	stiff	1	0.965
2275	stifle	1	0.965
2276	stomach	1	0.965
2277	stomp	1	0.965
2278	stonemason	1	0.965
2279	stones	1	0.965
2280	stopped	1	0.965
2281	stopping	1	0.965
2282	stories	1	0.965
2283	storms	1	0.965
2284	stormy	1	0.965
2285	stout	1	0.965
2286	straight	1	0.965
2287	streaked	1	0.965
2288	streets	1	0.965
2289	strict	1	0.965
2290	stripes	1	0.965
2291	striptease	1	0.965
2292	strong	1	0.965
2293	stronger	1	0.965
2294	struck	1	0.965
2295	structures	1	0.965
2296	struggled	1	0.965
2297	stubble	1	0.965
2298	stubbornness	1	0.965
2299	stuck	1	0.965
2300	student	1	0.965
2301	study	1	0.965
2302	stuff	1	0.965
2303	stupefaction	1	0.965
2304	style	1	0.965
2305	subdue	1	0.965
2306	substantial	1	0.965
2307	subtracted	1	0.965
2308	sucks	1	0.965
2309	suffer	1	0.965
2310	suffered	1	0.965
2311	suffering	1	0.965
2312	suffolk	1	0.965
2313	sugared	1	0.965
2314	sugarloaves	1	0.965
2315	suit	1	0.965
2316	sulks	1	0.965
2317	sunken	1	0.965
2318	supervision	1	0.965
2319	suppose	1	0.965
2320	suppressing	1	0.965
2321	surmise	1	0.965
2322	surrounded	1	0.965
2323	surrounding	1	0.965
2324	survive	1	0.965
2325	sustaining	1	0.965
2326	swallow	1	0.965
2327	swallowed	1	0.965
2328	swallowing	1	0.965
2329	swarming	1	0.965
2330	swatter	1	0.965
2331	swearing	1	0.965
2332	sweat	1	0.965
2333	swelled	1	0.965
2334	swelling	1	0.965
2335	swiftly	1	0.965
2336	swim	1	0.965
2337	swimming	1	0.965
2338	switched	1	0.965
2339	sword	1	0.965
2340	swords	1	0.965
2341	t-,	1	0.965
2342	table	1	0.965
2343	takes	1	0.965
2344	taking	1	0.965
2345	tally	1	0.965
2346	tanda	1	0.965
2347	tastes	1	0.965
2348	taught	1	0.965
2349	taxes	1	0.965
2350	taxman	1	0.965
2351	tears	1	0.965
2352	temples	1	0.965
2353	tended	1	0.965
2354	terribly	1	0.965
2355	terror	1	0.965
2356	test	1	0.965
2357	tested	1	0.965
2358	texture	1	0.965
2359	th-,	1	0.965
2360	theme	1	0.965
2361	they'll	1	0.965
2362	thin	1	0.965
2363	thinking	1	0.965
2364	thinks	1	0.965
2365	thistlewood's	1	0.965
2366	thorns	1	0.965
2367	thousand-and-a-half	1	0.965
2368	threatening	1	0.965
2369	threatens	1	0.965
2370	three-hundred	1	0.965
2371	throat	1	0.965
2372	thumb	1	0.965
2373	thumbed	1	0.965
2374	tie	1	0.965
2375	tied	1	0.965
2376	tightly	1	0.965
2377	tiny	1	0.965
2378	tired	1	0.965
2379	tiring	1	0.965
2380	titanic	1	0.965
2381	tobacco	1	0.965
2382	toenail	1	0.965
2383	together	1	0.965
2384	told	1	0.965
2385	tolerance	1	0.965
2386	tomb	1	0.965
2387	tomb's	1	0.965
2388	tongueless	1	0.965
2389	tonight	1	0.965
2390	tools	1	0.965
2391	touch	1	0.965
2392	tour	1	0.965
2393	trace	1	0.965
2394	trade	1	0.965
2395	trader	1	0.965
2396	tragedy	1	0.965
2397	trailing	1	0.965
2398	train	1	0.965
2399	transformation	1	0.965
2400	traveller	1	0.965
2401	treasure	1	0.965
2402	tree	1	0.965
2403	trench	1	0.965
2404	trespass	1	0.965
2405	tribute	1	0.965
2406	triumph	1	0.965
2407	trojan	1	0.965
2408	trooped	1	0.965
2409	trouble	1	0.965
2410	troy	1	0.965
2411	trumpets	1	0.965
2412	trussed	1	0.965
2413	truth	1	0.965
2414	try	1	0.965
2415	tugged	1	0.965
2416	tumble	1	0.965
2417	tune	1	0.965
2418	twenty	1	0.965
2419	twenty-five	1	0.965
2420	twigs	1	0.965
2421	twinkling	1	0.965
2422	twirls	1	0.965
2423	twisted	1	0.965
2424	two-hundred	1	0.965
2425	type	1	0.965
2426	u-,	1	0.965
2427	unable	1	0.965
2428	unborn	1	0.965
2429	underpinnings	1	0.965
2430	understand	1	0.965
2431	undertaker's	1	0.965
2432	unending	1	0.965
2433	unfamiliar	1	0.965
2434	unfolding	1	0.965
2435	unfulfilled	1	0.965
2436	ungratefully	1	0.965
2437	unhappiness	1	0.965
2438	unhewn	1	0.965
2439	universal	1	0.965
2440	unlike	1	0.965
2441	unnamed	1	0.965
2442	unpastes	1	0.965
2443	unpublished	1	0.965
2444	unties	1	0.965
2445	unwrap	1	0.965
2446	urien	1	0.965
2447	using	1	0.965
2448	utter	1	0.965
2449	valour	1	0.965
2450	value	1	0.965
2451	vanquished	1	0.965
2452	vaster	1	0.965
2453	veil	1	0.965
2454	venture	1	0.965
2455	ventured	1	0.965
2456	ver-,	1	0.965
2457	videos	1	0.965
2458	view	1	0.965
2459	viewer's	1	0.965
2460	villagers	1	0.965
2461	villages	1	0.965
2462	violent	1	0.965
2463	wa-,	1	0.965
2464	wake	1	0.965
2465	walking	1	0.965
2466	walls	1	0.965
2467	wander	1	0.965
2468	wandered	1	0.965
2469	waning	1	0.965
2470	wanking	1	0.965
2471	war	1	0.965
2472	warehouse	1	0.965
2473	wares	1	0.965
2474	warriors	1	0.965
2475	wash	1	0.965
2476	washed	1	0.965
2477	washing	1	0.965
2478	wasn't	1	0.965
2479	watches	1	0.965
2480	waves	1	0.965
2481	wayward	1	0.965
2482	weapon	1	0.965
2483	wears	1	0.965
2484	weaves	1	0.965
2485	weaving	1	0.965
2486	week	1	0.965
2487	weeks	1	0.965
2488	weep	1	0.965
2489	weeping	1	0.965
2490	western	1	0.965
2491	wet	1	0.965
2492	what's	1	0.965
2493	wheeze	1	0.965
2494	whereby	1	0.965
2495	wherever	1	0.965
2496	whether	1	0.965
2497	whim	1	0.965
2498	whip	1	0.965
2499	whips	1	0.965
2500	whores	1	0.965
2501	wife	1	0.965
2502	wild	1	0.965
2503	wildest	1	0.965
2504	willed	1	0.965
2505	win	1	0.965
2506	windows	1	0.965
2507	wine	1	0.965
2508	wings	1	0.965
2509	winking	1	0.965
2510	winning	1	0.965
2511	wipe	1	0.965
2512	wish	1	0.965
2513	wished	1	0.965
2514	withdrawal	1	0.965
2515	witter	1	0.965
2516	woke	1	0.965
2517	won	1	0.965
2518	wonderland	1	0.965
2519	workers	1	0.965
2520	works	1	0.965
2521	wriggled	1	0.965
2522	wringing	1	0.965
2523	wrinkled	1	0.965
2524	writhes	1	0.965
2525	yard	1	0.965
2526	year	1	0.965
2527	yolk	1	0.965
2528	zong	1	0.965

2.2 Association scores

Next, we calculate the association scores with a call to assoc_scores(), providing first the target frequency list flist_target and then the reference frequency list flist_ref. We’ll store the result in a variable called scores_kw. Once we have our scores_kw, we can sort them by PMI and by signed $G^2$.

# calculate scores
scores_kw <- assoc_scores(flist_target, flist_ref)

# print scores, sorted by PMI
print(scores_kw, sort_order = "PMI")

Association scores (types in list: 575, sort order criterion: PMI)
          type    a  PMI G_signed|    b   c       d dir exp_a DP_rows
 1        manu 15.5 7.25    152.3|10346 0.5 1614253   1 0.102   0.001
 2   gladstone 13.5 7.24    132.2|10348 0.5 1614253   1 0.089   0.001
 3      rohini 10.5 7.23    102.1|10352 0.5 1614253   1 0.070   0.001
 4     troilus  9.5 7.22     92.1|10352 0.5 1614253   1 0.064   0.001
 5 thistlewood  7.5 7.20     72.1|10354 0.5 1614253   1 0.051   0.001
 6    guyanese  6.5 7.19     62.1|10356 0.5 1614253   1 0.045   0.001
 7   stillborn  6.5 7.19     62.1|10356 0.5 1614253   1 0.045   0.001
 8       cabin  5.5 7.17     52.2|10356 0.5 1614253   1 0.038   0.001
 9      kampta  5.5 7.17     52.2|10356 0.5 1614253   1 0.038   0.001
10      guyana  9.0 7.14     84.5|10352 1.0 1614251   1 0.064   0.001
11      anarch  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
12    aperture  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
13      coolie  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
14    criseyde  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
15       ellar  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
16     grandpa  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
17        kaka  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
18   lachrimae  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
19   overboard  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
20    awakened  7.0 7.10     64.8|10354 1.0 1614251   1 0.051   0.001
...
<number of extra columns to the right: 7>

# print scores, sorted by G_signed
print(scores_kw, sort_order = "G_signed")

Association scores (types in list: 575, sort order criterion: G_signed)
        type     a  PMI G_signed|    b      c       d dir  exp_a
 1       her 106.0 4.14    420.1|10255  838.0 1613414   1  6.020
 2       she  81.0 3.55    256.0|10280 1005.0 1613247   1  6.926
 3   slavery  24.0 6.88    206.8|10337    8.0 1614244   1  0.204
 4       sea  28.0 5.69    176.1|10333   57.0 1614195   1  0.542
 5      manu  15.5 7.25    152.3|10346    0.5 1614253   1  0.102
 6    turner  17.0 6.79    143.0|10344    7.0 1614245   1  0.153
 7 gladstone  13.5 7.24    132.2|10348    0.5 1614253   1  0.089
 8        my  62.0 2.70    129.4|10299 1432.0 1612820   1  9.528
 9        he 102.0 1.76    107.0|10259 4612.0 1609640   1 30.064
10       his  64.0 2.30    103.9|10297 1968.0 1612284   1 12.959
11    rohini  10.5 7.23    102.1|10352    0.5 1614253   1  0.070
12    nigger  11.0 6.95     96.7|10350    3.0 1614249   1  0.089
13   african  15.0 5.64     93.2|10346   32.0 1614220   1  0.300
14   troilus   9.5 7.22     92.1|10352    0.5 1614253   1  0.064
15      shah  10.0 7.03     90.3|10351    2.0 1614250   1  0.077
16      dead  19.0 4.60     87.6|10342  104.0 1614148   1  0.784
17      shop  15.0 5.39     87.1|10346   41.0 1614211   1  0.357
18    guyana   9.0 7.14     84.5|10352    1.0 1614251   1  0.064
19       him  36.0 2.87     82.2|10325  739.0 1613513   1  4.943
20      fish  13.0 5.38     75.2|10348   36.0 1614216   1  0.312
...
<number of extra columns to the right: 8>

Haldane-Anscombe correction

It might seem disconcerting that some frequencies (a column) have decimals. This is because some values in the contingency table were 0 and 0.5 was added to all the cells to avoid divisions by 0 (the Haldane-Anscombe correction). If you would rather use a different small number in these cases, you can set haldane = FALSE in your assoc_scores() call and set your desired small value in the small_pos argument.

2.3 Filtering of keywords by PMI and signed $G^2$

We can use filter() to filter the keywords (i.e. the rows of scores_kw) by PMI and signed $G^2$. We’ll store the result in a variable called top_scores_kw and again print the result, first sorted by PMI, then by signed $G^2$. This allows us to explore which words are ranked higher by each of the measures.

top_scores_kw <- scores_kw %>% 
  filter(PMI >= 2 & G_signed >= 2)

# print top_scores_kw, sorted by PMI
top_scores_kw %>%
  print(sort_order = "PMI")

Association scores (types in list: 269, sort order criterion: PMI)
          type    a  PMI G_signed|    b   c       d dir exp_a DP_rows
 1        manu 15.5 7.25    152.3|10346 0.5 1614253   1 0.102   0.001
 2   gladstone 13.5 7.24    132.2|10348 0.5 1614253   1 0.089   0.001
 3      rohini 10.5 7.23    102.1|10352 0.5 1614253   1 0.070   0.001
 4     troilus  9.5 7.22     92.1|10352 0.5 1614253   1 0.064   0.001
 5 thistlewood  7.5 7.20     72.1|10354 0.5 1614253   1 0.051   0.001
 6    guyanese  6.5 7.19     62.1|10356 0.5 1614253   1 0.045   0.001
 7   stillborn  6.5 7.19     62.1|10356 0.5 1614253   1 0.045   0.001
 8       cabin  5.5 7.17     52.2|10356 0.5 1614253   1 0.038   0.001
 9      kampta  5.5 7.17     52.2|10356 0.5 1614253   1 0.038   0.001
10      guyana  9.0 7.14     84.5|10352 1.0 1614251   1 0.064   0.001
11      anarch  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
12    aperture  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
13      coolie  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
14    criseyde  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
15       ellar  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
16     grandpa  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
17        kaka  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
18   lachrimae  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
19   overboard  4.5 7.14     42.3|10358 0.5 1614253   1 0.032   0.000
20    awakened  7.0 7.10     64.8|10354 1.0 1614251   1 0.051   0.001
...
<number of extra columns to the right: 7>

# print top_scores_kw, sorted by G_signed
top_scores_kw %>%
  print(sort_order = "G_signed")

Association scores (types in list: 269, sort order criterion: G_signed)
        type     a  PMI G_signed|    b      c       d dir  exp_a
 1       her 106.0 4.14    420.1|10255  838.0 1613414   1  6.020
 2       she  81.0 3.55    256.0|10280 1005.0 1613247   1  6.926
 3   slavery  24.0 6.88    206.8|10337    8.0 1614244   1  0.204
 4       sea  28.0 5.69    176.1|10333   57.0 1614195   1  0.542
 5      manu  15.5 7.25    152.3|10346    0.5 1614253   1  0.102
 6    turner  17.0 6.79    143.0|10344    7.0 1614245   1  0.153
 7 gladstone  13.5 7.24    132.2|10348    0.5 1614253   1  0.089
 8        my  62.0 2.70    129.4|10299 1432.0 1612820   1  9.528
 9       his  64.0 2.30    103.9|10297 1968.0 1612284   1 12.959
10    rohini  10.5 7.23    102.1|10352    0.5 1614253   1  0.070
11    nigger  11.0 6.95     96.7|10350    3.0 1614249   1  0.089
12   african  15.0 5.64     93.2|10346   32.0 1614220   1  0.300
13   troilus   9.5 7.22     92.1|10352    0.5 1614253   1  0.064
14      shah  10.0 7.03     90.3|10351    2.0 1614250   1  0.077
15      dead  19.0 4.60     87.6|10342  104.0 1614148   1  0.784
16      shop  15.0 5.39     87.1|10346   41.0 1614211   1  0.357
17    guyana   9.0 7.14     84.5|10352    1.0 1614251   1  0.064
18       him  36.0 2.87     82.2|10325  739.0 1613513   1  4.943
19      fish  13.0 5.38     75.2|10348   36.0 1614216   1  0.312
20      poem  15.0 4.81     73.7|10346   69.0 1614183   1  0.536
...
<number of extra columns to the right: 8>

Here you can keep reading to see the steps for collocation analysis or skip to the Section 4 for tips on how to move forward.

3 Collocation analysis

For collocation analysis, we will use the full BASE corpus (fnames_BASE) and, instead of the freqlist() function, the surf_cooc() function. This function requires a re_node argument that is a regular expression capturing what should be identified as a node token. In this case we ask for (?xi) ^ big $.

What does this regex mean?

The regular expression (?xi) ^ big $ asks for the beginning of a line, followed by ‘big’, followed by the end of the line. When the corpus is tokenized, as is the case for freqlist() and surf_cooc(), but not for conc(), one line corresponds to one token: we are asking for tokens that match big exactly.

The rest of the arguments are the same that were set in freqlist(), as explained in Section 1.2.

coocs <- fnames_BASE %>% 
  surf_cooc("(?xi)  ^ big $",
            re_token_splitter = r"--[(?xi)    \s+   ]--",
            re_drop_token     = r"--[(?xi)  [:\[\]] ]--",
            file_encoding     = "windows-1252")
coocs$target_freqlist
coocs$ref_freqlist

Frequency list (types in list: 1013, tokens in list: 3949)
rank  type abs_freq nrm_freq
---- ----- -------- --------
   1     a      314    795.1
   2   the      218    552.0
   3    of      100    253.2
   4   and       91    230.4
   5    in       87    220.3
   6    er       81    205.1
   7    is       81    205.1
   8   you       62    157.0
   9    to       61    154.5
  10  this       54    136.7
  11   one       51    129.1
  12    it       44    111.4
  13  it's       43    108.9
  14  very       42    106.4
  15  that       41    103.8
  16    so       40    101.3
  17   how       34     86.1
  18    be       32     81.0
  19 there       31     78.5
  20   but       29     73.4
...

Frequency list (types in list: 36957, tokens in list: 1619982)
rank type abs_freq nrm_freq
---- ---- -------- --------
   1  the    87002    537.1
   2   of    49165    303.5
   3  and    45143    278.7
   4   to    42801    264.2
   5   er    39446    243.5
   6    a    36117    222.9
   7 that    31847    196.6
   8   in    30306    187.1
   9  you    29482    182.0
  10   is    26001    160.5
  11   it    20399    125.9
  12    i    16901    104.3
  13   so    16865    104.1
  14 this    14914     92.1
  15   we    13149     81.2
  16 have    10339     63.8
  17 what    10281     63.5
  18   on    10202     63.0
  19   be    10172     62.8
  20  but    10151     62.7
...

The object coocs is a cooc_info object, i.e. a list of two frequency lists: a target frequency list with the co-occurrence frequencies of items in the vicinity of the node big (by default, 3 tokens to either side), and a reference frequency list with the frequencies of items in the rest of the corpus.

3.1 Association scores

Next, we calculate the association score with a call to assoc_scores(), providing the full object coocs instead of the separated frequency lists. We store the result in a variable called scores_colloc. Once we have our scores_colloc, we can sort them by PMI and by signed $G^2$.

# calculate scores
scores_colloc <- assoc_scores(coocs)

# print scores, sorted by PMI
print(scores_colloc, sort_order = "PMI")

Association scores (types in list: 220, sort order criterion: PMI)
            type  a  PMI G_signed|   b   c       d dir exp_a DP_rows
 1         shiny  3 7.95     29.4|3946   2 1619980   1 0.012   0.001
 2           toe  3 7.95     29.4|3946   2 1619980   1 0.012   0.001
 3          bang  6 6.95     47.9|3943  14 1619968   1 0.049   0.002
 4       premium  3 6.02     19.6|3946  16 1619966   1 0.046   0.001
 5 organisations  5 5.88     31.6|3944  30 1619952   1 0.085   0.001
 6  corporations  3 5.57     17.6|3946  23 1619959   1 0.063   0.001
 7 multinational  3 5.10     15.6|3946  33 1619949   1 0.088   0.001
 8       climate  4 4.83     19.3|3945  54 1619928   1 0.141   0.001
 9         twice  4 4.83     19.3|3945  54 1619928   1 0.141   0.001
10 philosophical  3 4.81     14.4|3946  41 1619941   1 0.107   0.001
11           gap  5 4.58     22.4|3944  81 1619901   1 0.209   0.001
12             l  3 4.46     13.0|3946  53 1619929   1 0.136   0.001
13         tests  4 4.26     16.2|3945  82 1619900   1 0.209   0.001
14       vessels  3 4.20     11.9|3946  64 1619918   1 0.163   0.001
15    difference 19 4.17     74.9|3930 414 1619568   1 1.053   0.005
16         river  3 4.16     11.8|3946  66 1619916   1 0.168   0.001
17        debate  4 4.10     15.3|3945  92 1619890   1 0.233   0.001
18           box  4 4.07     15.2|3945  94 1619888   1 0.238   0.001
19    assumption  5 3.92     18.0|3944 131 1619851   1 0.331   0.001
20    dictionary  3 3.78     10.2|3946  87 1619895   1 0.219   0.001
...
<number of extra columns to the right: 7>

# print scores, sorted by G_signed
print(scores_colloc, sort_order = "G_signed")

Association scores (types in list: 220, sort order criterion: G_signed)
            type   a  PMI G_signed|   b     c       d dir  exp_a
 1             a 314 1.83    358.7|3635 36117 1583865   1 88.591
 2    difference  19 4.17     74.9|3930   414 1619568   1  1.053
 3       problem  22 2.94     51.7|3927  1159 1618823   1  2.872
 4          bang   6 6.95     47.9|3943    14 1619968   1  0.049
 5           one  51 1.56     43.5|3898  7041 1612941   1 17.246
 6        really  29 2.16     42.2|3920  2633 1617349   1  6.473
 7         quite  25 2.32     40.7|3924  2033 1617949   1  5.005
 8          very  42 1.67     40.0|3907  5387 1614595   1 13.202
 9           how  34 1.89     39.8|3915  3734 1616248   1  9.163
10 organisations   5 5.88     31.6|3944    30 1619952   1  0.085
11         shiny   3 7.95     29.4|3946     2 1619980   1  0.012
12           toe   3 7.95     29.4|3946     2 1619980   1  0.012
13       there's  24 1.79     25.7|3925  2822 1617160   1  6.921
14         thank   8 3.61     25.6|3941   261 1619721   1  0.654
15           too  12 2.75     25.5|3937   722 1619260   1  1.785
16           gap   5 4.58     22.4|3944    81 1619901   1  0.209
17         great  10 2.80     21.8|3939   582 1619400   1  1.440
18     companies   6 3.72     20.0|3943   181 1619801   1  0.455
19       premium   3 6.02     19.6|3946    16 1619966   1  0.046
20       climate   4 4.83     19.3|3945    54 1619928   1  0.141
...
<number of extra columns to the right: 8>

3.2 Filtering of collocates by PMI and signed $G^2$

We’ll use filter() to filter the collocates (i.e. the object scores_colloc) by PMI and signed $G$, store the result in a variable called top_scores_colloc and print the result, first sorted by PMI, then by signed $G^2$. This allows us to explore which words are ranked higher by each of the measures.

top_scores_colloc <- scores_colloc %>% 
  filter(PMI >= 2 & G_signed >= 2)

# print top_scores_colloc, sorted by PMI
top_scores_colloc %>%
  print(sort_order = "PMI")

Association scores (types in list: 61, sort order criterion: PMI)
            type  a  PMI G_signed|   b   c       d dir exp_a DP_rows
 1         shiny  3 7.95     29.4|3946   2 1619980   1 0.012   0.001
 2           toe  3 7.95     29.4|3946   2 1619980   1 0.012   0.001
 3          bang  6 6.95     47.9|3943  14 1619968   1 0.049   0.002
 4       premium  3 6.02     19.6|3946  16 1619966   1 0.046   0.001
 5 organisations  5 5.88     31.6|3944  30 1619952   1 0.085   0.001
 6  corporations  3 5.57     17.6|3946  23 1619959   1 0.063   0.001
 7 multinational  3 5.10     15.6|3946  33 1619949   1 0.088   0.001
 8       climate  4 4.83     19.3|3945  54 1619928   1 0.141   0.001
 9         twice  4 4.83     19.3|3945  54 1619928   1 0.141   0.001
10 philosophical  3 4.81     14.4|3946  41 1619941   1 0.107   0.001
11           gap  5 4.58     22.4|3944  81 1619901   1 0.209   0.001
12             l  3 4.46     13.0|3946  53 1619929   1 0.136   0.001
13         tests  4 4.26     16.2|3945  82 1619900   1 0.209   0.001
14       vessels  3 4.20     11.9|3946  64 1619918   1 0.163   0.001
15    difference 19 4.17     74.9|3930 414 1619568   1 1.053   0.005
16         river  3 4.16     11.8|3946  66 1619916   1 0.168   0.001
17        debate  4 4.10     15.3|3945  92 1619890   1 0.233   0.001
18           box  4 4.07     15.2|3945  94 1619888   1 0.238   0.001
19    assumption  5 3.92     18.0|3944 131 1619851   1 0.331   0.001
20    dictionary  3 3.78     10.2|3946  87 1619895   1 0.219   0.001
...
<number of extra columns to the right: 7>

# print top_scores_colloc, sorted by G_signed
top_scores_colloc %>%
  print(sort_order = "G_signed")

Association scores (types in list: 61, sort order criterion: G_signed)
            type  a  PMI G_signed|   b    c       d dir exp_a DP_rows
 1    difference 19 4.17     74.9|3930  414 1619568   1 1.053   0.005
 2       problem 22 2.94     51.7|3927 1159 1618823   1 2.872   0.005
 3          bang  6 6.95     47.9|3943   14 1619968   1 0.049   0.002
 4        really 29 2.16     42.2|3920 2633 1617349   1 6.473   0.006
 5         quite 25 2.32     40.7|3924 2033 1617949   1 5.005   0.005
 6 organisations  5 5.88     31.6|3944   30 1619952   1 0.085   0.001
 7         shiny  3 7.95     29.4|3946    2 1619980   1 0.012   0.001
 8           toe  3 7.95     29.4|3946    2 1619980   1 0.012   0.001
 9         thank  8 3.61     25.6|3941  261 1619721   1 0.654   0.002
10           too 12 2.75     25.5|3937  722 1619260   1 1.785   0.003
11           gap  5 4.58     22.4|3944   81 1619901   1 0.209   0.001
12         great 10 2.80     21.8|3939  582 1619400   1 1.440   0.002
13     companies  6 3.72     20.0|3943  181 1619801   1 0.455   0.001
14       premium  3 6.02     19.6|3946   16 1619966   1 0.046   0.001
15       climate  4 4.83     19.3|3945   54 1619928   1 0.141   0.001
16         twice  4 4.83     19.3|3945   54 1619928   1 0.141   0.001
17    assumption  5 3.92     18.0|3944  131 1619851   1 0.331   0.001
18  corporations  3 5.57     17.6|3946   23 1619959   1 0.063   0.001
19          area  8 2.78     17.2|3941  472 1619510   1 1.167   0.002
20        market 10 2.39     17.0|3939  776 1619206   1 1.911   0.002
...
<number of extra columns to the right: 7>

4 Post-processing

The rest of the steps can be applied to any assoc_scores object, i.e. either the output of a keyword analysis or that of a collocation analysis.

4.1 Saving the results to file

We can use write_assoc() to write an assoc_scores object to a file. That file is a tab delimited text file. It can easily be imported in spreadsheet tools but also be read again in RStudio, in future sessions, with read_assoc().

top_scores_kw %>%
  write_assoc("ahlct001_top_keywords.csv") 
# top_scores_kw <- read_assoc("ahlct001_top_keywords.csv")

top_scores_colloc %>%
  write_assoc("big_top_collocates.csv") 
# top_scores_colloc <- read_assoc("big_top_collocates.csv")

4.2 A nicer way of showing the scores in a report

We can turn the scores into a tibble with the function as_tibble(). This allows people familiar with the tidyverse to use the rich set of tidyverse functions that are applicable to tibbles.

In Table 2 we print the top thirty keywords (according to PMI, and sorted by descending PMI) the tidyverse way:

top_scores_kw %>% # also valid for top_scores_colloc
  as_tibble() %>%
  select(type, a, PMI, G_signed) %>% # select 4 columns
  arrange(desc(PMI)) %>%             # sort by PMI (descending) 
  head(30) %>%                       # select top 30 rows
  kbl(col.names = c("Type", "Frequency", "PMI", r"(Signed $G^2$)")) %>% 
  kable_minimal() %>% 
  scroll_box(height = "400px")

Table 2: Top 30 key words of the ‘ahlct001’ file in the BASE corpus, sorted by PMI.
Type	Frequency	PMI	Signed $G^2$
manu	15.5	7.25	152.3
gladstone	13.5	7.24	132.2
rohini	10.5	7.23	102.1
troilus	9.5	7.22	92.1
thistlewood	7.5	7.20	72.1
guyanese	6.5	7.19	62.1
stillborn	6.5	7.19	62.1
cabin	5.5	7.17	52.2
kampta	5.5	7.17	52.2
guyana	9.0	7.14	84.5
anarch	4.5	7.14	42.3
aperture	4.5	7.14	42.3
coolie	4.5	7.14	42.3
criseyde	4.5	7.14	42.3
ellar	4.5	7.14	42.3
grandpa	4.5	7.14	42.3
kaka	4.5	7.14	42.3
lachrimae	4.5	7.14	42.3
overboard	4.5	7.14	42.3
awakened	7.0	7.10	64.8
beatings	3.5	7.10	32.4
booths	3.5	7.10	32.4
diomede	3.5	7.10	32.4
ellar's	3.5	7.10	32.4
gladstone's	3.5	7.10	32.4
jamaican	3.5	7.10	32.4
melody	3.5	7.10	32.4
miriam's	3.5	7.10	32.4
mist	3.5	7.10	32.4
paki	3.5	7.10	32.4

We can do the same sorting by signed $G^2$ instead, as shown in Table 3:

top_scores_colloc %>% # also valid for top_scores_kw
  as_tibble() %>%
  select(type, a, PMI, G_signed) %>% # select 4 columns
  arrange(desc(G_signed)) %>%        # sort by G_signed (descending)  
  head(30) %>%                       # select top 30 rows
  kbl(col.names = c("Type", "Frequency", "PMI", r"(Signed $G^2$)")) %>% 
  kable_minimal() %>% 
  scroll_box(height = "400px")

Table 3: Top 30 collocates of ‘big’ in the BASE corpus, sorted by G signed.
Type	Frequency	PMI	Signed $G^2$
difference	19	4.17	74.9
problem	22	2.94	51.7
bang	6	6.95	47.9
really	29	2.16	42.2
quite	25	2.32	40.7
organisations	5	5.88	31.6
shiny	3	7.95	29.4
toe	3	7.95	29.4
thank	8	3.61	25.6
too	12	2.75	25.5
gap	5	4.58	22.4
great	10	2.80	21.8
companies	6	3.72	20.0
premium	3	6.02	19.6
climate	4	4.83	19.3
twice	4	4.83	19.3
assumption	5	3.92	18.0
corporations	3	5.57	17.6
area	8	2.78	17.2
market	10	2.39	17.0
issue	7	2.96	16.6
question	11	2.19	16.3
tests	4	4.26	16.2
multinational	3	5.10	15.6
debate	4	4.10	15.3
box	4	4.07	15.2
such	9	2.35	15.0
company	5	3.38	14.5
philosophical	3	4.81	14.4
enough	7	2.66	14.1

4.3 Plotting the association scores

If we store the tibble version of the results in a variable, such as top_scores_df, we can reuse that object in several subsequent instructions without having to recreate it time and again. Let’s work with the keywords here.

top_scores_df <- as_tibble(top_scores_kw)

To illustrate the use of the tibble version of the results, we can, for instance, generate plots on the basis of the object top_scores_df. Our base plot will map the signed $G^2$ values on the x-axis and the PMI scores on the y-axis. We’ll also set a common theme to all plots with theme_set().

theme_set(theme_minimal(base_size = 15))
g <- top_scores_df %>%
  ggplot(aes(x = PMI, y = G_signed)) +
  labs(x = "PMI", y = "Signed G")

In Figure 1, we build a simple scatter plot with points representing the different words. The plot gives us an idea of to which extent both measures correlate.

g + geom_point()

Figure 1: Scatterplot of PMI by signed $G^2$ in the keyword analysis.

We see that the measures do correlate a bit, but definitely not perfectly. In fact, should you build similar plots for other combinations of measures, you’ll find that some pair correlate much more clearly than what we see here.

Let’s inspect to which extent absolute frequencies can explain the discrepancies between PMI and signed $G^2$. In Figure 2 we have the frequencies in the target corpus (i.e. the values in the a cell of the contingency tables) mapped to the size of the symbols. We see that high frequencies tend to relatively boost signed G scores and that low frequencies appear to relatively increase the probability of obtaining a high PMI score.

g + geom_point(aes(size = a))

Figure 2: Scatterplot of PMI by signed $G^2$ in the keyword analysis, with size reflecting frequency.

We can also plot the names of the words instead of using symbols, using the geom_text() function, as shown in Figure 3.

Notice that in top_scores_df the names of the types are stored in a column called type.

g + geom_text(aes(label = type))

Figure 3: Scatterplot of PMI by signed $G^2$ in the keyword analysis, with types instead of dots.

For a more sophisticated plot, the ggrepel package allows us to add text close to the position of their datapoints, avoiding overlap. To create Figure 4 we define a smaller dataframe with the subset of keywords for which $G^2$ is larger than 100 and provide it as data for ggrepel::geom_text_repel(). The x and y aesthetics are inherited from the ggplot() call in g.

high_G_signed <- top_scores_df %>% 
  filter(G_signed > 100) # extract types with high G_signed

g + geom_point() +
  ggrepel::geom_text_repel(data = high_G_signed, aes(label = type))

Figure 4: Scatterplot of PMI by signed $G^2$ in the keyword analysis, labeled with the highest $G^2$ values.