Genitive Alternation I: Retrieval

Based on Dirk Speelman’s course material

This is the first part of a variation study focusing on the genitive alternation in English, using the Brown corpus. This part focuses on retrieving observations of the alternation, whereas the second part will illustrate the analysis with conditional inference trees and mixed-effects logistic regression. One of the reasons to split the analysis in two is that, more often than not, between retrieval and analysis there is an additional step of manual cleaning or annotation of the dataset.

Setup

For this study we need to activate the packages tidyverse and mclm. We will also use kableExtra to print some tables.

We’ll collect the files from the Brown corpus. To replicate this code, adjust the path to wherever you stored your copy of the corpus. In this case, we also add a keep_re() call to capture the pattern shared by all corpus files and not by metadata files such as “CONTENTS” and “README”. The hide_path argument hides a part of the filename when printing, to ease reading the table.

For more information on the Brown corpus, such as the components and tagset, see the Wikipedia documentation: https://en.wikipedia.org/wiki/Brown_Corpus.
corpus_folder <- here::here("studies", "_corpora", "brown") # adjust to your path
brown_fnames <- get_fnames(corpus_folder) %>% 
  keep_re("/c[a-z][0-9]{2}")
print(brown_fnames, n = 10, hide_path = corpus_folder)
Filename collection of length 500
   filename
   --------
 1     ca01
 2     ca02
 3     ca03
 4     ca04
 5     ca05
 6     ca06
 7     ca07
 8     ca08
 9     ca09
10     ca10
...

POS tags

The text in this corpus looks as follows, with word forms followed by a part-of-speech tag, with a slash in between:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at
investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/
vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd
place/nn ./.

Some of the POS tags of Brown are of particular importance for this case study.

Useful POS-tags of the Brown corpus for the genitive alternation study.
POS tag Description Example
at article the/at
in preposition of/in
jj adjective recent/jj
nn common noun place/nn
np proper noun Atlanta/np
-tl (suffix to a POS tag) title or part of title Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl
$ (suffix to a POS tag) genitive marker1 Atlanta’s/np$ recent/jj primary/nn

Data retrieval

Our goal is to obtain the attestations of the genitive alternation in the Brown corpus, e.g. examples of “the student’s idea” and “the idea of the student”. In this alternation, the main elements we are interested in are called “Possessor” and “Possessed”, because the prototypical situation of the genitive is that of possession. However, the terminology can be misleading, because this is not the only meaning of the genitive constructions. For example, in The car’s owner, car takes the role of Possessor and owner that of Possessed.

An alternation study aims to describe the aspects of language use that favor or disfavor one variant of the alternation against the other. This translates to studies where the alternation itself, e.g. the choice between an s form or an of form, becomes the response variable or outcome, and different characteristics of the context become predictors or regressors. In this study, we’ll capture the following language-internal predictors:

possessor_type

Whether the role of the Possessor is filled by a common noun or a proper noun.

possessor_size

The size of the Possessor slot, in characters.

possessed_size

The size of the Possessed slot, in characters.

size_difference

The difference between the size of the possessor and that of the possessed.

The reason for taking the size of the constituents, i.e. Possessor and Possessed, is the theory within linguistics that speakers try to push longer constituents towards the end. In the genitive alternation, in which each variant has a different word order, this becomes relevant. Concretely, in the s variant we have the Possessor before the Possessed (“The student’s idea”), whereas in the of variant we have the Possessed before the Possessor (“The idea of the student”). If we indeed tend to push longer constituents towards the end, a long Possessor such as “the keen and insightful student” would favor the of variant (“The idea of the keen and insightful student”) rather than the s variant (“The keen and insightful student’s idea”).

In practice, we will also transform the sizes to their logarithm, in order to lessen the impact of extremely long constituents.

The strategy of data retrieval consists of four queries with non-overlapping results, aimed at collecting four different patterns, combinations of the outcome gen_type and the possessor_type.

Queries as combinations of outcome (the rows) and type of Possessor (columns).
\({}\) common nouns proper nouns
of-genitive query 1 query 2
s-genitive query 3 query 4

For each of these queries, we will:

  1. Collect the observations with mclm::conc(), generating a concordance.
  2. Add columns with the values of gen_type (our outcome) and possessor_type, since they are the same for all observations in that query.
  3. Extract the possessed and possessor based on their position in the query. For that purpose, the regular expressions used to query the corpus include capturing groups surrounding each of the constituents.

This will be achieved with the following lines of code:

cd <- brown_fnames %>% 
  conc(pattern) %>%  # whatever regex pattern we have defined
  mutate(
    gen_type = "of",                                     # or "s"
    possessor_type = "common",                           # or "proper"
    possessed = re_replace_first(match, pattern, "\\1"), # or "\\2" for s-genitive
    possessor = re_replace_first(match, pattern, "\\2")  # or "\\1" for of-genitive
  )
  1. Once we have more than one concordance, we can merge them with merge_conc().

The first concordance will be stored in a variable called cd, to which we will merge each new concordance, called cd_new. Alternatively, you could name each concordance differently and then merge them into a larger concordance conc_merged, for example.

The first approach has the advantage of avoiding duplicate objects and using more memory than necessary. cd_new is constantly rewritten and, at the end, you only have one large concordance cd and a small cd_new concordance with the latest query. In contrast, with the second approach you end up with four small concordances and one large one, and their contents are duplicated.

The second approach makes sense if you need to keep the datasets separate. When you’re writing a long script or processing a dataset in different ways, overwriting a variable may bring confusion, as you forget the contents of the variable at any given time.

Think about this in your own analyses.

Of-genitive with common nouns

In the first query we will retrieve the attestations of the of-genitive with common nouns, e.g. “the idea of the student”.

Regular expression

If you would like to practice regular expressions, the {mclmtutorials} package includes a tutorial on how to use them in {mclm}. Just install the package along with {learnr} and {gradethis} and run the tutorial:

# install.packages("learnr")
# remotes::install_github("masterclm/mclmtutorials")
# remotes::install_github("rstudio/gradethis")
learnr::run_tutorial("regex", "mclmtutorials")

As a first step, we construct a regular expression that will match, in the non-tokenized corpus, the sequence we are interested in, and we’ll store it in a variable pattern. Notice that we use a raw string (preceded with r and, within the quotation marks, surrounded by two hyphens and a square bracket) and that we flag the regular expression with an x and an i. The x allows us to insert spaces and line breaks for more readability (free-spacing mode), and the i makes the search case insensitive.

pattern <- r"--[(?xi)
          \b         the / at          \s+
          ( (?:  [^\s/]+ / jj  [^\s]*  \s+ )*
            (?:  [^\s/]+ / nn  [^\s]*  \s+ )*
                 [^\s/]+ / nn  [^\s]*
          )                            \s+
                      of / in          \s+
                     the / at          \s+
          ( (?:  [^\s/]+ / jj  [^\s]*  \s+ )*
            (?:  [^\s/]+ / nn  [^\s]*  \s+ )*
                 [^\s/]+ / nn  [^\s$]* 
          )                            \s    
]--"

With this pattern, we are asking for:

  • a word boundary,

  • followed by the/at and whitespace,

  • optionally followed by adjectives, each followed by whitespace, and then at least one common noun — this is the first capturing group, the Possessed,

  • then whitespace, followed by of/in, whitespace, the/at and whitespace,

  • optionally followed by adjectives, each followed by whitespace, and then at least one common noun without the genitive tag — this is the second capturing group, the Possessor.

In line 2 we have a \b, followed by the string the/at, followed by \s+. This means that we want the literal string the/at preceded by a word boundary and followed by one or more whitespace characters. A word boundary at the beginning covers whitespace characters, punctuation marks and even the beginning of the text.

Lines 3-5 and 9-11 are surrounded by parentheses, turning them into capturing groups. Later, when we want to extract the Possessed and Possessor from the match, we can ask for the text matching each of the capturing groups.

Within the capturing groups we have non-capturing groups, marked by parentheses and ?: at the beginning. If we didn’t add ?: at the beginning, they would be included in the numbering of the capturing groups and we wouldn’t know for sure which number belongs to the full Possessed and Possessor groups. But we need the parentheses in order to apply the final asterisks (in lines 3, 4, 9 and 10) to the full match that they surround. But what are they matching, exactly?

The main capturing groups for Possessed and Possessor want to match a common noun optionally preceded by one or more adjectives and one or more nouns. The match for a common noun is [^\s/]+ / nn [^\s]*. [\s/] matches either a whitespace character or a slash; adding the ^ inverts the match to anything but a whitespace character or a slash; + requires one or more of them and *, zero or more. Therefore, this regex asks for a sequence of characters that are neither a whitespace character or a slash, (e.g. idea or student), followed by a slash and then nn, followed by an optional sequence of characters that are not whitespace characters, in case the part-of-speech tag ends with an s (for plurar) or -tl. We use jj instead of nn when we ask for adjectives and we also reject $ at the end of the Possessor slot, because we don’t want an s-genitive to match.

Within each non-capturing group, therefore, we want to match either an adjective (lines 3 and 9) or a noun (lines 4 and 10), but we still get a match if neither of them occurs. They are also followed by \s+ to capture the whitespace characters between the words.

Finally, lines 7 and 8 match the core of the of-genitive construction, i.e. of/in and the/at, surrounded and separated by whitespace characters.

Code

We build the concordance with conc(), which takes a corpus or filenames, e.g. brown_fnames, and a regular expression to match, e.g. pattern. We can then explore the concordance with print_kwic(), explore() or View(). This really is advisable, in order to inspect how good the regex pattern is and see if you might want to refine it to capture patterns that were missed, or to exclude patterns that should be discarded. Table 1 prints the first 6 items in the concordance with kableExtra.

With practice, you’ll find the balance between refining the automatic process and performing manual cleaning. In some cases, it’s quicker to fix the automatic procedure; in other cases, manual adjustments are easier and more reliable.

cd <- brown_fnames %>%
  conc(pattern)
print_kwic(cd, n = 10)
idx                      left|        match        |right                   
  1 ... suit/nn to/to test/vb|the/at va...at act/nn|,/, and/cc then/rb th...
  2 ...n/cd dollars/nns at/in|the/at en...j year/nn|next/ap Aug./np 31/cd...
  3 ...e/ber teaching/vbg ./.|The/at re...rement/nn|would/md be/be in/in ...
  4 ...ill/md retire/vb at/in|the/at cl...n term/nn|./. Dr./nn-tl Clark/n...
  5 ... ``/`` Actually/rb ,/,|the/at ab...rocess/nn|may/md have/hv consti...
  6 .../at 23d/od ward/nn ./.|The/at ca...udges/nns|in/in the/at 58th/od ...
  7 ...O/nn line/nn ./. On/in|the/at ne... sheet/nn|must/md be/be set/vbn...
  8 ...the/at alliance/nn ,/,|the/at us...ion/nn-tl|for/in-tl Economic/jj...
  9 ...ns as/ql well/rb as/cs|the/at ma...errent/nn|./. This/dt increase/...
 10 ...ey/ppss count/vb on/in|the/at ai...tries/nns|attending/vbg the/at ...
...
Code
cd %>% 
  select(left, match, right) %>% 
  head(6) %>% 
  kbl(align = c("r", "c", "l")) %>%
  kable_paper(font_size = 15)
Table 1: Concordance with first results of first query.
left match right
ns million/cd worth/nn of/in highway/nn reconstruction/nn bonds/nns ./. The/at bond/nn issue/nn will/md go/vb to/in the/at state/nn courts/nns for/in a/at friendly/jj test/nn suit/nn to/to test/vb the/at validity/nn of/in the/at act/nn ,/, and/cc then/rb the/at sales/nns will/md begin/vb and/cc contracts/nns let/vbn for/in repair/nn work/nn on/in some/dti of/in Georgia's/np$ most/ql heavily/rb traveled/vbn highways/nns ./. A/at H
servative/jj ''/'' his/pp$ estimate/nn that/cs it/pps would/md produce/vb 17/cd million/cd dollars/nns to/to help/vb erase/vb an/at anticipated/vbn deficit/nn of/in 63/cd million/cd dollars/nns at/in the/at end/nn of/in the/at current/jj fiscal/jj year/nn next/ap Aug./np 31/cd ./. He/pps told/vbd the/at committee/nn the/at measure/nn would/md merely/rb provide/vb means/nns of/in enforcing/vbg the/at escheat/nn law/nn which/wdt has/hvz been/ben on/in
over/np also/rb would/md require/vb junior-senior/jj high/nn teachers/nns to/to have/hv at/in least/ap 24/cd semester/nn hours/nns credit/vb in/in the/at subject/nn they/ppss are/ber teaching/vbg ./. The/at remainder/nn of/in the/at 4-year/jj college/nn requirement/nn would/md be/be in/in general/jj subjects/nns ./. ``/`` A/at person/nn with/in a/at master's/nn$ degree/nn in/in physics/nn ,/, chemistry/nn ,/, math/nn or/cc English/np ,/, yet/rb who/wps has/hvz n
/np Clark/np of/in Hays/np ,/, Kan./np as/cs the/at school's/nn$ new/jj president/nn ./. Dr./nn-tl Clark/np will/md succeed/vb Dr./nn-tl J./np R./np McLemore/np ,/, who/wps will/md retire/vb at/in the/at close/nn of/in the/at present/jj school/nn term/nn ./. Dr./nn-tl Clark/np holds/vbz an/at earned/vbn Doctor/nn-tl of/in-tl Education/nn-tl degree/nn from/in the/at University/nn-tl of/in-tl Oklahoma/np-tl ./. He/pps also/rb received/vbd a/at Master
d/jj jury/nn room/nn ''/'' ./. He/pps said/vbd this/dt constituted/vbd a/at ``/`` very/ql serious/jj misuse/nn ''/'' of/in the/at Criminal/jj-tl court/nn-tl processes/nns ./. ``/`` Actually/rb ,/, the/at abuse/nn of/in the/at process/nn may/md have/hv constituted/vbn a/at contempt/nn of/in the/at Criminal/jj-tl court/nn-tl of/in-tl Cook/np county/nn ,/, altho/cs vindication/nn of/in the/at authority/nn of/in that/dt court/nn is/bez n
at 21st/od and/cc 28th/od precincts/nns of/in the/at 29th/od ward/nn ,/, the/at 18th/od precinct/nn of/in the/at 4th/od ward/nn ,/, and/cc the/at 9th/od precinct/nn of/in the/at 23d/od ward/nn ./. The/at case/nn of/in the/at judges/nns in/in the/at 58th/od precinct/nn of/in the/at 23d/od ward/nn had/hvd been/ben heard/vbn previously/rb and/cc taken/vbn under/in advisement/nn by/in Karns/np ./. Two/cd other/ap cases/nns also/rb were/

Because cd is also a dataframe, of which match is a column, we can also inspect the elements in the match by extracting it.

head(cd$match)
[1] "the/at validity/nn of/in the/at act/nn "                              
[2] "the/at end/nn of/in the/at current/jj fiscal/jj year/nn "             
[3] "The/at remainder/nn of/in the/at 4-year/jj college/nn requirement/nn "
[4] "the/at close/nn of/in the/at present/jj school/nn term/nn "           
[5] "the/at abuse/nn of/in the/at process/nn "                             
[6] "The/at case/nn of/in the/at judges/nns "                              

Once we have a decent concordance, we can add variables that are characteristic of it. All of these observations will have the value of in the gen_type variable and the value common in the possessor_type variable. In addition, we can extract the constituents of the Possessed and Possessor slots with mclm::re_replace_first(). The first argument is a text to match (the elements in the match column); the second is a regular expression to match in the text (the pattern we use to retrieve the text), and the third is a replacement string. "\\1" and "\\2" correspond to the contents of the first and second capturing groups in pattern, respectively. In other words, we find the portion of text in each element of match that matches the pattern (which is all of it, since that was how it was constructed) and extract either the first capturing group, to fill the possessed column, or the second capturing group, to fill the possessor column.

cd <- cd %>%
  mutate(
    gen_type = "of",
    possessor_type = "common",
    possessed = re_replace_first(match, pattern, "\\1"), 
    possessor = re_replace_first(match, pattern, "\\2") 
)
cd %>% 
  as_tibble() %>% 
  select(match, possessed, possessor) %>% 
  head() %>% 
  kbl() %>% 
  kable_paper()
Table 2: Subset of annotated observations of the first dataset.
match possessed possessor
the/at validity/nn of/in the/at act/nn validity/nn act/nn
the/at end/nn of/in the/at current/jj fiscal/jj year/nn end/nn current/jj fiscal/jj year/nn
The/at remainder/nn of/in the/at 4-year/jj college/nn requirement/nn remainder/nn 4-year/jj college/nn requirement/nn
the/at close/nn of/in the/at present/jj school/nn term/nn close/nn present/jj school/nn term/nn
the/at abuse/nn of/in the/at process/nn abuse/nn process/nn
The/at case/nn of/in the/at judges/nns case/nn judges/nns

Of-genitive with proper nouns

The second query also captures the of-genitive variant, but with proper nouns in the Possessor slot instead, e.g. “the rivers of Belgium”.

Regular expression

The regular expression is very similar to the one for the first query, with a few differences:

  • It does not ask or even accept an article after of/in.

  • The noun(s) requested in lines 10 and 11 are proper nouns instead of common nouns.

  • Between the optional adjective(s) and the noun(s) of the second capturing group, i.e. the Possessor slot, we also accept optional items with any part-of-speech as long as they also end in -tl (line 9).

Rewriting variables

Notice that we are rewriting the variable pattern with the regex for the second query, so, if you suddenly wanted to rerun the first query, you would need to change pattern again.

pattern <- r"--[(?xi)
       \b         the / at                 \s+
       ( (?:  [^\s/]+ / jj         [^\s]*  \s+ )*
         (?:  [^\s/]+ / nn         [^\s]*  \s+ )*
              [^\s/]+ / nn         [^\s]*
         )                                 \s+
                   of / in                 \s+
       ( (?:  [^\s/]+ / jj         [^\s]*  \s+ )*
         (?:  [^\s/]+ / [^\s]+ -tl [^\s$]* \s+ )*
         (?:  [^\s/]+ / np         [^\s$]* \s+ )*
              [^\s/]+ / np         [^\s$]* 
       )                                   \s
]--"

Code

The steps are the same as for the first query. We store the object in a different variable, cd_new, and we assign the value proper to possessor_type instead of common. Afterwards, we rewrite cd to be the combination of cd (the first query) and cd_new (the second query) using merge_conc(), an mclm wrapper for bind_rows().

cd_new <- brown_fnames %>%
  conc(pattern) %>%
  mutate(
    gen_type = "of",
    possessor_type = "proper",
    possessed = re_replace_first(match, pattern, "\\1"), 
    possessor = re_replace_first(match, pattern, "\\2")  
  )
# print_kwic(cd_new, n = 10)
Code
cd_new %>% 
  select(left, match, right, possessed, possessor) %>% 
  head(6) %>% 
  kbl(align = c("r", "c", "l", "l", "l")) %>%
  kable_paper(font_size = 15)
Table 3: Concordance with first results of second query and first annotation.
left match right possessed possessor
-tl Department/nn-tl ``/`` has/hvz seen/vbn fit/jj to/to distribute/vb these/dts funds/nns through/in the/at welfare/nn departments/nns of/in all/abn the/at counties/nns in/in the/at state/nn with/in the/at exception/nn of/in Fulton/np-tl County/nn-tl ,/, which/wdt receives/vbz none/pn of/in this/dt money/nn ./. The/at jurors/nns said/vbd they/ppss realize/vb ``/`` a/at proportionate/jj distribution/nn of/in these/dts funds/nns migh exception/nn Fulton/np-tl
Austin/np-hl ,/,-hl Texas/np-hl --/-- Committee/nn approval/nn of/in Gov./nn-tl Price/np Daniel's/np$ ``/`` abandoned/vbn property/nn ''/'' act/nn seemed/vbd certain/jj Thursday/nr despite/in the/at adamant/jj protests/nns of/in Texas/np bankers/nns ./. Daniel/np personally/rb led/vbd the/at fight/nn for/in the/at measure/nn ,/, which/wdt he/pps had/hvd watered/vbn down/rp considerably/rb since/in its/pp$ rejection/nn by/in two/cd adamant/jj protests/nns Texas/np
n on/in the/at hearing/nn ,/, since/cs the/at bill/nn was/bedz introduced/vbn only/rb last/ap Monday/nr ./. Austin/np-hl ,/,-hl Texas/np-hl --/-- Senators/nns unanimously/rb approved/vbd Thursday/nr the/at bill/nn of/in Sen./nn-tl George/np Parkhouse/np of/in Dallas/np authorizing/vbg establishment/nn of/in day/nn schools/nns for/in the/at deaf/jj in/in Dallas/np and/cc the/at four/cd other/ap largest/jjt counties/nns ./. The/at bill/nn is/bez des bill/nn Sen./nn-tl George/np Parkhouse/np
--/-- Principals/nns of/in the/at 13/cd schools/nns in/in the/at Denton/np-tl Independent/jj-tl School/nn-tl District/nn-tl have/hv been/ben re-elected/vbn for/in the/at 1961-62/cd session/nn upon/in the/at recommendation/nn of/in Supt./nn-tl Chester/np O./np Strickland/np ./. State/nn and/cc federal/jj legislation/nn against/in racial/jj discrimination/nn in/in employment/nn was/bedz called/vbn for/in yesterday/nr in/in a/at report/nn of/in a/at ``/`` blue/jj ribbon recommendation/nn Supt./nn-tl Chester/np O./np Strickland/np
b of/in the/at circumstances/nns that/wps have/hv brought/vbn these/dts troubles/nns about/rp ,/, has/hvz been/ben conspicuous/jj by/in its/pp$ absence/nn ./. Explosion/nn-hl avoided/vbn-hl In/in the/at case/nn of/in Portugal/np ,/, which/wdt a/at few/ap weeks/nns ago/rb was/bedz rumored/vbn ready/jj to/to walk/vb out/rp of/in the/at NATO/nn Council/nn-tl should/md critics/nns of/in its/pp$ Angola/np policy/nn prove/vb harsh/ case/nn Portugal/np
some/dti disappointment/nn that/cs the/at United/vbn-tl States/nns-tl leadership/nn has/hvz not/* been/ben as/ql much/rb in/in evidence/nn as/cs hoped/vbn for/in ./. One/cd diplomat/nn described/vbd the/at tenor/nn of/in Secretary/nn-tl of/in-tl State/nn-tl Dean/np Rusk's/np$ speeches/nns as/cs ``/`` inconclusive/jj ''/'' ./. But/cc he/pps hastened/vbd to/to add/vb that/cs ,/, if/cs United/vbn-tl States/nns-tl policies/nns were/bed not/* always/rb clear/jj ,/, d tenor/nn Secretary/nn-tl of/in-tl State/nn-tl Dean/np
cd <- merge_conc(cd, cd_new)
nrow(cd)
[1] 4157

S-genitive with common nouns

The third query retrieves attestations of the s-genitive variant with common nouns in the Possessor slot, e.g. “the student’s idea”.

Regular expression

The main regular expression symbols are the same used in the previous queries. The main differences with the first regular expression are twofold. First, we don’t have a word between the two capturing groups; instead, we only have whitespace, and the noun in line 5 must include $ somewhere in its part-of-speech tag. Second, the first capturing group now represents the Possessor, and the second capturing group, the Possessed. This doesn’t affect the writing of the regular expression itself, other than the requirement of $ at the end of the first component and that it should not be present in the second component. However, it will affect the code below when extracting the Possessed and Possessor variables.

Again, here we overwrite the pattern variable with the new regular expression.

pattern <- r"--[(?xi)
      \b         the / at                  \s+
      (  (?: [^\s/]+ / jj  [^\s]*          \s+ )*
         (?: [^\s/]+ / nn  [^\s]*          \s+ )*
             [^\s/]+ / nn  [^\s]* [$] [^\s]*
       )                                   \s+
      (  (?: [^\s/]+ / jj  [^\s]*          \s+ )*
         (?: [^\s/]+ / nn  [^\s]*          \s+ )*
             [^\s/]+ / nn  [^\s$]*            
       )                                   \s 
]--"

Code

Now that we have merged the first two queries into cd, we don’t need cd_new anymore, so we can overwrite it with the output of the third query. Again, we call conc() with brown_fnames and the new pattern and assign the values that correspond to this concordance: s for gen_type and common for possessor_type, and the appropriate capturing groups of the pattern for possessed and possessor. Since the word order is inverted in relation to the of-genitive, now the first capturing group corresponds to the Possessor and the second one to the Possessed.

Finally, we use merge_conc() to merge the output of the first two queries, cd, with the output of the third query, cd_new, overwriting cd.

cd_new <- brown_fnames %>%
  conc(pattern) %>%
  mutate(
    gen_type = "s",
    possessor_type = "common",
    possessed = re_replace_first(match, pattern, "\\2"), 
    possessor = re_replace_first(match, pattern, "\\1")  
  )

# print_kwic(cd_new, n = 10)
Code
cd_new %>% 
  select(left, match, right, possessed, possessor) %>% 
  head(6) %>% 
  kbl(align = c("r", "c", "l", "l", "l")) %>%
  kable_paper(font_size = 15)
Table 4: Concordance with first results of third query and first annotation.
left match right possessed possessor
re-set/vb the/at effective/jj date/nn so/cs that/cs an/at orderly/jj implementation/nn of/in the/at law/nn may/md be/be effected/vbn ''/'' ./. The/at grand/jj jury/nn took/vbd a/at swipe/nn at/in the/at State/nn-tl Welfare/nn-tl Department's/nn$-tl handling/nn of/in federal/jj funds/nns granted/vbn for/in child/nn welfare/nn services/nns in/in foster/jj homes/nns ./. ``/`` This/dt is/bez one/cd of/in the/at major/jj items/nns in/in the/at Fulton/np-tl Co handling/nn State/nn-tl Welfare/nn-tl Department's/nn$-tl
Atlanta/np-tl Bar/nn-tl Association/nn-tl and/cc an/at interim/nn citizens/nns committee/nn ./. ``/`` These/dts actions/nns should/md serve/vb to/to protect/vb in/in fact/nn and/cc in/in effect/nn the/at court's/nn$ wards/nns from/in undue/jj costs/nns and/cc its/pp$ appointed/vbn and/cc elected/vbn servants/nns from/in unmeritorious/jj criticisms/nns ''/'' ,/, the/at jury/nn said/vbd ./. Regarding/in Atlanta's/np$ new/ wards/nns court's/nn$
e/np of/in Griffin/np ./. Attorneys/nns for/in the/at mayor/nn said/vbd that/cs an/at amicable/jj property/nn settlement/nn has/hvz been/ben agreed/vbn upon/rb ./. The/at petition/nn listed/vbd the/at mayor's/nn$ occupation/nn as/cs ``/`` attorney/nn ''/'' and/cc his/pp$ age/nn as/cs 71/cd ./. It/pps listed/vbd his/pp$ wife's/nn$ age/nn as/cs 74/cd and/cc place/nn of/in birth/nn as/cs Opelika/np ,/, Ala./np ./. The/at pe occupation/nn mayor's/nn$
more/ap than/in a/at year/nn ./. The/at Hartsfield/np home/nr is/bez at/in 637/cd E./np Pelham/np Rd./nn-tl Aj/nn ./. Henry/np L./np Bowden/np was/bedz listed/vbn on/in the/at petition/nn as/cs the/at mayor's/nn$ attorney/nn ./. Hartsfield/np has/hvz been/ben mayor/nn of/in Atlanta/np ,/, with/in exception/nn of/in one/cd brief/jj interlude/nn ,/, since/in 1937/cd ./. His/pp$ political/jj career/nn goes/vbz back/rb to/ attorney/nn mayor's/nn$
ith/in exception/nn of/in one/cd brief/jj interlude/nn ,/, since/in 1937/cd ./. His/pp$ political/jj career/nn goes/vbz back/rb to/in his/pp$ election/nn to/in city/nn council/nn in/in 1923/cd ./. The/at mayor's/nn$ present/jj term/nn of/in office/nn expires/vbz Jan./np 1/cd ./. He/pps will/md be/be succeeded/vbn by/in Ivan/np Allen/np Jr./np ,/, who/wps became/vbd a/at candidate/nn in/in the/at Sept./np 13/cd primary/nn after/cs M present/jj term/nn mayor's/nn$
ue/vb the/at new/jj rural/jj roads/nns bonds/nns ./. Schley/np County/nn-tl Rep./nn-tl B./np D./np Pelham/np will/md offer/vb a/at resolution/nn Monday/nr in/in the/at House/nn-tl to/to rescind/vb the/at body's/nn$ action/nn of/in Friday/nr in/in voting/vbg itself/ppl a/at $10/nns per/in day/nn increase/nn in/in expense/nn allowances/nns ./. Pelham/np said/vbd Sunday/nr night/nn there/ex was/bedz research/nn being/beg action/nn body's/nn$
cd <- merge_conc(cd, cd_new)
nrow(cd)
[1] 5207

S-genitive with proper nouns

The fourth query captures the s-genitive variant with proper nouns in the Possessor slot, e.g. Belgium’s rivers.

Regular expression

Again we overwrite pattern with the regular expression for the last query, which is very similar to the third query. The difference, like the difference between the second and the first, is that it asks of proper nouns instead of common nouns in the Possessor slot (lines 4 and 5; but also it excludes articles at the beginning) and that it allows for any item with -tl in its part-of-speech tag preceding the Possessor noun (line 3).

pattern <- r"--[(?xi)
        \b  ( (?: [^\s/]+ / jj         [^\s]*          \s+ )*
              (?: [^\s/]+ / [^\s]+ -tl [^\s]*          \s+ )*
              (?: [^\s/]+ / np         [^\s$]*         \s+ )*
                  [^\s/]+ / np         [^\s]* [$] [^\s]*        
            )                                          \s+
            ( (?: [^\s/]+ / jj         [^\s]*          \s+ )*
              (?: [^\s/]+ / nn         [^\s]*          \s+ )*
                  [^\s/]+ / nn         [^\s$]*           
            )                                          \s
]--"

Code

Again we overwrite the now useless cd_new with the output of the fourth query and assign the appropriate values for the common variables. The possessor_type is now proper, but the rest of the variables take the same values as in the previous query. At the end, we overwrite cd by merging the old cd, which contains the output of the first three queries, and cd_new.

cd_new <- brown_fnames %>%
  conc(pattern) %>%
  mutate(
    gen_type = "s",
    possessor_type = "proper",
    possessed = re_replace_first(match, pattern, "\\2"), 
    possessor = re_replace_first(match, pattern, "\\1")  
  )

# print_kwic(cd_new, n = 10)
Code
cd_new %>% 
  select(left, match, right, possessed, possessor) %>% 
  head(6) %>% 
  kbl(align = c("r", "c", "l", "l", "l")) %>%
  kable_paper(font_size = 15)
Table 5: Concordance with first results of fourth query and first annotation.
left match right possessed possessor
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./. The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl recent/jj primary/nn election/nn Atlanta's/np$
nterest/nn in/in the/at election/nn ,/, the/at number/nn of/in voters/nns and/cc the/at size/nn of/in this/dt city/nn ''/'' ./. The/at jury/nn said/vbd it/pps did/dod find/vb that/cs many/ap of/in Georgia's/np$ registration/nn and/cc election/nn laws/nns ``/`` are/ber outmoded/jj or/cc inadequate/jj and/cc often/rb ambiguous/jj ''/'' ./. It/pps recommended/vbd that/cs Fulton/np legislators/nns act/vb ``/`` to/to have/hv registration/nn Georgia's/np$
at result/nn of/in city/nn personnel/nns policies/nns ''/'' ./. It/pps urged/vbd that/cs the/at city/nn ``/`` take/vb steps/nns to/to remedy/vb ''/'' this/dt problem/nn ./. Implementation/nn of/in Georgia's/np$ automobile/nn title/nn law/nn was/bedz also/rb recommended/vbn by/in the/at outgoing/jj jury/nn ./. It/pps urged/vbd that/cs the/at next/ap Legislature/nn-tl ``/`` provide/vb enabling/vbg funds/nns and/cc re-set/vb the/at effec automobile/nn title/nn law/nn Georgia's/np$
t's/nn$ wards/nns from/in undue/jj costs/nns and/cc its/pp$ appointed/vbn and/cc elected/vbn servants/nns from/in unmeritorious/jj criticisms/nns ''/'' ,/, the/at jury/nn said/vbd ./. Regarding/in Atlanta's/np$ new/jj multi-million-dollar/jj airport/nn ,/, the/at jury/nn recommended/vbd ``/`` that/cs when/wrb the/at new/jj management/nn takes/vbz charge/nn Jan./np 1/cd the/at airport/nn be/be operated/vbn in/in a/at manner/nn that/wps will/md elimin new/jj multi-million-dollar/jj airport/nn Atlanta's/np$
/nn opposes/vbz in/in its/pp$ platform/nn ./. Sam/np Caldwell/np ,/, State/nn-tl Highway/nn-tl Department/nn-tl public/jj relations/nns director/nn ,/, resigned/vbd Tuesday/nr to/to work/vb for/in Lt./nn-tl Gov./nn-tl Garland/np Byrd's/np$ campaign/nn ./. Caldwell's/np$ resignation/nn had/hvd been/ben expected/vbn for/in some/dti time/nn ./. He/pps will/md be/be succeeded/vbn by/in Rob/np Ledford/np of/in Gainesville/np ,/, who/wps has/hvz been/ campaign/nn Lt./nn-tl Gov./nn-tl Garland/np Byrd's/np$
ll/np ,/, State/nn-tl Highway/nn-tl Department/nn-tl public/jj relations/nns director/nn ,/, resigned/vbd Tuesday/nr to/to work/vb for/in Lt./nn-tl Gov./nn-tl Garland/np Byrd's/np$ campaign/nn ./. Caldwell's/np$ resignation/nn had/hvd been/ben expected/vbn for/in some/dti time/nn ./. He/pps will/md be/be succeeded/vbn by/in Rob/np Ledford/np of/in Gainesville/np ,/, who/wps has/hvz been/ben an/at assistant/nn more/ap than/i resignation/nn Caldwell's/np$
cd <- merge_conc(cd, cd_new)
nrow(cd)
[1] 7344

Automatic annotation

We now have a concordance cd with 7344 observations matching four different patterns. We can get a quick overview of the size of each subset with xtabs(). Here we can see that the most common pattern is the of genitive with a common noun in the Possessor slot, but also that within the s variant the proper noun is more common.

xtabs(~ gen_type +
        possessor_type,
      data = cd)
        possessor_type
gen_type common proper
      of   3151   1006
      s    1050   2137
Tip

It’s a good idea to practice quick overview functions, such as xtabs() and table() from Base R, as well as count() and summarize() from tidyverse. Running sophisticated analysis should not push us away from simpler forms of exploration of the data.

Before saving our concordance, we can add some automatic annotation. The following dplyr::mutate() call adds or manipulates columns in different ways:

cd <- cd %>% mutate(
  possessed = re_replace_all(possessed, 
                             r"--[(?xi) ([^\s/]+) / [^\s]+ ]--", 
                             "\\1") %>% 
              tolower(),
  possessor = re_replace_all(possessor, 
                             r"--[(?xi) ([^\s/]+?) ('s)? / [^\s]+ ]--", 
                             "\\1") %>% 
              tolower(),
  comp = re_retrieve_first(source, ".(.)..$", requested_group = 1),
  left_tagged = left,
  left        = re_replace_all(left_tagged, 
                               r"--[(?xi) ([^\s/]+) / [^\s]+ ]--", 
                               "\\1"),
  match_tagged = match,
  match        = re_replace_all(match_tagged, 
                                r"--[(?xi) ([^\s/]+) / [^\s]+ ]--", 
                                "\\1"),
  right_tagged = right,
  right        = re_replace_all(right_tagged, 
                                r"--[(?xi) ([^\s/]+) / [^\s]+ ]--", 
                                "\\1"),
  size_possessed = log(nchar(possessed)),
  size_possessor = log(nchar(possessor)),
  size_diff      = size_possessor - size_possessed
)
  • Lines 2-4 remove the part-of-speech tags from the text in the possessed column, overwriting it. They do so by replacing all matches of the regular expression in line 3 as found in the text in line 2, i.e. the content of the possessed column, with the match of the first capturing group. The regular expression captures sequences of characters that are neither whitespace nor slashes, followed by a slash and then by another sequence of non whitespace characters, e.g. the/at and example/nn. The capturing group surrounds the first sequence before the slash, and therefore the result would replace the/at example/nn with the example.

  • Line 5 converts the new contents of possessed to lowercase, which is useful if you want to use the column as a random effect in regression analysis.

  • Lines 6-8 do the same as lines 2-4 but for the possessor column, and as such also exclude the optional ’s sequence by leaving it out of the capturing group. Line 9 does the same as line 5, for the new contents of possessor.

  • Line 10 creates a column comp that reads the column source, where the names of the files are stored, and extracts the third to last character by using re_retrieve_first() with the pattern .(.)..$ and specifying that we want the first captured group. This character corresponds to the genre assignment of the file and could be used in modelling.

  • Lines 12-14, 16-18 and 20-22 do the same as lines 2-4 for the columns left, match and right respectively. However, since we might want to keep the original, tagged text as well, lines 11, 15 and 19 stored the original values of these columns in new columns, left_tagged, match_tagged and right_tagged respectively.

  • Lines 22 and 23 count the number of characters in each row of possessed and possessor with nchar() and then take their logarithm with log(). For example, for a word such as example, nchar("example") returns 7, and log(nchar("example")) returns 1.946; for the example, nchar("the example") returns 11, and log(nchar("the example")) returns 2.398. You can retrieve the original length by applying exp(): exp(log(nchar("the example")) returns 11. Line 24 computes the difference between the logged size of the possessor and that of the possessed, giving us positive numbers when the possessor is longer and negative otherwise.

In this example, re_replace_all() and re_retrieve_first() seem to be doing the same thing: they match a regular expression to a piece of text and extract the contents of the first captured group. The difference is that re_replace_all() will replace all matches in a piece of text with the contents of the replacement string — if there are multiple matches, because there are multiple words, it will replace each group of wordform/pos with wordform. re_retrieve_first(), instead, will extract the first match of the pattern in the text, regardless of whether there are capturing groups; specifying requested_group = 1 narrows down the return value to the match of the first capturing group in the match.

The code above has some repetition: the re_replace_all() calls in lines 2, 6, 12, 16 and 20 all have the same structure. The first call and the three last calls only differ in the first argument (possessed, left_tagged, match_tagged or right_tagged), whereas the second call has a different first argument and a different pattern.

It is often useful —as long as it doesn’t hamper interpretation— to avoid repetition in coding. First, the more you type, the more chances you have of mistyping something. Second, if you want to “always do the same thing” but then you change your mind on what that thing should be, e.g. you want to refine the regex, it would be beter to only have to change it once, and not every time you do it.

This can be achieved with variables and functions. For example, instead of typing or copy-pasting the regular expressions multiple times, you could write the following variable and then just call pattern instead of the regex in lines 3, 12, 16 and 20. We have done this above.

pattern <- r"--[(?xi) ([^\s/]+) / [^\s]+ ]--"

Moreover, since the whole function call is pretty much the same, you could create your own function with one argument: the variable you want to change. After the code below, you could replace the full call from lines 2 to 4 with a simple remove_tags(possessed) in line 2, and the same, changing the name of the column, with the last three calls of re_replace_all().

remove_tags <- function(column) {
  re_replace_all(column,
                 r"--[(?xi) ([^\s/]+) / [^\s]+ ]--", 
                 "\\1")
}

However, this doesn’t work with the second call of re_replace_all(), since it uses a different pattern. You could either keep it that way, i.e. not replacing it with remove_tags(), or add an optional argument for the regular expression. The default value could be the most common regular expression, and then when you call remove_tags() on possessor() you could provide the appropriate regex.

remove_tags <- function(
    column,
    regex = r"--[(?xi) ([^\s/]+) / [^\s]+ ]--"
    ) {
  re_replace_all(column, regex, "\\1")
}

With this definition, the code from above would look as follows:

cd <- cd %>% mutate(
  possessed = remove_tags(possessed) %>% tolower(),
  possessor = remove_tags(possessor, 
                          r"--[(?xi) ([^\s/]+?) ('s)? / [^\s]+ ]--") %>% 
              tolower(),
  comp = re_retrieve_first(source, ".(.)..$", requested_group = 1),
  left_tagged = left,
  left        = remove_tags(left_tagged),
  match_tagged = match,
  match        = remove_tags(match_tagged),
  right_tagged = right,
  right        = remove_tags(right_tagged),
  size_possessed = log(nchar(possessed)),
  size_possessor = log(nchar(possessor)),
  size_diff      = size_possessor - size_possessed
)

In fact, you could make the code even shorter by replacing these new lines 8-12 with a single call to across(): across(c(left, match, right), remove_tags, .names = "{.col}_untagged"). In this case, instead of adding a suffix “_tagged” to the original columns, it adds an “_untagged” suffix to the new columns.

Citing examples in markdown

With this new version of the concordance, which you can explore with print_kwic(cd), View(cd) or explore(cd), you can also extract full examples. If you use R Markdown or Quarto to write a report, rather than copy-pasting an example to describe, you can join the new left, match and right columns and call a specific example to illustrate.

In the following code:

  • The call to mutate() and short_names() shortens the names of the files.

  • The calls to unite() create a column with the name of the first argument (conc_line or id) by joining the rest of the columns mentioned (left, match and right, or source and id). By default they are separated with an underscore, but the sep argument allows you to use a different term.

  • The call to select() extracts only the id and conc_line columns. Afterwards, deframe() turns the two-column tibble into a named character vector: the values are the contexts and their names, the ids.

examples <- cd %>% 
  as_tibble() %>% 
  mutate(source = short_names(source)) %>% 
  unite(conc_line, left, match, right, sep = " ") %>% 
  unite(id, source, id) %>% 
  select(id, conc_line) %>%
  deframe()

head(examples)
                                                                                                                                                                                                                                                                                                         ca01_1 
                         "ns million worth of highway reconstruction bonds . The bond issue will go to the state courts for a friendly test suit to test  the validity of the act  , and then the sales will begin and contracts let for repair work on some of Georgia's most heavily traveled highways . A H" 
                                                                                                                                                                                                                                                                                                         ca02_2 
             "servative '' his estimate that it would produce 17 million dollars to help erase an anticipated deficit of 63 million dollars at  the end of the current fiscal year  next Aug. 31 . He told the committee the measure would merely provide means of enforcing the escheat law which has been on" 
                                                                                                                                                                                                                                                                                                         ca02_3 
"over also would require junior-senior high teachers to have at least 24 semester hours credit in the subject they are teaching .  The remainder of the 4-year college requirement  would be in general subjects . `` A person with a master's degree in physics , chemistry , math or English , yet who has n" 
                                                                                                                                                                                                                                                                                                         ca02_4 
                            "/np Clark of Hays , Kan. as the school's new president . Dr. Clark will succeed Dr. J. R. McLemore , who will retire at  the close of the present school term  . Dr. Clark holds an earned Doctor of Education degree from the University of Oklahoma . He also received a Master" 
                                                                                                                                                                                                                                                                                                         ca03_5 
                                "d jury room '' . He said this constituted a `` very serious misuse '' of the Criminal court processes . `` Actually ,  the abuse of the process  may have constituted a contempt of the Criminal court of Cook county , altho vindication of the authority of that court is n" 
                                                                                                                                                                                                                                                                                                         ca03_6 
                            "at 21st and 28th precincts of the 29th ward , the 18th precinct of the 4th ward , and the 9th precinct of the 23d ward .  The case of the judges  in the 58th precinct of the 23d ward had been heard previously and taken under advisement by Karns . Two other cases also were/" 
examples[["ca01_1"]]
[1] "ns million worth of highway reconstruction bonds . The bond issue will go to the state courts for a friendly test suit to test  the validity of the act  , and then the sales will begin and contracts let for repair work on some of Georgia's most heavily traveled highways . A H"

Then you could write, in an empty line of your report, a numbered example with R inline code; this lets you easily cross-reference your example with (@label), which returns (1).

(@label) `r examples[["ca01_1"]]` ``
  1. ns million worth of highway reconstruction bonds . The bond issue will go to the state courts for a friendly test suit to test the validity of the act , and then the sales will begin and contracts let for repair work on some of Georgia’s most heavily traveled highways . A H

In any case, you probably want to edit your example a bit, removing unnecessary text from the extremes, editing sections to include italics or bold or even, if your corpus is in a language other than English, add glosses and translations to print with glossr.

Save your progress and be free!

We are DONE. Congratulations! We can now store the output in a tab-separated file with write_conc(). Then you can open the file with any spreadsheet software for manual cleaning and/or annotation and then read it again from R with read_conc(filename), as we will do in the second part of this study.

write_conc(cd, file.path(data_folder, "genitive-alternation.tab"))
CSV

Spreadsheet programs may call this format a “csv”, comma-separated values file. What matters is that it’s a plain text file. A proper csv file will have commas to separate the different columns, or semicolons in some places where commas are used to indicate decimals. However, this is meant for numbers: texts, such as concordances, will inevitably have commas and probably also semicolons. Therefore, we use tabs to separate the columns instead.

The tidyverse functions to read and write tab-separated files are read_tsv() and write_tsv() respectively.

Footnotes

  1. The ’s is part of the token it’s attached to.↩︎