Genitive Alternation I: Retrieval
Based on Dirk Speelman’s course material
This is the first part of a variation study focusing on the genitive alternation in English, using the Brown corpus. This part focuses on retrieving observations of the alternation, whereas the second part will illustrate the analysis with conditional inference trees and mixed-effects logistic regression. One of the reasons to split the analysis in two is that, more often than not, between retrieval and analysis there is an additional step of manual cleaning or annotation of the dataset.
Setup
For this study we need to activate the packages tidyverse and mclm. We will also use kableExtra to print some tables.
We’ll collect the files from the Brown corpus. To replicate this code, adjust the path to wherever you stored your copy of the corpus. In this case, we also add a keep_re()
call to capture the pattern shared by all corpus files and not by metadata files such as “CONTENTS” and “README”. The hide_path
argument hides a part of the filename when printing, to ease reading the table.
corpus_folder <- here::here("studies", "_corpora", "brown") # adjust to your path
brown_fnames <- get_fnames(corpus_folder) %>%
keep_re("/c[a-z][0-9]{2}")
print(brown_fnames, n = 10, hide_path = corpus_folder)
Filename collection of length 500
filename
--------
1 ca01
2 ca02
3 ca03
4 ca04
5 ca05
6 ca06
7 ca07
8 ca08
9 ca09
10 ca10
...
Data retrieval
Our goal is to obtain the attestations of the genitive alternation in the Brown corpus, e.g. examples of “the student’s idea” and “the idea of the student”. In this alternation, the main elements we are interested in are called “Possessor” and “Possessed”, because the prototypical situation of the genitive is that of possession. However, the terminology can be misleading, because this is not the only meaning of the genitive constructions. For example, in The car’s owner, car takes the role of Possessor and owner that of Possessed.
An alternation study aims to describe the aspects of language use that favor or disfavor one variant of the alternation against the other. This translates to studies where the alternation itself, e.g. the choice between an s form or an of form, becomes the response variable or outcome, and different characteristics of the context become predictors or regressors. In this study, we’ll capture the following language-internal predictors:
possessor_type
-
Whether the role of the Possessor is filled by a common noun or a proper noun.
possessor_size
-
The size of the Possessor slot, in characters.
possessed_size
-
The size of the Possessed slot, in characters.
size_difference
-
The difference between the size of the possessor and that of the possessed.
The reason for taking the size of the constituents, i.e. Possessor and Possessed, is the theory within linguistics that speakers try to push longer constituents towards the end. In the genitive alternation, in which each variant has a different word order, this becomes relevant. Concretely, in the s variant we have the Possessor before the Possessed (“The student’s idea”), whereas in the of variant we have the Possessed before the Possessor (“The idea of the student”). If we indeed tend to push longer constituents towards the end, a long Possessor such as “the keen and insightful student” would favor the of variant (“The idea of the keen and insightful student”) rather than the s variant (“The keen and insightful student’s idea”).
In practice, we will also transform the sizes to their logarithm, in order to lessen the impact of extremely long constituents.
The strategy of data retrieval consists of four queries with non-overlapping results, aimed at collecting four different patterns, combinations of the outcome gen_type
and the possessor_type
.
\({}\) | common nouns | proper nouns |
---|---|---|
of-genitive | query 1 | query 2 |
s-genitive | query 3 | query 4 |
For each of these queries, we will:
- Collect the observations with
mclm::conc()
, generating a concordance. - Add columns with the values of
gen_type
(our outcome) andpossessor_type
, since they are the same for all observations in that query. - Extract the
possessed
andpossessor
based on their position in the query. For that purpose, the regular expressions used to query the corpus include capturing groups surrounding each of the constituents.
This will be achieved with the following lines of code:
cd <- brown_fnames %>%
conc(pattern) %>% # whatever regex pattern we have defined
mutate(
gen_type = "of", # or "s"
possessor_type = "common", # or "proper"
possessed = re_replace_first(match, pattern, "\\1"), # or "\\2" for s-genitive
possessor = re_replace_first(match, pattern, "\\2") # or "\\1" for of-genitive
)
- Once we have more than one concordance, we can merge them with
merge_conc()
.
Of-genitive with common nouns
In the first query we will retrieve the attestations of the of-genitive with common nouns, e.g. “the idea of the student”.
Regular expression
If you would like to practice regular expressions, the {mclmtutorials} package includes a tutorial on how to use them in {mclm}. Just install the package along with {learnr} and {gradethis} and run the tutorial:
# install.packages("learnr")
# remotes::install_github("masterclm/mclmtutorials")
# remotes::install_github("rstudio/gradethis")
learnr::run_tutorial("regex", "mclmtutorials")
As a first step, we construct a regular expression that will match, in the non-tokenized corpus, the sequence we are interested in, and we’ll store it in a variable pattern
. Notice that we use a raw string (preceded with r
and, within the quotation marks, surrounded by two hyphens and a square bracket) and that we flag the regular expression with an x
and an i
. The x
allows us to insert spaces and line breaks for more readability (free-spacing mode), and the i
makes the search case insensitive.
With this pattern, we are asking for:
a word boundary,
followed by the/at and whitespace,
optionally followed by adjectives, each followed by whitespace, and then at least one common noun — this is the first capturing group, the Possessed,
then whitespace, followed by of/in, whitespace, the/at and whitespace,
optionally followed by adjectives, each followed by whitespace, and then at least one common noun without the genitive tag — this is the second capturing group, the Possessor.
Code
We build the concordance with conc()
, which takes a corpus or filenames, e.g. brown_fnames
, and a regular expression to match, e.g. pattern
. We can then explore the concordance with print_kwic()
, explore()
or View()
. This really is advisable, in order to inspect how good the regex pattern is and see if you might want to refine it to capture patterns that were missed, or to exclude patterns that should be discarded. Table 1 prints the first 6 items in the concordance with kableExtra.
With practice, you’ll find the balance between refining the automatic process and performing manual cleaning. In some cases, it’s quicker to fix the automatic procedure; in other cases, manual adjustments are easier and more reliable.
cd <- brown_fnames %>%
conc(pattern)
print_kwic(cd, n = 10)
idx left| match |right
1 ... suit/nn to/to test/vb|the/at va...at act/nn|,/, and/cc then/rb th...
2 ...n/cd dollars/nns at/in|the/at en...j year/nn|next/ap Aug./np 31/cd...
3 ...e/ber teaching/vbg ./.|The/at re...rement/nn|would/md be/be in/in ...
4 ...ill/md retire/vb at/in|the/at cl...n term/nn|./. Dr./nn-tl Clark/n...
5 ... ``/`` Actually/rb ,/,|the/at ab...rocess/nn|may/md have/hv consti...
6 .../at 23d/od ward/nn ./.|The/at ca...udges/nns|in/in the/at 58th/od ...
7 ...O/nn line/nn ./. On/in|the/at ne... sheet/nn|must/md be/be set/vbn...
8 ...the/at alliance/nn ,/,|the/at us...ion/nn-tl|for/in-tl Economic/jj...
9 ...ns as/ql well/rb as/cs|the/at ma...errent/nn|./. This/dt increase/...
10 ...ey/ppss count/vb on/in|the/at ai...tries/nns|attending/vbg the/at ...
...
Code
left | match | right |
---|---|---|
ns million/cd worth/nn of/in highway/nn reconstruction/nn bonds/nns ./. The/at bond/nn issue/nn will/md go/vb to/in the/at state/nn courts/nns for/in a/at friendly/jj test/nn suit/nn to/to test/vb | the/at validity/nn of/in the/at act/nn | ,/, and/cc then/rb the/at sales/nns will/md begin/vb and/cc contracts/nns let/vbn for/in repair/nn work/nn on/in some/dti of/in Georgia's/np$ most/ql heavily/rb traveled/vbn highways/nns ./. A/at H |
servative/jj ''/'' his/pp$ estimate/nn that/cs it/pps would/md produce/vb 17/cd million/cd dollars/nns to/to help/vb erase/vb an/at anticipated/vbn deficit/nn of/in 63/cd million/cd dollars/nns at/in | the/at end/nn of/in the/at current/jj fiscal/jj year/nn | next/ap Aug./np 31/cd ./. He/pps told/vbd the/at committee/nn the/at measure/nn would/md merely/rb provide/vb means/nns of/in enforcing/vbg the/at escheat/nn law/nn which/wdt has/hvz been/ben on/in |
over/np also/rb would/md require/vb junior-senior/jj high/nn teachers/nns to/to have/hv at/in least/ap 24/cd semester/nn hours/nns credit/vb in/in the/at subject/nn they/ppss are/ber teaching/vbg ./. | The/at remainder/nn of/in the/at 4-year/jj college/nn requirement/nn | would/md be/be in/in general/jj subjects/nns ./. ``/`` A/at person/nn with/in a/at master's/nn$ degree/nn in/in physics/nn ,/, chemistry/nn ,/, math/nn or/cc English/np ,/, yet/rb who/wps has/hvz n |
/np Clark/np of/in Hays/np ,/, Kan./np as/cs the/at school's/nn$ new/jj president/nn ./. Dr./nn-tl Clark/np will/md succeed/vb Dr./nn-tl J./np R./np McLemore/np ,/, who/wps will/md retire/vb at/in | the/at close/nn of/in the/at present/jj school/nn term/nn | ./. Dr./nn-tl Clark/np holds/vbz an/at earned/vbn Doctor/nn-tl of/in-tl Education/nn-tl degree/nn from/in the/at University/nn-tl of/in-tl Oklahoma/np-tl ./. He/pps also/rb received/vbd a/at Master |
d/jj jury/nn room/nn ''/'' ./. He/pps said/vbd this/dt constituted/vbd a/at ``/`` very/ql serious/jj misuse/nn ''/'' of/in the/at Criminal/jj-tl court/nn-tl processes/nns ./. ``/`` Actually/rb ,/, | the/at abuse/nn of/in the/at process/nn | may/md have/hv constituted/vbn a/at contempt/nn of/in the/at Criminal/jj-tl court/nn-tl of/in-tl Cook/np county/nn ,/, altho/cs vindication/nn of/in the/at authority/nn of/in that/dt court/nn is/bez n |
at 21st/od and/cc 28th/od precincts/nns of/in the/at 29th/od ward/nn ,/, the/at 18th/od precinct/nn of/in the/at 4th/od ward/nn ,/, and/cc the/at 9th/od precinct/nn of/in the/at 23d/od ward/nn ./. | The/at case/nn of/in the/at judges/nns | in/in the/at 58th/od precinct/nn of/in the/at 23d/od ward/nn had/hvd been/ben heard/vbn previously/rb and/cc taken/vbn under/in advisement/nn by/in Karns/np ./. Two/cd other/ap cases/nns also/rb were/ |
Because cd
is also a dataframe, of which match
is a column, we can also inspect the elements in the match
by extracting it.
head(cd$match)
[1] "the/at validity/nn of/in the/at act/nn "
[2] "the/at end/nn of/in the/at current/jj fiscal/jj year/nn "
[3] "The/at remainder/nn of/in the/at 4-year/jj college/nn requirement/nn "
[4] "the/at close/nn of/in the/at present/jj school/nn term/nn "
[5] "the/at abuse/nn of/in the/at process/nn "
[6] "The/at case/nn of/in the/at judges/nns "
Once we have a decent concordance, we can add variables that are characteristic of it. All of these observations will have the value of in the gen_type
variable and the value common in the possessor_type
variable. In addition, we can extract the constituents of the Possessed and Possessor slots with mclm::re_replace_first()
. The first argument is a text to match (the elements in the match
column); the second is a regular expression to match in the text (the pattern
we use to retrieve the text), and the third is a replacement string. "\\1"
and "\\2"
correspond to the contents of the first and second capturing groups in pattern
, respectively. In other words, we find the portion of text in each element of match
that matches the pattern
(which is all of it, since that was how it was constructed) and extract either the first capturing group, to fill the possessed
column, or the second capturing group, to fill the possessor
column.
cd <- cd %>%
mutate(
gen_type = "of",
possessor_type = "common",
possessed = re_replace_first(match, pattern, "\\1"),
possessor = re_replace_first(match, pattern, "\\2")
)
cd %>%
as_tibble() %>%
select(match, possessed, possessor) %>%
head() %>%
kbl() %>%
kable_paper()
match | possessed | possessor |
---|---|---|
the/at validity/nn of/in the/at act/nn | validity/nn | act/nn |
the/at end/nn of/in the/at current/jj fiscal/jj year/nn | end/nn | current/jj fiscal/jj year/nn |
The/at remainder/nn of/in the/at 4-year/jj college/nn requirement/nn | remainder/nn | 4-year/jj college/nn requirement/nn |
the/at close/nn of/in the/at present/jj school/nn term/nn | close/nn | present/jj school/nn term/nn |
the/at abuse/nn of/in the/at process/nn | abuse/nn | process/nn |
The/at case/nn of/in the/at judges/nns | case/nn | judges/nns |
Of-genitive with proper nouns
The second query also captures the of-genitive variant, but with proper nouns in the Possessor slot instead, e.g. “the rivers of Belgium”.
Regular expression
The regular expression is very similar to the one for the first query, with a few differences:
It does not ask or even accept an article after of/in.
The noun(s) requested in lines 10 and 11 are proper nouns instead of common nouns.
Between the optional adjective(s) and the noun(s) of the second capturing group, i.e. the Possessor slot, we also accept optional items with any part-of-speech as long as they also end in -tl (line 9).
Code
The steps are the same as for the first query. We store the object in a different variable, cd_new
, and we assign the value proper to possessor_type
instead of common. Afterwards, we rewrite cd
to be the combination of cd
(the first query) and cd_new
(the second query) using merge_conc()
, an mclm wrapper for bind_rows()
.
cd_new <- brown_fnames %>%
conc(pattern) %>%
mutate(
gen_type = "of",
possessor_type = "proper",
possessed = re_replace_first(match, pattern, "\\1"),
possessor = re_replace_first(match, pattern, "\\2")
)
# print_kwic(cd_new, n = 10)
Code
left | match | right | possessed | possessor |
---|---|---|---|---|
-tl Department/nn-tl ``/`` has/hvz seen/vbn fit/jj to/to distribute/vb these/dts funds/nns through/in the/at welfare/nn departments/nns of/in all/abn the/at counties/nns in/in the/at state/nn with/in | the/at exception/nn of/in Fulton/np-tl | County/nn-tl ,/, which/wdt receives/vbz none/pn of/in this/dt money/nn ./. The/at jurors/nns said/vbd they/ppss realize/vb ``/`` a/at proportionate/jj distribution/nn of/in these/dts funds/nns migh | exception/nn | Fulton/np-tl |
Austin/np-hl ,/,-hl Texas/np-hl --/-- Committee/nn approval/nn of/in Gov./nn-tl Price/np Daniel's/np$ ``/`` abandoned/vbn property/nn ''/'' act/nn seemed/vbd certain/jj Thursday/nr despite/in | the/at adamant/jj protests/nns of/in Texas/np | bankers/nns ./. Daniel/np personally/rb led/vbd the/at fight/nn for/in the/at measure/nn ,/, which/wdt he/pps had/hvd watered/vbn down/rp considerably/rb since/in its/pp$ rejection/nn by/in two/cd | adamant/jj protests/nns | Texas/np |
n on/in the/at hearing/nn ,/, since/cs the/at bill/nn was/bedz introduced/vbn only/rb last/ap Monday/nr ./. Austin/np-hl ,/,-hl Texas/np-hl --/-- Senators/nns unanimously/rb approved/vbd Thursday/nr | the/at bill/nn of/in Sen./nn-tl George/np Parkhouse/np | of/in Dallas/np authorizing/vbg establishment/nn of/in day/nn schools/nns for/in the/at deaf/jj in/in Dallas/np and/cc the/at four/cd other/ap largest/jjt counties/nns ./. The/at bill/nn is/bez des | bill/nn | Sen./nn-tl George/np Parkhouse/np |
--/-- Principals/nns of/in the/at 13/cd schools/nns in/in the/at Denton/np-tl Independent/jj-tl School/nn-tl District/nn-tl have/hv been/ben re-elected/vbn for/in the/at 1961-62/cd session/nn upon/in | the/at recommendation/nn of/in Supt./nn-tl Chester/np O./np Strickland/np | ./. State/nn and/cc federal/jj legislation/nn against/in racial/jj discrimination/nn in/in employment/nn was/bedz called/vbn for/in yesterday/nr in/in a/at report/nn of/in a/at ``/`` blue/jj ribbon | recommendation/nn | Supt./nn-tl Chester/np O./np Strickland/np |
b of/in the/at circumstances/nns that/wps have/hv brought/vbn these/dts troubles/nns about/rp ,/, has/hvz been/ben conspicuous/jj by/in its/pp$ absence/nn ./. Explosion/nn-hl avoided/vbn-hl In/in | the/at case/nn of/in Portugal/np | ,/, which/wdt a/at few/ap weeks/nns ago/rb was/bedz rumored/vbn ready/jj to/to walk/vb out/rp of/in the/at NATO/nn Council/nn-tl should/md critics/nns of/in its/pp$ Angola/np policy/nn prove/vb harsh/ | case/nn | Portugal/np |
some/dti disappointment/nn that/cs the/at United/vbn-tl States/nns-tl leadership/nn has/hvz not/* been/ben as/ql much/rb in/in evidence/nn as/cs hoped/vbn for/in ./. One/cd diplomat/nn described/vbd | the/at tenor/nn of/in Secretary/nn-tl of/in-tl State/nn-tl Dean/np | Rusk's/np$ speeches/nns as/cs ``/`` inconclusive/jj ''/'' ./. But/cc he/pps hastened/vbd to/to add/vb that/cs ,/, if/cs United/vbn-tl States/nns-tl policies/nns were/bed not/* always/rb clear/jj ,/, d | tenor/nn | Secretary/nn-tl of/in-tl State/nn-tl Dean/np |
cd <- merge_conc(cd, cd_new)
nrow(cd)
[1] 4157
S-genitive with common nouns
The third query retrieves attestations of the s-genitive variant with common nouns in the Possessor slot, e.g. “the student’s idea”.
Regular expression
The main regular expression symbols are the same used in the previous queries. The main differences with the first regular expression are twofold. First, we don’t have a word between the two capturing groups; instead, we only have whitespace, and the noun in line 5 must include $ somewhere in its part-of-speech tag. Second, the first capturing group now represents the Possessor, and the second capturing group, the Possessed. This doesn’t affect the writing of the regular expression itself, other than the requirement of $ at the end of the first component and that it should not be present in the second component. However, it will affect the code below when extracting the Possessed and Possessor variables.
Again, here we overwrite the pattern
variable with the new regular expression.
Code
Now that we have merged the first two queries into cd
, we don’t need cd_new
anymore, so we can overwrite it with the output of the third query. Again, we call conc()
with brown_fnames
and the new pattern
and assign the values that correspond to this concordance: s for gen_type
and common for possessor_type
, and the appropriate capturing groups of the pattern for possessed
and possessor
. Since the word order is inverted in relation to the of-genitive, now the first capturing group corresponds to the Possessor and the second one to the Possessed.
Finally, we use merge_conc()
to merge the output of the first two queries, cd
, with the output of the third query, cd_new
, overwriting cd
.
cd_new <- brown_fnames %>%
conc(pattern) %>%
mutate(
gen_type = "s",
possessor_type = "common",
possessed = re_replace_first(match, pattern, "\\2"),
possessor = re_replace_first(match, pattern, "\\1")
)
# print_kwic(cd_new, n = 10)
Code
left | match | right | possessed | possessor |
---|---|---|---|---|
re-set/vb the/at effective/jj date/nn so/cs that/cs an/at orderly/jj implementation/nn of/in the/at law/nn may/md be/be effected/vbn ''/'' ./. The/at grand/jj jury/nn took/vbd a/at swipe/nn at/in | the/at State/nn-tl Welfare/nn-tl Department's/nn$-tl handling/nn | of/in federal/jj funds/nns granted/vbn for/in child/nn welfare/nn services/nns in/in foster/jj homes/nns ./. ``/`` This/dt is/bez one/cd of/in the/at major/jj items/nns in/in the/at Fulton/np-tl Co | handling/nn | State/nn-tl Welfare/nn-tl Department's/nn$-tl |
Atlanta/np-tl Bar/nn-tl Association/nn-tl and/cc an/at interim/nn citizens/nns committee/nn ./. ``/`` These/dts actions/nns should/md serve/vb to/to protect/vb in/in fact/nn and/cc in/in effect/nn | the/at court's/nn$ wards/nns | from/in undue/jj costs/nns and/cc its/pp$ appointed/vbn and/cc elected/vbn servants/nns from/in unmeritorious/jj criticisms/nns ''/'' ,/, the/at jury/nn said/vbd ./. Regarding/in Atlanta's/np$ new/ | wards/nns | court's/nn$ |
e/np of/in Griffin/np ./. Attorneys/nns for/in the/at mayor/nn said/vbd that/cs an/at amicable/jj property/nn settlement/nn has/hvz been/ben agreed/vbn upon/rb ./. The/at petition/nn listed/vbd | the/at mayor's/nn$ occupation/nn | as/cs ``/`` attorney/nn ''/'' and/cc his/pp$ age/nn as/cs 71/cd ./. It/pps listed/vbd his/pp$ wife's/nn$ age/nn as/cs 74/cd and/cc place/nn of/in birth/nn as/cs Opelika/np ,/, Ala./np ./. The/at pe | occupation/nn | mayor's/nn$ |
more/ap than/in a/at year/nn ./. The/at Hartsfield/np home/nr is/bez at/in 637/cd E./np Pelham/np Rd./nn-tl Aj/nn ./. Henry/np L./np Bowden/np was/bedz listed/vbn on/in the/at petition/nn as/cs | the/at mayor's/nn$ attorney/nn | ./. Hartsfield/np has/hvz been/ben mayor/nn of/in Atlanta/np ,/, with/in exception/nn of/in one/cd brief/jj interlude/nn ,/, since/in 1937/cd ./. His/pp$ political/jj career/nn goes/vbz back/rb to/ | attorney/nn | mayor's/nn$ |
ith/in exception/nn of/in one/cd brief/jj interlude/nn ,/, since/in 1937/cd ./. His/pp$ political/jj career/nn goes/vbz back/rb to/in his/pp$ election/nn to/in city/nn council/nn in/in 1923/cd ./. | The/at mayor's/nn$ present/jj term/nn | of/in office/nn expires/vbz Jan./np 1/cd ./. He/pps will/md be/be succeeded/vbn by/in Ivan/np Allen/np Jr./np ,/, who/wps became/vbd a/at candidate/nn in/in the/at Sept./np 13/cd primary/nn after/cs M | present/jj term/nn | mayor's/nn$ |
ue/vb the/at new/jj rural/jj roads/nns bonds/nns ./. Schley/np County/nn-tl Rep./nn-tl B./np D./np Pelham/np will/md offer/vb a/at resolution/nn Monday/nr in/in the/at House/nn-tl to/to rescind/vb | the/at body's/nn$ action/nn | of/in Friday/nr in/in voting/vbg itself/ppl a/at $10/nns per/in day/nn increase/nn in/in expense/nn allowances/nns ./. Pelham/np said/vbd Sunday/nr night/nn there/ex was/bedz research/nn being/beg | action/nn | body's/nn$ |
cd <- merge_conc(cd, cd_new)
nrow(cd)
[1] 5207
S-genitive with proper nouns
The fourth query captures the s-genitive variant with proper nouns in the Possessor slot, e.g. Belgium’s rivers.
Regular expression
Again we overwrite pattern
with the regular expression for the last query, which is very similar to the third query. The difference, like the difference between the second and the first, is that it asks of proper nouns instead of common nouns in the Possessor slot (lines 4 and 5; but also it excludes articles at the beginning) and that it allows for any item with -tl in its part-of-speech tag preceding the Possessor noun (line 3).
Code
Again we overwrite the now useless cd_new
with the output of the fourth query and assign the appropriate values for the common variables. The possessor_type
is now proper, but the rest of the variables take the same values as in the previous query. At the end, we overwrite cd
by merging the old cd
, which contains the output of the first three queries, and cd_new
.
cd_new <- brown_fnames %>%
conc(pattern) %>%
mutate(
gen_type = "s",
possessor_type = "proper",
possessed = re_replace_first(match, pattern, "\\2"),
possessor = re_replace_first(match, pattern, "\\1")
)
# print_kwic(cd_new, n = 10)
Code
left | match | right | possessed | possessor |
---|---|---|---|---|
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in | Atlanta's/np$ recent/jj primary/nn election/nn | produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./. The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl | recent/jj primary/nn election/nn | Atlanta's/np$ |
nterest/nn in/in the/at election/nn ,/, the/at number/nn of/in voters/nns and/cc the/at size/nn of/in this/dt city/nn ''/'' ./. The/at jury/nn said/vbd it/pps did/dod find/vb that/cs many/ap of/in | Georgia's/np$ registration/nn | and/cc election/nn laws/nns ``/`` are/ber outmoded/jj or/cc inadequate/jj and/cc often/rb ambiguous/jj ''/'' ./. It/pps recommended/vbd that/cs Fulton/np legislators/nns act/vb ``/`` to/to have/hv | registration/nn | Georgia's/np$ |
at result/nn of/in city/nn personnel/nns policies/nns ''/'' ./. It/pps urged/vbd that/cs the/at city/nn ``/`` take/vb steps/nns to/to remedy/vb ''/'' this/dt problem/nn ./. Implementation/nn of/in | Georgia's/np$ automobile/nn title/nn law/nn | was/bedz also/rb recommended/vbn by/in the/at outgoing/jj jury/nn ./. It/pps urged/vbd that/cs the/at next/ap Legislature/nn-tl ``/`` provide/vb enabling/vbg funds/nns and/cc re-set/vb the/at effec | automobile/nn title/nn law/nn | Georgia's/np$ |
t's/nn$ wards/nns from/in undue/jj costs/nns and/cc its/pp$ appointed/vbn and/cc elected/vbn servants/nns from/in unmeritorious/jj criticisms/nns ''/'' ,/, the/at jury/nn said/vbd ./. Regarding/in | Atlanta's/np$ new/jj multi-million-dollar/jj airport/nn | ,/, the/at jury/nn recommended/vbd ``/`` that/cs when/wrb the/at new/jj management/nn takes/vbz charge/nn Jan./np 1/cd the/at airport/nn be/be operated/vbn in/in a/at manner/nn that/wps will/md elimin | new/jj multi-million-dollar/jj airport/nn | Atlanta's/np$ |
/nn opposes/vbz in/in its/pp$ platform/nn ./. Sam/np Caldwell/np ,/, State/nn-tl Highway/nn-tl Department/nn-tl public/jj relations/nns director/nn ,/, resigned/vbd Tuesday/nr to/to work/vb for/in | Lt./nn-tl Gov./nn-tl Garland/np Byrd's/np$ campaign/nn | ./. Caldwell's/np$ resignation/nn had/hvd been/ben expected/vbn for/in some/dti time/nn ./. He/pps will/md be/be succeeded/vbn by/in Rob/np Ledford/np of/in Gainesville/np ,/, who/wps has/hvz been/ | campaign/nn | Lt./nn-tl Gov./nn-tl Garland/np Byrd's/np$ |
ll/np ,/, State/nn-tl Highway/nn-tl Department/nn-tl public/jj relations/nns director/nn ,/, resigned/vbd Tuesday/nr to/to work/vb for/in Lt./nn-tl Gov./nn-tl Garland/np Byrd's/np$ campaign/nn ./. | Caldwell's/np$ resignation/nn | had/hvd been/ben expected/vbn for/in some/dti time/nn ./. He/pps will/md be/be succeeded/vbn by/in Rob/np Ledford/np of/in Gainesville/np ,/, who/wps has/hvz been/ben an/at assistant/nn more/ap than/i | resignation/nn | Caldwell's/np$ |
cd <- merge_conc(cd, cd_new)
nrow(cd)
[1] 7344
Automatic annotation
We now have a concordance cd
with 7344 observations matching four different patterns. We can get a quick overview of the size of each subset with xtabs()
. Here we can see that the most common pattern is the of genitive with a common noun in the Possessor slot, but also that within the s variant the proper noun is more common.
xtabs(~ gen_type +
possessor_type,
data = cd)
possessor_type
gen_type common proper
of 3151 1006
s 1050 2137
Before saving our concordance, we can add some automatic annotation. The following dplyr::mutate()
call adds or manipulates columns in different ways:
cd <- cd %>% mutate(
possessed = re_replace_all(possessed,
r"--[(?xi) ([^\s/]+) / [^\s]+ ]--",
"\\1") %>%
tolower(),
possessor = re_replace_all(possessor,
r"--[(?xi) ([^\s/]+?) ('s)? / [^\s]+ ]--",
"\\1") %>%
tolower(),
comp = re_retrieve_first(source, ".(.)..$", requested_group = 1),
left_tagged = left,
left = re_replace_all(left_tagged,
r"--[(?xi) ([^\s/]+) / [^\s]+ ]--",
"\\1"),
match_tagged = match,
match = re_replace_all(match_tagged,
r"--[(?xi) ([^\s/]+) / [^\s]+ ]--",
"\\1"),
right_tagged = right,
right = re_replace_all(right_tagged,
r"--[(?xi) ([^\s/]+) / [^\s]+ ]--",
"\\1"),
size_possessed = log(nchar(possessed)),
size_possessor = log(nchar(possessor)),
size_diff = size_possessor - size_possessed
)
Lines 2-4 remove the part-of-speech tags from the text in the
possessed
column, overwriting it. They do so by replacing all matches of the regular expression in line 3 as found in the text in line 2, i.e. the content of thepossessed
column, with the match of the first capturing group. The regular expression captures sequences of characters that are neither whitespace nor slashes, followed by a slash and then by another sequence of non whitespace characters, e.g. the/at and example/nn. The capturing group surrounds the first sequence before the slash, and therefore the result would replace the/at example/nn with the example.Line 5 converts the new contents of
possessed
to lowercase, which is useful if you want to use the column as a random effect in regression analysis.Lines 6-8 do the same as lines 2-4 but for the
possessor
column, and as such also exclude the optional ’s sequence by leaving it out of the capturing group. Line 9 does the same as line 5, for the new contents ofpossessor
.Line 10 creates a column
comp
that reads the columnsource
, where the names of the files are stored, and extracts the third to last character by usingre_retrieve_first()
with the pattern.(.)..$
and specifying that we want the first captured group. This character corresponds to the genre assignment of the file and could be used in modelling.Lines 12-14, 16-18 and 20-22 do the same as lines 2-4 for the columns
left
,match
andright
respectively. However, since we might want to keep the original, tagged text as well, lines 11, 15 and 19 stored the original values of these columns in new columns,left_tagged
,match_tagged
andright_tagged
respectively.Lines 22 and 23 count the number of characters in each row of
possessed
andpossessor
withnchar()
and then take their logarithm withlog()
. For example, for a word such as example,nchar("example")
returns 7, andlog(nchar("example"))
returns 1.946; for the example,nchar("the example")
returns 11, andlog(nchar("the example"))
returns 2.398. You can retrieve the original length by applyingexp()
:exp(log(nchar("the example"))
returns 11. Line 24 computes the difference between the logged size of the possessor and that of the possessed, giving us positive numbers when the possessor is longer and negative otherwise.
re_replace_all()
and re_retrieve_first()
seem to be doing the same thing: they match a regular expression to a piece of text and extract the contents of the first captured group. The difference is that re_replace_all()
will replace all matches in a piece of text with the contents of the replacement string — if there are multiple matches, because there are multiple words, it will replace each group of wordform/pos
with wordform
. re_retrieve_first()
, instead, will extract the first match of the pattern in the text, regardless of whether there are capturing groups; specifying requested_group = 1
narrows down the return value to the match of the first capturing group in the match.Citing examples in markdown
With this new version of the concordance, which you can explore with print_kwic(cd)
, View(cd)
or explore(cd)
, you can also extract full examples. If you use R Markdown or Quarto to write a report, rather than copy-pasting an example to describe, you can join the new left
, match
and right
columns and call a specific example to illustrate.
In the following code:
The call to
mutate()
andshort_names()
shortens the names of the files.The calls to
unite()
create a column with the name of the first argument (conc_line
orid
) by joining the rest of the columns mentioned (left
,match
andright
, orsource
andid
). By default they are separated with an underscore, but thesep
argument allows you to use a different term.The call to
select()
extracts only theid
andconc_line
columns. Afterwards,deframe()
turns the two-column tibble into a named character vector: the values are the contexts and their names, the ids.
examples <- cd %>%
as_tibble() %>%
mutate(source = short_names(source)) %>%
unite(conc_line, left, match, right, sep = " ") %>%
unite(id, source, id) %>%
select(id, conc_line) %>%
deframe()
head(examples)
ca01_1
"ns million worth of highway reconstruction bonds . The bond issue will go to the state courts for a friendly test suit to test the validity of the act , and then the sales will begin and contracts let for repair work on some of Georgia's most heavily traveled highways . A H"
ca02_2
"servative '' his estimate that it would produce 17 million dollars to help erase an anticipated deficit of 63 million dollars at the end of the current fiscal year next Aug. 31 . He told the committee the measure would merely provide means of enforcing the escheat law which has been on"
ca02_3
"over also would require junior-senior high teachers to have at least 24 semester hours credit in the subject they are teaching . The remainder of the 4-year college requirement would be in general subjects . `` A person with a master's degree in physics , chemistry , math or English , yet who has n"
ca02_4
"/np Clark of Hays , Kan. as the school's new president . Dr. Clark will succeed Dr. J. R. McLemore , who will retire at the close of the present school term . Dr. Clark holds an earned Doctor of Education degree from the University of Oklahoma . He also received a Master"
ca03_5
"d jury room '' . He said this constituted a `` very serious misuse '' of the Criminal court processes . `` Actually , the abuse of the process may have constituted a contempt of the Criminal court of Cook county , altho vindication of the authority of that court is n"
ca03_6
"at 21st and 28th precincts of the 29th ward , the 18th precinct of the 4th ward , and the 9th precinct of the 23d ward . The case of the judges in the 58th precinct of the 23d ward had been heard previously and taken under advisement by Karns . Two other cases also were/"
examples[["ca01_1"]]
[1] "ns million worth of highway reconstruction bonds . The bond issue will go to the state courts for a friendly test suit to test the validity of the act , and then the sales will begin and contracts let for repair work on some of Georgia's most heavily traveled highways . A H"
Then you could write, in an empty line of your report, a numbered example with R inline code; this lets you easily cross-reference your example with (@label)
, which returns (1).
`r examples[["ca01_1"]]` `` (@label)
- ns million worth of highway reconstruction bonds . The bond issue will go to the state courts for a friendly test suit to test the validity of the act , and then the sales will begin and contracts let for repair work on some of Georgia’s most heavily traveled highways . A H
In any case, you probably want to edit your example a bit, removing unnecessary text from the extremes, editing sections to include italics or bold or even, if your corpus is in a language other than English, add glosses and translations to print with glossr.
Save your progress and be free!
We are DONE. Congratulations! We can now store the output in a tab-separated file with write_conc()
. Then you can open the file with any spreadsheet software for manual cleaning and/or annotation and then read it again from R with read_conc(filename)
, as we will do in the second part of this study.
write_conc(cd, file.path(data_folder, "genitive-alternation.tab"))
Footnotes
The ’s is part of the token it’s attached to.↩︎