Methods of Corpus Linguistics (class 2)
usethis::use_git()
Initialize project
This only needs to be done once at the beginning.
Initialize project
If you make changes on the remote, use git pull
before making changes in the local repo.
Avoid .RData with Tools > Global Options > General > Workspace/History
You can work with Git(Hub) on the Git tab of RStudio or on the Git Bash Terminal
Initialize project
Optionally, you can start “new work” on a new branch and then merge it to main
.
Add corpus
Download the corpora from Toledo (mcl.zip
file with various corpora) and copy/move the brown
folder to your project. There are different options.
To be accessed with here::here("brown")
.
project
|_brown
|_project.Rproj
|_.gitignore
To be accessed with here::here("corpus", "brown")
.
project
|_corpus
| \_brown
|_project.Rproj
|_.gitignore
To be accessed with here::here("data", "corpus", "brown")
.
project
|_data
| \_corpus
| \_brown
|_project.Rproj
|_.gitignore
Add corpus
We don’t want to track the corpus on git (because of size and licenses).
Open .gitignore
.
Add a line for the folder to ignore, e.g. /brown/
.
(Check git status
!)
Add corpus
Easier to run the code again and to share it.
Load packages first.
Do not change your working directory; use here()
or reliable relative paths.
Explore corpus
Explore corpus
In the console:
path_to_corpus
= print(path_to_corpus)
print(brown_fnames, hide_path = path_to_corpus)
explore(brown_fnames)
drop_re(brown_fnames, "/c[a-z]")
Explore corpus
Create it on the script, inspect in the console (or from the script).
Explore corpus
Frequency list (types in list: 63517, tokens in list: 1162192)
rank type abs_freq nrm_freq
---- ------ -------- --------
1 the/at 69013 593.818
2 ,/, 58153 500.373
3 ./. 48812 419.999
...
[1] 1162192
[1] 63517
Explore corpus
Explore corpus
Explore corpus
Explore corpus
It might be in a README file, online, as a paper…
What time period(s) is/are covered?
What genre(s)? Language varieties?
Written? Transcripts of oral texts?
Is it a monitor corpus?
Describe corpus
Create basic Quarto document
Set meta data on the YAML choosing output
Optional: render to check it’s working
Remove current text and write your own
Describe corpus
Inline code surrounded by backticks and starting with “r”.
Code chunks: to run arbitrary code, create tables and plots, print glosses with {glossr}.
Read external scripts with source()
or with the code
or file
chunk options.
Describe corpus
The Brown corpus used in this project has `r prettyNum(n_tokens(flist))` tokens
and `r n_types(flist)` types,
giving us a type-token ratio of `r round(n_types(flist)/n_tokens(flist), 2)`.
The Brown corpus used in this project has 1162192 tokens and 63517 types, giving us a type-token ratio of 0.05.
Describe corpus
Frequency list (types in list: 5, tokens in list: 239548)
<total number of tokens: 1162192>
rank orig_rank type abs_freq nrm_freq
---- --------- ------ -------- --------
1 1 the/at 69013 593.818
2 2 ,/, 58153 500.373
3 3 ./. 48812 419.999
4 4 of/in 35028 301.396
5 5 and/cc 28542 245.588
Describe corpus
knitr::kable()
rank | orig_rank | type | abs_freq | nrm_freq |
---|---|---|---|---|
1 | 1 | the/at | 69013 | 593.8175 |
2 | 2 | ,/, | 58153 | 500.3734 |
3 | 3 | ./. | 48812 | 419.9994 |
4 | 4 | of/in | 35028 | 301.3960 |
5 | 5 | and/cc | 28542 | 245.5876 |
Describe corpus
kableExtra::kbl()
rank | orig_rank | type | abs_freq | nrm_freq |
---|---|---|---|---|
1 | 1 | the/at | 69013 | 593.8175 |
2 | 2 | ,/, | 58153 | 500.3734 |
3 | 3 | ./. | 48812 | 419.9994 |
4 | 4 | of/in | 35028 | 301.3960 |
5 | 5 | and/cc | 28542 | 245.5876 |
Describe corpus
In both cases, you might want to use the include: false
chunk option to avoid printing neither the code itself or its output.
Describe corpus
If you were in your explore-corpus
branch and want to bring changes to main
Describe corpus