Tibble manipulation

Third assignment

Author

Mariana Montes

The goal of this assignment is to practice working with tables using tidyverse. Tidyverse is a collection of R packages for data wrangling and visualization (among other things). A great resource to learn how to use it is R for data Science.

For this assignment I ask that you create both a script to manipulate the table and then a Quarto file to print and cross-reference the table (combining what you learned in the second assignment).

1 Instructions

Create a branch for the assignment, e.g. tibble. You will work here and only move your changes to main if and when you want to submit.
Create an R script where you will insert the necessary code to do the following:
1. Load the appropriate libraries (tidyverse and mclm).
2. Read the brown corpus.
3. Create an association scores table of the collocations of a word of your choosing.
4. Save the file.
  1. OPTION A: Manipulate the table as done in class: turn it into a tibble with as_tibble(), modify columns, select some columns to show, filter the rows, rearrange the order. Then write it to a file with write_tsv().
  2. OPTION B: Save the association scores object to a file with write_assoc().
Create a Quarto report where you will only load the {mclm} and {kableExtra} packages.
1. Read the association scores object:¹
  1. OPTION A: If you wrote it with write_tsv(), use read_tsv().
  2. OPTION B: If you wrote it with write_assoc(), you can either use read_tsv() or read_assoc() followed by as_tibble().
2. If you hadn’t manipulated the table, this is the time to do so.
3. Print the table with {kableExtra}, editing it as well if you so wish. Don’t forget to add a caption!
4. Include some text cross-referencing the table and maybe commenting on the result.

Render Latex in table header

If you try to use Latex in a table, for example r"($\chi^2$)" to obtain $\chi^2$, you might notice that it prints well in the interactive session but not in the rendered HTML document. This is (I think) a bug somewhere in the rendering of tables, but there is a workaround:

Somewhere in your Quarto document, paste the following text (as normal text, not as R code):

<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [["$","$"]]}})</script>
<script async src="https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

This will activate parsing of Latex inside the HTML tables.

In PDF output instead you just have to add the option escape = FALSE to your kbl() call. Notice, however, that you should not have unescaped Latex characters in other elements of the table! (No underscores, for example).

2 Tips

2.1 Manipulating the table

Use mutate() to change the values of a column.
Use filter() to subset the rows based on values in the columns. You can also use the slice_ family of functions to subset with other criteria:
- slice_head(n = 3) to select the first three rows; slice_tail(n = 5) to select the last five rows.
- slice_sample(n = 10) to select ten random rows, slice_sample(prop = 0.5) to select a random 50% of the rows.
Use select() to subset the columns. You can also use rename() to rename columns without removing the rest.
Use arrange() to sort the tibble based on the values in a column.

Example

An example is the code below, which starts with an assoc object (product of assoc_scores()) and ends with a tibble with a selection of rows and columns.

subsetted <- hot_assoc %>% 
  as_tibble() %>% 
  filter(PMI > 1, G_signed >= 5, endsWith(type, "nn")) %>% 
  select(type, freq = a, PMI, ends_with("signed"), OR) %>% 
  mutate(
    log_OR = log(OR),
    type = str_remove(type, "/nn")
  )

In line 2 we use as_tibble() to turn the assoc object into a tibble to manipulate with {tidyverse} functions.
In line 3 we use filter() to subset the rows that have PMI larger than 1, G_signed larger than or equal to 5, and a type ending with “nn”, i.e. nouns.
In line 4 we use select() to subset the columns type, a, PMI and OR as well as those ending in “signed” and at the same time rename a to “freq”.
In lines 5 through 8 we use mutate() to create a new column log_OR that contains the logarithm of the OR column, and we modify the type column to remove the “/nn” ending from its elements.
In line 1 we assign the whole operation, initially applied to hot_assoc, to a variable called subsetted.

Each operation of the pipe acts on the output of the operation before it.

2.2 Association scores

Use assoc_scores() after surf_cooc() to create an association scores object.
Use write_assoc(scores_object, filename) to save the object from your R script.
Use scores_object <- read_assoc(filename) to read the object in the Quarto file.

2.3 KableExtra

Check out the documentation for HTML or PDF output to learn about {kableExtra} features.

3 Git workflow

git status # check that you're on main, nothing to commit...
git branch tibble
git checkout tibble
# work on your .qmd file, render
git status # check everything is fine
git add .
git commit -m "practice with tibbles"
# you may also make several commits as you add a figure, a table...
git checkout main
git status # check everything is fine. New files should not be there
git merge tibble
# Now the .qmd file, the rendered file and the help files should be present
git push
# and send me a message!

Footnotes

If you use read_tsv(), the show_col_types = FALSE argument will hide the printed output with the description of the column types, e.g. my_data <- read_tsv("filepath", show_col_types = FALSE).↩︎