Foundations of programming and data analysis using R

Learning Objectives

introduction to tidyverse
introduction to Quarto

In-class assignment from Lecture 1

Draft a study protocol
- 2 slides max including the following:
  - Study objective
  - Primary endpoint
  - Secondary endpoint(s)
  - Design, observation window
  - Power
  - Data
  - Statistical analysis (with the knowledge you have now - if no idea we can discuss and revisit)
  - sex- and gender- based analysis considerations
  - Comment on feasibility and problems foreseen

Resources for sex- and gender- based analysis

SABV in Biomedicine Checklist

Cornelison, T. L., & Clayton, J. A. (2017). Considering Sex as a Biological Variable in Biomedical Research. Gender and the Genome, 1(2), 89-93. Reproduced by permission of the authors. (Cornelison and Clayton 2017)

Guidelines for the Analysis of Gender and Health by Liverpool School of Tropical Medicine Gender and Health Group

“… hypotheses need to be specific about whether the question is being asked in relation to men and women or only one sex - if the trial is carried out only on one sex, it needs to be made clear that the findings may only be applicable to that sex…”

This tutorial is prepared using the following references

Tidyverse Skills for Data Science, https://jhudatascience.org/tidyversecourse/
Alexander, R. (2023). Telling Stories with Data: With Applications in R. CRC Press. (Alexander 2023)
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science. ” O’Reilly Media, Inc.”. (Wickham, Çetinkaya-Rundel, and Grolemund 2023)
Quarto guide, https://quarto.org/docs/computations/r.html

1. Introduction to tidyverse

Note

What is the Tidyverse?

The tidyverse consists of a few key packages:

dplyr: data manipulation
ggplot2: data visualization
tibble: tibbles, a modern re-imagining of data frames
tidyr: data tidying
readr: data import
purrr: functional programming, e.g. alternate approaches to apply

1.1 Data import

For this course, we will focus on the most common data file types: CSV (comma-separated values) and excel

Excel files

We can use function read_excel() in packagereadxl to read excel files (both .xls and .xlsx)

##install.packages("readxl")
library(readxl)
args(read_excel)

function (path, sheet = NULL, range = NULL, col_names = TRUE, 
    col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, 
    guess_max = min(1000, n_max), progress = readxl_progress(), 
    .name_repair = "unique") 
NULL

# read Excel file into R
df_excel <- read_excel("myspreadsheet.xlsx")

A few features of read_excel()
- converts blank cells to missing data (NA)
- sheet: argument specifies the name of the sheet from the workbook you’d like to read in (string) or the integer of the sheet from the workbook.
- col_names: specifies whether the first row of the spreadsheet should be used as column names (default: TRUE). Additionally, if a character vector is passed, this will rename the columns explicitly at time of import.
- skip: specifies the number of rows to skip before reading information from the file into R.
- Often blank rows or information about the data are stored at the top of the spreadsheet that you want R to ignore.

Comma-separated values (CSV) files

We can use function read_csv() in tidyverse to reads csv file

##install.packages("readxl")
library(tidyverse)
args(read_csv)

function (file, col_names = TRUE, col_types = NULL, col_select = NULL, 
    id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, 
    quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, 
    guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), 
    progress = show_progress(), show_col_types = should_show_types(), 
    skip_empty_rows = TRUE, lazy = should_read_lazy()) 
NULL

By default, read_csv() converts blank cells to missing data (NA).
col_names = FALSE to specify that the first row does NOT contain column names.
skip = 1 will skip the first row. You can set the number to any number you want.
n_max = 100 will only read in the first 100 rows. You can set the number to any number you want.

Other packages for data import: - package readr reads txt, csv, Rdata (or rda). - package haven reads SPSS, Stata, and SAS files.

For this tutorial, we will be working with the co2_mm_gl_clean.csv dataset
This data contains monthly globally averaged CO2 records between 1979 and 2022.
Lan, X., Tans, P. and K.W. Thoning: Trends in globally-averaged CO2 determined from NOAA Global Monitoring Laboratory measurements. Version 2023-01 NOAA/GML (https://gml.noaa.gov/ccgg/trends/)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message= FALSE, fig.align="center")
options(scipen = 999, pillar.print_max = Inf)
co2 <- read_csv(file = "data/co2_mm_gl_clean.csv")

Looking at the data

co2	Look at the whole data frame
head(co2)	Look at the first few rows
tail(co2)	Look at the last few rows
colnames(co2)	Names of the columns in the data frame
attributes(co2)	Attributes of the data frame
dim(co2)	Dimensions of the data frame
ncol(co2)	Number of columns
nrow(co2)	Number of rows
summary(co2)	Summary statistics
str(co2)	Structure of the data frame

library(DT)
# look at the data;
datatable(co2,
          rownames = FALSE,
          options = list(dom = 't'))

1.2 Data manipulation with tidyverse (a crush introduction)

Pipe operator %>%
Pipes are operators that send what comes before the pipe to what comes after.
frequently used in tidyverse!

Selecting Columns

Example: using %>% to subset data
- select() in dplyr subset columns by names
- filter() subset rows using column values

# selecting columns;
co2 %>%
  select(year, month, average) %>%
  head()

# drop some variables;
co2 %>%
  select(-month) %>%
  head()

Sometimes, we have a lot of variables to select, and if they have a common naming scheme, this can be very easy.

co2 %>%
  select(contains("bound")) %>%
  head()

Over helpful functions to be used within select()
- starts_with: starts with a prefix
- ends_with: ends with a suffix
- contains: contains a literal string
- matches: matches a regular expression
- num_range: a numerical range like wk1, wk2, wk3.
  - select(num_range("wk", 1:3))
- everything: all variables.

selecting rows

# selecting observations in year 2022;
co2 %>%
  select(year, month, average) %>%
  filter(year == 2022)

# A tibble: 10 × 3
    year month average
   <dbl> <dbl>   <dbl>
 1  2022     1    417.
 2  2022     2    418.
 3  2022     3    418.
 4  2022     4    418.
 5  2022     5    418.
 6  2022     6    417.
 7  2022     7    416.
 8  2022     8    414.
 9  2022     9    415.
10  2022    10    416.

# selecting observations between 2020 and 2022;
co2 %>%
  select(year, month, average) %>%
  filter(year <= 2022 & year >= 2020)

creating new variables

Example: using %>% and mutate() to create new variable

co2 %>%
  mutate(lowerbound_2sd = average - 2*stddev,
         upperbound_2sd = average + 2*stddev) %>%
  head()

# A tibble: 6 × 7
   year month decimal average stddev lowerbound_2sd upperbound_2sd
  <dbl> <dbl>   <dbl>   <dbl>  <dbl>          <dbl>          <dbl>
1  1979     1   1979.    337.   0.1            336.           337.
2  1979     2   1979.    337.   0.09           337.           337.
3  1979     3   1979.    338.   0.1            338.           338.
4  1979     4   1979.    338.   0.11           338.           339.
5  1979     5   1979.    338.   0.04           338.           338.
6  1979     6   1979.    337.   0.17           337.           338.

# creating new variables based on conditions of another variable;
# suppose we want to create a year group variable;

co2 %>% 
  mutate(year_group = case_when(
    year < 1980 ~ '1970-1979',
    1980 <= year & year < 1990 ~ '1980-1989',
    1990 <= year & year < 2000 ~ '1990-1999',
    2000 <= year & year < 2010 ~ '2000-2009',
    2010 <= year & year < 2020 ~ '2010-2019',
    2020 <= year & year < 2030 ~ '2020-2029',
  )) %>%
  head()

# A tibble: 6 × 6
   year month decimal average stddev year_group
  <dbl> <dbl>   <dbl>   <dbl>  <dbl> <chr>     
1  1979     1   1979.    337.   0.1  1970-1979 
2  1979     2   1979.    337.   0.09 1970-1979 
3  1979     3   1979.    338.   0.1  1970-1979 
4  1979     4   1979.    338.   0.11 1970-1979 
5  1979     5   1979.    338.   0.04 1970-1979 
6  1979     6   1979.    337.   0.17 1970-1979

co2 <- co2 %>% #updating the data object
  mutate(year_group = case_when(
    year < 1980 ~ '1970-1979',
    1980 <= year & year < 1990 ~ '1980-1989',
    1990 <= year & year < 2000 ~ '1990-1999',
    2000 <= year & year < 2010 ~ '2000-2009',
    2010 <= year & year < 2020 ~ '2010-2019',
    2020 <= year & year < 2030 ~ '2020-2029',
  ))

Grouping and Summarizing Data

we can use group_by() and summarize() to help calculating group-based statistics
Example: suppose we want to calculate average, min, and max co2 by years (aggregated over month)

co2 %>%
  select(year, average) %>%
  group_by(year) %>%
  summarise(
    `mean co2 by month` = mean(average),
    `min co2 by month` = min(average),
    `max co2 by month` = max(average)
  )

# A tibble: 44 × 4
    year `mean co2 by month` `min co2 by month` `max co2 by month`
   <dbl>               <dbl>              <dbl>              <dbl>
 1  1979                337.               334.               338.
 2  1980                339.               337.               340.
 3  1981                340.               338.               342.
 4  1982                341.               338.               343.
 5  1983                343.               341.               344.
 6  1984                344.               342.               345.
 7  1985                346.               343.               347.
 8  1986                347.               345.               348.
 9  1987                349.               347.               350.
10  1988                351.               349.               352.
11  1989                353.               350.               354.
12  1990                354.               352.               356.
13  1991                355.               353.               357.
14  1992                356.               354.               358.
15  1993                357.               354.               358.
16  1994                358.               356.               360.
17  1995                360.               358.               362.
18  1996                362.               360.               363.
19  1997                363.               360.               365.
20  1998                366.               364.               367.
21  1999                368.               365.               369.
22  2000                369.               367.               370.
23  2001                371.               368.               372.
24  2002                373.               370.               374.
25  2003                375.               373.               377.
26  2004                377.               374.               378.
27  2005                379.               377.               381.
28  2006                381.               378.               383.
29  2007                383.               380.               384.
30  2008                385.               383.               387.
31  2009                386.               384.               388.
32  2010                389.               386.               390.
33  2011                391.               388.               392.
34  2012                393.               390.               394.
35  2013                395.               393.               397.
36  2014                397.               395.               399.
37  2015                400.               397.               402.
38  2016                403.               401.               405.
39  2017                405.               403.               407.
40  2018                408.               405.               409.
41  2019                410.               408.               412.
42  2020                412.               410.               414.
43  2021                415.               412.               417.
44  2022                417.               414.               418.

we observe an increasing trend of global co2 concenration over years.

Renaming columns

#example syntax;
data %>%
  rename(new_name = oldname,
         new_name2 = oldname2)

Reshaping datasets - wide vs long data

R cheetsheet on reshaping data - pivot function has three argumnets:

The function requires the following arguments - a data frame - cols: name of the columns we wish to gather - names_to: name of the new column - values_to: name of the new column containing variable values

suppose we want to reshape the long co2 data to wide data with month 1 to 12 as columns

co2_wide <- co2 %>%
  select(year, month, average) %>%
  pivot_wider(names_from = month,
              names_prefix = "mth",
              values_from = average)

head(co2_wide)

# A tibble: 6 × 13
   year  mth1  mth2  mth3  mth4  mth5  mth6  mth7  mth8  mth9 mth10 mth11 mth12
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  1979  337.  337.  338.  338.  338.  337.  336.  334.  335.  336.  337.  338.
2  1980  339.  339.  340.  340.  340.  340.  338.  337.  337.  338.  339.  340.
3  1981  340.  341.  341.  342.  341.  341.  339.  338.  338.  339.  340.  341.
4  1982  341.  342.  342.  343.  342.  341.  340.  338.  338.  340.  341.  342.
5  1983  342.  343.  343.  344.  344.  344.  342.  341.  341.  342.  343.  343.
6  1984  344.  345.  345.  345.  345.  345.  343.  342.  342.  343.  344.  345

Merging data frames

we can also join two or multiple data frames together.
- left_join()
  - keeps all the entries that are present in the left (first) table and excludes any that are only in the right table
- right_join()
  - keeps all the entries that are present in the right table and excludes any that are only in the left table.
- inner_join()
  - keeps only the entries that are present in both tables. inner_join is the only function that guarantees you won’t generate any missing entries.
- full_join()
  - keeps all of the entries in both tables, regardless of whether or not they appear in the other table.

2.3 Data visualization with ggplot2

> Data visualization is key to data story telling.

ggplot2 is a powerful package that enable publication ready plots
- easy customization (over SAS)
- clear syntax and lots of online template
- lots of extensions on style

How ggplot2 works, https://codeahoy.com/learn/rtutorial/ch8/

ggplot components (from Wickham, 2009)

layer is a collection of geometric elements and statistical transformations.
aesthetic(aes) is something you can see.
- x, y: variable along the x and y axis
- colour: color of geoms according to data
- fill: the inside color of the geom
- group: what group a geom belongs to
- shape: the figure used to plot a point
- linetype: the type of line used (solid, dashed, etc)
- size: size scaling for an extra dimension
- alpha: the transparency of the geom
geometric elements (geoms), represent what you actually see in the plot: points, lines, polygons, etc.
- geom_area() draws an area plot
- geom_bar(stat="identity") draws a bar chart
- geom_line() draws a line
- geom_point() draws a scatterplot
- geom_rect(), draws rectangles
scales map values in the data space to values in the aesthetic space.
- This includes the use of colour, shape or size.
- Scales also draw the legend and axes
coordinate(coord), describes how data coordinates
- provides axes and gridlines to help read the graph.
facet specifies how to break up and display subsets of data as small multiples.
theme controls display style, like the font size and background colour.

Layers

ggplot(aes(x = decimal, y = average), data = co2)

Geometry

ggplot(aes(x = decimal, y = average), data = co2)+
  geom_point() +
  geom_line(color = "blue")

comparing to the plot provided on https://gml.noaa.gov/ccgg/trends/

Expand To Learn About various geoms

geom_abline: Reference lines: horizontal, vertical, and diagonal
geom_area: Ribbons and area plots
geom_bar: Bar charts
geom_boxplot: A box and whiskers plot
geom_contour: 2d contours of a 3d surface
geom_count: Count overlapping points
geom_crossbar: Vertical intervals: lines, crossbars & errorbars
geom_curve: Line segments and curves
geom_density: Smoothed density estimates
geom_dotplot: Dot plot
geom_errorbar: Vertical intervals: lines, crossbars & errorbars
geom_errorbarh: Horizontal error bars
geom_freqpoly: Histograms and frequency polygons
geom_hex: Hexagonal heatmap of 2d bin counts
geom_histogram: Histograms and frequency polygons
geom_hline: Reference lines: horizontal, vertical, and diagonal
geom_jitter: Jittered points
geom_label: Text
geom_line: Connect observations
geom_linerange: Vertical intervals: lines, crossbars & errorbars
geom_map: Polygons from a reference map
geom_path: Connect observations
geom_pointrange: Vertical intervals: lines, crossbars & errorbars
geom_polygon: Polygons
geom_qq: A quantile-quantile plot
geom_qq_line: A quantile-quantile plot
geom_quantile: Quantile regression
geom_raster: Rectangles
geom_ribbon: Ribbons and area plots
geom_rug: Rug plots in the margins
geom_segment: Line segments and curves
geom_smooth: Smoothed conditional means
geom_step: Connect observations
geom_text: Text
geom_tile: Rectangles
geom_violin: Violin plot
geom_vline: Reference lines: horizontal, vertical, and diagonal

labs, axis, facet, and theme

ggplot(aes(x = month, y = average), data = co2)+
  geom_point(alpha = 0.5) +  
  geom_line(color = "blue") +
  labs(
    x = 'Month',
    y = 'CO2 mole fraction (ppm)',
    title = 'Global Monthly Mean CO2'
  ) +
  scale_x_continuous(breaks = c(1,5,9,12)) +
  scale_y_continuous(breaks = seq(from=300,to=450,by=50)) +
  facet_wrap(vars(year)) +
  theme_bw()

adding statistics

co2 %>%
  filter(year == 2021) %>%
  ggplot(aes(x = month, y = average))+
  geom_point() +
  geom_line(color="blue") +
  geom_smooth(formula = y ~ x, method = 'lm', color = "red") +
  geom_smooth(formula = y ~ splines::bs(x,3), method = 'lm', color = "orange") +
  scale_x_continuous(breaks = seq(1,12,1)) +
  labs(x = 'Month', y = 'CO2 mole fraction (ppm)', title = 'Global Monthly Mean CO2 in 2021')+ 
  theme_bw()

colours in R

require(RColorBrewer)
display.brewer.all()

Other representative plots

Boxplot

Suppose we want to look at CO2 distribution by year

ggplot(data = co2, aes(x=year_group, y=average, fill=year_group)) + 
  geom_boxplot(alpha=0.3) +
  scale_fill_brewer(palette="Reds")+ 
  labs(x = 'Year Group', 
       y = 'CO2 mole fraction (ppm)', 
       title = 'Global Monthly Mean CO2 by Year Group',
       fill = 'Year group')+ 
  theme_bw()

Heatmaps
we are interested to look at the pattern of CO2 concentration by year and month
three data dimensions: year, month, and value of co2

ggplot(data = co2, aes(y=factor(year), x=factor(month), fill=average)) + 
  geom_tile(colour = "white") +
  scale_x_discrete(position = "top") +
  scale_fill_distiller(palette = "Reds", direction = 1) +  
  labs(x = 'Month', 
       y = 'Year', 
       title = 'Global Monthly Mean CO2')+ 
  theme_bw()

The R Graph Gallery, https://r-graph-gallery.com/index.html

Best reference for plotting with ggplot2 in R, https://r-graph-gallery.com/index.html

2. Introduction to Quarto

Quarto is an open-source scientific and technical publishing system built on Pandoc

From (Wickham, Çetinkaya-Rundel, and Grolemund 2023)

Quarto provides a unified authoring framework for data science, combining your code, its results, and your prose.
Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more.
Like Rmarkdown but better!

Quarto files are designed to be used in three ways:

For communicating to decision-makers, who want to focus on the conclusions, not the code behind the analysis.
For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).
As an environment in which to do data science, as a modern-day lab notebook where you can capture not only what you did, but also what you were thinking.

List of Quarto formats

Basic documents in html, pdf, docx
Presentations in beamer, pptx, and revealjs (html slides)
.blue[.bold[Websites & Blogs]] in Quarto Website and Quarto Blogs

Quarto is a one-stop-shop that renders multiple output formats!

How it works

When you render a Quarto document,

Getting started

Be sure that you have installed Quarto, https://quarto.org/docs/get-started/
We can then create a new Quarto document within RStudio: “File” -> “New File” ->“Quarto Document”.
The Quarto R package is a convenience for command line rendering from R, and is not required for using Quarto with R.

install.packages("quarto")

After opening a new Quarto document and selecting “Source” view, you will see the default top matter, contained within a pair of three dashes. This is also known as the YAML header

Quarto basics

Quarto file has a .qmd extension
From (Wickham, Çetinkaya-Rundel, and Grolemund 2023), below is an example Quarto file

---
title: "Diamond sizes"
date: 2023-09-07
format: html
knitr:
  opts_chunk:
    comment: "#>"
    collapse: true
    echo: true
    warning: false
    message: false
---

```{r}
#| label: setup
#| include: true

library(tidyverse)

smaller <- diamonds %>% 
  filter(carat <= 2.5)
```

We have data about `r nrow(diamonds)` diamonds.
Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats.
The distribution of the remainder is shown below:

```{r}
#| label: plot-smaller-diamonds
#| echo: true

smaller %>% 
  ggplot(aes(x = carat)) + 
  geom_freqpoly(binwidth = 0.01)
```

It contains three important types of content:

An (optional) YAML header surrounded by ---s.
Chunks of R code surrounded by ```.
- You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter.
Text mixed with simple text formatting like # heading and _italics_.

shows a .qmd document in RStudio with notebook interface where code and output are interleaved.

Header matters

From (Alexander 2023)

Top matter consists of defining aspects such as the title, author, and date. It is contained within three dashes at the top of a Quarto document.
For instance, the following would specify a title, a date that automatically updated to the date the document was rendered, and an author.
An abstract is a short summary of the paper, and we could add that to the top matter.
By default, Quarto will create an HTML document, but we can change the output format to produce a PDF.
You can also use this section to define global options!

---
title: "My report"
author: "Name"
date: 2023-09-07
abstract: "This is my abstract."
format: html
---

---
title: "My report"
author: "Name"
date: 2023-09-07
abstract: "This is my abstract."
format: pdf
---

Source editor

You can use visual editor or the source editor to edit Quarto document

## Text formatting

*italic* **bold** ~~strikeout~~ `code`

superscript^2^ subscript~2~

[underline]{.underline} [small caps]{.smallcaps}

## Headings

# 1st Level Header

## 2nd Level Header

### 3rd Level Header

## Lists

-   Bulleted list item 1

-   Item 2

    -   Item 2a

    -   Item 2b

1.  Numbered list item 1

2.  Item 2.
    The numbers are incremented automatically in the output.

## Links and images

<http://example.com>

[linked phrase](http://example.com)

![optional caption text](quarto.png){fig-alt="Quarto logo and the word quarto spelled in small case letters"}

## Tables

| First Header | Second Header |
|--------------|---------------|
| Content Cell | Content Cell  |
| Content Cell | Content Cell  |

code chunks

Chunk label

Chunks can be given an optional label, e.g. - allows easy navigation of coding sections using the navigator drop-down (bottom-left of the script editor) and easy label reference to code chuck generated figures.

ggplot(airquality, aes(Temp, Ozone)) + 
  geom_point() + 
  geom_smooth(method = "loess", se = FALSE)

Chunk options

Chunk output can be customized with options, fields supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here we’ll cover the most important chunk options that you’ll use frequently. You can see the full list at https://yihui.org/knitr/options.

The most important set of options controls if your code block is executed and what results are inserted in the finished report:

eval: false prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.
include: false runs the code, but doesn’t show the code or results in the final document. Use this for setup code that you don’t want cluttering your report.
echo: false prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who don’t want to see the underlying R code.
message: false or warning: false prevents messages or warnings from appearing in the finished file.
results: hide hides printed output; fig-show: hide hides plots.
error: true causes the render to continue even if code returns an error. This is rarely something you’ll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your .qmd. It’s also useful if you’re teaching R and want to deliberately include an error. The default, error: false causes rendering to fail if there is a single error in the document.

The following table summarizes which types of output each option suppresses:

Option	Run code	Show code	Output	Plots	Messages	Warnings
`eval: false`	X		X	X	X	X
`include: false`		X	X	X	X	X
`echo: false`		X
`results: hide`			X
`fig-show: hide`				X
`message: false`					X
`warning: false`						X

Render

When a Quarto document is rendered, R code blocks are automatically executed. You can render Quarto documents in a variety of ways:

Using the Render button in RStudio:

The top section of a qmd file as displayed in RStudio. There is a toolbar right above the document containing various options, including 'Render.' There is a stylized, segmented blue arrow pointing at the word.

The Render button will render the first format listed in the document YAML. If no format is specified, then it will render to HTML.

From the R console using the quarto R package:

library(quarto)
quarto_render("document.qmd") # all formats
quarto_render("document.qmd", output_format = "pdf")

The function quarto_render() is a wrapper around quarto render and by default, will render all formats listed in the document YAML.

Exercise

Try render the diamond-sizes.qmd document to both html and pdf format

References

Alexander, Rohan. 2023. Telling Stories with Data: With Applications in r. CRC Press.

Cornelison, Terri Lynn, and Janine Austin Clayton. 2017. “Considering Sex as a Biological Variable in Biomedical Research.” Gender and the Genome 1 (2): 89–93.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".