# install.packages(tidyverse, dependencies = T)
library(tidyverse) # load package library
1. Getting started with R
1.1 R and RStudio
R is a language and environment for statistical computing and graphics (https://cran.r-project.org/manuals.html).
Many users of R like to use RStudio as the preferred interface for programming in R.
RStudio is an Integrated Development Environment (IDE) for R.
- easy to navigate
- lots of point and click features and customizations
- Rstudio is not just for R
RStudio layout
When you open RStudio, your interface is made up of four panes as shown below. These can be organised via menu options View > Panes >
R Packages
Packages are the fundamental units of reproducible R code.
They include reusable R functions, the documentation that describes how to use them, and sample data.
We install the package using
install.packages()
function or we can use the Package tab in Rstudio.Once we have the package installed, we must load the functions from this library so we can use them within R.
R script
We can run code in the console at the prompt where R will evaluate it and print the results.
best practice write code in a new script file so it can be saved, edited, and reproduced.
To open a new script, we select File > New File > R Script.
To “run code” that was written in the script file, you can highlight the code lines you wish be evaluated and
- press CTRL-Enter (windows)
- Cmd+Return (Mac).
Additionally, You can comment or uncomment script lines by pressing
- Ctrl+Shift+C (windows)
- Cmd+Shift+C (Mac).
The comment operator in R is
#
.You can find more RStudio default keyboard shortcuts here.
Customization
You can customize your RStudio session under the Options dialog Tools > Global Options menu (or RStudio > Preferences on a Mac).
A list of customization categories can be found here.
Working directory
The working directory is the default location where R will look for files you want to load and where it will put any files you save.
You can use function
getwd()
to display your current working directoryand use function
setwd()
to set your working directory to a new folder on your computer.
getwd() #show my current working directory;
[1] "D:/GitHub/Rworkshop"
Getting help with R
The help section of R is extremely useful if you need more information about the packages and functions that you are currently loaded.
You can initiate R help using the help function
help()
or?
, the help operator.
help(ggplot)
1.2 Basic R
- In this subsection, I will briefly outline some common R functions and commands for arithmetic, creating and working with objects such as vector and matrix
R is case sensitive.
Commands are separated by a newline in the console.
The # character can be used to make comments. R does not execute the rest of the line after the # symbol - it ignores it.
Previous commands can be accessed via the up and down arrow keys on the keyboard.
When naming in R, avoid using spaces and special characters (i.e., !@#$%^&*()_+=;:’“<>?/) and avoid leading names with numbers.
it’s common to see error and warning messages pop up as output in Console
- best solution: searching for online answers!
Arithmetic
2+3
3-2
2*3
2^3
3/2
2 + (2 + 3) * 2 - 5
pi
[1] 3.141593
exp(1)
[1] 2.718282
exp(3)
[1] 20.08554
log(exp(1), base = exp(1)) #playing with Euler's number;
[1] 1
log(3, base = exp(1)) #default natural logarithms;
[1] 1.098612
log(3, base = 10)
[1] 0.4771213
log10(3)
[1] 0.4771213
log(-1) #warning message;
[1] NaN
- Some of the other available useful functions are:
abs(), sqrt(), ceiling(), floor(), trunc(), round()
.
Working with objects
R is an object-oriented programming language.
We can create objects and save them in our workspace & environment
An object is composed of three parts: 1) a value we’re interested in 2) an identifier and 3) the assignment operator.
Value: can take any forms
- a number, a string of characters, a data frame, a plot or a function
identifier is the name you assign to the value.
assignment operator resembles an arrow
<-
and is used to link the value to the identifier.
# Creating a scalar called "a" and assigning a value of 2
<-2
a
# Creating a scalar called "b" and assigning a value of 3
<-3
b
# Adding "a" and "b" and saving under "d"
<-a+b
d
# Printing the value of "d"
d
[1] 5
# Updating the value of a scalar
# Adds 5 to the old value of "a" and saves it again under the name "a".
<-a+5
a a
[1] 7
Logic check
- TRUE or FALSE?
Operator | |
---|---|
== | exactly equal to |
!= | not equal to |
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
x | y | x or y |
x & y | x and y |
<5 # checks if x is less than 5 or not
a>5 # checks if x is greater than 5 or not
a<=5 # less or equal
a>=5 # greater or equal
a==4 #( == stands for equal)
a!=4 #( != stands for not equal) a
Data structures
- Vectors
- Matrices
- arrays
- Data frames
- List
Vectors
vectors can contain same type or mixed type elements.
vector.name <- c(value1, value2, value3, ...)
.The function
c()
means combine or concatenate and is used to create vectors.Types of elements:
- numeric(double)
- integer
- character
- logical: TRUE, FALSE
- Special values: NA(not available or missing), NULL(empty), NaN(not a number), Inf(infinite)
You can use
typeof()
orclass()
to examine an object’s type, or use anis()
function.
# a vector of a single numeric element;
<- 3
x x
[1] 3
typeof(x) #also try class(x);
[1] "double"
is(x)
[1] "numeric" "vector"
# a character vector
<- c("red", "green", "yellow")
x x
[1] "red" "green" "yellow"
typeof(x) #also try class(x);
[1] "character"
length(x)
[1] 3
nchar(x) #number of characters for each element;
[1] 3 5 6
# encode a vector as a factor (or category);
<- factor(c("red", "green", "yellow", "red", "red", "green"))
y y
[1] red green yellow red red green
Levels: green red yellow
attributes(y)
$levels
[1] "green" "red" "yellow"
$class
[1] "factor"
as.numeric(y) # we can return factors with numeric labels;
[1] 2 1 3 2 2 1
# we can update the levels;
levels(y)<- c("green","yellow","red")
attributes(y)
$levels
[1] "green" "yellow" "red"
$class
[1] "factor"
# we can also label numeric vector with factor levels;
<- factor(c(1,2,3,1,1,2), levels = c(1,2,3), labels = c("red", "green","yellow"))
z z
[1] red green yellow red red green
Levels: red green yellow
attributes(z)
$levels
[1] "red" "green" "yellow"
$class
[1] "factor"
# using the repeat command;
# the following line repeats 3, 5 times
rep(x=3,each=5)
[1] 3 3 3 3 3
# using sequence command;
1:10
[1] 1 2 3 4 5 6 7 8 9 10
seq(from=1, to=10, by=1)
[1] 1 2 3 4 5 6 7 8 9 10
rep(x=1:2, each = 2)
[1] 1 1 2 2
- Logical check for a vector
- Just like a scalar, we can evaluate logical conditions using a vector as well.
- This is an element-wise operation.
- R will check every element of the vector
- The output will be a TRUE/FALSE vector.
#Let's star with a new vector which has 5 elements
<- c(3,6,2,8,10)
x>5 x
[1] FALSE TRUE FALSE TRUE TRUE
==2 x
[1] FALSE FALSE TRUE FALSE FALSE
sum(x>5)
[1] 3
- select or remove elements from a vector
- we use the open bracket
[ ]
after the vector and use index to operate.
- we use the open bracket
#Starting with same x vector
# x= c(3,6,2,8,10)
1] # gives us the first element x[
[1] 3
c(1,3,4) ] # return the 1st, 3rd and 4th element x[
[1] 3 2 8
-1] # remove the first element x[
[1] 6 2 8 10
-1:-2] # remove first and second elements x[
[1] 2 8 10
-c(1,2)] x[
[1] 2 8 10
- Calculating summary statistics of a vector
set.seed(123)
<- sample(x = 1:100, size = 100, replace = TRUE)
r mean(r) #calculate the mean of a vector
[1] 52.15
var(r) #variance of a vector
[1] 874.7348
sd(r) #standard deviation of a vector
[1] 29.57592
min(r) #minimum of a vector
[1] 4
max(r) #maximum of a vector
[1] 99
median(r)#median
[1] 50
range(r) #range
[1] 4 99
Matrices
- matrices have two dimensions, rows and columns
# matrix in R;
matrix(data = 1:16, nrow=4, ncol=4, byrow=TRUE)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
matrix(data = 1:16, nrow=4, ncol=4, byrow=FALSE)
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
# creating matrix using diagonal;
diag(c(1,1,1))
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
# matrix calculation;
<- matrix(data = 1:16, nrow=4, ncol=4, byrow=TRUE)
X diag(X)
[1] 1 6 11 16
t(X) #transpose;
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
<- matrix(seq(1,32, by=2), nrow=4, byrow=T)
Y Y
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 9 11 13 15
[3,] 17 19 21 23
[4,] 25 27 29 31
# matrix operation;
+ X Y
[,1] [,2] [,3] [,4]
[1,] 2 5 8 11
[2,] 14 17 20 23
[3,] 26 29 32 35
[4,] 38 41 44 47
- X Y
[,1] [,2] [,3] [,4]
[1,] 0 1 2 3
[2,] 4 5 6 7
[3,] 8 9 10 11
[4,] 12 13 14 15
3 * X
[,1] [,2] [,3] [,4]
[1,] 3 6 9 12
[2,] 15 18 21 24
[3,] 27 30 33 36
[4,] 39 42 45 48
* Y X
[,1] [,2] [,3] [,4]
[1,] 1 6 15 28
[2,] 45 66 91 120
[3,] 153 190 231 276
[4,] 325 378 435 496
%*% Y #inner product; X
[,1] [,2] [,3] [,4]
[1,] 170 190 210 230
[2,] 378 430 482 534
[3,] 586 670 754 838
[4,] 794 910 1026 1142
Data frames
A data frame is a group of vectors of the same length.
Two dimensions: columns are variables and rows are observations
Unlike matrix, a data frame can contain different data types (e.g., numeric or character)
<- c("A", "B", "C", "D") #identifies the soil sampling site;
site_id <- c(6.1, 7.4, 5.1, 6) #soil pH
soil_pH <- c(17, 23, 7, 15) #number of species
num_species <- c("yes", "yes", "no", "no") #treatment status;
treated
# use data.frame function to create a data frame;
<- data.frame(site_id, soil_pH, num_species, treated)
soil_data
# view data;
soil_data
site_id soil_pH num_species treated
1 A 6.1 17 yes
2 B 7.4 23 yes
3 C 5.1 7 no
4 D 6.0 15 no
str(soil_data)
'data.frame': 4 obs. of 4 variables:
$ site_id : chr "A" "B" "C" "D"
$ soil_pH : num 6.1 7.4 5.1 6
$ num_species: num 17 23 7 15
$ treated : chr "yes" "yes" "no" "no"
dim(soil_data)
[1] 4 4
nrow(soil_data)
[1] 4
ncol(soil_data)
[1] 4
colnames(soil_data)
[1] "site_id" "soil_pH" "num_species" "treated"
Lists
- highly flexible objects
- lists can contain anything as their elements
<- list(
example_list num = sep(from=1, to=10, by=2),
char = c("apple", "pineapple"),
logic = c(TRUE, TRUE, FALSE)
)
1.3 Advanced topics - iterative programming
if statements in R
- If statements in R has got this following structure
if (condition){expression}
A simple example:
<-3
xif(x==3){print("x is 3")}
[1] "x is 3"
if else statement
if(condition){
expression1else {
}
expression2 }
- we can also use
ifelse()
function,ifelse(condition, expression 1, expression 2)
<- c(6:-4)
y sqrt(y) #- gives warning
Warning in sqrt(y): NaNs produced
[1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000 NaN
[9] NaN NaN NaN
sqrt(ifelse(y >= 0, x, NA)) # no warning
[1] 1.732051 1.732051 1.732051 1.732051 1.732051 1.732051 1.732051 NA
[9] NA NA NA
multiple conditions
if (condition1) {
expression1else if (condition2) {
}
expression2else if (condition3) {
}
expression3else {
}
expression4 }
# current value of x is 3
if(x==4){
print("x is 4")
else if (x>4){
}print("x is greater than 4")
else if (x<4){
}print("x is less than 4")
}
[1] "x is less than 4"
For Loops
- perform a particular action for every iteration of some sequence
for (i in sequence){
expression }
- a simple example
for (month in 1:12) {
print(paste('Month:', month))
}
[1] "Month: 1"
[1] "Month: 2"
[1] "Month: 3"
[1] "Month: 4"
[1] "Month: 5"
[1] "Month: 6"
[1] "Month: 7"
[1] "Month: 8"
[1] "Month: 9"
[1] "Month: 10"
[1] "Month: 11"
[1] "Month: 12"
- a slightly more complex example combining for loop and if statements - counting even numbers
<- c(2,5,3,9,8,11,6)
x <- 0
count for (val in x) {
if(val %% 2 == 0) count = count+1
}print(count)
[1] 3
apply family
apply family functions can be used in the same way as a for loop
apply()
- apply over the margins of an array (e.g. the - rows or columns of a matrix)
lapply()
- apply over an object and return list
sapply()
- apply over an object and return a simplified object (an array) if possible
vapply()
- similar to sapply but you specify the type of object returned by the iterations
mapply()
- multivariate version of
sapply()
- multivariate version of
tapply()
- used to apply a function over subsets of a vector
# a matrix with apply;
<-matrix(1:9,nrow=3)
mymatrix mymatrix
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
# calculate row sum;
apply(X=mymatrix,MARGIN=1,FUN = sum)
[1] 12 15 18
# a list with lapply
<-list(A=matrix(1:9,nrow=3),B=1:5,C=8)
mylist
# calculate sum for each element of the list;
lapply(mylist,FUN = sum)
$A
[1] 45
$B
[1] 15
$C
[1] 8
# calculate sum for each element of the list and simplify it to a vector;
sapply(mylist, FUN = sum)
A B C
45 15 8
Effectively use loops in statistically modelling
This can be handy in statistical modelling!
Data: Motor Trend Car Road Tests
- A data frame with 32 observations on 11 (numeric) variables.
- mpg Miles/(US) gallon
- cyl Number of cylinders
- disp Displacement (cu.in.)
- hp Gross horsepower
- drat Rear axle ratio
- wt Weight (1000 lbs)
- qsec 1/4 mile time
- vs Engine (0 = V-shaped, 1 = straight)
- am Transmission (0 = automatic, 1 = manual)
- gear Number of forward gears
- carb Number of carburetors
- A data frame with 32 observations on 11 (numeric) variables.
library(DT)
datatable(mtcars,
options = list(dom = 't'))
# creating a list a variables that are predictive of the fuel consumption;
<- colnames(mtcars)[-1]
predictors predictors
[1] "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
#run unadjusted regression analysis for each predictor;
# m1 <- lm(mpg~cyl,data = mtcars)
# m2 <- lm(mpg~disp,data = mtcars)
# m3 <- lm(mpg~hp,data = mtcars)
# make a list of model formulars: list(mpg ~ cyl, mpg ~ disp, ...);
<- sapply(predictors,function(x)as.formula(paste('mpg~',x)))
list_model_formulas
# making a list of unadjusted models;
<- lapply(list_model_formulas,function(x){lm(x,data=mtcars)})
list_models
#extract model results;
<- lapply(list_models, function(x){return(summary(x)$coef)})
results results
$cyl
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.88458 2.0738436 18.267808 8.369155e-18
cyl -2.87579 0.3224089 -8.919699 6.112687e-10
$disp
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.59985476 1.229719515 24.070411 3.576586e-21
disp -0.04121512 0.004711833 -8.747152 9.380327e-10
$hp
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886054 1.6339210 18.421246 6.642736e-18
hp -0.06822828 0.0101193 -6.742389 1.787835e-07
$drat
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.524618 5.476663 -1.373942 0.1796390847
drat 7.678233 1.506705 5.096042 0.0000177624
$wt
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285126 1.877627 19.857575 8.241799e-19
wt -5.344472 0.559101 -9.559044 1.293959e-10
$qsec
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.114038 10.0295433 -0.5098974 0.61385436
qsec 1.412125 0.5592101 2.5252133 0.01708199
$vs
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.616667 1.079711 15.389917 8.846603e-16
vs 7.940476 1.632370 4.864385 3.415937e-05
$am
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.147368 1.124603 15.247492 1.133983e-15
am 7.244939 1.764422 4.106127 2.850207e-04
$gear
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.623333 4.916379 1.143796 0.261753365
gear 3.923333 1.308131 2.999191 0.005400948
$carb
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.872334 1.8368072 14.08549 9.218370e-15
carb -2.055719 0.5685456 -3.61575 1.084446e-03