Introduction to Tidyverse

Fundamentals of Data Science for NHS using R

Today’s Plan

What is R?
What is RStudio?
What is Tidyverse?
Practical Session

Let’s introduce the main characters of our story!

What is R?

R is a language and environment for statistical computing and graphics.

R provides a wide variety of statistical and graphical techniques, and is highly extensible.

Why Use R?

R is free and open source
Any type of statistics/data analysis can be done in R
State-of-the-art graphics
Import data from wide variety of sources
R runs on all common platforms (Windows, macOS, Linux)
On-line community is always helpful, supportive and inclusive

R Popularity

R is popular even if it’s just for statistics and data analysis!

R is ranked 16^th in the TIOBE index and 7^th in the PYPL index

NHS-R Community

Promoting the Use of R in the UK Health & Care System.

R Packages

R Packages are extensions of the base language (base-R). They contain functions and data that are not included in the basic installation.

They can be installed by users via a centralised software repository.

R’s most important software repository is called Comprehensive R Archive Network (CRAN).

Currently, the CRAN package repository features 19359 available packages.

What is RStudio?

RStudio is an integrated development environment (IDE) for R developed by Posit PBC.

RStudio PBC changed its name to Posit PBC last summer.

Posit Cloud

Posit

Data Science Workflow

Tidy Data

Tidy data is data where:

Every column is variable.
Every row is an observation.
Every cell is a single value.

What is the Tidyverse?

Tideverse is a collection of R packages designed for data science.

Tidyverse Critics

Not everyone agrees with the Tidyverse philosophy:

Some others instead think that is should be the first thing to learn when you meet R for the first time:

Tidyverse 9 Core Packages

ggplot2

Data visualisation

dplyr

Data manipulation

tidyr

Data tidying

readr

Data import (CSV and TSV files)

purrr

Functional programming

tibble

tibble data structure

stringr

Strings manipulation

forcats

Categorical variables

lubridate

Dates manipulations

Other Tidyverse Packages

hms

Time-of-day values manipulation

Import other formats

There are dedicated packages to import the following formats:

SPSS, SAS, Stata (haven)
JSON (jsonlite)
xls and xlsx (readxl)
XML (xml2)

and more…

reprex

Ask for help in way that everybody can understand.

magrittr

Provides the %>% operator which is essential to write cleaner code.

From R 4.1.0 (May 2021) the pipe operator has been implemented in the base-R language and its syntax is |>.

See this link for differences.

Learning Resources

R for Data Science

This is where you should start!

R for Data Science work-in-progress 2^nd edition by Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund.

Workshop Part 1: R and RStudio basics

Custumize RStudio

The available options for the RStudio IDE are accessible from the Tools > Options menu.

A full reference of the available options can be found here.

Keyboard Shortcuts

A reference of the available shortcuts can be accesed from the Help > Keyboard Shortcuts Help menu.

Alternatively use keyboard shortcut Alt + Shift + K.

Install packages

We can install a package using the install.packages function.

install.packages("package_name")

Warning

The package name MUST be quoted!

Important

It’s good practice to run install.packages in your console and not including it in your script.

Load packages

We can load a package using the library function.

library(package_name)

Warning

The package name MUST NOT be quoted!

Important

It’s good practice to write all your library calls at the beginning of your script.

Workshop Part 2: Introduction to dplyr

Palmer Penguins

The goal of palmerpenguins is to provide a great dataset for data exploration and visualization.

Palmer Penguins Variables

What is dplyr?

dplyr is a package that contains functions for data manipulation.
It is automatically loaded when we call library(tidyverse) as it is one of the core packages.
First release 7^th January 2014.
Original author Hadley Wickham.
dplyr is a grammar of data manipulation.

dplyr cheat sheet

Single table verbs (for today…)

Column verbs
- select() subset by column
Row verbs
- slice() (and its friends) subset rows by position
- arrange() order rows by column values
- distinct() select unique rows
Group verbs
- count() count observation by group

Inspect the Data

We can output in the console the first lines of a dataset by using its name

dataset

We can see the whole dataset using view()

dataset |> view()

Another useful function is glimpse()

dataset |> glimpse()

`select` specific variables

We can select specific variables. List their names as arguments of the select function.

dataset |>
    select(var1_name, var2_name, ...)

`select` drop variables

We can select all variables but specific ones. List their names with a minus sign before the name of each variable you want to exclude.

dataset |>
    select(-var1_name, -var2_name, ...)

`select` by matching strings

We can use helpers to match strings in the variable name.

starts_with()

ends_with()

contains()

For example

dataset |>
    select(starts_with("string_to_be_matched"))

Multiple conditions can be combined using the logical operators and (&), or (|). We can also use the negation (!) operator.

`select` by variable type

We can use where() to specify variable type(s) to be selected. Possible arguments are

is.integer for integer

is.double for decimal

is.numerical for numerical (both integer and decimal)

is.factor for categorical

is.character for strings

For example

dataset |>
    select(where(is.character))

`slice` subset data by row number

slice(x) select the x^th row.

# select the 274th row
dataset |>
    slice(274)

We can also select more than one row by listing the corresponding row numbers.

# select the 7th, 143rd, 213rd rows
dataset |>
    slice(7, 143, 213)

To list consevutive rows use the : operator.

# select rows from the 75th to the 100th 
dataset |>
    slice(75:100)

`slice_head` and `slice_tail`

slice_head(n = x) select the first x rows

# select the first 5 rows
dataset |>
    slice_head(n = 5)

slice_tail(n = x) select the last x rows

# select the last 5 rows
dataset |>
    slice_tail(n = 7)

Alternatively we can use the prop argument to select a specified proportion of the data.

Other `slice` verbs

slice_sample() randomly selects rows
slice_min() and slice_max() select rows with highest or lowest values of a variable

`arrange` rows by column values

We can order rows by values of selected columns. List their names as arguments of the arrange function.

dataset |>
    arrange(var1_name, var2_name, ...)

Use desc() to sort a variable in descending order.

`distinct` rows by column values

We can select unique/distinct rows.

dataset |>
    distinct()

Observe that

dataset |>
    select(var1_name, var2_name, ...) |>
    distinct()

is equivalent to

dataset |>
    distinct(var1_name, var2_name, ...)

`count` observations by group

We can count unique values of one or more variables.

dataset |>
    count(var1_name, var2_name, ...)

A verb from tidyr

Missing values
- drop_na() drop rows containing missing values

dataset |>
    drop_na()

Thank you!

In the next episode

Column verbs
- mutate() and transmute() create, modify, and delete columns
Row verbs
- filter() subset rows by column values
Group verbs
- group_by() group rows
- summarise() collapse row groups into a single row

Honorable mentions

Column verbs
- relocate() change column order
- rename() change column name