Introduction to Tidyverse

Fundamentals of Data Science for NHS using R

Today’s Plan

  • What is R?
  • What is RStudio?
  • What is Tidyverse?
  • Practical Session

Let’s introduce the main characters of our story!

What is R?

R is a language and environment for statistical computing and graphics.

R provides a wide variety of statistical and graphical techniques, and is highly extensible.

Why Use R?

  • R is free and open source
  • Any type of statistics/data analysis can be done in R
  • State-of-the-art graphics
  • Import data from wide variety of sources
  • R runs on all common platforms (Windows, macOS, Linux)
  • On-line community is always helpful, supportive and inclusive

R Popularity

R is popular even if it’s just for statistics and data analysis!

R is ranked 16th in the TIOBE index and 7th in the PYPL index

PYPL index, April 2023

NHS-R Community

Promoting the Use of R in the UK Health & Care System.

R Packages

R Packages are extensions of the base language (base-R). They contain functions and data that are not included in the basic installation.

They can be installed by users via a centralised software repository.

R’s most important software repository is called Comprehensive R Archive Network (CRAN).

Currently, the CRAN package repository features 19359 available packages.

What is RStudio?

RStudio is an integrated development environment (IDE) for R developed by Posit PBC.

RStudio PBC changed its name to Posit PBC last summer.

Posit Cloud

Data Science Workflow

Tidy Data

Tidy data is data where:

  1. Every column is variable.
  2. Every row is an observation.
  3. Every cell is a single value.

What is the Tidyverse?

Tideverse is a collection of R packages designed for data science.

Tidyverse Critics

Not everyone agrees with the Tidyverse philosophy:

Some others instead think that is should be the first thing to learn when you meet R for the first time:

Tidyverse 9 Core Packages

ggplot2

Data visualisation

dplyr

Data manipulation

tidyr

Data tidying

readr

Data import (CSV and TSV files)

purrr

Functional programming

tibble

tibble data structure

stringr

Strings manipulation

forcats

Categorical variables

lubridate

Dates manipulations

Other Tidyverse Packages

hms

Time-of-day values manipulation

Import other formats

There are dedicated packages to import the following formats:

and more…

reprex

Ask for help in way that everybody can understand.

magrittr

Provides the %>% operator which is essential to write cleaner code.

From R 4.1.0 (May 2021) the pipe operator has been implemented in the base-R language and its syntax is |>.

See this link for differences.

Learning Resources

R for Data Science

This is where you should start!

R for Data Science work-in-progress 2nd edition by Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund.

Workshop Part 1: R and RStudio basics

Custumize RStudio

The available options for the RStudio IDE are accessible from the Tools > Options menu.

A full reference of the available options can be found here.

Keyboard Shortcuts

 

A reference of the available shortcuts can be accesed from the Help > Keyboard Shortcuts Help menu.

Alternatively use keyboard shortcut Alt + Shift + K.

Install packages

We can install a package using the install.packages function.

install.packages("package_name")

Warning

The package name MUST be quoted!

Important

It’s good practice to run install.packages in your console and not including it in your script.

Load packages

We can load a package using the library function.

library(package_name)

Warning

The package name MUST NOT be quoted!

Important

It’s good practice to write all your library calls at the beginning of your script.

Workshop Part 2: Introduction to dplyr

Palmer Penguins

The goal of palmerpenguins is to provide a great dataset for data exploration and visualization.

Palmer Penguins Variables

What is dplyr?

  • dplyr is a package that contains functions for data manipulation.

  • It is automatically loaded when we call library(tidyverse) as it is one of the core packages.

  • First release 7th January 2014.

  • Original author Hadley Wickham.

  • dplyr is a grammar of data manipulation.

dplyr cheat sheet

Single table verbs (for today…)

  • Column verbs
    • select() subset by column
  • Row verbs
    • slice() (and its friends) subset rows by position
    • arrange() order rows by column values
    • distinct() select unique rows
  • Group verbs
    • count() count observation by group

Inspect the Data

We can output in the console the first lines of a dataset by using its name

dataset

We can see the whole dataset using view()

dataset |> view()

Another useful function is glimpse()

dataset |> glimpse()

select specific variables

We can select specific variables. List their names as arguments of the select function.

dataset |>
    select(var1_name, var2_name, ...)

select drop variables

We can select all variables but specific ones. List their names with a minus sign before the name of each variable you want to exclude.

dataset |>
    select(-var1_name, -var2_name, ...)

select by matching strings

We can use helpers to match strings in the variable name.

starts_with()

ends_with()

contains()

For example

dataset |>
    select(starts_with("string_to_be_matched"))

Multiple conditions can be combined using the logical operators and (&), or (|). We can also use the negation (!) operator.

select by variable type

We can use where() to specify variable type(s) to be selected. Possible arguments are

is.integer for integer

is.double for decimal

is.numerical for numerical (both integer and decimal)

is.factor for categorical

is.character for strings

For example

dataset |>
    select(where(is.character))

slice subset data by row number

slice(x) select the xth row.

# select the 274th row
dataset |>
    slice(274)

We can also select more than one row by listing the corresponding row numbers.

# select the 7th, 143rd, 213rd rows
dataset |>
    slice(7, 143, 213)

To list consevutive rows use the : operator.

# select rows from the 75th to the 100th 
dataset |>
    slice(75:100)

slice_head and slice_tail

  • slice_head(n = x) select the first x rows
# select the first 5 rows
dataset |>
    slice_head(n = 5)
  • slice_tail(n = x) select the last x rows
# select the last 5 rows
dataset |>
    slice_tail(n = 7)

Alternatively we can use the prop argument to select a specified proportion of the data.

Other slice verbs

  • slice_sample() randomly selects rows

  • slice_min() and slice_max() select rows with highest or lowest values of a variable

arrange rows by column values

We can order rows by values of selected columns. List their names as arguments of the arrange function.

dataset |>
    arrange(var1_name, var2_name, ...)

Use desc() to sort a variable in descending order.

distinct rows by column values

We can select unique/distinct rows.

dataset |>
    distinct()

Observe that

dataset |>
    select(var1_name, var2_name, ...) |>
    distinct()

is equivalent to

dataset |>
    distinct(var1_name, var2_name, ...)

count observations by group

We can count unique values of one or more variables.

dataset |>
    count(var1_name, var2_name, ...)

A verb from tidyr

  • Missing values
    • drop_na() drop rows containing missing values
dataset |>
    drop_na()

Thank you!

In the next episode

  • Column verbs
    • mutate() and transmute() create, modify, and delete columns
  • Row verbs
    • filter() subset rows by column values
  • Group verbs
    • group_by() group rows
    • summarise() collapse row groups into a single row

Honorable mentions

  • Column verbs
    • relocate() change column order
    • rename() change column name