Data Manipulation

Fundamentals of Data Science for NHS using R

Single table verbs (we have seen so far)

Column verbs
- select() subset by column
Row verbs
- slice() (and its friends) subset rows by position
- arrange() order rows by column values
- distinct() select unique rows
Group verbs
- count() count observation by group

The pipe `|>` operator

The pipe operator allows us to write more readable code:

function_n(... function_2(function_1(dataset)))

is the same as

dataset |>
    function_1() |>
    function_2() |>
    ... |>
    function_n()

Any questions?

Single table verbs (for today’s session)

Column verbs
- mutate() and transmute() create, modify, and delete columns
Row verbs
- filter() subset rows by column values
Group verbs
- group_by() group rows
- summarise() collapse row groups into a single row

A Useful function from base-R

summary produces useful summary statistics:

for a categorical variable it returns the count of each level, and the count of NA’s (if any).
for a numerical variable it returns minimum, 1^st quartile, median, mean, 3^rd quartile, maximum, and the count of NA’s (if any).

As usual we can ‘pipe’ this function:

dataset |>
    summary()

`filter` rows

We can select specific rows satisfying a specific condition.

Write the condition as argument of the filter function.

dataset |>
    filter(condition)

Important

When using strings in your conditions, don’t forget to enclose them in speechmarks e.g. "string"

Comparison Operators

Conditions can be expressed using

Operator	Syntax
equals	`==`
not equals	`!=`
less than	`<`
greater than	`>`
less than or equal to	`<=`
greater than or equal to	`>=`

Boolean Operators

Conditions can be combined using

Operator	Syntax
and	`&`
or	`\|`
not	`!`
xor	`xor()`

Use parentheses, ( and ), to build more ariticulated conditions.

Some tips

Tip 1 between

x >= a & x <= b

is the same as

between(x, a, b)

Tip 2 %in%

x == value_1 | x == value_2 | ... | x == value_n

is the same as

x %in% c(value_1, value_2, ..., value_n)

Missing Values (`NA`)

In the R language missing values are denoted by NA (not available). We can select missing values using the function is.na.

For example we can select all the observations having missing values in a specific variable in the following way:

dataset |>
    filter(is.na(variable_name))

We can select all the non missing values by using !is.na().

Apply a function across multiple variables (part 1)

if_any() and if_all() apply the same predicate function to a selection of variables.

if_any() is TRUE when the predicate is TRUE for any of the selected variables.
if_all() is TRUE when the predicate is TRUE for all the selected variables.

`if_any()` and `if_all()` syntax

if_any()

dataset |>
    filter(if_any(c(variable_1, variable_2, ..., variable_n), function)

if_all()

dataset |>
    filter(if_all(c(variable_1, variable_2, ..., variable_n), function)

Important

The function used MUST be a predicate function, i.e. a function that returns TRUE or FALSE (for example is.na).

“Smart” selections

Tip

We can select all variables but specific ones by listing their names with a minus sign before the name of each variable you want to exclude (same as for select).

Tip

To select all the variables use everything().

Statistical Functions (position)

mean() top evaluate the mean
median() top evaluate the median (50th percentile)

Important

These functions allow the option na.rm = TRUE / FALSE to exclude/include missing values. Observe that if there are missing values that are not removed the output would be an error.

Statistical Functions (spread)

IQR() top evaluate the interquartile range
mad() top evaluate the median absolute deviation
sd() top evaluate the standard deviation
var() top evaluate the variance

Important

These functions allow the option na.rm = TRUE / FALSE.

Statistical Functions (rank)

max() top evaluate the interquartile range
min() top evaluate the median absolute deviation
quantile( , prob = x) top evaluate the 100x^th percentile

Important

These functions allow the option na.rm = TRUE / FALSE.

Create, modify and delete columns

mutate() adds new variables and preserves existing ones
transmute() adds new variables and drops existing ones

New variables overwrite existing variables of the same name.

dataset |>
    mutate(variable = ...)

A verb from tidyr

Missing values
- drop_na() drop rows containing missing values

dataset |>
    drop_na()

Row Number

We can include the row number by using the row_number() function.

Conditional Behaviour

if_else(condition, true, false)

condition logical vector
true, false values to use for TRUE and FALSE values of condition

Tip

var_name = if_else(is.na(var_name), value_if_na, var_name)

is the same as

variable_name = replace_na(variable_name, value_if_na)

More than two choices

case_when(
    condition_1 ~ value_1,
    condition_2 ~ value_2,
    ...,
    condition_n ~ value_n)

If no cases match, NA is returned. That’s why it is prefereable to write

case_when(
    condition_1 ~ value_1,
    condition_2 ~ value_2,
    ...,
    condition_n ~ value_n,
    TRUE ~ value_otherwise)

Grouped Data

dplyr verbs can be applied to grouped tables by including the group_by() verb in your pipeline.

The argument should be the variable (or variables!) to use to perform the grouping.

dataset |>
    group_by(variable_1, variable_2, ...) |> 
    dplyr_verb(...)

Summaries

summarise() computes the summary functions listed as argument.

dataset |>
    summarise(
        summary_name_1 = summary_function_1(...),
        summary_name_2 = summary_function_2(...),
        ...
    )

Tip

Since summarise is a dplyr verb, it can be combined with group_by to evaluate numerical summaries by group.

Apply a function across multiple variables (part 2)

across() makes it easy to apply the same function to multiple variables.

Tip

When using across, the function doesn’t have to be a predicate function!

Thank you!

In the next episode

Categorical Variables and Dates!

Data Manipulation

Single table verbs (we have seen so far)

The pipe |> operator

Any questions?

Single table verbs (for today’s session)

A Useful function from base-R

filter rows

Comparison Operators

Boolean Operators

Some tips

Missing Values (NA)

Apply a function across multiple variables (part 1)

if_any() and if_all() syntax

“Smart” selections

Statistical Functions (position)

Statistical Functions (spread)

Statistical Functions (rank)

Create, modify and delete columns

A verb from tidyr

Row Number

Conditional Behaviour

More than two choices

Grouped Data

Summaries

Apply a function across multiple variables (part 2)

Thank you!

In the next episode

The pipe `|>` operator

`filter` rows

Missing Values (`NA`)

`if_any()` and `if_all()` syntax