Data Manipulation

Fundamentals of Data Science for NHS using R

Single table verbs (we have seen so far)

  • Column verbs
    • select() subset by column
  • Row verbs
    • slice() (and its friends) subset rows by position
    • arrange() order rows by column values
    • distinct() select unique rows
  • Group verbs
    • count() count observation by group

The pipe |> operator

The pipe operator allows us to write more readable code:

function_n(... function_2(function_1(dataset)))

is the same as

dataset |>
    function_1() |>
    function_2() |>
    ... |>
    function_n()

Any questions?

Single table verbs (for today’s session)

  • Column verbs
    • mutate() and transmute() create, modify, and delete columns
  • Row verbs
    • filter() subset rows by column values
  • Group verbs
    • group_by() group rows
    • summarise() collapse row groups into a single row

A Useful function from base-R

summary produces useful summary statistics:

  • for a categorical variable it returns the count of each level, and the count of NA’s (if any).

  • for a numerical variable it returns minimum, 1st quartile, median, mean, 3rd quartile, maximum, and the count of NA’s (if any).

As usual we can ‘pipe’ this function:

dataset |>
    summary()

filter rows

We can select specific rows satisfying a specific condition.

Write the condition as argument of the filter function.

dataset |>
    filter(condition)

Important

When using strings in your conditions, don’t forget to enclose them in speechmarks e.g. "string"

Comparison Operators

Conditions can be expressed using

Operator Syntax
equals ==
not equals !=
less than <
greater than >
less than or equal to <=
greater than or equal to >=

Boolean Operators

Conditions can be combined using

Operator Syntax
and &
or |
not !
xor xor()

Use parentheses, ( and ), to build more ariticulated conditions.

Some tips

Tip 1 between

x >= a & x <= b

is the same as

between(x, a, b)

Tip 2 %in%

x == value_1 | x == value_2 | ... | x == value_n

is the same as

x %in% c(value_1, value_2, ..., value_n)

Missing Values (NA)

In the R language missing values are denoted by NA (not available). We can select missing values using the function is.na.

For example we can select all the observations having missing values in a specific variable in the following way:

dataset |>
    filter(is.na(variable_name))

We can select all the non missing values by using !is.na().

Apply a function across multiple variables (part 1)

if_any() and if_all() apply the same predicate function to a selection of variables.

  • if_any() is TRUE when the predicate is TRUE for any of the selected variables.

  • if_all() is TRUE when the predicate is TRUE for all the selected variables.

if_any() and if_all() syntax

  • if_any()
dataset |>
    filter(if_any(c(variable_1, variable_2, ..., variable_n), function)
  • if_all()
dataset |>
    filter(if_all(c(variable_1, variable_2, ..., variable_n), function)

Important

The function used MUST be a predicate function, i.e. a function that returns TRUE or FALSE (for example is.na).

“Smart” selections

Tip

We can select all variables but specific ones by listing their names with a minus sign before the name of each variable you want to exclude (same as for select).

Tip

To select all the variables use everything().

Statistical Functions (position)

  • mean() top evaluate the mean

  • median() top evaluate the median (50th percentile)

Important

These functions allow the option na.rm = TRUE / FALSE to exclude/include missing values. Observe that if there are missing values that are not removed the output would be an error.

Statistical Functions (spread)

  • IQR() top evaluate the interquartile range

  • mad() top evaluate the median absolute deviation

  • sd() top evaluate the standard deviation

  • var() top evaluate the variance

Important

These functions allow the option na.rm = TRUE / FALSE.

Statistical Functions (rank)

  • max() top evaluate the interquartile range

  • min() top evaluate the median absolute deviation

  • quantile( , prob = x) top evaluate the 100xth percentile

Important

These functions allow the option na.rm = TRUE / FALSE.

Create, modify and delete columns

  • mutate() adds new variables and preserves existing ones

  • transmute() adds new variables and drops existing ones

New variables overwrite existing variables of the same name.

dataset |>
    mutate(variable = ...)

A verb from tidyr

  • Missing values
    • drop_na() drop rows containing missing values
dataset |>
    drop_na()

Row Number

We can include the row number by using the row_number() function.

Conditional Behaviour

if_else(condition, true, false)

  • condition logical vector

  • true, false values to use for TRUE and FALSE values of condition

Tip

var_name = if_else(is.na(var_name), value_if_na, var_name)

is the same as

variable_name = replace_na(variable_name, value_if_na)

More than two choices

case_when(
    condition_1 ~ value_1,
    condition_2 ~ value_2,
    ...,
    condition_n ~ value_n)

If no cases match, NA is returned. That’s why it is prefereable to write

case_when(
    condition_1 ~ value_1,
    condition_2 ~ value_2,
    ...,
    condition_n ~ value_n,
    TRUE ~ value_otherwise)

Grouped Data

dplyr verbs can be applied to grouped tables by including the group_by() verb in your pipeline.

The argument should be the variable (or variables!) to use to perform the grouping.

dataset |>
    group_by(variable_1, variable_2, ...) |> 
    dplyr_verb(...)

Summaries

summarise() computes the summary functions listed as argument.

dataset |>
    summarise(
        summary_name_1 = summary_function_1(...),
        summary_name_2 = summary_function_2(...),
        ...
    )

Tip

Since summarise is a dplyr verb, it can be combined with group_by to evaluate numerical summaries by group.

Apply a function across multiple variables (part 2)

across() makes it easy to apply the same function to multiple variables.

Tip

When using across, the function doesn’t have to be a predicate function!

Thank you!

In the next episode

Categorical Variables and Dates!