function_n(... function_2(function_1(dataset)))
Fundamentals of Data Science for NHS using R
select()
subset by columnslice()
(and its friends) subset rows by positionarrange()
order rows by column valuesdistinct()
select unique rowscount()
count observation by group|>
operatorThe pipe operator allows us to write more readable code:
is the same as
mutate()
and transmute()
create, modify, and delete columnsfilter()
subset rows by column valuesgroup_by()
group rowssummarise()
collapse row groups into a single rowsummary
produces useful summary statistics:
for a categorical variable it returns the count of each level, and the count of NA’s (if any).
for a numerical variable it returns minimum, 1st quartile, median, mean, 3rd quartile, maximum, and the count of NA’s (if any).
As usual we can ‘pipe’ this function:
filter
rowsWe can select specific rows satisfying a specific condition.
Write the condition as argument of the filter
function.
Important
When using strings in your conditions, don’t forget to enclose them in speechmarks e.g. "string"
Conditions can be expressed using
Operator | Syntax |
---|---|
equals | == |
not equals | != |
less than | < |
greater than | > |
less than or equal to | <= |
greater than or equal to | >= |
Conditions can be combined using
Operator | Syntax |
---|---|
and | & |
or | | |
not | ! |
xor | xor() |
Use parentheses, (
and )
, to build more ariticulated conditions.
Tip 1 between
x >= a & x <= b
is the same as
between(x, a, b)
Tip 2 %in%
x == value_1 | x == value_2 | ... | x == value_n
is the same as
x %in% c(value_1, value_2, ..., value_n)
NA
)In the R language missing values are denoted by NA
(not available). We can select missing values using the function is.na
.
For example we can select all the observations having missing values in a specific variable in the following way:
We can select all the non missing values by using !is.na()
.
if_any()
and if_all()
apply the same predicate function to a selection of variables.
if_any()
is TRUE when the predicate is TRUE for any of the selected variables.
if_all()
is TRUE when the predicate is TRUE for all the selected variables.
if_any()
and if_all()
syntaxif_any()
if_all()
Important
The function used MUST be a predicate function, i.e. a function that returns TRUE or FALSE (for example is.na
).
Tip
We can select all variables but specific ones by listing their names with a minus sign before the name of each variable you want to exclude (same as for select
).
Tip
To select all the variables use everything()
.
mean()
top evaluate the mean
median()
top evaluate the median (50th percentile)
Important
These functions allow the option na.rm = TRUE / FALSE
to exclude/include missing values. Observe that if there are missing values that are not removed the output would be an error.
IQR()
top evaluate the interquartile range
mad()
top evaluate the median absolute deviation
sd()
top evaluate the standard deviation
var()
top evaluate the variance
Important
These functions allow the option na.rm = TRUE / FALSE
.
max()
top evaluate the interquartile range
min()
top evaluate the median absolute deviation
quantile( , prob = x)
top evaluate the 100x
th percentile
Important
These functions allow the option na.rm = TRUE / FALSE
.
mutate()
adds new variables and preserves existing ones
transmute()
adds new variables and drops existing ones
New variables overwrite existing variables of the same name.
drop_na()
drop rows containing missing valuesWe can include the row number by using the row_number()
function.
if_else(condition, true, false)
condition
logical vector
true, false
values to use for TRUE and FALSE values of condition
Tip
var_name = if_else(is.na(var_name), value_if_na, var_name)
is the same as
variable_name = replace_na(variable_name, value_if_na)
If no cases match, NA
is returned. That’s why it is prefereable to write
dplyr verbs can be applied to grouped tables by including the group_by()
verb in your pipeline.
The argument should be the variable (or variables!) to use to perform the grouping.
summarise()
computes the summary functions listed as argument.
Tip
Since summarise
is a dplyr verb, it can be combined with group_by
to evaluate numerical summaries by group.
across()
makes it easy to apply the same function to multiple variables.
Tip
When using across, the function doesn’t have to be a predicate function!
Categorical Variables and Dates!