library(tidyverse)
library(nycflights13)
Fundamentals of Data Science for NHS using R
Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in.
Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
Relations are always defined between a pair of tables.
Mutating joins
, which add new variables to one data frame from matching observations in another.Filtering joins
, which filter observations from one data frame based on whether or not they match an observation in the other table.Set operations
, which treat observations as if they were set elements.
nycflights13
using the two-table verbs from dplyr
.?nycflights13
nycflights13
contains four tibbles
airlines
lets you look up the full carrier name from its abbreviated carrier
codeairports
gives information about each airport, identified by the faa
airport codeplanes
gives information about each plane, identified by its tailnum
weather
gives the weather at each NYC airport for each hour
Examine the variables that exist across the 4 tibbles in nycflights13
and draw a diagram depicting the relationships that exists between flights
and the other tables.
flights
flights
connects to planes
via a single variable, tailnum
flights
connects to airlines
through the carrier
variableflights
connects to airports
in two ways: via the origin
and dest
variablesflights
connects to weather
via origin
, and year
, month
, day
and hour
variablesnycflights13
weather
and airports
and how should it appear in the diagram?key
?key
is a variable (or set of variables) that uniquely identifies an observation.
primary key
uniquely identifies an observation in its own table. For example, planes$tailnum
is a primary key because it uniquely identifies each plane in the planes table.foreign key
uniquely identifies an observation in another table. For example, flights$tailnum
is a foreign key because it appears in the flights table where it matches each flight to a unique plane.
flights
does not have an explicit primary key.mutate()
and row_number()
.surrogate
key and makes it easier to check back with the original data following manipulation.
relation
.
A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
To help us understand joins, let’s create the following two tables.
The coloured column represents the “key” variable: these are used to match the rows between the tables.
In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output.
In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output.
inner_join
The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal. We use by
to tell dplyr which variable is the key.
outer_join
An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
outer_join
left_join()
Imagine you want to add the full airline name to the flights data. You can combine the airlines and flights data frames with left_join()
.
Explore the different outer join options across the tables.
So far all the diagrams have assumed that the keys are unique. There are two possibilities when keys are not unique.
This is useful when you want to add in additional information as there is typically a one-to-many relationship.
When you join duplicated keys, you get all possible combinations.
You can use other values for by
to connect the tables in other ways:
by = NULL
, uses all variables that appear in both tables, the so called natural
join.by = "x"
. This is like a natural join, but uses only some of the common variables.by = c("a" = "b")
. This will match variable a
in table x
to variable b
in table y
. The variables from x
will be used in the output.
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
semi_join(x, y)
keeps all observations in x
that have a match in y
.anti_join(x, y)
drops all observations in x
that have a match in y
.
Semi-joins
are useful for matching filtered summary tables back to the original rows. For example, imagine you’ve found the top ten most popular destinations.
Now you want to find each flight that went to one of those destinations. You could construct a filter yourself.
Instead you can use a semi-join, which connects the two tables like a mutating join, but instead of adding new columns, only keeps the rows in x that have a match in y.
Graphically, a semi-join looks like this:
Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do.
The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match.
Anti-joins are useful for diagnosing join mismatches. For example, when connecting flights and planes, you might be interested to know that there are many flights that don’t have a match in planes.
The final type of two-table verb are the set operations. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables:
intersect(x, y):
return only observations in both x
and y
.union(x, y):
return unique observations in x
and y
.setdiff(x, y):
return observations in x
, but not in y
.Data Visualisation!