library(tidyverse)
library(nycflights13)Fundamentals of Data Science for NHS using R
Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in.
Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
Relations are always defined between a pair of tables.
Mutating joins, which add new variables to one data frame from matching observations in another.Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.Set operations, which treat observations as if they were set elements.
nycflights13 using the two-table verbs from dplyr.?nycflights13nycflights13 contains four tibbles
airlines lets you look up the full carrier name from its abbreviated carrier codeairports gives information about each airport, identified by the faa airport codeplanes gives information about each plane, identified by its tailnumweather gives the weather at each NYC airport for each hour
Examine the variables that exist across the 4 tibbles in nycflights13 and draw a diagram depicting the relationships that exists between flights and the other tables.
flights
flights connects to planes via a single variable, tailnumflights connects to airlines through the carrier variableflights connects to airports in two ways: via the origin and dest variablesflights connects to weather via origin, and year, month, day and hour variablesnycflights13weather and airports and how should it appear in the diagram?key?key is a variable (or set of variables) that uniquely identifies an observation.
primary key uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.foreign key uniquely identifies an observation in another table. For example, flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.
flights does not have an explicit primary key.mutate() and row_number().surrogate key and makes it easier to check back with the original data following manipulation.
relation.
A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
To help us understand joins, let’s create the following two tables.
The coloured column represents the “key” variable: these are used to match the rows between the tables.
In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output.
In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output.
inner_joinThe simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal. We use by to tell dplyr which variable is the key.
outer_join
An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
outer_joinleft_join()
Imagine you want to add the full airline name to the flights data. You can combine the airlines and flights data frames with left_join().
Explore the different outer join options across the tables.
So far all the diagrams have assumed that the keys are unique. There are two possibilities when keys are not unique.
This is useful when you want to add in additional information as there is typically a one-to-many relationship.
When you join duplicated keys, you get all possible combinations.
You can use other values for by to connect the tables in other ways:
by = NULL, uses all variables that appear in both tables, the so called natural join.by = "x". This is like a natural join, but uses only some of the common variables.by = c("a" = "b"). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
semi_join(x, y) keeps all observations in x that have a match in y.anti_join(x, y) drops all observations in x that have a match in y.
Semi-joins are useful for matching filtered summary tables back to the original rows. For example, imagine you’ve found the top ten most popular destinations.
Now you want to find each flight that went to one of those destinations. You could construct a filter yourself.
Instead you can use a semi-join, which connects the two tables like a mutating join, but instead of adding new columns, only keeps the rows in x that have a match in y.
Graphically, a semi-join looks like this:
Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do.
The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match.
Anti-joins are useful for diagnosing join mismatches. For example, when connecting flights and planes, you might be interested to know that there are many flights that don’t have a match in planes.
The final type of two-table verb are the set operations. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables:
intersect(x, y): return only observations in both x and y.union(x, y): return unique observations in x and y.setdiff(x, y): return observations in x, but not in y.Data Visualisation!
