library(tidyverse)
Fundamentals of Data Science for NHS using R
Two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
dplyr
and ggplot2
to interactively ask questions, answer them with data, and then ask new questions.diamonds
data
A dataset containing the prices and other attributes of almost 54,000 diamonds.
What are the limitations of this plot?
mpg
dataset
Fuel economy data from 1999 to 2008 for 38 popular models of cars.
mpg
dataset.hwy
) and class
?
We can reorder the class
using the reorder function. FUN
= numeric summary function.
If you have long variable names, geom_boxplot()
will work better if you flip it 90 degrees. You can do that with coord_flip()
.
geom_count
aes(x = color, y = cut)
rather than aes(x = cut, y = color)
in the example above?
You’ve already seen one great way to visualise the covariation between two continuous variables: draw a scatterplot with geom_point()
.
carat
and price
Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black.
geom_bin2d()
and geom_hex()
divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin.
geom_bin2d()
creates rectangular binsgeom_hex()
creates hexagonal bins
varwidth = TRUE
.
Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon.
What do you need to consider when using cut_width()
vs cut_number()
? How does that impact a visualisation of the 2d distribution of carat and price?
Patterns in your data provide clues about relationships. If you spot a pattern, ask yourself:
On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.