library(tidyverse)
Fundamentals of Data Science for NHS using R
You always need to investigate the quality of your data.
variable
is a quantity, quality, or property that you can measure.value
is the state of a variable when you measure it and may change from measurement to measurement.observation
is a set of measurements made under similar conditions; contains several values.Tabular data
is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
dplyr
and ggplot2
to interactively ask questions, answer them with data, and then ask new questions.
Two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
Anscombe's quartet
comprises four data sets. Each dataset consists of eleven (x,y) points.Datasaurus
dataset was initially created by Alberto Cairo. It is composed of 142 observations with a bivariate normal distribution.diamonds
data
A dataset containing the prices and other attributes of almost 54,000 diamonds.
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values. To examine the distribution of a categorical variable, use a bar chart.
To examine the distribution of a continuous variable, use a histogram:
What is the histogram telling us?
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science.
y
variable measures one of the three dimensions of these diamonds, in mm.
If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.
Other times you want to understand what makes observations with missing values different to observations with recorded values.
On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
For example, in nycflights13::flights
, missing values in the dep_time
variable indicate that the flight was cancelled?
EDA Part 2