library(tidyverse)
Fundamentals of Data Science for NHS using R
A variable that can take on one of a limited, and usually fixed, number of possible values
Factor variables in R are categorical variables that can be either numeric or string variables
For each of the following, what type of categorical variable are these?
Inspired from R for Data Science by Hadley Wickham and Garrett Grolemund. https://r4ds.had.co.nz/factors.html
forcats
package
To work with factors, we’ll use the forcats
package, which is part of the core tidyverse
.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.1 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
To create a factor variable we use the factor
function
factor()
factor()
function
levels
argument, which determines the categories of the factor variable, and the default is the sorted list of all the distinct values of the data vector.factor()
function
labels
argument is another optional argument which is a vector of values that will be the labels of the categories in the levels argument.exclude
argument is also optional; it defines which levels will be classified as NA in any output using the factor variable.unique
Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.
fct_inorder
Or you can do it after the fact, with fct_inorder()
We will use a sample of data from the General Social Survey. This is a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
factors
in a tibble
When factors are stored in a tibble, you can’t see their levels so easily. One way to see them is with count()
or with a bar chart.
More powerful than changing the order of the levels is changing their values using fct_recode()
allows you to recode or change, the value of each level.
If you want to collapse a lot of levels, fct_collapse()
is a useful variant of fct_recode()
. For each new variable, you can provide a vector of old levels.
Sometimes you just want to lump together all the small groups to make a plot or table simpler using fct_lump()
As with strings and factors, there is a tidyverse package to help you work with dates more easily. It is part of the core tidyverse
too.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.1 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
When working with dates and times in R, you can consider either dates, times, or date-times. Date-times refer to dates plus times, specifying an exact moment in time.
For strings, you can call a function using y
, m
, and d
in the order in which the year (y), month (m), and date (d) appear in your data.
To work with date-time objects, you have to further include hour (h), minute(m), and second (s) into the function.
If you have a dataset where month, date, year, and/or time information are included in separate columns, the functions within lubridate
can take this separate information and create a date or date-time object.
makedate()
maketimedate()
To extract a component of your date object you can do that with the following functions:
It is also important to be able to perform operations over dates.
The as.duration()
function can get this date in years.
lubridate
functionsUsing addition, subtraction, multiplication, and division is possible with date objects, and accurately takes into account things like leap years and different number of days each month.
This capability and the additional functions that exist within lubridate can be enormously helpful when working with dates and date-time objects.
This data includes on-time data for all flights that departed NYC in 2013.
Relational data: