Categorical Variables

Fundamentals of Data Science for NHS using R

Today’s Plan

 

  • What is a categorical variable?
  • Creating a Factor
  • Using Factors to create new variables
  • Exploring dates and times

Let’s start at the beginning…

What is a categorical variable?

 

  • A variable that can take on one of a limited, and usually fixed, number of possible values

  • Factor variables in R are categorical variables that can be either numeric or string variables

What are the advantages of categorical variables?

 

  • They can be used in statistical modelling
  • Useful for visualisations
  • More efficient use of memory

Types of categorical variables

 

  • There are 3 types of categorical variables
    • Binary - variables with only 2 options
    • Nominal - variables with no inherent order or ranking sequence
    • Ordinal - variables with an ordered series

What type are these?

 

For each of the following, what type of categorical variable are these?

  • Sex
  • Race
  • Age Group
  • Educational level

Workshop Part 1: Introduction to factors

 

Inspired from R for Data Science by Hadley Wickham and Garrett Grolemund. https://r4ds.had.co.nz/factors.html

The forcats package

 

To work with factors, we’ll use the forcats package, which is part of the core tidyverse.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     

Creating a factor

 

To create a factor variable we use the factor function

  • factor()
dataset |> factor(x = character(), levels, labels = levels,
       exclude = NA, ordered = is.ordered(x), nmax = NA)

The factor() function

 

  • The only required argument is a vector of values which can either be string or numeric.
  • Optional arguments includes the levels argument, which determines the categories of the factor variable, and the default is the sorted list of all the distinct values of the data vector.

The factor() function

 

  • The labels argument is another optional argument which is a vector of values that will be the labels of the categories in the levels argument.
  • The exclude argument is also optional; it defines which levels will be classified as NA in any output using the factor variable.

Order of levels using unique

 

Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.

y <- factor(x, levels = unique(x))

Order of levels using fct_inorder

 

Or you can do it after the fact, with fct_inorder()

y <- x |> factor() |> fct_inorder()

Workshop Part 2: General Social Survey

 

We will use a sample of data from the General Social Survey. This is a long-running US survey conducted by the independent research organization NORC at the University of Chicago.

gss_cat

Viewing factors in a tibble

 

When factors are stored in a tibble, you can’t see their levels so easily. One way to see them is with count() or with a bar chart.

gss_cat |> count(var.test())
dataset |>
    ggplot(mapping = aes(x = var)) +
    geom_bar()

Modifying factor levels

 

More powerful than changing the order of the levels is changing their values using fct_recode() allows you to recode or change, the value of each level.

dataset |>
    mutate(var = fct_recode(var,
                            "First" = "a"
                            "Second" = "b"
                            "Third" = "c , d"
    ))

Modifying factor levels

 

If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels.

dataset |>
    mutate(var = fct_recode(var,
                            First = c("a" , "b" , "c"),
                            Second = c("d" , "e" , "f"),
                            Third = c("g , h")
    ))

Modifying factor levels

 

Sometimes you just want to lump together all the small groups to make a plot or table simpler using fct_lump()

gss_cat |> 
    mutate(var = fct_lump(var))

Workshop Part 3: Working with Dates and Times

 

  • Why is working with dates so important?
    • Often want to plot data over time
    • Group data by year, month or day
    • Calculate how long something has taken

Working with date and times in R

 

As with strings and factors, there is a tidyverse package to help you work with dates more easily. It is part of the core tidyverse too.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     

Working with dates and times

 

When working with dates and times in R, you can consider either dates, times, or date-times. Date-times refer to dates plus times, specifying an exact moment in time.

Creating date objects from strings

 

For strings, you can call a function using y, m, and d in the order in which the year (y), month (m), and date (d) appear in your data.

# year-month-date
ymd("2022-10-10")

# month-day-year
mdy("October 10th, 2022")

# day-month-year
dmy("10-Oct-2022")

Creating date-time from strings

 

To work with date-time objects, you have to further include hour (h), minute(m), and second (s) into the function.

# year-month-date
ymd_hms("2022-10-10 09:30:59")

Creating objects from indivdiual parts

 

If you have a dataset where month, date, year, and/or time information are included in separate columns, the functions within lubridate can take this separate information and create a date or date-time object.

  • We will be using the functions
    • makedate()
    • maketimedate()

Getting components of dates

 

To extract a component of your date object you can do that with the following functions:

mydate <- ymd("2022-10-10")

## extract year information
year(mydate)

## extract day of the month
mday(mydate)

## extract weekday information
wday(mydate)

## label with actual day of the week
wday(mydate, label = TRUE)

Time Spans

 

It is also important to be able to perform operations over dates.

## subtract birthday from today's date
mydate<- ymd("your_birthday")

age <- today() - mydate
age

The as.duration() function can get this date in years.

## a duration object can get this information in years
as.duration(age)

lubridate functions

Using addition, subtraction, multiplication, and division is possible with date objects, and accurately takes into account things like leap years and different number of days each month.

This capability and the additional functions that exist within lubridate can be enormously helpful when working with dates and date-time objects.

nycflights13

 

This data includes on-time data for all flights that departed NYC in 2013.

#install.packages('nycflights13')
library(nycflights13)

Thank you!

In the next episode

 

Relational data:

  • Mutating joins
  • Filtering joins
  • Set operations