Categorical Variables

What is a categorical variable?

A variable that can take on one of a limited, and usually fixed, number of possible values
Factor variables in R are categorical variables that can be either numeric or string variables

What are the advantages of categorical variables?

They can be used in statistical modelling
Useful for visualisations
More efficient use of memory

Types of categorical variables

There are 3 types of categorical variables
- Binary - variables with only 2 options
- Nominal - variables with no inherent order or ranking sequence
- Ordinal - variables with an ordered series

What type are these?

For each of the following, what type of categorical variable are these?

Sex
Race
Age Group
Educational level

Workshop Part 1: Introduction to factors

Inspired from R for Data Science by Hadley Wickham and Garrett Grolemund. https://r4ds.had.co.nz/factors.html

The `forcats` package

To work with factors, we’ll use the forcats package, which is part of the core tidyverse.

library(tidyverse)


── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1

Creating a factor

To create a factor variable we use the factor function

factor()

dataset |> factor(x = character(), levels, labels = levels,
       exclude = NA, ordered = is.ordered(x), nmax = NA)

The `factor()` function

The only required argument is a vector of values which can either be string or numeric.
Optional arguments includes the levels argument, which determines the categories of the factor variable, and the default is the sorted list of all the distinct values of the data vector.

The `factor()` function

The labels argument is another optional argument which is a vector of values that will be the labels of the categories in the levels argument.
The exclude argument is also optional; it defines which levels will be classified as NA in any output using the factor variable.

Order of levels using `unique`

Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.

y <- factor(x, levels = unique(x))

Order of levels using `fct_inorder`

Or you can do it after the fact, with fct_inorder()

y <- x |> factor() |> fct_inorder()

Workshop Part 2: General Social Survey

We will use a sample of data from the General Social Survey. This is a long-running US survey conducted by the independent research organization NORC at the University of Chicago.

gss_cat

Viewing `factors` in a `tibble`

When factors are stored in a tibble, you can’t see their levels so easily. One way to see them is with count() or with a bar chart.

gss_cat |> count(var.test())

dataset |>
    ggplot(mapping = aes(x = var)) +
    geom_bar()

Modifying factor levels

More powerful than changing the order of the levels is changing their values using fct_recode() allows you to recode or change, the value of each level.

dataset |>
    mutate(var = fct_recode(var,
                            "First" = "a"
                            "Second" = "b"
                            "Third" = "c , d"
    ))

Modifying factor levels

If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels.

dataset |>
    mutate(var = fct_recode(var,
                            First = c("a" , "b" , "c"),
                            Second = c("d" , "e" , "f"),
                            Third = c("g , h")
    ))

Modifying factor levels

Sometimes you just want to lump together all the small groups to make a plot or table simpler using fct_lump()

gss_cat |> 
    mutate(var = fct_lump(var))

Workshop Part 3: Working with Dates and Times

Why is working with dates so important?
- Often want to plot data over time
- Group data by year, month or day
- Calculate how long something has taken

Working with date and times in R

As with strings and factors, there is a tidyverse package to help you work with dates more easily. It is part of the core tidyverse too.

library(tidyverse)


── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1

Working with dates and times

When working with dates and times in R, you can consider either dates, times, or date-times. Date-times refer to dates plus times, specifying an exact moment in time.

Creating date objects from strings

For strings, you can call a function using y, m, and d in the order in which the year (y), month (m), and date (d) appear in your data.

# year-month-date
ymd("2022-10-10")

# month-day-year
mdy("October 10th, 2022")

# day-month-year
dmy("10-Oct-2022")

Creating date-time from strings

To work with date-time objects, you have to further include hour (h), minute(m), and second (s) into the function.

# year-month-date
ymd_hms("2022-10-10 09:30:59")

Creating objects from indivdiual parts

If you have a dataset where month, date, year, and/or time information are included in separate columns, the functions within lubridate can take this separate information and create a date or date-time object.

We will be using the functions
- makedate()
- maketimedate()

Getting components of dates

To extract a component of your date object you can do that with the following functions:

mydate <- ymd("2022-10-10")

## extract year information
year(mydate)

## extract day of the month
mday(mydate)

## extract weekday information
wday(mydate)

## label with actual day of the week
wday(mydate, label = TRUE)

Time Spans

It is also important to be able to perform operations over dates.

## subtract birthday from today's date
mydate<- ymd("your_birthday")

age <- today() - mydate
age

The as.duration() function can get this date in years.

## a duration object can get this information in years
as.duration(age)

`lubridate` functions

Using addition, subtraction, multiplication, and division is possible with date objects, and accurately takes into account things like leap years and different number of days each month.

This capability and the additional functions that exist within lubridate can be enormously helpful when working with dates and date-time objects.

nycflights13

This data includes on-time data for all flights that departed NYC in 2013.

#install.packages('nycflights13')
library(nycflights13)

Categorical Variables

Today’s Plan

Let’s start at the beginning…

What is a categorical variable?

What are the advantages of categorical variables?

Types of categorical variables

What type are these?

Workshop Part 1: Introduction to factors

The `forcats` package

Creating a factor

The `factor()` function

The `factor()` function

Order of levels using `unique`

Order of levels using `fct_inorder`

Viewing `factors` in a `tibble`

Modifying factor levels

Modifying factor levels

Modifying factor levels

Workshop Part 3: Working with Dates and Times

Working with date and times in R

Working with dates and times

Creating date objects from strings

Creating date-time from strings

Creating objects from indivdiual parts

Getting components of dates

Time Spans

`lubridate` functions

nycflights13

Thank you!

In the next episode

Categorical Variables

Today’s Plan

Let’s start at the beginning…

What is a categorical variable?

What are the advantages of categorical variables?

Types of categorical variables

What type are these?

Workshop Part 1: Introduction to factors

The forcats package

Creating a factor

The factor() function

The factor() function

Order of levels using unique

Order of levels using fct_inorder

Workshop Part 2: General Social Survey

Viewing factors in a tibble

Modifying factor levels

Modifying factor levels

Modifying factor levels

Workshop Part 3: Working with Dates and Times

Working with date and times in R

Working with dates and times

Creating date objects from strings

Creating date-time from strings

Creating objects from indivdiual parts

Getting components of dates

Time Spans

lubridate functions

nycflights13

Thank you!

In the next episode

The `forcats` package

The `factor()` function

The `factor()` function

Order of levels using `unique`

Order of levels using `fct_inorder`

Viewing `factors` in a `tibble`

`lubridate` functions