Data Visualisation I

Fundamentals of Data Science for NHS using R

Data Visualisation in baseR

hist(data, col = "green") # data is a vector

Data Visualisation in ggplot2

data |> # data is a tibble
    ggplot(mapping = aes(x = x)) + geom_histogram(fill = "green", col = "black") +
    labs(title = "Histogram of data") + theme_minimal(base_size = 16)

Why should we prefer ggplot2?

Because

Unlike most other graphics packages, ggplot2 has an underlying grammar, based on the Grammar of Graphics, that allows you to compose graphs by combining independent components. This makes ggplot2 powerful. Rather than being limited to sets of pre-defined graphics, you can create novel graphics that are tailored to your specific problem.

Hadley Wickham,

ggplot2: Elegant Graphics for Data Analysis

The Leo Chart

Grammar?

The Grammar of Graphics book

  • Theoretical foundation of graphical applications and packages including ggplot2.

  • Should you read it? Maybe in the future!

  • To have an intuition (and other things about ggplot2) watch

Learning Resources

R for Data Science

  • R for Data Science work-in-progress 2nd edition by Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund.

  • This is still the best place where to start also for data visualisation! (see chapters 2 and 10)

R Graphics Cookbook

ggplot2 book

  • ggplot2 (3e) by Hadley Wickham & Danielle Navarro & Thomas Lin Pedersen.

  • This is what you should read to fully understand how ggplot2 works.

Ok… But I want to draw plots!

Exercise 1

Install and load tidyverse and palmerpenguins in your environment!

A plotting template

ggplot(dataset, mapping = aes(x = X, y = X)) +
  geom_TYPE()

# or

dataset |>
    ggplot(mapping = aes(x = X, y = X)) +
    geom_TYPE()

Scatter Plots

Exercise 2

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis.

Exercise 3

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different shapes for different islands.

Exercise 4

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different shapes for different islands. Increase the size of the points to 7. We will keep this size for the rest of the session unless otherwise stated.

Exercise 5

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different colours for different islands.

Exercise 6

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different colours for different values of the continuous variable flipper_length_mm.

Exercise 7

Draw a scatter plot using the variable flipper_length_mm for the x-axis, and the variable body_mass_g on the y-axis. Use different colours for different values of the new continuous variable obtained from the ratio of bill_length_mm by bill_depth_mm.

Exercise 8

Repeat the previous exercise using the dplyr verb mutate calling the new variable ratio.

Exercise 9

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different colours for different species.

Exercise 10

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Highlight the penguins that belong to the “Chinstrap” species.

Exercise 11

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different colours for different species, and different shapes for different islands.

Exercise 12

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different colours for different species. Change opacity to 0.4.

Exercise 13

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different colours for different penguin sex. Change opacity to 0.4.

Exercise 14

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different colours for different penguin species. Change opacity to 0.4. Use different facet (panels) for different species.

Exercise 15

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different colours for different penguin sex. Change opacity to 0.4. Use different facet (panels) for different species.

Exercise 16

Find the mean of every group identified by species and sex.

# A tibble: 6 × 4
# Groups:   species [3]
  species   sex    bill_length_mm bill_depth_mm
  <fct>     <fct>           <dbl>         <dbl>
1 Adelie    female           37.3          17.6
2 Adelie    male             40.4          19.1
3 Chinstrap female           46.6          17.6
4 Chinstrap male             51.1          19.3
5 Gentoo    female           45.6          14.2
6 Gentoo    male             49.5          15.7

Exercise 17

Draw a scatter plot using the variable bill_length_mm for the x-axis, and the variable bill_depth_mm on the y-axis. Use different colours for different penguin sex. Change opacity to 0.4. Use different facet (panels) for different species and different sex.

Histograms

Exercise 18

Draw an histogram with the distribution of the continuous variable flipper_length_mm.

Exercise 19

Draw an histogram with the distribution of the continuous variable flipper_length_mm. Use different colours for different species.

Exercise 20

Draw an histogram with the distribution of the continuous variable flipper_length_mm. Use different colours for different species. Change opacity to 0.4.

Exercise 21

Draw an histogram with the distribution of the continuous variable flipper_length_mm. Use different colours for different species. Change opacity to 0.4. Use the option position = "stack". What changed compared to the previous exercise?

Exercise 22

Draw an histogram with the distribution of the continuous variable flipper_length_mm. Use different colours for different species. Change opacity to 0.4. Use the option position = "identity". What changed compared to the previous exercise?

Exercise 23

Draw an histogram with the distribution of the continuous variable flipper_length_mm. Use different colours for different species. Change opacity to 0.4. Use the option position = "identity". This time use also different colour for the borders of the histogram.

Exercise 24

Draw an histogram with the distribution of the continuous variable flipper_length_mm. Use different colours for different species. Change opacity to 0.4. Use the option position = "dodge". What changed compared to the previous exercise?

Density

Exercise 25

Draw a density plot with the distribution of the continuous variable flipper_length_mm.

Exercise 26

Draw a density plot with the distribution of the continuous variable flipper_length_mm. Use different colours for different species.

Exercise 27

Draw a density plot with the distribution of the continuous variable flipper_length_mm. Use different colours for different species. Change opacity to 0.6.

Exercise 28

Draw a density plot with the distribution of the continuous variable flipper_length_mm. Use different colours for different species. Change opacity to 0.6. Change also the colour of the border of the density plot and change its size.

Axis

Exercise 29

Reproduce the following plot. Observe the values on the x-axis.

Exercise 30

Repeat the previous exercise using the function seq.

Exercise 31

Repeat the previous exercise also changing the values on the y-axis.

Labels

Exercise 32

Reproduce the following plot.

Fixed Ratio

Exercise 33

Change the ratio between x and y axis. Keep the ratio fixed even if you resize the picture.

Is data visualisation useful?

A New Dataset!

data <- anscombe |>
    pivot_longer(everything(),
                 cols_vary = "slowest",
                 names_to = c(".value", "group"),
                 names_pattern = "(.)(.)")
# A tibble: 44 × 3
   group     x     y
   <chr> <dbl> <dbl>
 1 1        10  8.04
 2 1         8  6.95
 3 1        13  7.58
 4 1         9  8.81
 5 1        11  8.33
 6 1        14  9.96
 7 1         6  7.24
 8 1         4  4.26
 9 1        12 10.8 
10 1         7  4.82
11 1         5  5.68
12 2        10  9.14
13 2         8  8.14
14 2        13  8.74
15 2         9  8.77
16 2        11  9.26
17 2        14  8.1 
18 2         6  6.13
19 2         4  3.1 
20 2        12  9.13
21 2         7  7.26
22 2         5  4.74
23 3        10  7.46
24 3         8  6.77
25 3        13 12.7 
26 3         9  7.11
27 3        11  7.81
28 3        14  8.84
29 3         6  6.08
30 3         4  5.39
31 3        12  8.15
32 3         7  6.42
33 3         5  5.73
34 4         8  6.58
35 4         8  5.76
36 4         8  7.71
37 4         8  8.84
38 4         8  8.47
39 4         8  7.04
40 4         8  5.25
41 4        19 12.5 
42 4         8  5.56
43 4         8  7.91
44 4         8  6.89

Exercise 34

Obtain the following numerical summaries

# A tibble: 4 × 6
  group mean_x var_x mean_y var_y cor_xy
  <chr>  <dbl> <dbl>  <dbl> <dbl>  <dbl>
1 1          9    11   7.50  4.13  0.816
2 2          9    11   7.50  4.13  0.816
3 3          9    11   7.5   4.12  0.816
4 4          9    11   7.50  4.12  0.817

Exercise 35

Plot the data using an appropriate technique we have seen today.

Thank you!

In the next episode

More Data Visualisation!