Fundamentals of Machine Learning for NHS using R
Field of study that gives computers the ability to learn without being explicitly programmed
Arthur Samuel is credited for this definition, but there is no source of it!
Was a pioneer of artificial-intelligence research
In 1959, while working at IBM, he published a paper that popularized the term machine learning
Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.
A computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program.
February 1996, first computer program to defeat a world champion in a classical game under tournament regulations (Kasparov–Deep Blue 4–2)
March 1997, first computer program to defeat a world champion in a match under tournament regulations (Deep Blue–Kasparov 3½–2½)
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E
The term machine learning refers to the automated detection of meaningful patterns in data.
Email spam and malware filters (Gmail, Outlook, …)
Commute estimation (Google Maps, Waze, …)
Fraud detection (banking)
Online customer support (chatbots)
Image recognition (Facebook, Pinterest, …)
Search engine results refining (Google)
Product reccomendation (Amazon)
Healthcare
Virtual personal smart assistants (Alexa, Cortana, Google Assistant, Siri)
…
Learning Problems
Statistical Inference
Hybrid Learning Problems
Learning Techniques
In supervised learning we are given a dataset and already know what our correct output should look like, assuming that there is a relationship between the input and the output.
In unsupervised learning we have no idea what our results should look like. Unsupervised methods try to capture patterns without knowing how these patterns look like.
Supervised learning wants to build predictive models
Unsupervised learning wants to build descriptive models
We will talk ONLY about supervised learning techniques
In supervised learning we are assuming that there is a relationship between input (usually denoted as \(X\)) and output (usually denoted as \(Y\)), and we want to express it in terms of a mathematical function.
Given two sets, a function is a relationship that associates to each element of the first set, one and only one element of the second set.
If \(f\) is a function from the set \(X\) to the set \(Y\) we write
\[f: X \to Y\]
The element \(y\) of \(Y\) associated by \(f\) to the element \(x\) of \(X\) is denoted by \(f(x)\). Therefore we write
\[y = f(x)\]
We said that \(X\) and \(Y\) are sets, more precisely they are the input and the output set respectively.
But we will also use \(X\) and \(Y\) also to name variables.
Various terms are used interchangeably:
\(X\): “independent variable”, “input variable”, “predictor”, “feature”, “attribute”.
\(Y\): “dependent variable”, “output variable”, “target”, “response”, “outcome”.
Start with \(N\) input-output pairs of observed data
\[ \begin{multline} (x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \dots, (x^{(N)},y^{(N)}) = \\ = \left\{(x^{(i)},y^{(i)})\right\}_{i = 1}^N \end{multline} \]
Problem: find the function \(f: X \to Y\) that has generated the observed outputs given the corresponding inputs.
We observe \(N\) input-output pairs \(\left\{(x^{(i)},y^{(i)})\right\}_{i = 1}^N\).
We assume that there is some relationship between the predictor \(X\) and the response \(Y\), which can be written in the form
\[Y = f(X) + \varepsilon\]
where \(\varepsilon\) is a random error term that is independent of \(X\) and “on average” is equal to zero.
In general the number of predictors is more than one.
So instead of having just \(X\), we have \(X_1, X_2, \dots, X_m\).
Hence we observe tuples rather than pairs
\[\left\{(x^{(i)}_1, x^{(i)}_2, \dots, x^{(i)}_m,y^{(i)})\right\}_{i = 1}^N\]
Therefore the relationship between input and output is written as
\[Y = f(X_1, X_2, \dots, X_m) + \varepsilon\]
We can write
\[X = (X_1, X_2, \dots, X_m) \]
Therefore the relationship between input and output is written as
\[Y = f(X) + \varepsilon\]
and we silently remember that the input can have \(m\) dimensions.
The problem is simple to state: find \(f\).
In essence, statistical learning refers to a set of approaches for estimating \(f\).
The issue is that we will never find the exact \(f\). Therefore we lower our expectations and we set our goal to find an estimate of \(f\), that we are going to write as \(\hat{f}\).
\[\hat{Y} = \hat{f}(X_{\text{new}})\]
Categorical variables can take a limited number of possible values assigning each observation to a group
Numerical variables can take values within a (possibly unbounded) numerical interval
What kind of output we are trying to predict?
In classification the objective is to predict a categorical outcome
In regression the objective is to predict a numerical outcome
However often in classification problems rather than predict a particular class, we often want predict the probability of a particular class.
Assume that \(f\) has a particular functional form depending on a set of parameters.
Estimate parameters using observed data.
Advantages: easy to estimate.
Disavantages: assumptions may be very wrong!
Don’t make particular assumptions on the functional form of \(f\).
Estimate a function that adapts smoothly to the observed data “avoiding excessive oscillations and edgy behaviours”.
Advantages: can estimate effectively many functions.
Disadvantages: need a lot of data!
In Parametric methods, we will make assumption about the functional form of the function \(f\). In other words we assume that \(f\) can be written in a particular form. We call it hypothesis function, we denote it as \(h\), and it will depend on one or more parameters.
A very simple model for regression is given by
\[h(x) = \beta_0 \qquad \beta_0 \in \mathbb{R}\]
What does \(h(x) = \beta_0\) represents?
How can we find the best \(\beta_0\)?
How do we find the “best” one?
What actually best means in this context?
One ingredient is still missing…
We need a way to measure how good is our fit given the choice made on parameters. This is done using a so-called loss function the we will denote as \(\mathcal{L}(\hat{y}, y)\), where \(\hat{y} = h(x)\).
Basically we:
Select a hypothesis function \(h(x)\)
Make a choice for its parameters
Predict \(\hat{y} = h(x)\) for all observed \(x\)’s
Compare \(y\) and \(\hat{y}\) using the loss function
Repeat 2. to 4. for many parameters choices and the select the one with minimum loss
This procedure is know as Empirical Risk Minimization.
In regression problems a common choice for the loss function is the square loss.
\[ \mathcal{L}(\hat{y}, y) = \frac{1}{N} \sum_{i = 1}^N \left( \hat{y}^{(i)} - y^{(i)} \right)^2 \]
where for \(i = 1, \dots, N\)
\[ \hat{y}^{(i)} = h(x^{(i)}) \]
Since \(\hat{y}^{(i)} = h(x^{(i)}) = \beta_0\) for \(i = 1, \dots, N\) then our problem is
\[ \min_{\beta_0} \left[ \frac{1}{N} \sum_{i = 1}^N \left( \beta_0 - y^{(i)} \right)^2 \right] \]
With some maths, pen and paper it is possible to show that the best \(\beta_0\), the one that minimizes the loss, is the average of the observed \(y\)’s.
A very commonly used model for regression (with one predictor) is given by
\[h(x) = \beta_0 + \beta_1 x \qquad \beta_0, \beta_1 \in \mathbb{R}\]
What does \(h(x) = \beta_0 + \beta_1 x\) represents?
How can we find the best \(\beta_0\) and \(\beta_1\)?
How do we find the “best” one?
Since \(\hat{y}^{(i)} = h(x^{(i)}) = \beta_0 + \beta_1 x^{(i)}\) for \(i = 1, \dots, N\), then our problem, also know as ordinary least squares (OLS), is
\[ \min_{\beta_0, \beta_1} \left[ \frac{1}{N} \sum_{i = 1}^N \left( \beta_0 + \beta_1 x^{(i)} - y^{(i)} \right)^2 \right] \]
Again with some maths, pen and paper it is possible to find a solution, but is not that easy to write (and to find!).
In case of more predictors, the model is of the form
\[h(x) = \beta_0 + \beta_1 x_1 + \dots + \beta_m x_m \qquad \beta_0, \beta_1, \dots, \beta_m \in \mathbb{R}\]
What does \(h(x) = \beta_0 + \beta_1 x_1 + \dots + \beta_m x_m\) represents?
How can we find the best \(\beta_0,\beta_1, \dots, \beta_m\)?
Since
\[\hat{y}^{(i)} = h(x^{(i)}) = \beta_0 + \beta_1 x_1^{(i)} + \dots + \beta_m x_m^{(i)}\]
for \(i = 1, \dots, N\), then our problem is
\[ \min_{\beta_0, \beta_1, \dots, \beta_m} \left[ \frac{1}{N} \sum_{i = 1}^N \left( \beta_0 + \beta_1 x_1^{(i)} + \dots + \beta_m x_m^{(i)} - y^{(i)} \right)^2 \right] \]
We wrote the general statement of the problem for linear regression!
Linear regression is one possible model, there are many more!
In “classical” statistics a similar model is discussed, but there are assumptions to check and other things can be said. We won’t discuss most of them.
How can we find the best parameters?
The idea is to start from a random point and then update our approximation of the \(x\)-coordinate of the minimum by going opposite of the direction of the gradient at that point
First 10 iterations
# A tibble: 10 × 4
epoch b_0 b_1 loss
<int> <dbl> <dbl> <dbl>
1 1 40 -10 1431.
2 2 40.3 -6.50 591.
3 3 40.3 -5.26 487.
4 4 40.1 -4.81 471.
5 5 40.0 -4.63 465.
6 6 39.7 -4.55 461.
7 7 39.5 -4.50 456.
8 8 39.3 -4.46 452.
9 9 39.1 -4.42 447.
10 10 38.9 -4.39 443.
Last 10 iterations
# A tibble: 10 × 4
epoch b_0 b_1 loss
<int> <dbl> <dbl> <dbl>
1 1991 -3.11 2.02 2.32
2 1992 -3.11 2.02 2.32
3 1993 -3.11 2.02 2.32
4 1994 -3.11 2.02 2.32
5 1995 -3.11 2.02 2.32
6 1996 -3.11 2.02 2.32
7 1997 -3.11 2.02 2.32
8 1998 -3.11 2.02 2.32
9 1999 -3.11 2.02 2.32
10 2000 -3.11 2.02 2.32
Find the best values for the parameters is a job that we let computers do. From a machine learning practitioner perspective having an idea of how gradient descent works is enough: you don’t need to know all the mathematical details.
Logistic (or Logit) function is defined by
\[\sigma(x) = \frac{1}{1+e^{-x}}\]
It is also know with the name of sigmoid function.
If \(x\) is large positive, then \(\sigma(x) \approx 1\)
If \(x\) is large negative, then \(\sigma(x) \approx 0\)
A possible model for classification (with one predictor) is given by
\[h(x) = \sigma(\beta_0 + \beta_1 x) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x)}} \qquad \beta_0, \beta_1 \in \mathbb{R}\]
What does \(h(x)\) represents?
How can we find the best \(\beta_0\) and \(\beta_1\)?
Logistic regression for classification?!?!
The regression part in the Logistic Regression name derives from technical similarities with Linear Regression when estimating parameters for both models. It’s just a name, a legacy that comes from the past. However logistic regression is used only for classification.
In classification problems a choice for the loss function is the cross entropy loss.
\[ \mathcal{L}(\hat{y}, y) = -\frac{1}{N} \sum_{i = 1}^N \left(y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right) \]
where for \(i = 1, \dots, N\)
\[ \hat{y}^{(i)} = h(x^{(i)}) = \sigma(\beta_0 + \beta_1 x^{(i)}) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x^{(i)})}} \]
Our problem is
\[ \min_{\beta_0, \beta_1} \left[ -\frac{1}{N} \sum_{i = 1}^N \left(y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right) \right] \]
where
\[\hat{y}^{(i)} = h(x^{(i)}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x^{(i)})}}\]
for \(i = 1, \dots, N\).
Let’s focus on a single observation.
If \(y = 1\), then
\[\text{loss} = -\left(y \log \hat{y} + (1 - y) \log(1 - \hat{y}) \right)\]
becomes
\[\text{loss} = -\log \hat{y}\]
Hence the loss is lower for \(\hat{y}\) approaching \(1\).
If \(y = 0\), then
\[\text{loss} = -\left(y \log \hat{y} + (1 - y) \log(1 - \hat{y}) \right)\]
becomes
\[\text{loss} = -\log(1 - \hat{y})\]
Hence the loss is lower for \(\hat{y}\) approaching \(0\).
Assuming the the hypothesis function is of the form of the logistic regression may seem innatural.
However there are two good features that come together with logistic regression:
Predictions are values between \(0\) and \(1\), therefore can be interpreted as probabilities
Compared to linear regression, logistic regression would be less affected by outliers.
In case of more predictors, the model is of the form
\[ \begin{multline} h(x) = \sigma(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m) \\ = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)}} \\ \beta_0, \beta_1, \dots, \beta_m \in \mathbb{R} \end{multline} \]
How can we find the best \(\beta_0,\beta_1, \dots, \beta_m\)?
First 10 iterations
# A tibble: 10 × 4
epoch b_0 b_1 cross_entropy
<int> <dbl> <dbl> <dbl>
1 1 7 -8 12.9
2 2 6.99 -7.97 12.8
3 3 6.99 -7.95 12.8
4 4 6.98 -7.92 12.7
5 5 6.98 -7.89 12.7
6 6 6.97 -7.86 12.7
7 7 6.97 -7.84 12.6
8 8 6.96 -7.81 12.6
9 9 6.96 -7.78 12.6
10 10 6.95 -7.76 12.5
Last 10 iterations
# A tibble: 10 × 4
epoch b_0 b_1 cross_entropy
<int> <dbl> <dbl> <dbl>
1 4991 -1.64 3.45 0.121
2 4992 -1.64 3.45 0.121
3 4993 -1.64 3.45 0.121
4 4994 -1.64 3.45 0.121
5 4995 -1.64 3.45 0.121
6 4996 -1.64 3.45 0.121
7 4997 -1.64 3.45 0.121
8 4998 -1.64 3.45 0.121
9 4999 -1.64 3.45 0.121
10 5000 -1.64 3.45 0.121
Yes!
During this course you’ll learn how to train different families of models. Then when you have a new problem, you’ll try the models you have in your toolbox and pick the best one.
Why should we try (and therefore learn to use) different families of functions and not try the “best” family?
The No Free Lunch Theorem states that there is no one model that works best for every situation. Because the assumptions of a great model for one issue may not hold true for another, it is typical in machine learning to attempt many models to discover the one that performs best for a specific problem.
R opinionated (of course!) references:
Bradley Boehmke and Brandon M. Greenwell, Hands-On Machine Learning with R, 2019 free access
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning: with Applications in R (2nd Edition), 2021, free access
Max Kuhn and Kjell Johnson, Applied Predictive Modelling, 2013
Max Kuhn and Kjell Johnson, Feature Engineering and Selection: A Practical Approach for Predictive Models, 2019, free access
Max Kuhn and Julia Silge, Tidy Modeling with R, 2022, free access
Brett Lantz, Machine Learning with R (3rd Edition), 2019
If you really want to know more mathematical details. But be warned, the following are not easy readings…
Luc Devroye, Gábor Lugosi, and László Györfi, A Probabilistic Theory of Pattern Recognition, 1996
Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd Edition), 2009
Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning, 2014
Questions?