Classification Part 1

Fundamentals of Machine Learning for NHS using R

Classification

Part 1 Logistic Regression
Part 2 Decision Trees & Random Forests

Logistic Regression

Logistic regression is foremost used to model a binary (0,1) variable \(Y\), based on one or more other variables, called predictors.
The binary variable being modelled is generally referred to as the response variable, or the dependent variable.

Logistic Regression

For a model to fit the data well, it is assumed that:
- The predictors are uncorrelated with one another.
- That they are significantly related to the response variable.
- The observations or data elements of a model are also uncorrelated.

Logistic Regression

Logistic regression is typically utilised for 3 purposes:
- To predict the probability that the response variable \(Y\) is equal to 1.
- To categorise outcomes or predictors.
- To assess the odds or risk associated with model predictors.

Logistic Regression

Aim: To predict the dependent variable \(Y\). \[P(Y=0)=1-P(Y=1)\] Denote \(k\) independent variables \[x_1,x_2, .,x_k\]

Logistic Regression

At the centre of logistic regression analysis is the task of estimating the log odds of an event. \[ Odds = P(Y=1)/P(Y=0) \]
Odds \(> 1 if (Y=1)\) is more likely
Odds \(< 1 if (Y=0)\) is more likely

Logistic Regression

Suppose we roll a dice numbered 1 to 6 and we are interested in the outcome of rolling greater than a 2. \[ P(Roll>2) = 4/6 \\ P(Roll \leq 2) = 2/6 \] Therefore, the odds of rolling >2 \[ Odds=P(Roll>2)/P(Roll \leq 2)= 4/2 = 2 \] Odds 2:1

Logistic regression

However, there is a problem using this calculation in the model. If probability =1 then \[ Odds = 1/0= \infty \]
So, we use log odds instead. \[ Odds = Ln(P(Y=1)/P(Y=0) ) \] It maps probability ranging between 0 and 1 to log odds ranging from negative infinity to positive infinity.

Logistic regression

Logistic Regression

From mathematics, it can be shown that: \[ Ln\frac{P(Y=1)}{P(Y=0)}=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_k x_k \] Hence we need to rearrange this function in terms of P(Y=1).

Logistic Regression

Given the fact that \[ \frac{P(Y=1)}{P(Y=0)}=\frac{P(Y=1)}{1 - P(Y=1)} \]

Logistic Regression

Taking logs of both sides you get \[ \frac{P(Y=1)}{1 - P(Y=1)} = e^{\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_k x_k} \]

Logistic Regression

The logit function or the log-odds is therefore \[ P(Y=1) = \frac{1}{1 + e^-(\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_k x_k)} \]

Logistic Regression

Positive values are predicted as class 1
Negative values are predicted as class 0
The coefficients are selected to:
- Predict a high probability for the positive cases
- Predict a low probability for the low cases

Logisitc Regression

Logistic Regression

After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors.
To do so, they will want to examine the regression coefficients.
In logistic regression, there are several different tests designed to assess the significance of an individual predictor.

Logistic Regression

The AIC provides a method for assessing the quality of your model through comparison of related models.
Unlike adjusted R-squared, the number itself is not meaningful.
If you have more than one similar candidate models, then you should select the model that has the smallest AIC.
So its useful for comparing models, but isn’t interpretable on its own.

Workshop: Logistic Regression

Thank you!

Questions?