Feature Engineering

Fundamentals of Machine Learning for NHS using R

Today’s Plan

 

  • Feature Engineering
    • Feature engineering techniques including but not limited to: transformations, feature extraction, reduction and selection.

Data transformation

 

  • Once data cleaning is complete, we need to consolidate the quality data into alternate forms by changing the value, structure, or format of data.
  • Generally machine learning models are difficult to understand when the input numerical variables have different scales.

Data transformation

 

Data transformation

 

  • Min-max Normalisation
    • Values are shifted and rescaled so they end up ranging from 0 to 1
  • Standardisation
    • Subtracts the mean value from each index and divide by standard deviation, resulting into a new distribution with unit variance

Data transformation

 

Data transformation

 

Data transformation

 

Feature Engineering

Categorical variables

 

  • Also known as creating dummy variables.
  • Categorical features are replaced by one or more new features which can have numerical values.
  • Any number of categories can be represented by introducing one new feature per category.

Feature Engineering

Categorical variables

 

Feature Engineering

Categorical variables

 

When to one-hot encode and dummy encode

 

  • Both types of encoding can be used to encode ordinal and nominal categorical variables.
  • However, if you strictly want to keep the natural order of an ordinal categorical variable, you can use label encoding instead.

Label encoding

 

Advantages & disadvantages

 

  • One advantage of label encoding is that it does not expand the feature space at all as we just replace category names with numbers.
  • The major disadvantage of label encoding is that machine learning algorithms may consider there may be relationships between the encoded categories.
  • For example, an algorithm may interpret Premium (2) as two times better than Good (1).

Feature Engineering

Continuous variables

  • Binning takes a numeric predictor and pre-categorises or “bins” it into two or more groups.
  • Binning can be used to create new features or can be used to simply categorise features as they are.
  • There could be certain drawbacks to binning continuous data when being used as features for ML models. There can be a loss of precision in the predictions when the predictors are categorised.

Binning

 

Feature selection

 

  • Removing predictors - fewer predictors mean less computational time and complexity.
  • Collinearity - the situation where a pair of predictor variables have a substantial correlation with each other.
  • Remove if: two predictors are highly correlated, discard one?
  • We shouldn’t just blindly follow the correlation rule. We can use the highly correlated features to create new features.

Feature selection

 

  • These methods reduce the data by generating a smaller set of predictors.
  • They capture the majority of the information in the original variables.
  • Fewer variables can be used that provide reasonable fidelity to the original data.
  • For most data reduction techniques, the new predictors are functions of the original predictors.

Thank you!

Questions?