Regularization

Fundamentals of Machine Learning for NHS using R

A New Dataset

Ames Housing

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

Dataset Overview

  • 2,930 observations
  • 82 variables
  • sale_price is the target variable

Dataset Variables

Variable Name Type Description
order Numeric Unique observation ID
pid Nominal Parcel identification number - can be used with city web site for parcel review
ms_subclass Nominal

Identifies the type of dwelling involved in the sale

  • One Story 1946 and Newer All Styles
  • One Story 1945 and Older
  • One Story with Finished Attic All Ages
  • One and Half Story Unfinished All Ages
  • One and Half Story Finished All Ages
  • Two Story 1946 and Newer
  • Two Story 1945 and Older
  • Two and Half Story All Ages
  • Split or Multilevel
  • Split Foyer
  • Duplex All Styles and Ages
  • One Story PUD 1946 and Newer
  • One and Half Story PUD All Ages
  • Two Story PUD 1946 and Newer
  • PUD Multilevel Split Level Foyer
  • Two Family Conversion All Styles and Ages
ms_zoning Nominal

Identifies the general zoning classification of the sale

  • Agriculture
  • Commercial
  • Floating Village Residential
  • Industrial
  • Residential High Density
  • Residential Low Density
  • Residential Medium Density
lot_frontage Numeric Linear feet of street connected to property
lot_area Numeric Lot size in square feet
street Nominal

Type of road access to property

  • Gravel
  • Paved
alley Nominal

Type of alley access to property

  • Gravel
  • Paved
  • No Alley Access
lot_shape Nominal

General shape of property

  • Irregular
  • Moderately Irregular
  • Slightly Irregular
  • Regular
land_contour Nominal

Flatness of the property

  • Near Flat Level
  • Banked
  • Hillside
  • Depression
utilities Nominal

Type of utilities available

  • Electricity Gas
  • Electricity Gas Water
  • All Public Utilities
lot_configuration Nominal

Lot configuration

  • Inside Lot
  • Corner Lot
  • Cul de Sac
  • Frontage on 2 Sides
  • Frontage on 3 Sides
land_slope Nominal

Slope of property

  • Gentle Slope
  • Moderate Slope
  • Severe Slope
neighborhood Nominal

Physical locations within Ames city limits

  • Bloomington Heights
  • Bluestem
  • Briardale
  • Brookside
  • Clear Creek
  • College Creek
  • Crawford
  • Edwards
  • Gilbert
  • Greens
  • Green Hills
  • Iowa DOT and Rail Road
  • Landmark
  • Meadow Village
  • Mitchell
  • North Ames
  • Northridge
  • Northpark Villa
  • Northridge Heights
  • Northwest Ames
  • Old Town
  • South and West of Iowa State University
  • Sawyer
  • Sawyer West
  • Somerset
  • Stone Brook
  • Timberland
  • Veenker
  • Hayden Lake
condition_1 Nominal

Proximity to various conditions

  • Adjacent Arterial Street
  • Adjacent Feeder Street
  • Normal
  • Within 200 North South Railroad
  • Adjacent North South Railroad
  • Near Positive off Site Feature
  • Adjacent Postive off Site Feature
  • Within 200 East West Railroad
  • Adjacent East West Railroad
condition_2 Nominal

Proximity to various conditions (if more than one is present)

  • Adjacent Arterial Street
  • Adjacent Feeder Street
  • Normal
  • Within 200 North South Railroad
  • Adjacent North South Railroad
  • Near Positive off Site Feature
  • Adjacent Postive off Site Feature
  • Within 200 East West Railroad
  • Adjacent East West Railroad
building_type Nominal

Type of dwelling

  • One Family Detached
  • Two Family Conversion
  • Duplex
  • Townhouse End Unit
  • Townhouse Inside Unit
house_style Nominal

Style of dwelling

  • One Story
  • One and Half Story 2nd Level Finished
  • One and Half Story 2nd Level Unfinished
  • Two Story
  • Two and Half Story 2nd Level Finished
  • Two and Half Story 2nd Level Unfinished
  • Split Foyer
  • Split Level
overall_quality Nominal

Rates the overall material and finish of the house

  • Very Poor
  • Poor
  • Fair
  • Below Average
  • Average
  • Above Average
  • Good
  • Very Good
  • Excellent
  • Very Excellent
overall_condition Nominal

Rates the overall condition of the house

  • Very Poor
  • Poor
  • Fair
  • Below Average
  • Average
  • Above Average
  • Good
  • Very Good
  • Excellent
  • Very Excellent
year_built Numeric Original construction date
year_remod_add Numeric Remodel date (same as construction date if no remodeling or additions)
roof_style Nominal

Type of roof

  • Flat
  • Gable
  • Gabrel Barn
  • Hip
  • Mansard
  • Shed
roof_material Nominal

Roof material

  • Clay or Tile
  • Composite Shingle
  • Membrane
  • Metal
  • Roll
  • Gravel and Tar
  • Wood Shakes
  • Wood Shingles
exterior_1 Nominal

Exterior covering on house

  • Asbestos Shingles
  • Asphalt Shingles
  • Brick Common
  • Brick Face
  • Cinder Block
  • Cement Board
  • Hard Board
  • Imitation Stucco
  • Metal Siding
  • Other
  • Plywood
  • PreCast
  • Stone
  • Stucco
  • Vinyl Siding
  • Wood Siding
  • Wood Shingles
exterior_2 Nominal

Exterior covering on house (if more than one material)

  • Asbestos Shingles
  • Asphalt Shingles
  • Brick Common
  • Brick Face
  • Cinder Block
  • Cement Board
  • Hard Board
  • Imitation Stucco
  • Metal Siding
  • Other
  • Plywood
  • PreCast
  • Stone
  • Stucco
  • Vinyl Siding
  • Wood Siding
  • Wood Shingles
masonry_veneer_type Nominal

Masonry veneer type

  • Brick Common
  • Brick Face
  • Cinder Block
  • None
  • Stone
  • None
masonry_veneer_area Numeric Masonry veneer area in square feet
exterior_quality Nominal

Evaluates the quality of the material on the exterior

  • Poor
  • Fair
  • Average Typical
  • Good
  • Excellent
exterior_condition Nominal

Evaluates the present condition of the material on the exterior

  • Poor
  • Fair
  • Average Typical
  • Good
  • Excellent
foundation Nominal

Evaluates the present condition of the material on the exterior

  • Brick and Tile
  • Cinder Block
  • Poured Contrete
  • Slab
  • Stone
  • Wood
basement_quality Nominal

Evaluates the height of the basement

  • No Basement
  • Poor
  • Fair
  • Average Typical
  • Good
  • Excellent
basement_condition Nominal

Evaluates the general condition of the basement

  • No Basement
  • Poor
  • Fair
  • Average Typical
  • Good
  • Excellent
basement_exposure Nominal

Refers to walkout or garden level walls

  • No Basement
  • No Exposure
  • Minimum Exposure
  • Average Exposure
  • Good Exposure
basement_fin_type_1 Nominal

Rating of basement finished area

  • No Basement
  • Unfinshed
  • Low Quality
  • Average Rec Room
  • Below Average Living Quarters
  • Average Living Quarters
  • Good Living Quarters
basement_area_type_1 Numeric Type 1 finished square feet
basement_fin_type_2 Nominal

Rating of basement finished area (if multiple types)

  • No Basement
  • Unfinshed
  • Low Quality
  • Average Rec Room
  • Below Average Living Quarters
  • Average Living Quarters
  • Good Living Quarters
basement_area_type_2 Numeric Type 2 finished square feet
basement_unfinished_area Numeric Unfinished square feet of basement area
basement_total_area Numeric Total square feet of basement area
heating Nominal

Type of heating

  • Floor Furnace
  • Gas Forced Warm Air Furnace
  • Gas Hot Water or Steam Heat
  • Gravity Furnace
  • Hot Water or Steam Heat other than Gas
  • Average Living Quarters
  • Wall Furnace
heating_quality Nominal

Heating quality and condition

  • Poor
  • Fair
  • Average Typical
  • Good
  • Excellent
central_air Nominal

Heating quality and condition

  • No
  • Yes
electrical Nominal

Heating quality and condition

  • Mixed
  • Fuse Box Poor
  • Fuse Box Fair
  • Fuse Box Average
  • Standard Circuit Breakers and Romex
first_floor_area Numeric First Floor square feet
second_floor_area Numeric Second floor square feet
low_quality_finished_area Numeric Low quality finished square feet (all floors)
ground_living_area Numeric Above grade (ground) living area square feet
basement_full_bathrooms Numeric Basement full bathrooms
basement_half_bathrooms Numeric Basement half bathrooms
full_bathrooms Numeric Full bathrooms above grade
half_bathrooms Numeric Half baths above grade
bedroom_above_ground Numeric Bedrooms above grade (does NOT include basement bedrooms)
kitchen_above_ground Numeric Kitchens above grade
kitchen_quality Nominal

Kitchen quality

  • Poor
  • Fair
  • Average Typical
  • Good
  • Excellent
total_rooms_above_ground Numeric Total rooms above grade (does not include bathrooms)
functional Nominal

Home functionality (Assume typical unless deductions are warranted)

  • Salvage Only
  • Severely Damaged
  • Major Deductions 2
  • Major Deductions 1
  • Moderate Deductions
  • Minor Deductions 2
  • Minor Deductions 1
  • Typical Functionality
fireplaces Numeric Number of fireplaces
fireplace_quality Nominal

Fireplace quality

  • No Fireplace
  • Poor
  • Fair
  • Average Typical
  • Good
  • Excellent
garage_type Nominal

Garage location

  • More than One Type
  • Attached to Home
  • Basement Garage
  • Built-In
  • Car Port
  • Detached from Home
  • No Garage
garage_year_built Numeric Year garage was built
garage_finish Nominal

Interior finish of the garage

  • No Garage
  • Finished
  • Rough Finished
  • Unfinished
garage_cars Numeric Size of garage in car capacity
garage_area Numeric Size of garage in square feet
garage_quality Nominal

Garage quality

  • No Garage
  • Poor
  • Fair
  • Average Typical
  • Good
  • Excellent
garage_condition Nominal

Garage condition

  • No Garage
  • Poor
  • Fair
  • Average Typical
  • Good
  • Excellent
paved_drive Nominal

Garage condition

  • Paved
  • Partial Pavement
  • Dirt Gravel
wood_deck_area Numeric Wood deck area in square feet
open_porch_area Numeric Open porch area in square feet
enclosed_porch_area Numeric Enclosed porch area in square feet
three_season_porch_area Numeric Three season porch area in square feet
screen_porch_area Numeric Screen porch area in square feet
pool_area Numeric Pool area in square feet
pool_quality Nominal

Pool quality

  • No Pool
  • Fair
  • Average Typical
  • Good
  • Excellent
fence Nominal

Fence quality

  • No Fence
  • Minimum Wood Wire
  • Good Wood
  • Minimum Privacy
  • Good Privacy
misc_feature Nominal

Miscellaneous feature not covered in other categories

  • Elevator
  • Second Garage if not Described in Garage Section
  • Other
  • Shed over 100 SF
  • Tennis Court
  • None
misc_value Numeric Value of miscellaneous feature ($)
month_sold Numeric Month Sold
year_sold Numeric Year Sold
sale_type Nominal

Type of sale

  • Warranty Deed Conventional
  • Warranty Deed Cash
  • Warranty Deed VA Loan
  • Home just Constructed and Sold
  • Court Officer Deed Estate
  • Contract 15 Down Payment Regular Terms
  • Contract Low Down Payment and Low Interest
  • Contract Low Interest
  • Contract Low Down
  • Other
sale_condition Nominal

Type of sale

  • Normal Sale
  • Abnormal Sale
  • Adjoining Land Purchase
  • Allocation
  • Sale between Family Members
  • Home was not Completed when Last Assessed
sale_price Numeric Sale price ($)

Workshop 1: EDA

Review of Linear Regression

Model

\[ h(x) = \beta_0 + \beta_1 x_1 + \dots + \beta_m x_m \qquad \beta_0, \beta_1, \dots, \beta_m \in \mathbb{R} \]

Problem

How can we find the best \(\beta_0,\beta_1, \dots, \beta_m\)?

Statement of the Problem

Since

\[ \hat{y}^{(i)} = h(x^{(i)}) = \beta_0 + \beta_1 x_1^{(i)} + \dots + \beta_m x_m^{(i)} \]

for \(i = 1, \dots, N\), then our problem is

\[ \min_{\beta_0, \beta_1, \dots, \beta_m} \left[ \frac{1}{N} \sum_{i = 1}^N \left( y^{(i)} - \beta_0 - \beta_1 x_1^{(i)} - \dots - \beta_m x_m^{(i)} \right)^2 \right] \]

We can rewrite the problem as

Let \(\beta = \left( \beta_0, \beta_1, \dots, \beta_m \right)\). Then we want to find

\[ \min_{\beta} \left[ \frac{1}{N} \sum_{i = 1}^N \left( y^{(i)} - \hat{y}^{(i)} \right)^2 \right] \]

Cross-validation

Problem

When we have many predictors:

  • Some of them may not be needed to predict the target

  • Some others may be correlated, and this can affect our model performace

Regularisation: a possible solution

  • Regularized regression provides an approach to constrain the total size of coefficient estimates

  • Constraints helps to reduce the magnitude and fluctuations of the coefficients and will reduce the variance of the model

  • This results in more stable coefficients and model performance

How?

A Regularized regression model has a loss function of the form

\[ \min_{\beta} \left[ \frac{1}{N} \sum_{i = 1}^N \left( y^{(i)} - \hat{y}^{(i)} \right)^2 + \text{penalty} \right] \]

LASSO Regression

The LASSO (least absolute shrinkage and selection operator) penalty asks to solve the following problem

\[ \min_{\beta} \left[ \frac{1}{N} \sum_{i = 1}^N \left( y^{(i)} - \hat{y}^{(i)} \right)^2 + \lambda \sum_{j = 1}^m |\beta_j| \right] \]

  • Size of penalty is controlled by \(\lambda\)

  • When \(\lambda = 0\) is the same as linear regression

  • As \(\lambda \to +\infty\) the penalty forces coefficients towards \(0\)

  • Performs automated feature selection

Workshop 2: LASSO

Ridge Regression

The Ridge penalty asks to solve the following problem

\[ \min_{\beta} \left[ \frac{1}{N} \sum_{i = 1}^N \left( y^{(i)} - \hat{y}^{(i)} \right)^2 + \lambda \sum_{j = 1}^m \beta_j^2 \right] \]

  • Size of penalty is controlled by \(\lambda\)

  • When \(\lambda = 0\) is the same as linear regression

  • As \(\lambda \to +\infty\) the penalty forces coefficients towards \(0\) (but never fully to \(0\))

  • Retains all features

Workshop 3: Ridge

Elastic Net

Combines both LASSO and Ridge penalties

\[ \min_{\beta} \left[ \frac{1}{N} \sum_{i = 1}^N \left( y^{(i)} - \hat{y}^{(i)} \right)^2 + \lambda_1 \sum_{j = 1}^m |\beta_j| + \lambda_2 \sum_{j = 1}^m \beta_j^2 \right] \]

  • Size of penalty is controlled by \(\lambda_1\) and \(\lambda_2\)

  • Provides best of both words

Elastic Net another way of writing it

\[ \min_{\beta} \left[ \frac{1}{N} \sum_{i = 1}^N \left( y^{(i)} - \hat{y}^{(i)} \right)^2 + \lambda \left( \alpha \sum_{j = 1}^m |\beta_j| + (1 - \alpha) \sum_{j = 1}^m \beta_j^2 \right) \right] \]

  • Size of penalty is controlled by \(\lambda\)

  • \(\alpha\), called mixture, is such that \(0 \leq \alpha \leq 1\):

    • For \(\alpha = 0\) we have Ridge penalty (\(\lambda = \lambda_2\))
    • For \(\alpha = 1\) we have LASSO penalty (\(\lambda = \lambda_1\))
    • For \(0 < \alpha < 1\) we have a mix of the two penalties and

\[ \lambda = \lambda_1 + \lambda_2 \qquad \alpha = \frac{\lambda_1}{\lambda_1 + \lambda_2} \qquad 1 - \alpha = \frac{\lambda_2}{\lambda_1 + \lambda_2} \]

Workshop 4: Elastic Net

3 Types of Penalties

  • LASSO penalty has feature selection capability, is helpful when we have many non-important (noisy) features

  • Ridge penalty is typically more effective controlling multicollinearity

  • Elastic Net provides a mix of both