Overview

Motivation

  1. Estimate predictions for small data
  2. Evaluate variable or feature importance on all subpopulations of the data
  3. Generate prediction uncertainty and variance
  4. Develop classifiers based on unseen data

Main advantages

  • Predictions are generated on many different training/validation data splits
  • Predictor or feature importances to the dependent variable are generalized over many subpopulations of the data
  • No data leakage - Predictions are on observations not included during training

Procedural overview

  • Monte Carlo simulation splits the data into training and validation sets

  • K-fold cross validation (10 by default) on the training set is used to estimate “good” model parameters

  • The model with the “good” parameters is fit on the entire training set

  • The refitted model predicts the yet-to-be-seen validation set

  • Performance metrics are generated using resamples (bootstrap with replacement) of the observation probabilities

References