Overview

Motivation

Estimate predictions for small data
Evaluate variable or feature importance on all subpopulations of the data
Generate prediction uncertainty and variance
Develop classifiers based on unseen data

Main advantages

Predictions are generated on many different training/validation data splits
Predictor or feature importances to the dependent variable are generalized over many subpopulations of the data
No data leakage - Predictions are on observations not included during training

Procedural overview

Monte Carlo simulation splits the data into training and validation sets
K-fold cross validation (10 by default) on the training set is used to estimate “good” model parameters
The model with the “good” parameters is fit on the entire training set
The refitted model predicts the yet-to-be-seen validation set
Performance metrics are generated using resamples (bootstrap with replacement) of the observation probabilities

References