Evidentiary and interpretable prediction 
Binary and Multi-classification algorithm for adverse outcome detection, survival classification, and endpoint prediction (see references for details)
 
Objectives of this project 
Build the mccv  python package: easily implement and perform MCCV for learning and prediction tasks. 
Showcase accessibly to build, validate, and interpret MCCV classifiers. 
Demonstrate use in both python and R for diverse community implementations. 
 
 
Installation 
mkdir ~/my_directory #choose where to clone the mccv repository
cd ~/my_directory
git clone https://github.com/ngiangre/mccv.git
cd mccv/
python3 -m pip install . 
Usage 
import  pandas as  pd=  pd.read_csv('data/data.csv' ,index_col= 0 ) # Feature column name is 'biomarker' and response column  name is 'status' 
     status  biomarker
obs                   
1         0   1.665731
2         0  -0.875837
3         0  -1.391374
4         0  -0.297352
5         1   0.189857 
 
import  mccv=  mccv.mccv(num_bootstraps= 200 )'biomarker' ]] )'status' ]] )#Output for  n in  mccv_obj.mccv_data:print (n)
Model Learning
   bootstrap                model  ...  train_roc_auc  validation_roc_auc
0          0  Logistic Regression  ...       0.529453            0.611111
1          1  Logistic Regression  ...       0.515235            0.732143
2          2  Logistic Regression  ...       0.543056            0.400000
3          3  Logistic Regression  ...       0.519728            0.727273
4          4  Logistic Regression  ...       0.554054            0.574074
[5 rows x 5 columns]
Feature Importance
   bootstrap    feature  importance                model
0          0  biomarker    1.009705  Logistic Regression
1          0  Intercept   -0.598575  Logistic Regression
0          1  biomarker    0.509433  Logistic Regression
1          1  Intercept   -0.226550  Logistic Regression
0          2  biomarker    1.598627  Logistic Regression
Patient Predictions
     bootstrap                model  y_pred   y_proba  y_true
obs                                                          
27           0  Logistic Regression       0  0.384723       1
87           0  Logistic Regression       1  0.601359       0
3            0  Logistic Regression       0  0.401320       0
56           0  Logistic Regression       1  0.512481       1
76           0  Logistic Regression       0  0.393009       0
Performance
                 model   metric  performance_bootstrap     value
0  Logistic Regression  roc_auc                      0  0.467487
1  Logistic Regression  roc_auc                      1  0.467776
2  Logistic Regression  roc_auc                      2  0.480176
3  Logistic Regression  roc_auc                      3  0.480679
4  Logistic Regression  roc_auc                      4  0.475859 
for  n in  mccv_obj.mccv_permuted_data:print (n)
Model Learning
   bootstrap                model  ...  train_roc_auc  validation_roc_auc
0          0  Logistic Regression  ...       0.506233            0.642857
1          1  Logistic Regression  ...       0.492030            0.703704
2          2  Logistic Regression  ...       0.510135            0.537037
3          3  Logistic Regression  ...       0.506944            0.703704
4          4  Logistic Regression  ...       0.589547            0.340909
[5 rows x 5 columns]
Feature Importance
   bootstrap    feature  importance                model
0          0  biomarker   -0.196116  Logistic Regression
1          0  Intercept    0.079220  Logistic Regression
0          1  biomarker   -0.628093  Logistic Regression
1          1  Intercept    0.236617  Logistic Regression
0          2  biomarker    0.166196  Logistic Regression
Patient Predictions
     bootstrap                model  y_pred   y_proba  y_true
obs                                                          
27           0  Logistic Regression       1  0.513536       1
87           0  Logistic Regression       0  0.470809       0
3            0  Logistic Regression       1  0.510160       0
56           0  Logistic Regression       0  0.488317       1
76           0  Logistic Regression       1  0.511844       1
Performance
                 model   metric  performance_bootstrap     value
0  Logistic Regression  roc_auc                      0  0.440616
1  Logistic Regression  roc_auc                      1  0.442506
2  Logistic Regression  roc_auc                      2  0.449941
3  Logistic Regression  roc_auc                      3  0.440162
4  Logistic Regression  roc_auc                      4  0.449896 
 
 
if (! requireNamespace ("readr" )){install.packages ("readr" )}
Loading required namespace: readr 
library (readr)<-  read_csv ("data/data.csv" ,col_types =  c ("iid" )) #set obs as integer, status as integer, and biomarker as double head (data)
# A tibble: 6 × 3
    obs status biomarker
  <int>  <int>     <dbl>
1     1      0     1.67 
2     2      0    -0.876
3     3      0    -1.39 
4     4      0    -0.297
5     5      1     0.190
6     6      0     2.20  
 
if (! requireNamespace ("reticulate" )){install.packages ("reticulate" )}=  reticulate:: import ('mccv' )=  mccv$ mccv (num_bootstraps =  as.integer (200 ))=  reticulate:: r_to_py (data[,c ('obs' ,'biomarker' )])=  X$ set_index (reticulate:: r_to_py ('obs' ))=  reticulate:: r_to_py (data[,c ('obs' ,'status' )])=  y$ set_index (reticulate:: r_to_py ('obs' ))$ set_X (X)$ set_Y (y)$ run_mccv ()$ run_permuted_mccv ()#Output lapply (mccv_obj$ mccv_data,head)
Warning in py_to_r.pandas.core.frame.DataFrame(object): index contains
duplicated values: row names not set
Warning in py_to_r.pandas.core.frame.DataFrame(object): index contains
duplicated values: row names not set 
$`Model Learning`
  bootstrap               model test_roc_auc train_roc_auc validation_roc_auc
1         0 Logistic Regression       1.0000     0.5294525          0.6111111
2         1 Logistic Regression       0.8000     0.5152355          0.7321429
3         2 Logistic Regression       1.0000     0.5430556          0.4000000
4         3 Logistic Regression       0.8750     0.5197279          0.7272727
5         4 Logistic Regression       0.8125     0.5540541          0.5740741
6         5 Logistic Regression       1.0000     0.5499325          0.5357143
$`Feature Importance`
  bootstrap   feature importance               model
1         0 biomarker  1.0097049 Logistic Regression
2         0 Intercept -0.5985751 Logistic Regression
3         1 biomarker  0.5094328 Logistic Regression
4         1 Intercept -0.2265503 Logistic Regression
5         2 biomarker  1.5986271 Logistic Regression
6         2 Intercept -0.9420031 Logistic Regression
$`Patient Predictions`
  bootstrap               model y_pred   y_proba y_true
1         0 Logistic Regression      0 0.3847230      1
2         0 Logistic Regression      1 0.6013587      0
3         0 Logistic Regression      0 0.4013202      0
4         0 Logistic Regression      1 0.5124811      1
5         0 Logistic Regression      0 0.3930090      0
6         0 Logistic Regression      0 0.4660667      1
$Performance
                model  metric performance_bootstrap     value
1 Logistic Regression roc_auc                     0 0.4674874
2 Logistic Regression roc_auc                     1 0.4677764
3 Logistic Regression roc_auc                     2 0.4801763
4 Logistic Regression roc_auc                     3 0.4806793
5 Logistic Regression roc_auc                     4 0.4758592
6 Logistic Regression roc_auc                     5 0.4687351 
lapply (mccv_obj$ mccv_permuted_data,head)
Warning in py_to_r.pandas.core.frame.DataFrame(object): index contains
duplicated values: row names not set
Warning in py_to_r.pandas.core.frame.DataFrame(object): index contains
duplicated values: row names not set 
$`Model Learning`
  bootstrap               model test_roc_auc train_roc_auc validation_roc_auc
1         0 Logistic Regression       0.5500     0.5062327          0.6428571
2         1 Logistic Regression       0.8000     0.4920305          0.7037037
3         2 Logistic Regression       0.5625     0.5101351          0.5370370
4         3 Logistic Regression       0.8000     0.5069444          0.7037037
5         4 Logistic Regression       0.9000     0.5895470          0.3409091
6         5 Logistic Regression       0.7000     0.5360111          0.5178571
$`Feature Importance`
  bootstrap   feature  importance               model
1         0 biomarker -0.19611610 Logistic Regression
2         0 Intercept  0.07921951 Logistic Regression
3         1 biomarker -0.62809256 Logistic Regression
4         1 Intercept  0.23661698 Logistic Regression
5         2 biomarker  0.16619555 Logistic Regression
6         2 Intercept -0.01455491 Logistic Regression
$`Patient Predictions`
  bootstrap               model y_pred   y_proba y_true
1         0 Logistic Regression      1 0.5135363      1
2         0 Logistic Regression      0 0.4708091      0
3         0 Logistic Regression      1 0.5101595      0
4         0 Logistic Regression      0 0.4883168      1
5         0 Logistic Regression      1 0.5118443      1
6         0 Logistic Regression      0 0.4973405      1
$Performance
                model  metric performance_bootstrap     value
1 Logistic Regression roc_auc                     0 0.4406164
2 Logistic Regression roc_auc                     1 0.4425061
3 Logistic Regression roc_auc                     2 0.4499406
4 Logistic Regression roc_auc                     3 0.4401616
5 Logistic Regression roc_auc                     4 0.4498963
6 Logistic Regression roc_auc                     5 0.4436607 
 
 
 
 
 
Contribute 
Please do! Reach out to Nick directly (nick.giangreco@gmail.com), make an issue, or make a pull request.
 
License 
This software is released under the MIT license, which can be found in LICENSE in the root directory of this repository.