ValidMind for model validation 3 — Developing a potential challenger model

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger model and then pass your model and its predictions to ValidMind.

A challenger model is an alternate model that attempts to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challenger models with this notebook, you'll need to first have:

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)
Note: you may need to restart the kernel to use updated packages.
2026-01-28 18:05:38,908 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results table lists the top ten feature pairs ranked by the absolute value of their Pearson correlation coefficients, along with their corresponding coefficients and Pass/Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs display lower correlation values and are marked as Pass.

Key insights:

  • Single feature pair exceeds correlation threshold: The pair (Age, Exited) shows a Pearson correlation coefficient of 0.3499, surpassing the 0.3 threshold and resulting in a Fail status.
  • All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.1947 to 0.0419, all below the 0.3 threshold and marked as Pass.
  • Predominantly weak linear relationships: Most feature pairs exhibit weak linear associations, with coefficients clustered near zero.

The test results indicate that the dataset contains predominantly weak linear relationships among features, with only one pair—(Age, Exited)—exceeding the specified correlation threshold. This suggests limited risk of feature redundancy or multicollinearity based on linear associations, with the exception of the identified pair. The overall correlation structure supports the interpretability and stability of the feature set.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3499 Fail
(IsActiveMember, Exited) -0.1947 Pass
(Balance, NumOfProducts) -0.1711 Pass
(Balance, Exited) 0.1483 Pass
(Age, Balance) 0.0519 Pass
(Age, NumOfProducts) -0.0475 Pass
(NumOfProducts, Exited) -0.0462 Pass
(Balance, HasCrCard) -0.0452 Pass
(Tenure, IsActiveMember) -0.0447 Pass
(NumOfProducts, IsActiveMember) 0.0419 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3499 Fail
1 (IsActiveMember, Exited) -0.1947 Pass
2 (Balance, NumOfProducts) -0.1711 Pass
3 (Balance, Exited) 0.1483 Pass
4 (Age, Balance) 0.0519 Pass
5 (Age, NumOfProducts) -0.0475 Pass
6 (NumOfProducts, Exited) -0.0462 Pass
7 (Balance, HasCrCard) -0.0452 Pass
8 (Tenure, IsActiveMember) -0.0447 Pass
9 (NumOfProducts, IsActiveMember) 0.0419 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity within the dataset. The results table presents the top ten absolute Pearson correlation coefficients, each paired with a Pass/Fail status based on a threshold of 0.3. All reported feature pairs have coefficients below the threshold, and each is marked as Pass, indicating no detected high linear correlations among the top pairs.

Key insights:

  • No feature pairs exceed correlation threshold: All reported Pearson correlation coefficients are below the 0.3 threshold, with the highest absolute value observed at 0.1947.
  • Weakest observed correlation is minimal: The lowest absolute coefficient among the top ten is 0.0303, indicating very weak linear association.
  • Consistent Pass status across all pairs: Every feature pair in the top ten is marked as Pass, reflecting the absence of strong linear relationships in the evaluated set.

The results indicate that the dataset does not exhibit high linear correlations among the top feature pairs, suggesting a low risk of feature redundancy or multicollinearity based on the tested threshold. The observed correlation structure supports the interpretability and stability of subsequent modeling efforts.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.1947 Pass
(Balance, NumOfProducts) -0.1711 Pass
(Balance, Exited) 0.1483 Pass
(NumOfProducts, Exited) -0.0462 Pass
(Balance, HasCrCard) -0.0452 Pass
(Tenure, IsActiveMember) -0.0447 Pass
(NumOfProducts, IsActiveMember) 0.0419 Pass
(Tenure, EstimatedSalary) 0.0338 Pass
(HasCrCard, IsActiveMember) -0.0338 Pass
(CreditScore, Balance) 0.0303 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
5143 582 4 0.00 2 0 0 156153.27 0 False False True
5633 580 2 130334.84 2 1 1 51672.08 0 True False True
5567 706 5 0.00 2 1 1 81718.37 0 False False True
5948 778 1 151958.19 3 1 1 131238.37 1 True False False
4332 558 1 153697.53 2 0 0 89891.40 1 False False False
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:

Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Training a potential challenger model

We're curious how an alternate model compares to our champion model, so let's train a challenger model as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initializing the model objects

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

You simply initialize this model object with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

  • The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
  • This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-01-28 18:05:51,731 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 18:05:51,732 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 18:05:51,733 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 18:05:51,735 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 18:05:51,737 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 18:05:51,738 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 18:05:51,739 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 18:05:51,741 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 18:05:51,744 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 18:05:51,768 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 18:05:51,769 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 18:05:51,792 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 18:05:51,796 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 18:05:51,808 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 18:05:51,810 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 18:05:51,822 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion model, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.CalibrationCurve Calibration Curve Evaluates the calibration of probability estimates by comparing predicted probabilities against observed... True False ['model', 'dataset'] {'n_bins': {'type': 'int', 'default': 10}} ['sklearn', 'model_performance', 'classification'] ['classification']
validmind.model_validation.sklearn.ClassifierPerformance Classifier Performance Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,... False True ['dataset', 'model'] {'average': {'type': 'str', 'default': 'macro'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix Confusion Matrix Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix... True False ['dataset', 'model'] {'threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning Hyper Parameters Tuning Performs exhaustive grid search over specified parameter ranges to find optimal model configurations... False True ['model', 'dataset'] {'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}} ['sklearn', 'model_performance'] ['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy Minimum Accuracy Checks if the model's prediction accuracy meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.7}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score Minimum F1 Score Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore Minimum ROCAUC Score Validates model by checking if the ROC AUC score meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison Models Performance Comparison Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,... False True ['dataset', 'models'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex Population Stability Index Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across... True True ['datasets', 'model'] {'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve Precision Recall Curve Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve.... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve ROC Curve Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors Regression Errors Assesses the performance and error distribution of a regression model using various error metrics.... False True ['model', 'dataset'] {} ['sklearn', 'model_performance'] ['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation Training Test Degradation Tests if model performance degradation between training and test datasets exceeds a predefined threshold.... False True ['datasets', 'model'] {'max_threshold': {'type': 'float', 'default': 0.1}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable GINI Table Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets.... False True ['dataset', 'model'] {} ['model_performance'] ['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift Calibration Curve Drift Evaluates changes in probability calibration between reference and monitoring datasets.... True True ['datasets', 'model'] {'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift Class Discrimination Drift Compares classification discrimination metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift Classification Accuracy Drift Compares classification accuracy metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift Confusion Matrix Drift Compares confusion matrix metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift ROC Curve Drift Compares ROC curves between reference and monitoring datasets.... True False ['datasets', 'model'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

  • The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
  • The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.
for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

The Classifier Performance test evaluates the predictive effectiveness of the classification model by reporting precision, recall, F1-Score, accuracy, and ROC AUC metrics. The results are presented for each class, as well as macro and weighted averages, providing a comprehensive view of model performance across all classes. The accuracy and ROC AUC scores are also reported, summarizing the model's overall classification ability and its discrimination capacity between classes.

Key insights:

  • Balanced class-wise performance: Precision, recall, and F1-Score are similar for both classes (Class 0: F1 = 0.6473; Class 1: F1 = 0.6385), indicating consistent model behavior across classes.
  • Moderate overall accuracy: The model achieves an accuracy of 0.643, reflecting moderate correct classification rates on the test set.
  • Macro and weighted averages align: Macro and weighted averages for precision, recall, and F1-Score are all 0.6429, suggesting class balance in the dataset and uniform model performance.
  • ROC AUC indicates moderate discrimination: The ROC AUC score of 0.6726 demonstrates moderate ability to distinguish between classes.

The results indicate that the model exhibits consistent and balanced performance across both classes, with moderate accuracy and discrimination capability as reflected by the ROC AUC. The alignment of macro and weighted averages further supports the absence of significant class imbalance or performance disparity. Overall, the model demonstrates stable but moderate classification effectiveness on the evaluated dataset.

Tables

Precision, Recall, and F1

Class Precision Recall F1
0 0.6444 0.6503 0.6473
1 0.6415 0.6355 0.6385
Weighted Average 0.6430 0.6430 0.6429
Macro Average 0.6429 0.6429 0.6429

Accuracy and ROC AUC

Metric Value
Accuracy 0.6430
ROC AUC 0.6726
2026-01-28 18:06:04,551 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

The Confusion Matrix test evaluates the classification performance of the logistic regression model by comparing predicted and actual class labels, providing a breakdown of true positives, true negatives, false positives, and false negatives. The resulting matrix visually displays the distribution of correct and incorrect predictions across both classes. The matrix quantifies the model's ability to distinguish between the two classes and highlights the types and frequencies of classification errors.

Key insights:

  • Balanced distribution of correct predictions: The model produced 204 true positives and 212 true negatives, indicating similar effectiveness in identifying both positive and negative cases.
  • Notable presence of misclassifications: There are 117 false negatives and 114 false positives, reflecting a comparable rate of both error types across the two classes.
  • Error rates are non-negligible: The number of false positives and false negatives is substantial relative to the number of correct predictions, indicating that misclassification risk is present for both classes.

The confusion matrix reveals that the logistic regression model demonstrates similar performance in correctly identifying both classes, with true positives and true negatives occurring at comparable frequencies. However, the presence of substantial false positives and false negatives indicates that the model's classification errors are distributed across both classes, warranting attention to both types of misclassification risk in further evaluation.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:9091
2026-01-28 18:06:17,592 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

The Minimum Accuracy test evaluates whether the model's prediction accuracy meets or exceeds a specified threshold, in this case 0.7. The results table presents the model's observed accuracy score, the threshold applied, and the resulting pass/fail outcome. The model's accuracy score is 0.643, which is compared directly to the threshold value to determine test status.

Key insights:

  • Accuracy below threshold: The model achieved an accuracy score of 0.643, which is below the specified threshold of 0.7.
  • Test outcome is Fail: The test result is marked as "Fail" due to the accuracy score not meeting the minimum requirement.

The results indicate that the model did not achieve the minimum accuracy criterion established for this evaluation. The observed accuracy shortfall relative to the threshold signals that the model's predictive performance, as measured by overall correctness, does not satisfy the predefined acceptance standard for this test.

Tables

Score Threshold Pass/Fail
0.643 0.7 Fail
2026-01-28 18:06:22,510 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

The MinimumF1Score:logreg_champion test evaluates whether the model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents the observed F1 score, the minimum threshold for passing, and the pass/fail outcome. The model achieved an F1 score of 0.6385, compared against a threshold of 0.5, with the test outcome recorded as "Pass".

Key insights:

  • F1 score exceeds minimum threshold: The model's F1 score of 0.6385 is above the required threshold of 0.5.
  • Test outcome is Pass: The model meets the minimum performance standard for balanced precision and recall as defined by the test criteria.

The results indicate that the model demonstrates balanced classification performance on the validation set, with the F1 score surpassing the established minimum requirement. The test outcome confirms that the model satisfies the specified standard for combined precision and recall.

Tables

Score Threshold Pass/Fail
0.6385 0.5 Pass
2026-01-28 18:06:26,835 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) on the test_dataset_final. The resulting plot displays the trade-off between the true positive rate and false positive rate across all classification thresholds, with the model's ROC curve compared against a baseline representing random classification. The AUC value is provided as a summary metric of the model's discriminative ability.

Key insights:

  • AUC indicates moderate discriminative power: The model achieves an AUC of 0.67, which is above the random baseline of 0.5, indicating the model can distinguish between the two classes better than chance.
  • ROC curve consistently above random line: The ROC curve remains above the diagonal line representing random performance, confirming the model's ability to provide meaningful separation between positive and negative classes across thresholds.
  • No evidence of near-random or inverted performance: The ROC curve does not approach or fall below the random line at any threshold, indicating the absence of high-risk model failure modes.

The ROC Curve test results demonstrate that log_model_champion exhibits moderate classification performance on the test dataset, with an AUC of 0.67 reflecting a measurable ability to differentiate between classes. The ROC curve's consistent position above the random baseline further supports the model's discriminative capability, with no indications of performance collapse or inversion.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:eff5
2026-01-28 18:06:37,289 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion model doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Need more help?):

  1. From the Inventory in the ValidMind Platform, go to the model you connected to earlier.

  2. In the left sidebar that appears for your model, click Validation Report under Documents.

  3. Locate the Data Preparation section and click on 2.2.2. Model Performance to expand that section.

  4. Under the Model Performance Metrics section, locate Artifacts then click Link Artifact to Report:

    Screenshot showing the validation report with the link artifact option highlighted

  5. Select Validation Issue as the type of artifact.

  6. Click + Add Validation Issue to add a validation issue type artifact.

  7. Enter in the details for your validation issue, for example:

    • TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
    • RISK AREA — Model Performance
    • DOCUMENTATION SECTION — 3.2. Model Evaluation
    • DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
  8. Click Save.

  9. Select the validation issue you just added to link to your validation report and click Update Linked Artifacts to insert your validation issue.

  10. Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger model:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

The Classifier Performance test evaluates the predictive effectiveness of classification models by reporting precision, recall, F1-Score, accuracy, and ROC AUC metrics. The results compare two models, "log_model_champion" and "rf_model," across these metrics for both classes, as well as macro and weighted averages. The tables present detailed class-level and aggregate performance scores, enabling direct comparison of model discrimination and overall accuracy.

Key insights:

  • rf_model outperforms log_model_champion across all metrics: rf_model achieves higher precision, recall, F1-Score, accuracy (0.7202), and ROC AUC (0.7926) compared to log_model_champion, which records accuracy of 0.643 and ROC AUC of 0.6726.
  • Consistent class-level performance for rf_model: rf_model shows balanced precision and recall for both classes (precision: 0.7126 for class 0, 0.7288 for class 1; recall: 0.7454 for class 0, 0.6947 for class 1), resulting in macro and weighted averages closely aligned at 0.72.
  • log_model_champion exhibits lower and more uniform scores: log_model_champion displays similar precision and recall for both classes (precision: 0.6444 for class 0, 0.6415 for class 1; recall: 0.6503 for class 0, 0.6355 for class 1), with macro and weighted averages at 0.6429 and 0.643, respectively.

The results indicate that rf_model demonstrates superior classification performance relative to log_model_champion, as evidenced by higher scores across all evaluated metrics. Both models exhibit balanced class-level metrics, but rf_model achieves notably higher discrimination and accuracy, as reflected in its ROC AUC and F1-Score values. The observed performance differentials provide clear evidence of rf_model's enhanced predictive capability in this test context.

Tables

model Class Precision Recall F1
log_model_champion 0 0.6444 0.6503 0.6473
log_model_champion 1 0.6415 0.6355 0.6385
log_model_champion Weighted Average 0.6430 0.6430 0.6429
log_model_champion Macro Average 0.6429 0.6429 0.6429
rf_model 0 0.7126 0.7454 0.7286
rf_model 1 0.7288 0.6947 0.7113
rf_model Weighted Average 0.7206 0.7202 0.7200
rf_model Macro Average 0.7207 0.7201 0.7200
model Metric Value
log_model_champion Accuracy 0.6430
log_model_champion ROC AUC 0.6726
rf_model Accuracy 0.7202
rf_model ROC AUC 0.7926
2026-01-28 18:06:44,765 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

The Confusion Matrix test evaluates the classification performance of the champion (logistic regression) and challenger (random forest) models by comparing predicted versus actual class labels. The resulting matrices display the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) for each model, providing a detailed breakdown of correct and incorrect predictions. These results enable direct comparison of error types and overall predictive accuracy between the two models.

Key insights:

  • Challenger model reduces both FP and FN: The random forest model (challenger) records 83 False Positives and 98 False Negatives, compared to 114 False Positives and 117 False Negatives for the logistic regression (champion), indicating improved error control.
  • Higher correct classification in challenger: The challenger model achieves 223 True Positives and 243 True Negatives, both higher than the champion model’s 204 True Positives and 212 True Negatives.
  • Error distribution shifts favor challenger: The challenger model demonstrates a more favorable balance between correct and incorrect classifications, with lower misclassification rates across both positive and negative classes.

The confusion matrix results indicate that the challenger (random forest) model outperforms the champion (logistic regression) model in both reducing misclassification errors and increasing the number of correct predictions. The challenger model achieves higher counts of both True Positives and True Negatives, while simultaneously lowering both False Positives and False Negatives, reflecting a more effective classification performance across the evaluated dataset.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:a311
ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:018f
2026-01-28 18:06:57,952 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

The Minimum Accuracy test evaluates whether each model's prediction accuracy meets or exceeds a specified threshold, with results presented for both the log_model_champion and rf_model. The table displays the accuracy scores, the threshold applied (0.7), and the corresponding pass/fail outcome for each model. The log_model_champion achieved an accuracy score of 0.643, while the rf_model achieved a score of 0.7202, allowing for direct comparison against the threshold.

Key insights:

  • rf_model meets accuracy threshold: The rf_model achieved an accuracy score of 0.7202, surpassing the minimum threshold of 0.7 and receiving a "Pass" outcome.
  • log_model_champion falls below threshold: The log_model_champion recorded an accuracy score of 0.643, which is below the threshold, resulting in a "Fail" outcome.
  • Clear differentiation in model performance: The two models display a marked difference in accuracy relative to the threshold, with only the rf_model meeting the minimum requirement.

The results indicate that, under the specified test conditions, the rf_model satisfies the minimum accuracy criterion, while the log_model_champion does not. This distinction highlights a performance gap between the models with respect to overall prediction accuracy as measured by the test.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6430 0.7 Fail
rf_model 0.7202 0.7 Pass
2026-01-28 18:07:04,550 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

The MinimumF1Score:champion_vs_challenger test evaluates whether each model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents F1 scores for two models—log_model_champion and rf_model—alongside the minimum threshold and pass/fail status. Both models are assessed against a threshold of 0.5, with their respective F1 scores and outcomes displayed.

Key insights:

  • Both models exceed the minimum F1 threshold: log_model_champion and rf_model achieved F1 scores of 0.6385 and 0.7113, respectively, both surpassing the 0.5 threshold.
  • rf_model demonstrates higher F1 performance: rf_model outperforms log_model_champion by 0.0728 in F1 score, indicating stronger balance between precision and recall on the validation set.
  • All models pass the test criteria: Both models are marked as "Pass," confirming that each meets the minimum F1 score requirement.

Both evaluated models satisfy the minimum F1 score criterion, indicating balanced classification performance on the validation set. The rf_model demonstrates comparatively higher F1 performance, suggesting improved precision-recall tradeoff relative to log_model_champion. No models fall below the established threshold, and all pass the test as specified.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6385 0.5 Pass
rf_model 0.7113 0.5 Pass
2026-01-28 18:07:08,961 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

The ROC Curve test evaluates the discrimination ability of binary classification models by plotting the True Positive Rate against the False Positive Rate at various thresholds and calculating the Area Under the Curve (AUC) score. The results present ROC curves and AUC values for two models—log_model_champion and rf_model—on the test_dataset_final, with each curve compared against a baseline representing random classification (AUC = 0.5). The ROC curves and corresponding AUC scores provide a visual and quantitative assessment of each model's ability to distinguish between the two classes.

Key insights:

  • rf_model demonstrates higher discrimination: The rf_model achieves an AUC of 0.79, indicating stronger separation between classes compared to the log_model_champion.
  • log_model_champion shows moderate performance: The log_model_champion records an AUC of 0.67, reflecting moderate discriminative ability above random chance but below that of the rf_model.
  • Both models outperform random classification: Both ROC curves are consistently above the diagonal line representing random performance (AUC = 0.5), confirming that each model provides meaningful predictive power on the test dataset.

The results indicate that both models possess discriminative capability, with the rf_model exhibiting notably stronger performance as measured by the AUC metric. The log_model_champion provides moderate classification ability, while the rf_model achieves a higher level of class separation, as reflected in the ROC curve and AUC score. Both models demonstrate predictive value beyond random assignment on the evaluated dataset.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:06c4
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:b9f0
2026-01-28 18:07:23,868 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document
Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger model by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.OverfitDiagnosis Overfit Diagnosis Assesses potential overfitting in a model's predictions, identifying regions where performance between training and... True True ['model', 'datasets'] {'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}} ['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis'] ['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis Robustness Diagnosis Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions.... True True ['datasets', 'model'] {'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}} ['sklearn', 'model_diagnosis', 'visualization'] ['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis Weakspots Diagnosis Identifies and visualizes weak spots in a machine learning model's performance across various sections of the... True True ['datasets', 'model'] {'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization'] ['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

  • Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
  • The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.
vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

The Overfit Diagnosis test evaluates the extent to which model performance on the training set diverges from performance on the test set across feature segments, using AUC as the metric for classification models. The results are presented for both a logistic regression model (log_model_champion) and a random forest model (rf_model), with AUC gaps calculated for binned regions of key features. Visualizations and tabular data highlight regions where the absolute difference in AUC between training and test sets exceeds the default threshold of 0.04, indicating potential overfitting.

Key insights:

  • Localized overfitting in logistic regression model: For log_model_champion, AUC gaps above the 0.04 threshold are observed in specific segments, notably for CreditScore (0.1524 in [400, 450]), Tenure (0.111 in [2, 3]), Balance (0.1818 in [200718, 225808]), and EstimatedSalary (up to 0.144 in [139939, 159929]).
  • Widespread and pronounced overfitting in random forest model: For rf_model, nearly all feature segments exhibit AUC gaps well above the threshold, with values frequently exceeding 0.2 and reaching as high as 1.0 for Balance ([200718, 225808]). This pattern is consistent across CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Geography_Germany, Geography_Spain, and Gender_Male.
  • Magnitude and consistency of AUC gaps: The random forest model shows consistently high AUC gaps across all feature bins, while the logistic regression model displays more moderate and localized gaps, with most segments remaining below the threshold except for a few isolated regions.
  • Feature segments with limited test data: The largest AUC gaps, particularly in the Balance feature for both models, are associated with bins containing very few test records (e.g., 5 test records for [200718, 225808]), which may contribute to instability in the observed metrics.

The Overfit Diagnosis test reveals that the logistic regression model demonstrates moderate and localized overfitting, with only a subset of feature segments exceeding the AUC gap threshold. In contrast, the random forest model exhibits pervasive and substantial overfitting across nearly all feature segments, with AUC gaps consistently and substantially above the threshold. The most extreme gaps are observed in regions with limited test data, suggesting that both model complexity and data sparsity contribute to the observed overfitting patterns. These findings provide a detailed view of model generalization behavior and highlight specific regions and features where overfitting is most pronounced.

Tables

model Feature Slice Number of Training Records Number of Test Records Training AUC Test AUC Gap
log_model_champion CreditScore (400.0, 450.0] 47 10 0.6724 0.5200 0.1524
log_model_champion CreditScore (500.0, 550.0] 269 64 0.6734 0.6129 0.0605
log_model_champion CreditScore (750.0, 800.0] 251 63 0.6911 0.6401 0.0510
log_model_champion Tenure (2.0, 3.0] 279 68 0.6764 0.5654 0.1110
log_model_champion Tenure (5.0, 6.0] 239 68 0.6917 0.6092 0.0825
log_model_champion Tenure (9.0, 10.0] 136 31 0.6535 0.5882 0.0652
log_model_champion Balance (-250.898, 25089.809] 808 220 0.6654 0.6146 0.0508
log_model_champion Balance (100359.236, 125449.045] 621 141 0.7410 0.6895 0.0515
log_model_champion Balance (150538.854, 175628.663] 186 39 0.5683 0.5111 0.0572
log_model_champion Balance (200718.472, 225808.281] 12 5 0.1818 0.0000 0.1818
log_model_champion NumOfProducts (0.997, 1.3] 1510 374 0.6731 0.6325 0.0406
log_model_champion EstimatedSalary (-188.318, 20001.354] 242 64 0.7244 0.6768 0.0476
log_model_champion EstimatedSalary (59980.902, 79970.676] 270 68 0.6921 0.6497 0.0425
log_model_champion EstimatedSalary (79970.676, 99960.45] 267 63 0.7156 0.5731 0.1424
log_model_champion EstimatedSalary (139939.998, 159929.772] 237 54 0.7485 0.6045 0.1440
log_model_champion EstimatedSalary (179919.546, 199909.32] 264 47 0.7062 0.6015 0.1047
log_model_champion Gender_Male (0.9, 1.0] 1300 316 0.6858 0.6378 0.0480
rf_model CreditScore (400.0, 450.0] 47 10 1.0000 0.6600 0.3400
rf_model CreditScore (450.0, 500.0] 121 29 1.0000 0.7500 0.2500
rf_model CreditScore (500.0, 550.0] 269 64 1.0000 0.7483 0.2517
rf_model CreditScore (550.0, 600.0] 371 97 1.0000 0.7806 0.2194
rf_model CreditScore (600.0, 650.0] 494 107 1.0000 0.7841 0.2159
rf_model CreditScore (650.0, 700.0] 470 138 1.0000 0.8086 0.1914
rf_model CreditScore (700.0, 750.0] 410 94 1.0000 0.7559 0.2441
rf_model CreditScore (750.0, 800.0] 251 63 1.0000 0.8135 0.1865
rf_model CreditScore (800.0, 850.0] 141 42 1.0000 0.9032 0.0968
rf_model Tenure (-0.01, 1.0] 367 98 1.0000 0.7178 0.2822
rf_model Tenure (1.0, 2.0] 261 59 1.0000 0.8387 0.1613
rf_model Tenure (2.0, 3.0] 279 68 1.0000 0.7641 0.2359
rf_model Tenure (3.0, 4.0] 268 66 1.0000 0.8557 0.1443
rf_model Tenure (4.0, 5.0] 277 65 1.0000 0.8014 0.1986
rf_model Tenure (5.0, 6.0] 239 68 1.0000 0.8240 0.1760
rf_model Tenure (6.0, 7.0] 252 53 1.0000 0.7337 0.2663
rf_model Tenure (7.0, 8.0] 258 68 1.0000 0.8156 0.1844
rf_model Tenure (8.0, 9.0] 248 71 1.0000 0.8081 0.1919
rf_model Tenure (9.0, 10.0] 136 31 1.0000 0.8277 0.1723
rf_model Balance (-250.898, 25089.809] 808 220 1.0000 0.8158 0.1842
rf_model Balance (25089.809, 50179.618] 18 7 1.0000 0.7500 0.2500
rf_model Balance (50179.618, 75269.427] 115 15 1.0000 0.8889 0.1111
rf_model Balance (75269.427, 100359.236] 275 80 1.0000 0.7390 0.2610
rf_model Balance (100359.236, 125449.045] 621 141 1.0000 0.7723 0.2277
rf_model Balance (125449.045, 150538.854] 502 127 1.0000 0.7374 0.2626
rf_model Balance (150538.854, 175628.663] 186 39 1.0000 0.7593 0.2407
rf_model Balance (175628.663, 200718.472] 46 13 1.0000 0.7250 0.2750
rf_model Balance (200718.472, 225808.281] 12 5 1.0000 0.0000 1.0000
rf_model NumOfProducts (0.997, 1.3] 1510 374 1.0000 0.7062 0.2938
rf_model NumOfProducts (1.9, 2.2] 891 223 1.0000 0.6449 0.3551
rf_model NumOfProducts (2.8, 3.1] 151 39 1.0000 0.7315 0.2685
rf_model HasCrCard (-0.001, 0.1] 777 196 1.0000 0.7772 0.2228
rf_model HasCrCard (0.9, 1.0] 1808 451 1.0000 0.7987 0.2013
rf_model IsActiveMember (-0.001, 0.1] 1378 346 1.0000 0.7884 0.2116
rf_model IsActiveMember (0.9, 1.0] 1207 301 1.0000 0.7869 0.2131
rf_model EstimatedSalary (-188.318, 20001.354] 242 64 1.0000 0.7867 0.2133
rf_model EstimatedSalary (20001.354, 39991.128] 265 75 1.0000 0.8682 0.1318
rf_model EstimatedSalary (39991.128, 59980.902] 237 61 1.0000 0.7151 0.2849
rf_model EstimatedSalary (59980.902, 79970.676] 270 68 1.0000 0.7881 0.2119
rf_model EstimatedSalary (79970.676, 99960.45] 267 63 1.0000 0.8684 0.1316
rf_model EstimatedSalary (99960.45, 119950.224] 262 77 1.0000 0.7360 0.2640
rf_model EstimatedSalary (119950.224, 139939.998] 254 72 1.0000 0.8089 0.1911
rf_model EstimatedSalary (139939.998, 159929.772] 237 54 1.0000 0.7560 0.2440
rf_model EstimatedSalary (159929.772, 179919.546] 287 66 1.0000 0.7912 0.2088
rf_model EstimatedSalary (179919.546, 199909.32] 264 47 1.0000 0.8458 0.1542
rf_model Geography_Germany (-0.001, 0.1] 1807 454 1.0000 0.7780 0.2220
rf_model Geography_Germany (0.9, 1.0] 778 193 1.0000 0.7393 0.2607
rf_model Geography_Spain (-0.001, 0.1] 1949 496 1.0000 0.7907 0.2093
rf_model Geography_Spain (0.9, 1.0] 636 151 1.0000 0.7746 0.2254
rf_model Gender_Male (-0.001, 0.1] 1285 331 1.0000 0.8017 0.1983
rf_model Gender_Male (0.9, 1.0] 1300 316 1.0000 0.7722 0.2278

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2d87
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1a1a
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c5ee
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0444
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:88d2
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4e80
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:023e
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6062
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6aff
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:81d3
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:b18c
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9341
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:38ba
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9d18
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:638c
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c855
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2ce3
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c81f
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2d05
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:d6be
2026-01-28 18:07:55,386 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test. Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

The Robustness Diagnosis test evaluates the resilience of machine learning models to input perturbations by measuring AUC decay under increasing levels of Gaussian noise. Results are presented for both the logistic regression champion model and a random forest model, with AUC and performance decay tracked across multiple perturbation scales for both train and test datasets. The visualizations and tabular data illustrate the comparative robustness of each model, highlighting the magnitude and pattern of performance degradation as noise increases.

Key insights:

  • Logistic regression model exhibits minimal AUC decay: Across all perturbation sizes (up to 0.5 standard deviations), the log_model_champion maintains stable AUC values on both train (0.684 to 0.6696) and test (0.6726 to 0.6691) datasets, with performance decay remaining below 0.016.
  • Random forest model shows pronounced sensitivity to noise: The rf_model experiences substantial AUC decline on the train set (from 1.0 to 0.7968) and a notable decrease on the test set (from 0.7926 to 0.7156) as perturbation size increases, with performance decay reaching 0.2032 (train) and 0.077 (test) at the highest noise level.
  • Threshold failures observed in random forest model: The rf_model fails the robustness threshold on the train set at perturbation sizes of 0.2 and above, and on the test set at the highest perturbation (0.5), as indicated by the "Passed: false" status.
  • Logistic regression model consistently passes robustness criteria: The log_model_champion passes the robustness threshold at all tested perturbation levels for both train and test datasets.

The results demonstrate that the logistic regression champion model maintains stable predictive performance under increasing Gaussian noise, with negligible AUC decay and consistent threshold compliance. In contrast, the random forest model exhibits marked performance degradation, particularly on the train set, and fails robustness criteria at moderate to high noise levels. These findings indicate a higher degree of robustness to input perturbations in the logistic regression model relative to the random forest model under the tested conditions.

Tables

model Perturbation Size Dataset Row Count AUC Performance Decay Passed
log_model_champion Baseline (0.0) train_dataset_final 2585 0.6840 0.0000 True
log_model_champion Baseline (0.0) test_dataset_final 647 0.6726 0.0000 True
log_model_champion 0.1 train_dataset_final 2585 0.6843 -0.0003 True
log_model_champion 0.1 test_dataset_final 647 0.6710 0.0015 True
log_model_champion 0.2 train_dataset_final 2585 0.6794 0.0046 True
log_model_champion 0.2 test_dataset_final 647 0.6654 0.0072 True
log_model_champion 0.3 train_dataset_final 2585 0.6770 0.0070 True
log_model_champion 0.3 test_dataset_final 647 0.6663 0.0062 True
log_model_champion 0.4 train_dataset_final 2585 0.6757 0.0083 True
log_model_champion 0.4 test_dataset_final 647 0.6566 0.0159 True
log_model_champion 0.5 train_dataset_final 2585 0.6696 0.0144 True
log_model_champion 0.5 test_dataset_final 647 0.6691 0.0034 True
rf_model Baseline (0.0) train_dataset_final 2585 1.0000 0.0000 True
rf_model Baseline (0.0) test_dataset_final 647 0.7926 0.0000 True
rf_model 0.1 train_dataset_final 2585 0.9829 0.0171 True
rf_model 0.1 test_dataset_final 647 0.7895 0.0030 True
rf_model 0.2 train_dataset_final 2585 0.9400 0.0600 False
rf_model 0.2 test_dataset_final 647 0.7816 0.0110 True
rf_model 0.3 train_dataset_final 2585 0.8969 0.1031 False
rf_model 0.3 test_dataset_final 647 0.7743 0.0183 True
rf_model 0.4 train_dataset_final 2585 0.8445 0.1555 False
rf_model 0.4 test_dataset_final 647 0.7550 0.0376 True
rf_model 0.5 train_dataset_final 2585 0.7968 0.2032 False
rf_model 0.5 test_dataset_final 647 0.7156 0.0770 False

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:42d8
ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:573c
2026-01-28 18:08:17,838 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI
['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

The FeaturesAUC:champion_vs_challenger test evaluates the discriminatory power of each individual feature in a binary classification context by calculating the Area Under the Curve (AUC) for each feature separately. The results are presented as horizontal bar plots, with each bar representing the AUC score for a specific feature, allowing for direct comparison of univariate classification strength across all features. The AUC values range from approximately 0.41 to 0.63, with features ordered from highest to lowest AUC, providing a clear view of which features are most and least effective at distinguishing between the two classes on their own.

Key insights:

  • Geography_Germany exhibits highest univariate discrimination: Geography_Germany achieves the highest AUC score, exceeding 0.6, indicating the strongest individual ability to separate the two classes among all features evaluated.
  • Balance and CreditScore show moderate discriminatory power: Both Balance and CreditScore register AUC values above 0.5, reflecting moderate univariate classification strength.
  • Several features display limited univariate separation: Features such as NumOfProducts, IsActiveMember, and Gender_Male have AUC scores near or below 0.45, indicating limited ability to distinguish between classes when considered independently.
  • Consistent feature ranking across test runs: The ordering and relative magnitudes of AUC scores are consistent across both test result plots, supporting the stability of the univariate feature evaluation.

The results indicate that Geography_Germany, Balance, and CreditScore are the most individually informative features for binary class separation in the evaluated dataset, while other features contribute less discriminatory power on their own. The observed AUC distribution highlights a clear differentiation in univariate predictive strength across the feature set, with consistent rankings reinforcing the reliability of these findings.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:a620
ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:5982
2026-01-28 18:08:35,656 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

The Permutation Feature Importance (PFI) test evaluates the relative importance of each input feature by measuring the decrease in model performance when the feature's values are randomly permuted. The results are presented as bar plots for both the logistic regression (log_model_champion) and random forest (rf_model) models, with each bar representing the importance score for a given feature. The plots allow for direct comparison of feature influence across the two model types.

Key insights:

  • Divergent top features across models: The logistic regression model assigns highest importance to Geography_Germany, IsActiveMember, and Gender_Male, while the random forest model ranks NumOfProducts and Balance as most influential.
  • Geography_Germany consistently important: Geography_Germany is among the top three features for both models, indicating a stable influence on predictions across model architectures.
  • NumOfProducts critical for random forest: NumOfProducts is the most important feature in the random forest model, but is of moderate importance in the logistic regression model.
  • Low importance for EstimatedSalary and Geography_Spain: Both models assign minimal importance to EstimatedSalary and Geography_Spain, suggesting limited predictive contribution from these features.
  • Variation in IsActiveMember impact: IsActiveMember is highly important in the logistic regression model but less so in the random forest model, highlighting model-specific feature utilization.

The PFI results reveal distinct patterns of feature reliance between the logistic regression and random forest models. While some features such as Geography_Germany demonstrate consistent importance, others like NumOfProducts and IsActiveMember show model-dependent influence. Several features, including EstimatedSalary and Geography_Spain, contribute minimally to predictive performance in both models. These findings provide a clear view of how each model leverages available features, supporting further analysis of model behavior and risk.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:f739
ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:c17d
2026-01-28 18:08:55,506 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

The SHAP Global Importance test evaluates and visualizes the global feature importance for both the champion (logistic regression) and challenger (random forest) models using SHAP values. The results include mean importance plots and summary plots, which display the normalized SHAP values for each feature and illustrate the distribution and impact of feature values on model output. These visualizations provide a comparative perspective on how each model attributes importance to input features.

Key insights:

  • Champion model dominated by few features: The logistic regression model assigns the highest normalized SHAP importance to IsActiveMember, Geography_Germany, and Gender_Male, with IsActiveMember reaching 100% normalized importance and the next two features also showing high values. Remaining features contribute substantially less to the model's output.
  • Challenger model focuses on fewer variables: The random forest model's SHAP plots indicate that only Tenure and CreditScore are assigned notable importance, with other features not represented in the summary plots.
  • Distinct feature attribution patterns: The champion model distributes importance across a broader set of features, while the challenger model concentrates importance on a narrow subset.
  • SHAP value distributions are compact: Both models exhibit relatively tight SHAP value distributions for their most important features, with no evidence of extreme outliers or high variability in the summary plots.

The SHAP Global Importance analysis reveals that the champion model relies on a wider range of features, with a strong emphasis on IsActiveMember, Geography_Germany, and Gender_Male, while the challenger model attributes nearly all importance to Tenure and CreditScore. Both models display compact SHAP value distributions for their key features, indicating stable feature contributions without pronounced outlier effects. The observed differences in feature attribution highlight distinct model reasoning and may inform further model selection or refinement.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:f5e5
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:d61a
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:762a
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:34e9
2026-01-28 18:09:15,823 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial