ValidMind for model validation 3 — Developing a potential challenger model

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger model and then pass your model and its predictions to ValidMind.

A challenger model is an alternate model that attempts to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challenger models with this notebook, you'll need to first have:

Registered a model within the ValidMind Platform and granted yourself access to the model as a validator
Installed the ValidMind Library in your local environment, allowing you to access all its features
Learned how to import and initialize datasets for use with ValidMind
Understood the basics of how to run and log tests with ValidMind
Run data quality tests on the datasets used to train the champion model, and logged the results of those tests to ValidMind
Inserted your logged test results into your validation report

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.
Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)

Note: you may need to restart the kernel to use updated packages.

2025-12-31 22:26:35,223 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

High Pearson Correlation is designed to identify highly correlated feature pairs in a dataset, suggesting feature redundancy or multicollinearity. The primary purpose of this test is to measure the linear relationship between features, which can indicate potential issues such as feature redundancy or multicollinearity that may affect the performance and interpretability of machine learning models.

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It then sorts these correlations, removing duplicates and self-correlations. The Pearson correlation coefficient measures the strength and direction of the linear relationship between two variables, ranging from -1 to 1. A value close to 1 indicates a strong positive linear relationship, while a value close to -1 indicates a strong negative linear relationship. A value around 0 suggests no linear relationship. The test assigns a Pass or Fail status based on whether the absolute value of the correlation coefficient exceeds a pre-set threshold, which is 0.3 by default. The test also returns the top n strongest correlations, which is configurable, with a default of 10.

The primary advantages of this test include its ability to quickly and simply identify relationships between feature pairs, providing a transparent output that displays pairs of correlated variables, the Pearson correlation coefficient, and a Pass or Fail status for each. This transparency aids in the early identification of potential multicollinearity issues that may disrupt model training. By highlighting these relationships, the test allows developers and risk management teams to address potential impacts on model performance and interpretability, ensuring that the model remains robust and reliable.

It should be noted that the test is limited to identifying linear relationships and does not account for nonlinear dependencies. It is also sensitive to outliers, which can significantly affect the correlation coefficient. Additionally, the test only identifies redundancy within feature pairs and may not detect more complex relationships involving three or more variables. High correlation coefficients exceeding the threshold indicate a high risk of multicollinearity and model overfitting, as well as potential redundancy that can undermine the model's interpretability.

This test shows the results in a tabular format, where each row represents a pair of features with their corresponding Pearson correlation coefficient and Pass/Fail status. The table includes columns for the feature pairs, the calculated correlation coefficient, and whether the correlation passes the threshold test. The coefficients range from -0.1715 to 0.3512, with the threshold set at 0.3. Notably, the pair (Age, Exited) has a coefficient of 0.3512, which exceeds the threshold and is marked as Fail, indicating a strong linear relationship. Other pairs, such as (IsActiveMember, Exited) and (Balance, NumOfProducts), have coefficients below the threshold and are marked as Pass, suggesting weaker linear relationships.

The test results reveal the following key insights:

Age and Exited Correlation: The pair (Age, Exited) shows a correlation coefficient of 0.3512, which exceeds the threshold, indicating a significant linear relationship that may suggest multicollinearity or feature redundancy.
Weak Correlations Among Other Features: Other feature pairs, such as (IsActiveMember, Exited) and (Balance, NumOfProducts), have correlation coefficients well below the threshold, indicating weaker linear relationships and less risk of multicollinearity.
Overall Low Correlation: Most feature pairs exhibit low correlation coefficients, suggesting that the dataset generally lacks strong linear relationships between features, which is favorable for model interpretability and performance.

Based on these results, the dataset shows a generally low level of linear correlation among most feature pairs, with the exception of the (Age, Exited) pair, which indicates a potential area of concern for multicollinearity. This suggests that while the dataset is largely free from strong linear dependencies, attention should be given to the Age and Exited features to assess their impact on model performance and interpretability. The overall low correlation among other features supports the robustness of the model, minimizing the risk of overfitting and enhancing the clarity of individual feature contributions.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3512	Fail
(IsActiveMember, Exited)	-0.1715	Pass
(Balance, NumOfProducts)	-0.1700	Pass
(Balance, Exited)	0.1493	Pass
(CreditScore, Exited)	-0.0480	Pass
(NumOfProducts, Exited)	-0.0471	Pass
(NumOfProducts, IsActiveMember)	0.0469	Pass
(Tenure, EstimatedSalary)	0.0446	Pass
(Tenure, HasCrCard)	0.0410	Pass
(CreditScore, IsActiveMember)	0.0351	Pass

# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3512	Fail
1	(IsActiveMember, Exited)	-0.1715	Pass
2	(Balance, NumOfProducts)	-0.1700	Pass
3	(Balance, Exited)	0.1493	Pass
4	(CreditScore, Exited)	-0.0480	Pass
5	(NumOfProducts, Exited)	-0.0471	Pass
6	(NumOfProducts, IsActiveMember)	0.0469	Pass
7	(Tenure, EstimatedSalary)	0.0446	Pass
8	(Tenure, HasCrCard)	0.0410	Pass
9	(CreditScore, IsActiveMember)	0.0351	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

High Pearson Correlation is designed to identify highly correlated feature pairs in a dataset, suggesting feature redundancy or multicollinearity. The primary purpose of this test is to measure the linear relationship between features, which can indicate potential issues such as multicollinearity that may affect the performance and interpretability of machine learning models.

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It measures the strength and direction of the linear relationship between two variables, with the correlation coefficient ranging from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The test sorts these correlations, removing duplicates and self-correlations, and evaluates them against a pre-set threshold, which is 0.3 by default. If the absolute value of a correlation exceeds this threshold, it suggests a significant linear relationship. The test then returns the top n strongest correlations, providing a Pass or Fail status based on the threshold.

The primary advantages of this test include its ability to quickly and effectively identify linear relationships between feature pairs, which is crucial for detecting multicollinearity early in the model development process. This transparency allows developers to understand which features may be redundant, potentially simplifying the model and improving its interpretability. By highlighting these relationships, the test aids in preventing overfitting and ensures that the model's predictions are based on authentic and independent variables. This is particularly useful in scenarios where model performance and clarity are critical, such as in financial or regulatory environments.

It should be noted that the test is limited to identifying linear relationships and does not account for nonlinear dependencies, which may also impact model performance. Additionally, the Pearson correlation is sensitive to outliers, which can skew the results and lead to misleading interpretations. The test only examines pairwise relationships, potentially missing more complex interactions among three or more variables. High correlation coefficients indicate a risk of multicollinearity, which can lead to overfitting and reduce the model's ability to generalize to new data.

This test shows a table format output, listing feature pairs, their correlation coefficients, and a Pass or Fail status based on the threshold of 0.3. Each row represents a pair of features, with the "Columns" field indicating the feature pair, the "Coefficient" field showing the calculated Pearson correlation, and the "Pass/Fail" field indicating whether the correlation exceeds the threshold. The coefficients range from -0.1715 to 0.1493, all of which are below the threshold, resulting in a Pass status for each pair. Notable observations include the highest correlation between "IsActiveMember" and "Exited" at -0.1715, and the lowest between "HasCrCard" and "IsActiveMember" at -0.0305. The results suggest that none of the feature pairs exhibit a strong linear relationship, indicating low risk of multicollinearity.

The test results reveal the following key insights:

Low Correlation Across Features: All feature pairs have correlation coefficients below the threshold of 0.3, indicating a low risk of multicollinearity.
Negative Correlation Observed: The strongest correlation is negative, between "IsActiveMember" and "Exited" with a coefficient of -0.1715, suggesting a slight inverse relationship.
Minimal Linear Relationships: The coefficients for all pairs are close to zero, with the highest positive correlation being 0.1493 between "Balance" and "Exited", indicating minimal linear relationships among the features.

Based on these results, the dataset exhibits low levels of linear correlation among the features, suggesting that multicollinearity is not a significant concern. The absence of strong correlations implies that the features are largely independent, which is beneficial for model interpretability and performance. This independence reduces the risk of overfitting and ensures that the model's predictions are based on distinct and meaningful variables. The results provide confidence that the features can be used effectively in model development without the need for extensive feature reduction or transformation to address multicollinearity.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(IsActiveMember, Exited)	-0.1715	Pass
(Balance, NumOfProducts)	-0.1700	Pass
(Balance, Exited)	0.1493	Pass
(CreditScore, Exited)	-0.0480	Pass
(NumOfProducts, Exited)	-0.0471	Pass
(NumOfProducts, IsActiveMember)	0.0469	Pass
(Tenure, EstimatedSalary)	0.0446	Pass
(Tenure, HasCrCard)	0.0410	Pass
(CreditScore, IsActiveMember)	0.0351	Pass
(HasCrCard, IsActiveMember)	-0.0305	Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
4990	696	9	0.00	1	0	0	10883.52	0	False	True	True
2539	635	4	140197.18	1	1	1	142935.83	0	False	True	True
6451	714	9	0.00	2	1	0	129192.55	0	False	True	True
5786	779	0	133295.98	1	1	0	22832.71	1	False	True	False
7	376	4	115046.74	4	1	0	119346.88	1	True	False	False

from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:

Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Training a potential challenger model

We're curious how an alternate model compares to our champion model, so let's train a challenger model as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=50, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initializing the model objects

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

You simply initialize this model object with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)

2025-12-31 22:26:54,341 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2025-12-31 22:26:54,343 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2025-12-31 22:26:54,344 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2025-12-31 22:26:54,347 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2025-12-31 22:26:54,349 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2025-12-31 22:26:54,351 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2025-12-31 22:26:54,352 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2025-12-31 22:26:54,354 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2025-12-31 22:26:54,357 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2025-12-31 22:26:54,379 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2025-12-31 22:26:54,381 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2025-12-31 22:26:54,403 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2025-12-31 22:26:54,406 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2025-12-31 22:26:54,418 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2025-12-31 22:26:54,419 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2025-12-31 22:26:54,432 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion model, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.model_validation.sklearn.CalibrationCurve	Calibration Curve	Evaluates the calibration of probability estimates by comparing predicted probabilities against observed...	True	False	['model', 'dataset']	{'n_bins': {'type': 'int', 'default': 10}}	['sklearn', 'model_performance', 'classification']	['classification']
validmind.model_validation.sklearn.ClassifierPerformance	Classifier Performance	Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...	False	True	['dataset', 'model']	{'average': {'type': 'str', 'default': 'macro'}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix	Confusion Matrix	Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...	True	False	['dataset', 'model']	{'threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning	Hyper Parameters Tuning	Performs exhaustive grid search over specified parameter ranges to find optimal model configurations...	False	True	['model', 'dataset']	{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}	['sklearn', 'model_performance']	['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy	Minimum Accuracy	Checks if the model's prediction accuracy meets or surpasses a specified threshold....	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.7}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score	Minimum F1 Score	Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore	Minimum ROCAUC Score	Validates model by checking if the ROC AUC score meets or surpasses a specified threshold....	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison	Models Performance Comparison	Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...	False	True	['dataset', 'models']	{}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']	['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex	Population Stability Index	Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...	True	True	['datasets', 'model']	{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve	Precision Recall Curve	Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....	True	False	['model', 'dataset']	{}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve	ROC Curve	Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...	True	False	['model', 'dataset']	{}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors	Regression Errors	Assesses the performance and error distribution of a regression model using various error metrics....	False	True	['model', 'dataset']	{}	['sklearn', 'model_performance']	['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation	Training Test Degradation	Tests if model performance degradation between training and test datasets exceeds a predefined threshold....	False	True	['datasets', 'model']	{'max_threshold': {'type': 'float', 'default': 0.1}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable	GINI Table	Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets....	False	True	['dataset', 'model']	{}	['model_performance']	['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift	Calibration Curve Drift	Evaluates changes in probability calibration between reference and monitoring datasets....	True	True	['datasets', 'model']	{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift	Class Discrimination Drift	Compares classification discrimination metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift	Classification Accuracy Drift	Compares classification accuracy metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift	Confusion Matrix Drift	Compares confusion matrix metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift	ROC Curve Drift	Compares ROC curves between reference and monitoring datasets....	True	False	['datasets', 'model']	{}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.

for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

Classifier Performance: Logreg Champion is designed to evaluate the performance of classification models by calculating key metrics such as precision, recall, F1-Score, accuracy, and ROC AUC. These metrics provide a comprehensive view of how well a model distinguishes between classes, making it suitable for both binary and multiclass classification tasks.

The test operates by utilizing scikit-learn's classification_report to compute precision, recall, F1-Score, and accuracy. Precision measures the proportion of true positive results in all positive predictions, indicating the model's ability to avoid false positives. Recall, or sensitivity, assesses the proportion of true positive results in all actual positives, reflecting the model's ability to capture all relevant instances. The F1-Score, a harmonic mean of precision and recall, balances these two metrics, providing a single score that accounts for both false positives and false negatives. Accuracy represents the overall correctness of the model's predictions, calculated as the ratio of correctly predicted instances to the total instances. The ROC AUC score, derived from the Receiver Operating Characteristic curve, quantifies the model's ability to distinguish between classes, with values ranging from 0 to 1, where 1 indicates perfect discrimination and 0.5 suggests no discrimination.

The primary advantages of this test include its versatility in handling both binary and multiclass models, making it a robust tool for various classification tasks. By employing a range of performance metrics, it offers a detailed analysis of model behavior, highlighting strengths and weaknesses in different aspects of classification. The inclusion of ROC AUC is particularly beneficial for evaluating models on unbalanced datasets, as it provides insight into the model's discriminatory power across different threshold settings. This comprehensive approach ensures that the test can effectively assess model performance in diverse scenarios, aiding in the identification of areas for improvement.

It should be noted that the test has limitations, such as assuming correctly identified labels for binary classification models, which may not always be the case in real-world applications. It is specifically designed for classification models and is not applicable to regression models, limiting its scope to classification tasks. Additionally, the test's insights may be constrained if the test dataset does not adequately represent real-world scenarios, potentially leading to over- or underestimation of model performance. Signs of high risk include low precision, recall, F1-Score, accuracy, and ROC AUC values, which indicate poor model performance and potential issues with class imbalance or model calibration.

This test shows the results in two tables: one for precision, recall, and F1-Score, and another for accuracy and ROC AUC. The first table presents these metrics for each class, as well as weighted and macro averages, allowing for a detailed examination of model performance across different classes. Precision, recall, and F1-Score values are provided for Class 0 and Class 1, with weighted and macro averages offering a summary view. The second table displays the overall accuracy and ROC AUC score, providing a snapshot of the model's general performance. The precision, recall, and F1-Score values range from 0 to 1, with higher values indicating better performance. The accuracy and ROC AUC scores also range from 0 to 1, with values closer to 1 suggesting better model performance. Notable observations include the relatively balanced precision and recall scores across classes, with a slight variation in F1-Score, and a ROC AUC score that suggests moderate discriminatory power.

The test results reveal the following key insights:

Balanced Class Performance: The precision, recall, and F1-Score for Class 0 and Class 1 are relatively balanced, with values around 0.64, indicating consistent performance across classes.
Moderate Overall Performance: The weighted and macro averages for precision, recall, and F1-Score are approximately 0.64, suggesting a moderate level of performance when considering all classes.
Accuracy and Discriminatory Power: The accuracy of 0.6399 and ROC AUC of 0.7044 indicate that the model has a moderate ability to correctly classify instances and distinguish between classes.

Based on these results, the model demonstrates a moderate level of performance, with balanced precision and recall across classes and a reasonable F1-Score. The accuracy and ROC AUC scores suggest that the model is capable of distinguishing between classes to a certain extent, though there is room for improvement. The balanced performance across classes indicates that the model does not favor one class over another, which is a positive aspect in scenarios where class balance is crucial. However, the moderate scores across all metrics suggest that further refinement and tuning may be necessary to enhance the model's overall performance and discriminatory power.

Tables

Precision, Recall, and F1

Class	Precision	Recall	F1
0	0.6555	0.6418	0.6486
1	0.6238	0.6378	0.6307
Weighted Average	0.6402	0.6399	0.6400
Macro Average	0.6397	0.6398	0.6397

Accuracy and ROC AUC

Metric	Value
Accuracy	0.6399
ROC AUC	0.7044

2025-12-31 22:27:08,070 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

Confusion Matrix: logreg champion is designed to evaluate and visually represent the classification ML model's predictive performance using a Confusion Matrix heatmap. The primary purpose of this test is to assess how well the model can correctly classify True Positives, True Negatives, False Positives, and False Negatives, which are fundamental aspects of model accuracy.

The test operates by comparing the predicted results (y_test_predict) from the classification model against the actual values (y_test_true). A confusion matrix is constructed using the unique labels from y_test_true, utilizing scikit-learn's metrics. This matrix is then visually rendered with Plotly's create_annotated_heatmap function, providing a two-dimensional graphical representation of the model's performance. The matrix highlights the distribution of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN), offering insights into the model's classification capabilities. The values in the matrix indicate the number of instances for each category, with higher values for TP and TN generally indicating better model performance.

The primary advantages of this test include its ability to provide a simplified yet comprehensive visual snapshot of the classification model's predictive performance. It distinctly highlights True Positives, True Negatives, False Positives, and False Negatives, making it easier to identify potential areas for improvement. The matrix is particularly useful for multi-class classification problems, offering a straightforward view of complex model performances. Additionally, it aids in understanding the different types of errors the model could make, providing insights into Type-I and Type-II errors, which are crucial for refining model accuracy.

It should be noted that the test has limitations, particularly in cases of unbalanced classes, where the confusion matrix might misinterpret the accuracy of a model that predominantly predicts the majority class. It does not provide a single unified statistic to evaluate overall model performance, as different aspects are assessed separately. The matrix serves primarily as a descriptive tool and lacks the capability for statistical hypothesis testing. There is also a risk of misinterpretation, as the matrix does not directly provide precision, recall, or F1-score data, which must be computed separately for a more comprehensive evaluation.

This test shows a confusion matrix plot that visually represents the classification results. The matrix is divided into four quadrants: True Positives (199), True Negatives (215), False Positives (120), and False Negatives (113). The x-axis represents the predicted labels, while the y-axis represents the true labels. Each cell in the matrix indicates the count of instances for each classification category. The heatmap uses color intensity to reflect the magnitude of these counts, with darker shades indicating higher values. This visualization allows for quick assessment of the model's performance, highlighting areas where the model excels or struggles. The balance between True Positives and True Negatives versus False Positives and False Negatives provides insights into the model's accuracy and error distribution.

The test results reveal the following key insights:

Balanced True Positives and True Negatives: The model shows a relatively balanced number of True Positives (199) and True Negatives (215), indicating a reasonable level of accuracy in correctly identifying both classes.
Significant False Positives and False Negatives: There are notable counts of False Positives (120) and False Negatives (113), suggesting areas where the model's predictions could be improved to reduce misclassification.
Error Distribution: The distribution of errors between False Positives and False Negatives is relatively even, which may indicate a consistent pattern in the model's misclassification behavior.

Based on these results, the confusion matrix provides a clear view of the model's classification performance, highlighting both strengths and areas for improvement. The balance between True Positives and True Negatives suggests that the model is reasonably effective in identifying the correct classes. However, the presence of significant False Positives and False Negatives indicates potential areas for refinement. These insights can guide further model tuning and evaluation, focusing on reducing misclassification rates to enhance overall predictive accuracy. The visualization effectively communicates the model's behavior, offering a foundation for deeper analysis and understanding of its performance characteristics.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:52d4

2025-12-31 22:27:29,776 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

Minimum Accuracy: logreg_champion is designed to ensure that the model's prediction accuracy meets or surpasses a specified threshold, which is crucial for validating the model's performance. The primary purpose of this test is to confirm that the model can reliably predict outcomes with a level of accuracy that is deemed acceptable for its intended application. This is particularly important in scenarios where accurate predictions are critical, such as in risk assessment or decision-making processes.

The test operates by calculating the model's accuracy score using the accuracy_score method from sklearn, which compares the true labels (y_true) with the predicted labels (class_pred). The accuracy score is a straightforward metric that represents the proportion of correct predictions out of the total number of predictions made by the model. This score is then compared against a predefined threshold, typically set at 0.7, to determine if the model's performance is satisfactory. The accuracy score ranges from 0 to 1, where a score closer to 1 indicates better performance. A score below the threshold suggests that the model may not be performing adequately, while a score above the threshold indicates acceptable performance.

The primary advantages of this test include its simplicity and ease of interpretation, making it a useful tool for quickly assessing the overall performance of a model. It is particularly beneficial in situations where the classes are balanced, as it provides a clear indication of the model's ability to correctly classify instances across all categories. Additionally, the test's versatility allows it to be applied to both binary and multiclass classification tasks, making it a valuable component of a comprehensive model evaluation strategy.

It should be noted that the test has limitations, particularly in scenarios where the dataset is imbalanced. In such cases, the accuracy score may be misleading, as it tends to favor the majority class, potentially giving an inaccurate perception of the model's performance. This limitation highlights the importance of considering additional metrics, such as precision and recall, to gain a more comprehensive understanding of the model's capabilities. Furthermore, the test does not account for the model's ability to manage false positives or false negatives, which can be critical in certain applications.

This test shows the results in a tabular format, presenting key metrics such as the model's accuracy score, the threshold used for comparison, and the pass/fail status of the test. The table indicates that the model achieved an accuracy score of 0.6399, which is below the threshold of 0.7, resulting in a "Fail" status. This suggests that the model's performance does not meet the minimum required standard for accuracy. The table is straightforward to interpret, with each column clearly labeled to provide a quick overview of the test outcome. The accuracy score, expressed as a decimal, represents the proportion of correct predictions, while the threshold indicates the minimum acceptable level of performance. The pass/fail status provides an immediate indication of whether the model meets the required standard.

The test results reveal the following key insights:

Model Fails to Meet Accuracy Threshold: The model's accuracy score of 0.6399 falls short of the 0.7 threshold, indicating that it does not meet the minimum performance criteria.
Performance Below Acceptable Standard: The "Fail" status highlights that the model's predictions are not sufficiently accurate, which could impact its reliability in practical applications.

Based on these results, the model's current accuracy level is insufficient to meet the predefined threshold, suggesting that it may not be suitable for deployment in its current state. The failure to achieve the required accuracy indicates potential issues with the model's ability to generalize from the training data to unseen data. This observation underscores the need for further investigation into the model's design, data preprocessing, or feature selection processes to identify areas for improvement. The insights gained from this test provide a clear indication of the model's current limitations and highlight the importance of addressing these issues to enhance its predictive capabilities.

Tables

Score	Threshold	Pass/Fail
0.6399	0.7	Fail

2025-12-31 22:27:39,520 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

Minimum F1 Score: Logreg Champion is designed to ensure that the model's F1 score on the validation set meets or exceeds a predefined threshold, thereby confirming balanced performance between precision and recall. This test is particularly crucial in classification tasks where the distribution of positive and negative classes is skewed, as it provides a more comprehensive measure of a model's effectiveness than accuracy alone.

The test operates by calculating the F1 score using scikit-learn's metrics in Python. For binary classification problems, the F1 score is computed directly, while for multi-class problems, a macro averaging approach is used. The F1 score is a harmonic mean of precision and recall, where precision measures the proportion of true positive results in all positive predictions, and recall measures the proportion of true positive results in all actual positive cases. The F1 score ranges from 0 to 1, with 1 indicating perfect precision and recall balance. A score above the threshold indicates satisfactory model performance, while a score below suggests potential issues in balancing false positives and false negatives.

The primary advantages of this test include its ability to provide a balanced measure of a model's performance by accounting for both false positives and false negatives. This is particularly useful in scenarios with imbalanced class distributions, where accuracy can be misleading. The flexibility in setting the threshold value allows practitioners to define minimum acceptable performance standards tailored to specific business needs or regulatory requirements. This adaptability makes the test a valuable tool for ensuring that models meet predefined performance criteria.

It should be noted that the F1 score assumes an equal cost for false positives and false negatives, which may not align with all real-world scenarios. Additionally, while the F1 score is a robust measure of model performance, it may not be suitable for all types of models and machine learning tasks. Practitioners might need to rely on other metrics such as precision, recall, or the ROC-AUC score that align more closely with specific requirements. Furthermore, if a model returns an F1 score below the established threshold, it is regarded as high risk, indicating that the model may not be effectively identifying positive classes while minimizing false positives.

This test shows the results in a tabular format, presenting the F1 score achieved by the model, the predefined threshold, and a pass/fail status. The table includes a single row with columns labeled "Score," "Threshold," and "Pass/Fail." The "Score" column displays the F1 score obtained from the validation dataset, which is 0.6307 in this case. The "Threshold" column indicates the minimum acceptable F1 score, set at 0.5. The "Pass/Fail" column shows whether the model's performance meets the threshold, with a "Pass" indicating that the model's F1 score is above the threshold. This table provides a clear and concise summary of the model's performance against the predefined criteria.

The test results reveal the following key insights:

Model Performance Exceeds Threshold: The model achieved an F1 score of 0.6307, which is above the predefined threshold of 0.5, indicating that the model successfully balances precision and recall on the validation dataset.
Validation Set Adequacy: The pass status confirms that the model's performance on the validation set is adequate, suggesting that it effectively identifies positive cases while minimizing false positives.

Based on these results, the model demonstrates a satisfactory balance between precision and recall, as evidenced by the F1 score exceeding the threshold. This indicates that the model is well-suited for the classification task at hand, particularly in scenarios with imbalanced class distributions. The results suggest that the model is capable of maintaining a good level of performance, effectively identifying positive cases while minimizing false positives, which is crucial for ensuring reliable and accurate predictions in practical applications.

Tables

Score	Threshold	Pass/Fail
0.6307	0.5	Pass

2025-12-31 22:27:48,204 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

ROC Curve: Logreg Champion is designed to evaluate the performance of a binary classification model by generating a Receiver Operating Characteristic (ROC) curve and calculating the Area Under Curve (AUC) score. The ROC curve provides a visual representation of the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) across different threshold levels, while the AUC score quantifies the model's ability to distinguish between the two classes.

The test operates by first selecting the target model and dataset for binary classification. It calculates the predicted probabilities for the test set and uses these predictions, along with the true class labels, to plot the ROC curve. The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The AUC score is then computed, which provides a single scalar value representing the model's overall performance. The AUC ranges from 0 to 1, where a score of 0.5 indicates no discriminative ability (equivalent to random guessing), and a score closer to 1 indicates excellent discrimination between classes. The test also includes a line representing randomness (AUC of 0.5) for comparison. Any infinite values in the ROC thresholds are removed to ensure accurate plotting and calculation.

The primary advantages of this test include its ability to provide a comprehensive visual depiction of a model's discriminative power across all possible classification thresholds. Unlike other metrics that evaluate model performance at a single threshold, the ROC curve offers insights into how the model performs across a range of thresholds. The AUC score, which summarizes the entire ROC curve into a single value, remains consistent regardless of the class distribution in the dataset, making it particularly useful in scenarios with imbalanced classes. This consistency allows for a more reliable comparison of model performance across different datasets and conditions.

It should be noted that this test is specifically designed for binary classification tasks, limiting its applicability to other types of models. Additionally, the ROC curve may not perform well with models that produce probabilities heavily skewed towards 0 or 1, as it could still reflect high performance even if the majority of classifications are incorrect, provided the ranking order is maintained. This is known as the "Class Imbalance Problem," where the ROC curve might not accurately represent the model's true performance in highly imbalanced datasets. Furthermore, an AUC score near 0.5 indicates a high risk, as it suggests the model's performance is no better than random guessing.

This test shows a plot of the ROC curve for the logistic regression model on the test dataset, with the AUC score prominently displayed. The x-axis represents the False Positive Rate (FPR), while the y-axis represents the True Positive Rate (TPR). The ROC curve is plotted in magenta, showing the model's performance across various thresholds, while a dashed gray line represents the performance of a random classifier with an AUC of 0.5. The AUC score of 0.70 indicates that the model has a moderate ability to distinguish between the positive and negative classes. The curve's position above the line of randomness suggests that the model performs better than random guessing, but there is room for improvement. The plot provides a clear visual representation of the model's trade-offs between sensitivity and specificity.

The test results reveal the following key insights:

Moderate Discriminative Ability: The AUC score of 0.70 indicates that the model has a moderate ability to distinguish between the two classes, performing better than random guessing but not reaching optimal performance.
Curve Above Randomness: The ROC curve lies above the line of randomness, suggesting that the model has some discriminative power, although it is not exceptionally strong.
Potential for Improvement: The distance of the ROC curve from the top-left corner of the plot indicates that there is potential for improving the model's sensitivity and specificity.

Based on these results, the logistic regression model demonstrates a moderate level of performance in distinguishing between the positive and negative classes, as evidenced by the AUC score of 0.70. The ROC curve's position above the line of randomness confirms that the model is better than random guessing, but there is significant room for enhancement. The insights suggest that while the model is functional, further refinement and optimization could improve its discriminative power, particularly in scenarios where higher accuracy is required. The analysis provides a clear understanding of the model's current capabilities and highlights areas for potential development.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:c401

2025-12-31 22:28:04,451 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion model doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Need more help?):

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Validation Report under Documents.
Locate the Data Preparation section and click on 2.2.2. Model Performance to expand that section.
Under the Model Performance Metrics section, locate Artifacts then click Link Artifact to Report:
Select Validation Issue as the type of artifact.
Click + Add Validation Issue to add a validation issue type artifact.
Enter in the details for your validation issue, for example:
- TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
- RISK AREA — Model Performance
- DOCUMENTATION SECTION — 3.2. Model Evaluation
- DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
Click Save.
Select the validation issue you just added to link to your validation report and click Update Linked Artifacts to insert your validation issue.
Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger model:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

Classifier Performance: Champion vs Challenger is designed to evaluate the performance of classification models by calculating key metrics such as precision, recall, F1-Score, accuracy, and ROC AUC scores. This test provides a comprehensive analysis of model performance, applicable to both binary and multiclass classification models, allowing for a detailed comparison of different models' effectiveness.

The test operates by utilizing scikit-learn's classification_report to compute precision, recall, F1-Score, and accuracy for each class in the model. Precision measures the proportion of true positive predictions among all positive predictions, indicating the model's ability to avoid false positives. Recall, or sensitivity, measures the proportion of true positive predictions among all actual positives, reflecting the model's ability to capture all relevant instances. The F1-Score is the harmonic mean of precision and recall, providing a balance between the two metrics. Accuracy represents the proportion of correct predictions among all predictions made. For multiclass models, macro and weighted averages of these scores are calculated to provide an overall performance measure. Additionally, the ROC AUC score is calculated to assess the model's ability to distinguish between classes, with values ranging from 0 to 1, where a score closer to 1 indicates better performance.

The primary advantages of this test include its versatility in handling both binary and multiclass models, making it suitable for a wide range of classification tasks. By employing a variety of performance metrics, the test offers a comprehensive view of model performance, highlighting strengths and weaknesses across different aspects. The inclusion of the ROC AUC score is particularly beneficial for evaluating models on unbalanced datasets, as it provides insight into the model's discriminatory power beyond simple accuracy measures. This makes the test a valuable tool for model comparison and selection in diverse scenarios.

It should be noted that the test assumes correctly identified labels for binary classification models, which may not always be the case in real-world applications. It is specifically designed for classification models and is not applicable to regression models, limiting its use to certain types of machine learning tasks. Additionally, the test may provide limited insights if the test dataset does not adequately represent real-world scenarios, potentially leading to misleading conclusions about model performance. Signs of high risk include low values for precision, recall, F1-Score, accuracy, and ROC AUC, as well as imbalances in precision and recall scores, which could indicate poor model performance.

This test shows the results in tabular format, presenting precision, recall, F1-Score, accuracy, and ROC AUC scores for two models: log_model_champion and rf_model. Each table row corresponds to a specific class or an average metric, with columns representing the model, class, and respective scores. The tables allow for easy comparison of the models' performance across different metrics. Key measurements include precision, recall, and F1-Score for each class, as well as weighted and macro averages. Notable observations include the rf_model generally outperforming the log_model_champion across most metrics, with higher precision, recall, and F1-Score values. The rf_model also shows a higher accuracy and ROC AUC score, indicating better overall performance and discriminatory power.

The test results reveal the following key insights:

RF Model Outperforms in Precision and Recall: The rf_model demonstrates higher precision and recall scores for both classes compared to the log_model_champion, indicating a better balance between capturing true positives and minimizing false positives.
Higher F1-Score for RF Model: The rf_model achieves a higher F1-Score for both classes, with a weighted average of 0.6965, compared to 0.64 for the log_model_champion, suggesting a more balanced performance across precision and recall.
Superior Accuracy and ROC AUC for RF Model: The rf_model shows an accuracy of 0.6971 and a ROC AUC score of 0.7727, outperforming the log_model_champion, which has an accuracy of 0.6399 and a ROC AUC score of 0.7044, indicating better overall model performance and class separation.

Based on these results, the rf_model demonstrates superior performance compared to the log_model_champion, with higher precision, recall, F1-Score, accuracy, and ROC AUC scores. This suggests that the rf_model is more effective at distinguishing between classes and making accurate predictions. The higher ROC AUC score indicates that the rf_model has better discriminatory power, making it a more reliable choice for classification tasks. These insights highlight the rf_model as a more robust and capable model, particularly in scenarios where precision and recall are critical.

Tables

model	Class	Precision	Recall	F1
log_model_champion	0	0.6555	0.6418	0.6486
log_model_champion	1	0.6238	0.6378	0.6307
log_model_champion	Weighted Average	0.6402	0.6399	0.6400
log_model_champion	Macro Average	0.6397	0.6398	0.6397
rf_model	0	0.6969	0.7343	0.7151
rf_model	1	0.6973	0.6571	0.6766
rf_model	Weighted Average	0.6971	0.6971	0.6965
rf_model	Macro Average	0.6971	0.6957	0.6958

model	Metric	Value
log_model_champion	Accuracy	0.6399
log_model_champion	ROC AUC	0.7044
rf_model	Accuracy	0.6971
rf_model	ROC AUC	0.7727

2025-12-31 22:28:14,850 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

Confusion Matrix: Champion vs Challenger is designed to evaluate and visually represent the classification ML model's predictive performance using a Confusion Matrix heatmap. The primary purpose is to assess how well the model can correctly classify True Positives, True Negatives, False Positives, and False Negatives, which are fundamental aspects of model accuracy.

The test operates by comparing the predicted results from the classification model against the actual values. A confusion matrix is constructed using the unique labels from the actual values, employing scikit-learn's metrics. This matrix is then visually rendered using Plotly's create_annotated_heatmap function, providing a two-dimensional graphical representation of the model's performance. The matrix highlights the distribution of True Positives, True Negatives, False Positives, and False Negatives, offering insights into the model's classification capabilities. The values in the matrix indicate the number of instances for each category, with higher True Positive and True Negative values generally indicating better model performance.

It should be noted that the test has limitations, especially in cases of unbalanced classes, where it might misinterpret the accuracy of a model that predominantly predicts the majority class. The confusion matrix does not provide a single unified statistic to evaluate overall model performance, requiring separate calculations for metrics like precision, recall, or F1-score. It mainly serves as a descriptive tool and lacks the capability for statistical hypothesis testing. There is also a risk of misinterpretation, as the matrix does not directly provide precision, recall, or F1-score data.

This test shows the results in the form of confusion matrix heatmaps for two models: log_model_champion and rf_model. Each matrix is a 2x2 grid displaying the counts of True Positives, True Negatives, False Positives, and False Negatives. The axes represent the predicted and true labels, with the diagonal elements (True Positives and True Negatives) indicating correct classifications. The off-diagonal elements (False Positives and False Negatives) represent misclassifications. The color intensity in the heatmap corresponds to the count of instances, with darker shades indicating higher values. Notable observations include the distribution of classification outcomes, with specific counts for each category, providing a clear view of the model's performance.

The test results reveal the following key insights:

Higher True Negatives in RF Model: The rf_model shows a higher count of True Negatives (246) compared to the log_model_champion (215), indicating better performance in correctly identifying negative instances.
Lower False Positives in RF Model: The rf_model has fewer False Positives (89) than the log_model_champion (120), suggesting improved precision in positive predictions.
Similar True Positives Across Models: Both models have comparable True Positive counts, with rf_model at 205 and log_model_champion at 199, showing similar effectiveness in identifying positive instances.
Slightly Lower False Negatives in RF Model: The rf_model has a slightly lower count of False Negatives (107) compared to log_model_champion (113), indicating marginally better recall.

Based on these results, the rf_model demonstrates superior performance in terms of True Negatives and False Positives, suggesting it is more effective in correctly identifying negative instances and reducing false alarms. Both models perform similarly in identifying positive instances, as indicated by their True Positive counts. The rf_model also shows a slight advantage in recall, with fewer False Negatives. These insights highlight the rf_model as a more balanced classifier, particularly in scenarios where minimizing false alarms is crucial. The analysis provides a clear understanding of each model's strengths and areas for potential improvement, guiding further model refinement and selection.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:a381

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:0a36

2025-12-31 22:28:32,221 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

Minimum Accuracy: Champion vs Challenger is designed to ensure that a model's prediction accuracy meets or exceeds a specified threshold, which is crucial for validating the model's performance. The primary purpose of this test is to confirm that the model can reliably predict outcomes with a level of accuracy that is deemed acceptable for its intended application. This is particularly important in scenarios where accurate predictions are critical, such as in medical diagnoses or financial forecasting.

The test operates by calculating the model's accuracy score using the accuracy_score method from sklearn, which compares the true labels (y_true) with the predicted labels (class_pred). The accuracy score is the ratio of correct predictions to the total number of predictions, providing a straightforward measure of the model's overall performance. This score is then compared to a predefined threshold, typically set at 0.7, to determine if the model passes the test. An accuracy score above this threshold indicates that the model is performing adequately, while a score below suggests potential issues with prediction reliability. The accuracy score ranges from 0 to 1, where a score closer to 1 indicates better performance.

The primary advantages of this test include its simplicity and effectiveness in providing a quick assessment of a model's performance across all classes. It is particularly useful in situations where the classes are balanced, as it offers a clear indication of the model's ability to correctly classify instances. Additionally, the test's versatility allows it to be applied to both binary and multiclass classification tasks, making it a valuable tool for a wide range of applications. Its straightforward nature means that it can be easily understood and implemented, providing immediate feedback on model performance.

It should be noted that the test has limitations, particularly in datasets with imbalanced classes, where accuracy can be misleading. In such cases, the model may appear to perform well by predominantly predicting the majority class, thus inflating the accuracy score without truly reflecting the model's ability to handle minority classes. This can lead to an inaccurate perception of the model's performance. Furthermore, the test does not account for other important metrics such as precision, recall, or the model's ability to manage false positives and false negatives, which are crucial for a comprehensive evaluation of model performance.

This test shows the results in a tabular format, presenting the accuracy scores of different models against the specified threshold. The table includes columns for the model name, the calculated accuracy score, the threshold value, and the pass/fail status. The accuracy scores are presented as decimal values, indicating the proportion of correct predictions. In this test, the log_model_champion achieved an accuracy score of 0.6399, while the rf_model scored 0.6971. Both models failed to meet the threshold of 0.7, resulting in a "Fail" status. The table provides a clear and concise overview of each model's performance relative to the threshold, highlighting areas where improvements may be necessary.

The test results reveal the following key insights:

Log Model Champion Fails to Meet Threshold: The log_model_champion achieved an accuracy score of 0.6399, which is below the threshold of 0.7, indicating that the model's predictions are not sufficiently accurate for the intended application.
RF Model Marginally Below Threshold: The rf_model scored 0.6971, just shy of the 0.7 threshold, suggesting that while the model is close to meeting the required accuracy, it still falls short and may require further tuning or data adjustments.

Based on these results, the models tested do not currently meet the minimum accuracy threshold of 0.7, with both the log_model_champion and rf_model failing to achieve the required level of prediction accuracy. The log_model_champion shows a more significant shortfall, while the rf_model is marginally below the threshold, indicating potential for improvement with minor adjustments. These insights suggest that further refinement and evaluation of the models are necessary to enhance their predictive capabilities and ensure they meet the desired performance standards. The results underscore the importance of considering additional metrics and methodologies to gain a comprehensive understanding of model performance, particularly in scenarios with imbalanced datasets.

Tables

model	Score	Threshold	Pass/Fail
log_model_champion	0.6399	0.7	Fail
rf_model	0.6971	0.7	Fail

2025-12-31 22:28:45,418 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

Minimum F1 Score: Champion vs Challenger is designed to evaluate whether the F1 score of a model on the validation set meets a predefined minimum threshold, ensuring a balanced performance between precision and recall. This test is crucial for classification tasks, particularly when the distribution of positive and negative classes is skewed, as it provides a more comprehensive measure of a model's effectiveness than accuracy alone.

The test operates by calculating the F1 score using scikit-learn's metrics in Python. For binary classification problems, the f1_score function is employed, while for multi-class problems, macro averaging is used to compute the F1 score. The F1 score is a harmonic mean of precision and recall, where precision measures the accuracy of positive predictions, and recall measures the ability to identify all positive instances. The score ranges from 0 to 1, with 1 indicating perfect precision and recall. A score above the predefined threshold is considered satisfactory, indicating that the model maintains a good balance between precision and recall.

The primary advantages of this test include its ability to provide a balanced measure of a model's performance by accounting for both false positives and false negatives. This is particularly useful in scenarios with imbalanced class distributions, where accuracy can be misleading. The flexibility in setting the threshold value allows practitioners to define minimum acceptable performance standards tailored to specific requirements, ensuring that the model meets the necessary criteria for deployment.

It should be noted that the F1 score assumes an equal cost for false positives and false negatives, which may not be true in all real-world scenarios. Additionally, the test may not be suitable for all types of models and machine learning tasks, as some applications might require other metrics such as precision, recall, or the ROC-AUC score. A low F1 score, below the established threshold, is a sign of high risk, indicating that the model may not effectively balance precision and recall, potentially leading to suboptimal identification of positive classes.

This test shows the results in a tabular format, presenting the F1 scores for different models alongside their respective thresholds and pass/fail status. The table includes columns for the model name, the calculated F1 score, the predefined threshold, and whether the model passed or failed the test. The F1 scores are numerical values ranging from 0 to 1, with the threshold set at 0.5. A score above this threshold indicates a pass, suggesting that the model achieves a satisfactory balance between precision and recall. Notable observations include the specific F1 scores for each model and their pass/fail status, providing a clear indication of which models meet the performance criteria.

The test results reveal the following key insights:

Champion Model Performance: The log_model_champion achieved an F1 score of 0.6307, surpassing the threshold of 0.5, and thus passed the test, indicating a balanced performance between precision and recall.
Random Forest Model Performance: The rf_model recorded an F1 score of 0.6766, also exceeding the threshold of 0.5, resulting in a pass, which suggests a strong ability to balance precision and recall effectively.

Based on these results, both the log_model_champion and the rf_model demonstrate satisfactory performance by exceeding the minimum F1 score threshold of 0.5. The rf_model shows a slightly higher F1 score compared to the log_model_champion, indicating a marginally better balance between precision and recall. These insights suggest that both models are capable of effectively handling the classification task, with the rf_model potentially offering a slight edge in terms of performance. The results provide a clear indication of the models' ability to meet the predefined performance standards, ensuring their suitability for deployment in scenarios where balanced precision and recall are critical.

Tables

model	Score	Threshold	Pass/Fail
log_model_champion	0.6307	0.5	Pass
rf_model	0.6766	0.5	Pass

2025-12-31 22:28:54,186 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

ROC Curve: Champion vs Challenger is designed to evaluate the performance of binary classification models by generating and plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under Curve (AUC) score. The primary purpose of this test is to measure the model's ability to discriminate between two classes, such as default vs non-default, by illustrating the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) across various threshold levels. A higher AUC score indicates better model performance in distinguishing between the positive and negative classes.

The test operates by selecting the target model and datasets that require binary classification. It calculates the predicted probabilities for the test set and uses this data, along with the true outcomes, to generate and plot the ROC curve. The ROC curve is a graphical representation that plots the TPR against the FPR at different threshold levels. The AUC score is computed to provide a numerical estimation of the model's performance, with values ranging from 0 to 1. An AUC of 0.5 suggests no discriminative ability, akin to random guessing, while a score closer to 1 indicates excellent discrimination. The test also includes a line representing randomness (AUC of 0.5) for comparison. Infinite values in the ROC threshold are eliminated to ensure accuracy, and the resulting ROC curve, AUC score, and thresholds are saved for future reference.

The primary advantages of this test include its ability to provide a comprehensive visual depiction of a model's discriminative power across all possible classification thresholds. Unlike other metrics that only reveal model performance at a single threshold, the ROC curve offers a complete view of how the model performs at various levels. The AUC score, representing the entire ROC curve as a single data point, remains consistent regardless of the dataset's proportions, making it an ideal choice for evaluating models in diverse scenarios. This consistency allows for a reliable comparison of model performance across different datasets and conditions.

It should be noted that this test is exclusively structured for binary classification tasks, limiting its application to other model types. Additionally, models that output probabilities highly skewed towards 0 or 1 may not perform well under this test. The ROC curve can sometimes reflect high performance even when the majority of classifications are incorrect, as long as the model's ranking format is maintained. This phenomenon, known as the "Class Imbalance Problem," can lead to misleading interpretations of the model's effectiveness, especially in imbalanced datasets where one class significantly outweighs the other.

This test shows the ROC curves for two models, the "log_model_champion" and the "rf_model," plotted against the test dataset. The ROC curve for the "log_model_champion" shows an AUC of 0.70, while the "rf_model" displays an AUC of 0.77. The plots illustrate the TPR on the y-axis and the FPR on the x-axis, with a dashed line representing random performance (AUC of 0.5). The curves demonstrate how each model's TPR varies with changes in the FPR, providing insight into their discriminative capabilities. The AUC values indicate that both models perform better than random guessing, with the "rf_model" showing superior performance. The range of the AUC values, from 0.5 to 1, helps in assessing the models' effectiveness, with higher values indicating better discrimination between classes.

The test results reveal the following key insights:

RF Model Superior Performance: The "rf_model" achieves an AUC of 0.77, indicating a stronger ability to discriminate between classes compared to the "log_model_champion," which has an AUC of 0.70.
Consistent Discrimination: Both models demonstrate consistent performance above the random line, suggesting reliable discrimination capabilities across various thresholds.
Visual Comparison: The ROC curves provide a clear visual comparison, with the "rf_model" curve consistently lying above the "log_model_champion," highlighting its superior performance.

Based on these results, the "rf_model" exhibits a more robust discriminative ability compared to the "log_model_champion," as evidenced by its higher AUC score. The ROC curves provide a visual confirmation of this performance difference, with the "rf_model" consistently outperforming the "log_model_champion" across various thresholds. These insights suggest that the "rf_model" is more effective in distinguishing between the positive and negative classes, making it a preferable choice for binary classification tasks in this context. The analysis underscores the importance of using ROC curves and AUC scores to evaluate and compare model performance comprehensively.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:bb7b

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:7266

2025-12-31 22:29:10,980 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document

Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger model by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.model_validation.sklearn.OverfitDiagnosis	Overfit Diagnosis	Assesses potential overfitting in a model's predictions, identifying regions where performance between training and...	True	True	['model', 'datasets']	{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}	['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']	['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis	Robustness Diagnosis	Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....	True	True	['datasets', 'model']	{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}	['sklearn', 'model_diagnosis', 'visualization']	['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis	Weakspots Diagnosis	Identifies and visualizes weak spots in a machine learning model's performance across various sections of the...	True	True	['datasets', 'model']	{'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']	['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

Overfit Diagnosis: Champion vs Challenger is designed to assess potential overfitting in a model's predictions by identifying regions where performance between training and testing sets deviates significantly. The primary purpose of this test is to pinpoint specific regions or feature segments where the model may be overfitting, thereby providing insights into the model's generalization capabilities.

The test operates by comparing the model's performance on training versus test data, grouped by feature columns. It calculates the difference between the training and test performance for each group and identifies regions where this difference exceeds a specified threshold. For classification models, the AUC metric is used, while regression models use the MSE metric. The threshold for identifying overfitting regions is set to 0.04 by default. The test calculates the performance metrics for each feature segment and plots regions where the performance gap exceeds the threshold. The AUC metric, ranging from 0 to 1, measures the model's ability to distinguish between classes, with higher values indicating better performance. A significant gap between training and test AUC suggests overfitting.

The primary advantages of this test include its ability to identify specific areas where overfitting occurs, supporting multiple performance metrics, and providing flexibility. It is applicable to both classification and regression models, making it versatile across different types of predictive modeling tasks. The visualization of overfitting segments aids in better understanding and debugging, allowing model developers to focus on specific areas that require attention. This targeted approach helps in refining models to improve their generalization performance.

It should be noted that the default threshold may not be suitable for all use cases and requires tuning to match specific model characteristics and data distributions. The test may not capture more subtle forms of overfitting that do not exceed the threshold, potentially missing areas where the model's performance is suboptimal. Additionally, the test assumes that the binning of features adequately represents the data segments, which may not always be the case. This can lead to misinterpretation if the feature segments do not align well with the underlying data distribution.

This test shows the results in both tabular and graphical formats, providing a comprehensive view of the overfitting regions. The tables present detailed information on the number of training and test records, training and test AUC, and the gap for each feature segment. The plots visually represent the AUC gap across different feature slices, with a red line indicating the cut-off threshold of 0.04. The x-axis represents the feature slices, while the y-axis shows the AUC gap. Notable observations include significant gaps in certain feature segments, indicating potential overfitting. For instance, the "rf_model" shows a substantial AUC gap in the "CreditScore" feature slice (400.0, 450.0], with a gap of 0.5, suggesting severe overfitting in this region.

The test results reveal the following key insights:

Severe Overfitting in RF Model: The "rf_model" exhibits significant overfitting in the "CreditScore" feature slice (400.0, 450.0], with an AUC gap of 0.5, indicating a large discrepancy between training and test performance.
Consistent Overfitting Across Features: Both models show consistent overfitting across various features, such as "Balance" and "NumOfProducts," with gaps exceeding the threshold, highlighting areas needing attention.
Log Model's Moderate Overfitting: The "log_model_champion" shows moderate overfitting in features like "Balance" and "Tenure," with gaps around 0.05 to 0.09, suggesting areas for potential improvement.
Feature-Specific Overfitting Patterns: Different features exhibit varying levels of overfitting, with some segments showing more pronounced gaps, indicating the need for feature-specific analysis and tuning.

Based on these results, the analysis highlights distinct patterns of overfitting in both models, with the "rf_model" showing more severe overfitting in specific feature segments compared to the "log_model_champion." The insights suggest that targeted interventions may be necessary to address these overfitting issues, particularly in features like "CreditScore" and "Balance." The results underscore the importance of feature-specific tuning and validation to enhance the model's generalization capabilities and reduce overfitting. The visualizations and tabular data provide a clear roadmap for identifying and addressing these areas, contributing to more robust model performance.

Tables

model	Feature	Slice	Number of Training Records	Number of Test Records	Training AUC	Test AUC	Gap
log_model_champion	CreditScore	(750.0, 800.0]	248	61	0.6967	0.6462	0.0504
log_model_champion	Tenure	(8.0, 9.0]	278	64	0.7049	0.6504	0.0545
log_model_champion	Balance	(71516.268, 95355.024]	234	50	0.6740	0.5829	0.0911
log_model_champion	Balance	(190710.048, 214548.804]	18	9	0.5357	0.2222	0.3135
log_model_champion	EstimatedSalary	(59994.105, 79988.28]	273	63	0.6579	0.6061	0.0518
rf_model	CreditScore	(400.0, 450.0]	54	7	1.0000	0.5000	0.5000
rf_model	CreditScore	(450.0, 500.0]	121	26	1.0000	0.7959	0.2041
rf_model	CreditScore	(500.0, 550.0]	256	64	1.0000	0.8093	0.1907
rf_model	CreditScore	(550.0, 600.0]	387	89	1.0000	0.7482	0.2518
rf_model	CreditScore	(600.0, 650.0]	458	125	1.0000	0.7609	0.2391
rf_model	CreditScore	(650.0, 700.0]	472	128	1.0000	0.8133	0.1867
rf_model	CreditScore	(700.0, 750.0]	398	110	1.0000	0.6939	0.3061
rf_model	CreditScore	(750.0, 800.0]	248	61	1.0000	0.7667	0.2333
rf_model	CreditScore	(800.0, 850.0]	178	36	1.0000	0.9127	0.0873
rf_model	Tenure	(-0.01, 1.0]	379	88	1.0000	0.7515	0.2485
rf_model	Tenure	(1.0, 2.0]	278	70	1.0000	0.7946	0.2054
rf_model	Tenure	(2.0, 3.0]	267	76	1.0000	0.7091	0.2909
rf_model	Tenure	(3.0, 4.0]	243	62	1.0000	0.7927	0.2073
rf_model	Tenure	(4.0, 5.0]	266	62	1.0000	0.8833	0.1167
rf_model	Tenure	(5.0, 6.0]	232	64	1.0000	0.7721	0.2279
rf_model	Tenure	(6.0, 7.0]	238	66	1.0000	0.7768	0.2232
rf_model	Tenure	(7.0, 8.0]	269	64	1.0000	0.8804	0.1196
rf_model	Tenure	(8.0, 9.0]	278	64	1.0000	0.6860	0.3140
rf_model	Tenure	(9.0, 10.0]	135	31	1.0000	0.7051	0.2949
rf_model	Balance	(-238.388, 23838.756]	830	214	1.0000	0.8296	0.1704
rf_model	Balance	(47677.512, 71516.268]	80	23	1.0000	0.7235	0.2765
rf_model	Balance	(71516.268, 95355.024]	234	50	1.0000	0.7029	0.2971
rf_model	Balance	(95355.024, 119193.78]	539	113	1.0000	0.7038	0.2962
rf_model	Balance	(119193.78, 143032.536]	498	136	1.0000	0.7354	0.2646
rf_model	Balance	(143032.536, 166871.292]	268	72	1.0000	0.6646	0.3354
rf_model	Balance	(166871.292, 190710.048]	97	23	1.0000	0.8712	0.1288
rf_model	Balance	(190710.048, 214548.804]	18	9	1.0000	0.6667	0.3333
rf_model	NumOfProducts	(0.997, 1.3]	1520	362	1.0000	0.6692	0.3308
rf_model	NumOfProducts	(1.9, 2.2]	879	237	1.0000	0.7252	0.2748
rf_model	NumOfProducts	(2.8, 3.1]	151	39	1.0000	0.5536	0.4464
rf_model	HasCrCard	(-0.001, 0.1]	792	209	1.0000	0.7622	0.2378
rf_model	HasCrCard	(0.9, 1.0]	1793	438	1.0000	0.7771	0.2229
rf_model	IsActiveMember	(-0.001, 0.1]	1420	342	1.0000	0.7501	0.2499
rf_model	IsActiveMember	(0.9, 1.0]	1165	305	1.0000	0.7714	0.2286
rf_model	EstimatedSalary	(-188.362, 20005.755]	271	69	1.0000	0.7882	0.2118
rf_model	EstimatedSalary	(20005.755, 39999.93]	239	59	1.0000	0.7624	0.2376
rf_model	EstimatedSalary	(39999.93, 59994.105]	254	60	1.0000	0.7766	0.2234
rf_model	EstimatedSalary	(59994.105, 79988.28]	273	63	1.0000	0.7298	0.2702
rf_model	EstimatedSalary	(79988.28, 99982.455]	255	59	1.0000	0.5356	0.4644
rf_model	EstimatedSalary	(99982.455, 119976.63]	269	70	1.0000	0.8129	0.1871
rf_model	EstimatedSalary	(119976.63, 139970.805]	258	57	1.0000	0.7965	0.2035
rf_model	EstimatedSalary	(139970.805, 159964.98]	247	67	1.0000	0.7573	0.2427
rf_model	EstimatedSalary	(159964.98, 179959.155]	258	75	1.0000	0.8080	0.1920
rf_model	EstimatedSalary	(179959.155, 199953.33]	261	68	1.0000	0.8596	0.1404
rf_model	Geography_Germany	(-0.001, 0.1]	1800	459	1.0000	0.7533	0.2467
rf_model	Geography_Germany	(0.9, 1.0]	785	188	1.0000	0.7219	0.2781
rf_model	Geography_Spain	(-0.001, 0.1]	2001	506	1.0000	0.7646	0.2354
rf_model	Geography_Spain	(0.9, 1.0]	584	141	1.0000	0.8053	0.1947
rf_model	Gender_Male	(-0.001, 0.1]	1254	331	1.0000	0.8136	0.1864
rf_model	Gender_Male	(0.9, 1.0]	1331	316	1.0000	0.7136	0.2864

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:8d2b

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:ace8

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0630

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:552f

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:05f1

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:32ba

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:38ef

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:97c8

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4e85

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4abe

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:d4fe

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1f99

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:93bf

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6417

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4b52

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4ca0

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:5d07

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:a9d6

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:f750

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:5a95

2025-12-31 22:29:44,601 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test. Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

Robustness Diagnosis: Champion vs Log Regression is designed to evaluate the resilience of machine learning models when exposed to noisy input data. This test is crucial for understanding how models perform under real-world conditions where data may be imperfect or corrupted, ensuring that the models maintain their predictive power despite such challenges.

The test operates by introducing Gaussian noise to the numeric input features of the datasets at varying scales of standard deviation. The performance of the models is then assessed using metrics such as the Area Under the Curve (AUC) for classification tasks. AUC measures the ability of the model to distinguish between classes, with values ranging from 0 to 1, where 1 indicates perfect classification and 0.5 suggests no discriminative power. The test involves calculating the performance decay, which is the reduction in AUC as noise increases, and visualizing these results to understand the extent of performance degradation.

The primary advantages of this test include its ability to provide insights into a model's robustness against noisy or corrupted data. By utilizing a variety of performance metrics suitable for both classification and regression tasks, the test offers a comprehensive view of model behavior under stress. The visualization of results aids in understanding the extent of performance degradation, making it easier to identify models that are more resilient to noise and those that may require further tuning or improvement.

It should be noted that the test has limitations, such as the use of Gaussian noise, which might not represent all types of real-world data perturbations. The performance thresholds used to evaluate decay are somewhat arbitrary and may need adjustment based on specific application requirements. Additionally, the test may not account for more complex or unstructured noise patterns that could affect model robustness, potentially leading to an incomplete assessment of a model's real-world performance.

This test shows the results in both tabular and graphical formats. The tables provide detailed numerical insights into the AUC values and performance decay for each model across different perturbation sizes. The plots visually represent the AUC decay as noise increases, with the x-axis indicating the perturbation size and the y-axis showing the AUC values. The graphs allow for a quick comparison of how each model's performance changes with increasing noise, highlighting any significant drops in AUC. Notable observations include the stability of the logistic regression model compared to the random forest model, which shows a more pronounced performance decay as noise increases.

The test results reveal the following key insights:

Logistic Regression Stability: The logistic regression model demonstrates relatively stable performance across varying noise levels, with minimal performance decay observed. The AUC remains above 0.66 even at the highest perturbation size.
Random Forest Sensitivity: The random forest model exhibits significant performance decay, particularly in the training dataset, where AUC drops from 1.0 to 0.7844 as noise increases. This indicates a higher sensitivity to noise compared to the logistic regression model.
Test Dataset Consistency: Both models maintain more consistent performance on the test dataset, with less pronounced decay compared to the training dataset, suggesting better generalization to unseen data.

Based on these results, the logistic regression model appears to be more robust to noise, maintaining stable performance across different perturbation levels. In contrast, the random forest model shows greater sensitivity to noise, particularly in the training dataset, which may indicate overfitting or a need for further tuning to improve robustness. These insights highlight the importance of evaluating model performance under noisy conditions to ensure reliability in real-world applications.

Tables

model	Perturbation Size	Dataset	Row Count	AUC	Performance Decay	Passed
log_model_champion	Baseline (0.0)	train_dataset_final	2585	0.6736	0.0000	True
log_model_champion	Baseline (0.0)	test_dataset_final	647	0.7044	0.0000	True
log_model_champion	0.1	train_dataset_final	2585	0.6734	0.0001	True
log_model_champion	0.1	test_dataset_final	647	0.7026	0.0018	True
log_model_champion	0.2	train_dataset_final	2585	0.6689	0.0047	True
log_model_champion	0.2	test_dataset_final	647	0.6970	0.0074	True
log_model_champion	0.3	train_dataset_final	2585	0.6704	0.0032	True
log_model_champion	0.3	test_dataset_final	647	0.6997	0.0047	True
log_model_champion	0.4	train_dataset_final	2585	0.6661	0.0075	True
log_model_champion	0.4	test_dataset_final	647	0.6903	0.0141	True
log_model_champion	0.5	train_dataset_final	2585	0.6624	0.0112	True
log_model_champion	0.5	test_dataset_final	647	0.6853	0.0191	True
rf_model	Baseline (0.0)	train_dataset_final	2585	1.0000	0.0000	True
rf_model	Baseline (0.0)	test_dataset_final	647	0.7727	0.0000	True
rf_model	0.1	train_dataset_final	2585	0.9822	0.0178	True
rf_model	0.1	test_dataset_final	647	0.7699	0.0029	True
rf_model	0.2	train_dataset_final	2585	0.9398	0.0602	False
rf_model	0.2	test_dataset_final	647	0.7751	-0.0024	True
rf_model	0.3	train_dataset_final	2585	0.8896	0.1104	False
rf_model	0.3	test_dataset_final	647	0.7681	0.0046	True
rf_model	0.4	train_dataset_final	2585	0.8366	0.1634	False
rf_model	0.4	test_dataset_final	647	0.7619	0.0108	True
rf_model	0.5	train_dataset_final	2585	0.7844	0.2156	False
rf_model	0.5	test_dataset_final	647	0.7459	0.0269	True

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:b49b

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:3d3a

2025-12-31 22:30:06,050 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI

['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

Features AUC: Champion vs Challenger is designed to evaluate the discriminatory power of each individual feature within a binary classification model by calculating the Area Under the Curve (AUC) for each feature separately. The primary purpose of this test is to quantify how well each feature can differentiate between the two classes in a binary classification problem, serving as a tool for pre-modeling feature selection or post-modeling interpretation.

The test operates by treating the values of each feature as raw scores and computing the AUC against the actual binary outcomes. This involves plotting the true positive rate against the false positive rate at various threshold settings to derive the AUC value for each feature. The AUC is a measure of the feature's ability to distinguish between the classes, with values ranging from 0 to 1. A value of 0.5 suggests no discrimination (equivalent to random guessing), while values closer to 1 indicate strong discriminatory power. This test provides a straightforward indication of each feature's univariate classification strength, independent of other variables.

The primary advantages of this test include its ability to isolate the individual contribution of features to the classification task without the influence of other variables. This makes it particularly useful for initial feature evaluation, allowing practitioners to identify which features are inherently strong predictors. Additionally, it provides insights into the model's reliance on individual features after training, helping to understand the model's behavior and potential areas of improvement. By focusing on univariate analysis, it simplifies the complexity often associated with multivariate interactions, making it accessible for quick assessments.

It should be noted that this test does not reflect the combined effects of features or any interaction between them, which can be critical in certain models. The AUC values are calculated without considering the model's use of the features, which could lead to different interpretations of feature importance when considering the model holistically. Furthermore, this metric is applicable only to binary classification tasks and cannot be directly extended to multiclass classification or regression without modifications. Signs of high risk include features with low AUC scores that may not contribute significantly to class differentiation, or unexpectedly high AUC scores for features not believed to be informative, which may suggest data leakage or other data issues.

This test shows the AUC scores for individual features in a binary classification model, presented in a bar plot format. Each bar represents a feature, with the length of the bar corresponding to the AUC score, which ranges from 0 to 1. The plot allows for easy comparison of the discriminatory power of each feature. Key measurements include the AUC values for features such as Balance, Geography_Germany, HasCrCard, and others. Notable observations include the feature Balance having the highest AUC score, indicating strong discriminatory power, while features like IsActiveMember and Gender_Male have lower scores, suggesting weaker individual predictive strength. The scale of the AUC values provides a clear indication of each feature's ability to differentiate between the classes, with higher values indicating better performance.

The test results reveal the following key insights:

Balance Shows Strong Discriminatory Power: The feature Balance has the highest AUC score, indicating it is a strong predictor of class differentiation.
Geography_Germany and HasCrCard Are Effective Predictors: These features also exhibit relatively high AUC scores, suggesting they contribute significantly to the model's classification ability.
Lower AUC for IsActiveMember and Gender_Male: These features have lower AUC scores, indicating they may not be as effective in distinguishing between the classes on their own.

Based on these results, the analysis highlights the varying discriminatory power of individual features within the model. Features like Balance and Geography_Germany demonstrate strong predictive capabilities, which could be leveraged for more effective classification. Conversely, features with lower AUC scores, such as IsActiveMember and Gender_Male, may require further investigation to understand their role in the model. This understanding of feature performance can guide future feature selection and model refinement efforts, ensuring that the most informative features are prioritized in the modeling process.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:65d3

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:8222

2025-12-31 22:30:27,127 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

Permutation Feature Importance: Champion vs Challenger is designed to evaluate the significance of each feature in a machine learning model by assessing the impact on model performance when feature values are randomly rearranged. The primary purpose of this test is to identify which features are most influential in the model's predictions, thereby providing insights into the model's decision-making process and potential areas of over-reliance or under-utilization of certain features.

The test operates by utilizing the permutation_importance method from the sklearn.inspection module. This method involves shuffling the values of each feature in the dataset and measuring the resulting change in the model's performance. The underlying principle is that if a feature is important, permuting its values will lead to a significant decrease in model performance. Conversely, if the model's performance remains largely unchanged, the feature is likely not crucial. The metric typically ranges from 0 to 1, where higher values indicate greater importance. The test requires the model's predictions before and after permutation to calculate the importance scores, which are then visualized to show the relative importance of each feature.

The primary advantages of this test include its ability to provide clear insights into the importance of different features, which can reveal underlying data structures and potential overfitting. It is model-agnostic, meaning it can be applied to any classifier that provides a measure of prediction accuracy before and after feature permutation. This flexibility makes it particularly useful for comparing different models or understanding the feature dynamics in complex datasets. Additionally, it can highlight features that the model may be overly reliant on, which is crucial for ensuring model robustness and generalizability.

It should be noted that this test does not imply causality; it only indicates the amount of information a feature provides for the prediction task. It also does not account for interactions between features, which can lead to misleading importance scores if features are correlated. Furthermore, the test cannot interact with certain libraries like statsmodels, pytorch, or catboost, limiting its applicability in some contexts. High-risk signs include models heavily relying on features with highly variable or easily permutable values, indicating potential instability, and features expected to be significant based on domain knowledge not influencing the model's predictions.

This test shows the permutation importance of features for two models: a logistic regression model (log_model_champion) and a random forest model (rf_model). The results are presented in bar plots, with each bar representing a feature and its corresponding importance score. The x-axis shows the importance score, while the y-axis lists the features. In the logistic regression model, "IsActiveMember" and "Geography_Germany" are the most important features, with scores around 0.04. In contrast, the random forest model highlights "NumOfProducts" and "Balance" as the most significant, with scores exceeding 0.1. These plots allow for a visual comparison of feature importance across models, indicating which features each model relies on most heavily.

The test results reveal the following key insights:

Logistic Model Feature Reliance: The logistic regression model places significant importance on "IsActiveMember" and "Geography_Germany," suggesting these features are crucial for its predictions.
Random Forest Feature Dominance: The random forest model shows a strong reliance on "NumOfProducts" and "Balance," indicating these features are pivotal in its decision-making process.
Divergent Feature Importance: There is a notable difference in feature importance between the two models, with each model prioritizing different features, reflecting their distinct mechanisms and potential areas of overfitting.
Low Importance Features: Features like "HasCrCard" and "Geography_Spain" have low importance scores in both models, suggesting they contribute minimally to the predictions.

Based on these results, the logistic regression model and the random forest model exhibit distinct patterns of feature importance, highlighting their different approaches to prediction. The logistic model's reliance on membership status and geographic location contrasts with the random forest's focus on product numbers and account balance. This divergence suggests that each model captures different aspects of the data, which could be leveraged for complementary insights or ensemble approaches. Understanding these differences is crucial for interpreting model behavior and ensuring robust, reliable predictions across various scenarios.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:a687

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:ff8b

2025-12-31 22:30:53,925 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

SHAP Global Importance: Champion vs Challenger is designed to evaluate and visualize global feature importance using SHAP values for model explanation and risk identification. The primary purpose of this test is to elucidate model outcomes by attributing them to the contributing features, assigning a quantifiable global importance to each feature via their respective absolute Shapley values. This makes it suitable for tasks like classification, both binary and multiclass, and forms an essential part of model risk management.

The test operates by selecting a suitable explainer that aligns with the model's type. For tree-based models like XGBClassifier, RandomForestClassifier, and CatBoostClassifier, the TreeExplainer is used, whereas for linear models like LogisticRegression and LinearRegression, the LinearExplainer is employed. The explainer calculates the Shapley values, which are then visualized using two specific graphical representations: the Mean Importance Plot and the Summary Plot. The Mean Importance Plot portrays the significance of individual features based on their absolute Shapley values, calculating the average of these values across all instances to highlight global importance. The Summary Plot combines feature importance with their effects, where each dot represents a Shapley value for a certain feature in a specific case, with a color gradient indicating the feature's value.

The primary advantages of this test include its ability to provide a detailed perspective on how different features shape the model's decision-making logic for each instance. SHAP not only illustrates global feature significance but also offers clear insights into model behavior, making it particularly useful for understanding complex models. By visualizing the contribution of each feature, stakeholders can better comprehend the model's reasoning, which is crucial for tasks involving risk management and decision-making. This transparency helps in identifying potential biases and ensuring that the model's decisions align with business objectives.

It should be noted that high-dimensional data can convolute interpretations, making it challenging to derive clear insights. Associating importance with tangible real-world impact still involves a certain degree of subjectivity, as the significance of features may vary depending on the context. Signs of high risk include overemphasis on certain features in SHAP importance plots, which might hint at model overfitting, and anomalies such as unexpected features showing high importance, suggesting incorrect reasoning. A SHAP summary plot filled with high variability or scattered data points may also indicate potential issues.

This test shows the results through a series of plots, including the Mean Importance Plot and the Summary Plot. The Mean Importance Plot displays normalized feature importance for the champion model, with features like "IsActiveMember," "Geography_Germany," and "Gender_Male" showing the highest importance. The horizontal axis represents the normalized SHAP value as a percentage, indicating the contribution of each feature to the model's predictions. The Summary Plot provides a more detailed view, showing the impact of each feature on the model output, with a color gradient representing feature values. The vertical axis lists the features, while the horizontal axis shows the SHAP value's impact on the model output. The plots for the challenger model focus on "CreditScore" and "Tenure," with SHAP interaction values indicating how these features interact with others.

The test results reveal the following key insights:

IsActiveMember Dominance: The feature "IsActiveMember" shows the highest normalized SHAP value, indicating its significant influence on the champion model's predictions.
Geographical Influence: "Geography_Germany" and "Geography_Spain" exhibit notable importance, suggesting geographical factors play a crucial role in the model's decision-making.
Gender and Balance Impact: "Gender_Male" and "Balance" are also prominent, highlighting their substantial contribution to the model's output.
Challenger Model Focus: The challenger model emphasizes "CreditScore" and "Tenure," with interaction plots showing how these features interact with others, affecting the model's predictions.

Based on these results, the SHAP Global Importance test provides a comprehensive view of feature contributions for both the champion and challenger models. The insights reveal the dominant role of certain features, such as "IsActiveMember" and geographical factors, in shaping the champion model's predictions. The challenger model's focus on "CreditScore" and "Tenure" suggests a different approach to feature importance. These observations help in understanding the models' behavior, ensuring alignment with business objectives, and identifying potential areas for further investigation or refinement. The visualizations offer a clear and interpretable representation of feature importance, aiding stakeholders in making informed decisions.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:676c

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:7de2

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:55ff

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:e352

2025-12-31 22:31:20,500 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Initialize ValidMind model objects
Assign predictions and probabilities to your ValidMind model objects
Use tests from ValidMind to evaluate the potential of models, including comparative tests between champion and challenger models
Log an artifact in the ValidMind Platform

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	50
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	None
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'sqrt'
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary ` for details.	42
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary ` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples. - If int, then draw `max_samples` samples. - If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`. .. versionadded:: 0.22	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`), - classifications trained on data with missing values. The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide `. .. versionadded:: 1.4	None