ValidMind for model validation 3 — Developing a potential challenger model

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger model and then pass your model and its predictions to ValidMind.

A challenger model is an alternate model that attempts to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challenger models with this notebook, you'll need to first have:

Registered a model within the ValidMind Platform and granted yourself access to the model as a validator
Installed the ValidMind Library in your local environment, allowing you to access all its features
Learned how to import and initialize datasets for use with ValidMind
Understood the basics of how to run and log tests with ValidMind
Run data quality tests on the datasets used to train the champion model, and logged the results of those tests to ValidMind
Inserted your logged test results into your validation report

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.
Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)

Note: you may need to restart the kernel to use updated packages.

2026-01-28 18:05:38,908 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results table lists the top ten feature pairs ranked by the absolute value of their Pearson correlation coefficients, along with their corresponding coefficients and Pass/Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs display lower correlation values and are marked as Pass.

Key insights:

Single feature pair exceeds correlation threshold: The pair (Age, Exited) shows a Pearson correlation coefficient of 0.3499, surpassing the 0.3 threshold and resulting in a Fail status.
All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.1947 to 0.0419, all below the 0.3 threshold and marked as Pass.
Predominantly weak linear relationships: Most feature pairs exhibit weak linear associations, with coefficients clustered near zero.

The test results indicate that the dataset contains predominantly weak linear relationships among features, with only one pair—(Age, Exited)—exceeding the specified correlation threshold. This suggests limited risk of feature redundancy or multicollinearity based on linear associations, with the exception of the identified pair. The overall correlation structure supports the interpretability and stability of the feature set.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3499	Fail
(IsActiveMember, Exited)	-0.1947	Pass
(Balance, NumOfProducts)	-0.1711	Pass
(Balance, Exited)	0.1483	Pass
(Age, Balance)	0.0519	Pass
(Age, NumOfProducts)	-0.0475	Pass
(NumOfProducts, Exited)	-0.0462	Pass
(Balance, HasCrCard)	-0.0452	Pass
(Tenure, IsActiveMember)	-0.0447	Pass
(NumOfProducts, IsActiveMember)	0.0419	Pass

# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3499	Fail
1	(IsActiveMember, Exited)	-0.1947	Pass
2	(Balance, NumOfProducts)	-0.1711	Pass
3	(Balance, Exited)	0.1483	Pass
4	(Age, Balance)	0.0519	Pass
5	(Age, NumOfProducts)	-0.0475	Pass
6	(NumOfProducts, Exited)	-0.0462	Pass
7	(Balance, HasCrCard)	-0.0452	Pass
8	(Tenure, IsActiveMember)	-0.0447	Pass
9	(NumOfProducts, IsActiveMember)	0.0419	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity within the dataset. The results table presents the top ten absolute Pearson correlation coefficients, each paired with a Pass/Fail status based on a threshold of 0.3. All reported feature pairs have coefficients below the threshold, and each is marked as Pass, indicating no detected high linear correlations among the top pairs.

Key insights:

No feature pairs exceed correlation threshold: All reported Pearson correlation coefficients are below the 0.3 threshold, with the highest absolute value observed at 0.1947.
Weakest observed correlation is minimal: The lowest absolute coefficient among the top ten is 0.0303, indicating very weak linear association.
Consistent Pass status across all pairs: Every feature pair in the top ten is marked as Pass, reflecting the absence of strong linear relationships in the evaluated set.

The results indicate that the dataset does not exhibit high linear correlations among the top feature pairs, suggesting a low risk of feature redundancy or multicollinearity based on the tested threshold. The observed correlation structure supports the interpretability and stability of subsequent modeling efforts.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(IsActiveMember, Exited)	-0.1947	Pass
(Balance, NumOfProducts)	-0.1711	Pass
(Balance, Exited)	0.1483	Pass
(NumOfProducts, Exited)	-0.0462	Pass
(Balance, HasCrCard)	-0.0452	Pass
(Tenure, IsActiveMember)	-0.0447	Pass
(NumOfProducts, IsActiveMember)	0.0419	Pass
(Tenure, EstimatedSalary)	0.0338	Pass
(HasCrCard, IsActiveMember)	-0.0338	Pass
(CreditScore, Balance)	0.0303	Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
5143	582	4	0.00	2	0	0	156153.27	0	False	False	True
5633	580	2	130334.84	2	1	1	51672.08	0	True	False	True
5567	706	5	0.00	2	1	1	81718.37	0	False	False	True
5948	778	1	151958.19	3	1	1	131238.37	1	True	False	False
4332	558	1	153697.53	2	0	0	89891.40	1	False	False	False

from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:

Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Training a potential challenger model

We're curious how an alternate model compares to our champion model, so let's train a challenger model as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=50, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initializing the model objects

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

You simply initialize this model object with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)

2026-01-28 18:05:51,731 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 18:05:51,732 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 18:05:51,733 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 18:05:51,735 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 18:05:51,737 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 18:05:51,738 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 18:05:51,739 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 18:05:51,741 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 18:05:51,744 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 18:05:51,768 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 18:05:51,769 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 18:05:51,792 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 18:05:51,796 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 18:05:51,808 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 18:05:51,810 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 18:05:51,822 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion model, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.model_validation.sklearn.CalibrationCurve	Calibration Curve	Evaluates the calibration of probability estimates by comparing predicted probabilities against observed...	True	False	['model', 'dataset']	{'n_bins': {'type': 'int', 'default': 10}}	['sklearn', 'model_performance', 'classification']	['classification']
validmind.model_validation.sklearn.ClassifierPerformance	Classifier Performance	Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...	False	True	['dataset', 'model']	{'average': {'type': 'str', 'default': 'macro'}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix	Confusion Matrix	Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...	True	False	['dataset', 'model']	{'threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning	Hyper Parameters Tuning	Performs exhaustive grid search over specified parameter ranges to find optimal model configurations...	False	True	['model', 'dataset']	{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}	['sklearn', 'model_performance']	['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy	Minimum Accuracy	Checks if the model's prediction accuracy meets or surpasses a specified threshold....	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.7}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score	Minimum F1 Score	Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore	Minimum ROCAUC Score	Validates model by checking if the ROC AUC score meets or surpasses a specified threshold....	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison	Models Performance Comparison	Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...	False	True	['dataset', 'models']	{}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']	['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex	Population Stability Index	Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...	True	True	['datasets', 'model']	{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve	Precision Recall Curve	Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....	True	False	['model', 'dataset']	{}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve	ROC Curve	Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...	True	False	['model', 'dataset']	{}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors	Regression Errors	Assesses the performance and error distribution of a regression model using various error metrics....	False	True	['model', 'dataset']	{}	['sklearn', 'model_performance']	['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation	Training Test Degradation	Tests if model performance degradation between training and test datasets exceeds a predefined threshold....	False	True	['datasets', 'model']	{'max_threshold': {'type': 'float', 'default': 0.1}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable	GINI Table	Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets....	False	True	['dataset', 'model']	{}	['model_performance']	['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift	Calibration Curve Drift	Evaluates changes in probability calibration between reference and monitoring datasets....	True	True	['datasets', 'model']	{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift	Class Discrimination Drift	Compares classification discrimination metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift	Classification Accuracy Drift	Compares classification accuracy metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift	Confusion Matrix Drift	Compares confusion matrix metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift	ROC Curve Drift	Compares ROC curves between reference and monitoring datasets....	True	False	['datasets', 'model']	{}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.

for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

The Classifier Performance test evaluates the predictive effectiveness of the classification model by reporting precision, recall, F1-Score, accuracy, and ROC AUC metrics. The results are presented for each class, as well as macro and weighted averages, providing a comprehensive view of model performance across all classes. The accuracy and ROC AUC scores are also reported, summarizing the model's overall classification ability and its discrimination capacity between classes.

Key insights:

Balanced class-wise performance: Precision, recall, and F1-Score are similar for both classes (Class 0: F1 = 0.6473; Class 1: F1 = 0.6385), indicating consistent model behavior across classes.
Moderate overall accuracy: The model achieves an accuracy of 0.643, reflecting moderate correct classification rates on the test set.
Macro and weighted averages align: Macro and weighted averages for precision, recall, and F1-Score are all 0.6429, suggesting class balance in the dataset and uniform model performance.
ROC AUC indicates moderate discrimination: The ROC AUC score of 0.6726 demonstrates moderate ability to distinguish between classes.

The results indicate that the model exhibits consistent and balanced performance across both classes, with moderate accuracy and discrimination capability as reflected by the ROC AUC. The alignment of macro and weighted averages further supports the absence of significant class imbalance or performance disparity. Overall, the model demonstrates stable but moderate classification effectiveness on the evaluated dataset.

Tables

Precision, Recall, and F1

Class	Precision	Recall	F1
0	0.6444	0.6503	0.6473
1	0.6415	0.6355	0.6385
Weighted Average	0.6430	0.6430	0.6429
Macro Average	0.6429	0.6429	0.6429

Accuracy and ROC AUC

Metric	Value
Accuracy	0.6430
ROC AUC	0.6726

2026-01-28 18:06:04,551 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

The Confusion Matrix test evaluates the classification performance of the logistic regression model by comparing predicted and actual class labels, providing a breakdown of true positives, true negatives, false positives, and false negatives. The resulting matrix visually displays the distribution of correct and incorrect predictions across both classes. The matrix quantifies the model's ability to distinguish between the two classes and highlights the types and frequencies of classification errors.

Key insights:

Balanced distribution of correct predictions: The model produced 204 true positives and 212 true negatives, indicating similar effectiveness in identifying both positive and negative cases.
Notable presence of misclassifications: There are 117 false negatives and 114 false positives, reflecting a comparable rate of both error types across the two classes.
Error rates are non-negligible: The number of false positives and false negatives is substantial relative to the number of correct predictions, indicating that misclassification risk is present for both classes.

The confusion matrix reveals that the logistic regression model demonstrates similar performance in correctly identifying both classes, with true positives and true negatives occurring at comparable frequencies. However, the presence of substantial false positives and false negatives indicates that the model's classification errors are distributed across both classes, warranting attention to both types of misclassification risk in further evaluation.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:9091

2026-01-28 18:06:17,592 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

The Minimum Accuracy test evaluates whether the model's prediction accuracy meets or exceeds a specified threshold, in this case 0.7. The results table presents the model's observed accuracy score, the threshold applied, and the resulting pass/fail outcome. The model's accuracy score is 0.643, which is compared directly to the threshold value to determine test status.

Key insights:

Accuracy below threshold: The model achieved an accuracy score of 0.643, which is below the specified threshold of 0.7.
Test outcome is Fail: The test result is marked as "Fail" due to the accuracy score not meeting the minimum requirement.

The results indicate that the model did not achieve the minimum accuracy criterion established for this evaluation. The observed accuracy shortfall relative to the threshold signals that the model's predictive performance, as measured by overall correctness, does not satisfy the predefined acceptance standard for this test.

Tables

Score	Threshold	Pass/Fail
0.643	0.7	Fail

2026-01-28 18:06:22,510 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

The MinimumF1Score:logreg_champion test evaluates whether the model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents the observed F1 score, the minimum threshold for passing, and the pass/fail outcome. The model achieved an F1 score of 0.6385, compared against a threshold of 0.5, with the test outcome recorded as "Pass".

Key insights:

F1 score exceeds minimum threshold: The model's F1 score of 0.6385 is above the required threshold of 0.5.
Test outcome is Pass: The model meets the minimum performance standard for balanced precision and recall as defined by the test criteria.

The results indicate that the model demonstrates balanced classification performance on the validation set, with the F1 score surpassing the established minimum requirement. The test outcome confirms that the model satisfies the specified standard for combined precision and recall.

Tables

Score	Threshold	Pass/Fail
0.6385	0.5	Pass

2026-01-28 18:06:26,835 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) on the test_dataset_final. The resulting plot displays the trade-off between the true positive rate and false positive rate across all classification thresholds, with the model's ROC curve compared against a baseline representing random classification. The AUC value is provided as a summary metric of the model's discriminative ability.

Key insights:

AUC indicates moderate discriminative power: The model achieves an AUC of 0.67, which is above the random baseline of 0.5, indicating the model can distinguish between the two classes better than chance.
ROC curve consistently above random line: The ROC curve remains above the diagonal line representing random performance, confirming the model's ability to provide meaningful separation between positive and negative classes across thresholds.
No evidence of near-random or inverted performance: The ROC curve does not approach or fall below the random line at any threshold, indicating the absence of high-risk model failure modes.

The ROC Curve test results demonstrate that log_model_champion exhibits moderate classification performance on the test dataset, with an AUC of 0.67 reflecting a measurable ability to differentiate between classes. The ROC curve's consistent position above the random baseline further supports the model's discriminative capability, with no indications of performance collapse or inversion.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:eff5

2026-01-28 18:06:37,289 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion model doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Need more help?):

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Validation Report under Documents.
Locate the Data Preparation section and click on 2.2.2. Model Performance to expand that section.
Under the Model Performance Metrics section, locate Artifacts then click Link Artifact to Report:
Select Validation Issue as the type of artifact.
Click + Add Validation Issue to add a validation issue type artifact.
Enter in the details for your validation issue, for example:
- TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
- RISK AREA — Model Performance
- DOCUMENTATION SECTION — 3.2. Model Evaluation
- DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
Click Save.
Select the validation issue you just added to link to your validation report and click Update Linked Artifacts to insert your validation issue.
Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger model:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

The Classifier Performance test evaluates the predictive effectiveness of classification models by reporting precision, recall, F1-Score, accuracy, and ROC AUC metrics. The results compare two models, "log_model_champion" and "rf_model," across these metrics for both classes, as well as macro and weighted averages. The tables present detailed class-level and aggregate performance scores, enabling direct comparison of model discrimination and overall accuracy.

Key insights:

rf_model outperforms log_model_champion across all metrics: rf_model achieves higher precision, recall, F1-Score, accuracy (0.7202), and ROC AUC (0.7926) compared to log_model_champion, which records accuracy of 0.643 and ROC AUC of 0.6726.
Consistent class-level performance for rf_model: rf_model shows balanced precision and recall for both classes (precision: 0.7126 for class 0, 0.7288 for class 1; recall: 0.7454 for class 0, 0.6947 for class 1), resulting in macro and weighted averages closely aligned at 0.72.
log_model_champion exhibits lower and more uniform scores: log_model_champion displays similar precision and recall for both classes (precision: 0.6444 for class 0, 0.6415 for class 1; recall: 0.6503 for class 0, 0.6355 for class 1), with macro and weighted averages at 0.6429 and 0.643, respectively.

The results indicate that rf_model demonstrates superior classification performance relative to log_model_champion, as evidenced by higher scores across all evaluated metrics. Both models exhibit balanced class-level metrics, but rf_model achieves notably higher discrimination and accuracy, as reflected in its ROC AUC and F1-Score values. The observed performance differentials provide clear evidence of rf_model's enhanced predictive capability in this test context.

Tables

model	Class	Precision	Recall	F1
log_model_champion	0	0.6444	0.6503	0.6473
log_model_champion	1	0.6415	0.6355	0.6385
log_model_champion	Weighted Average	0.6430	0.6430	0.6429
log_model_champion	Macro Average	0.6429	0.6429	0.6429
rf_model	0	0.7126	0.7454	0.7286
rf_model	1	0.7288	0.6947	0.7113
rf_model	Weighted Average	0.7206	0.7202	0.7200
rf_model	Macro Average	0.7207	0.7201	0.7200

model	Metric	Value
log_model_champion	Accuracy	0.6430
log_model_champion	ROC AUC	0.6726
rf_model	Accuracy	0.7202
rf_model	ROC AUC	0.7926

2026-01-28 18:06:44,765 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

The Confusion Matrix test evaluates the classification performance of the champion (logistic regression) and challenger (random forest) models by comparing predicted versus actual class labels. The resulting matrices display the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) for each model, providing a detailed breakdown of correct and incorrect predictions. These results enable direct comparison of error types and overall predictive accuracy between the two models.

Key insights:

Challenger model reduces both FP and FN: The random forest model (challenger) records 83 False Positives and 98 False Negatives, compared to 114 False Positives and 117 False Negatives for the logistic regression (champion), indicating improved error control.
Higher correct classification in challenger: The challenger model achieves 223 True Positives and 243 True Negatives, both higher than the champion model’s 204 True Positives and 212 True Negatives.
Error distribution shifts favor challenger: The challenger model demonstrates a more favorable balance between correct and incorrect classifications, with lower misclassification rates across both positive and negative classes.

The confusion matrix results indicate that the challenger (random forest) model outperforms the champion (logistic regression) model in both reducing misclassification errors and increasing the number of correct predictions. The challenger model achieves higher counts of both True Positives and True Negatives, while simultaneously lowering both False Positives and False Negatives, reflecting a more effective classification performance across the evaluated dataset.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:a311

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:018f

2026-01-28 18:06:57,952 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

The Minimum Accuracy test evaluates whether each model's prediction accuracy meets or exceeds a specified threshold, with results presented for both the log_model_champion and rf_model. The table displays the accuracy scores, the threshold applied (0.7), and the corresponding pass/fail outcome for each model. The log_model_champion achieved an accuracy score of 0.643, while the rf_model achieved a score of 0.7202, allowing for direct comparison against the threshold.

Key insights:

rf_model meets accuracy threshold: The rf_model achieved an accuracy score of 0.7202, surpassing the minimum threshold of 0.7 and receiving a "Pass" outcome.
log_model_champion falls below threshold: The log_model_champion recorded an accuracy score of 0.643, which is below the threshold, resulting in a "Fail" outcome.
Clear differentiation in model performance: The two models display a marked difference in accuracy relative to the threshold, with only the rf_model meeting the minimum requirement.

The results indicate that, under the specified test conditions, the rf_model satisfies the minimum accuracy criterion, while the log_model_champion does not. This distinction highlights a performance gap between the models with respect to overall prediction accuracy as measured by the test.

Tables

model	Score	Threshold	Pass/Fail
log_model_champion	0.6430	0.7	Fail
rf_model	0.7202	0.7	Pass

2026-01-28 18:07:04,550 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

The MinimumF1Score:champion_vs_challenger test evaluates whether each model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents F1 scores for two models—log_model_champion and rf_model—alongside the minimum threshold and pass/fail status. Both models are assessed against a threshold of 0.5, with their respective F1 scores and outcomes displayed.

Key insights:

Both models exceed the minimum F1 threshold: log_model_champion and rf_model achieved F1 scores of 0.6385 and 0.7113, respectively, both surpassing the 0.5 threshold.
rf_model demonstrates higher F1 performance: rf_model outperforms log_model_champion by 0.0728 in F1 score, indicating stronger balance between precision and recall on the validation set.
All models pass the test criteria: Both models are marked as "Pass," confirming that each meets the minimum F1 score requirement.

Both evaluated models satisfy the minimum F1 score criterion, indicating balanced classification performance on the validation set. The rf_model demonstrates comparatively higher F1 performance, suggesting improved precision-recall tradeoff relative to log_model_champion. No models fall below the established threshold, and all pass the test as specified.

Tables

model	Score	Threshold	Pass/Fail
log_model_champion	0.6385	0.5	Pass
rf_model	0.7113	0.5	Pass

2026-01-28 18:07:08,961 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

The ROC Curve test evaluates the discrimination ability of binary classification models by plotting the True Positive Rate against the False Positive Rate at various thresholds and calculating the Area Under the Curve (AUC) score. The results present ROC curves and AUC values for two models—log_model_champion and rf_model—on the test_dataset_final, with each curve compared against a baseline representing random classification (AUC = 0.5). The ROC curves and corresponding AUC scores provide a visual and quantitative assessment of each model's ability to distinguish between the two classes.

Key insights:

rf_model demonstrates higher discrimination: The rf_model achieves an AUC of 0.79, indicating stronger separation between classes compared to the log_model_champion.
log_model_champion shows moderate performance: The log_model_champion records an AUC of 0.67, reflecting moderate discriminative ability above random chance but below that of the rf_model.
Both models outperform random classification: Both ROC curves are consistently above the diagonal line representing random performance (AUC = 0.5), confirming that each model provides meaningful predictive power on the test dataset.

The results indicate that both models possess discriminative capability, with the rf_model exhibiting notably stronger performance as measured by the AUC metric. The log_model_champion provides moderate classification ability, while the rf_model achieves a higher level of class separation, as reflected in the ROC curve and AUC score. Both models demonstrate predictive value beyond random assignment on the evaluated dataset.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:06c4

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:b9f0

2026-01-28 18:07:23,868 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document

Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger model by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.model_validation.sklearn.OverfitDiagnosis	Overfit Diagnosis	Assesses potential overfitting in a model's predictions, identifying regions where performance between training and...	True	True	['model', 'datasets']	{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}	['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']	['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis	Robustness Diagnosis	Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....	True	True	['datasets', 'model']	{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}	['sklearn', 'model_diagnosis', 'visualization']	['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis	Weakspots Diagnosis	Identifies and visualizes weak spots in a machine learning model's performance across various sections of the...	True	True	['datasets', 'model']	{'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']	['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

The Overfit Diagnosis test evaluates the extent to which model performance on the training set diverges from performance on the test set across feature segments, using AUC as the metric for classification models. The results are presented for both a logistic regression model (log_model_champion) and a random forest model (rf_model), with AUC gaps calculated for binned regions of key features. Visualizations and tabular data highlight regions where the absolute difference in AUC between training and test sets exceeds the default threshold of 0.04, indicating potential overfitting.

Key insights:

Localized overfitting in logistic regression model: For log_model_champion, AUC gaps above the 0.04 threshold are observed in specific segments, notably for CreditScore (0.1524 in [400, 450]), Tenure (0.111 in [2, 3]), Balance (0.1818 in [200718, 225808]), and EstimatedSalary (up to 0.144 in [139939, 159929]).
Widespread and pronounced overfitting in random forest model: For rf_model, nearly all feature segments exhibit AUC gaps well above the threshold, with values frequently exceeding 0.2 and reaching as high as 1.0 for Balance ([200718, 225808]). This pattern is consistent across CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Geography_Germany, Geography_Spain, and Gender_Male.
Magnitude and consistency of AUC gaps: The random forest model shows consistently high AUC gaps across all feature bins, while the logistic regression model displays more moderate and localized gaps, with most segments remaining below the threshold except for a few isolated regions.
Feature segments with limited test data: The largest AUC gaps, particularly in the Balance feature for both models, are associated with bins containing very few test records (e.g., 5 test records for [200718, 225808]), which may contribute to instability in the observed metrics.

The Overfit Diagnosis test reveals that the logistic regression model demonstrates moderate and localized overfitting, with only a subset of feature segments exceeding the AUC gap threshold. In contrast, the random forest model exhibits pervasive and substantial overfitting across nearly all feature segments, with AUC gaps consistently and substantially above the threshold. The most extreme gaps are observed in regions with limited test data, suggesting that both model complexity and data sparsity contribute to the observed overfitting patterns. These findings provide a detailed view of model generalization behavior and highlight specific regions and features where overfitting is most pronounced.

Tables

model	Feature	Slice	Number of Training Records	Number of Test Records	Training AUC	Test AUC	Gap
log_model_champion	CreditScore	(400.0, 450.0]	47	10	0.6724	0.5200	0.1524
log_model_champion	CreditScore	(500.0, 550.0]	269	64	0.6734	0.6129	0.0605
log_model_champion	CreditScore	(750.0, 800.0]	251	63	0.6911	0.6401	0.0510
log_model_champion	Tenure	(2.0, 3.0]	279	68	0.6764	0.5654	0.1110
log_model_champion	Tenure	(5.0, 6.0]	239	68	0.6917	0.6092	0.0825
log_model_champion	Tenure	(9.0, 10.0]	136	31	0.6535	0.5882	0.0652
log_model_champion	Balance	(-250.898, 25089.809]	808	220	0.6654	0.6146	0.0508
log_model_champion	Balance	(100359.236, 125449.045]	621	141	0.7410	0.6895	0.0515
log_model_champion	Balance	(150538.854, 175628.663]	186	39	0.5683	0.5111	0.0572
log_model_champion	Balance	(200718.472, 225808.281]	12	5	0.1818	0.0000	0.1818
log_model_champion	NumOfProducts	(0.997, 1.3]	1510	374	0.6731	0.6325	0.0406
log_model_champion	EstimatedSalary	(-188.318, 20001.354]	242	64	0.7244	0.6768	0.0476
log_model_champion	EstimatedSalary	(59980.902, 79970.676]	270	68	0.6921	0.6497	0.0425
log_model_champion	EstimatedSalary	(79970.676, 99960.45]	267	63	0.7156	0.5731	0.1424
log_model_champion	EstimatedSalary	(139939.998, 159929.772]	237	54	0.7485	0.6045	0.1440
log_model_champion	EstimatedSalary	(179919.546, 199909.32]	264	47	0.7062	0.6015	0.1047
log_model_champion	Gender_Male	(0.9, 1.0]	1300	316	0.6858	0.6378	0.0480
rf_model	CreditScore	(400.0, 450.0]	47	10	1.0000	0.6600	0.3400
rf_model	CreditScore	(450.0, 500.0]	121	29	1.0000	0.7500	0.2500
rf_model	CreditScore	(500.0, 550.0]	269	64	1.0000	0.7483	0.2517
rf_model	CreditScore	(550.0, 600.0]	371	97	1.0000	0.7806	0.2194
rf_model	CreditScore	(600.0, 650.0]	494	107	1.0000	0.7841	0.2159
rf_model	CreditScore	(650.0, 700.0]	470	138	1.0000	0.8086	0.1914
rf_model	CreditScore	(700.0, 750.0]	410	94	1.0000	0.7559	0.2441
rf_model	CreditScore	(750.0, 800.0]	251	63	1.0000	0.8135	0.1865
rf_model	CreditScore	(800.0, 850.0]	141	42	1.0000	0.9032	0.0968
rf_model	Tenure	(-0.01, 1.0]	367	98	1.0000	0.7178	0.2822
rf_model	Tenure	(1.0, 2.0]	261	59	1.0000	0.8387	0.1613
rf_model	Tenure	(2.0, 3.0]	279	68	1.0000	0.7641	0.2359
rf_model	Tenure	(3.0, 4.0]	268	66	1.0000	0.8557	0.1443
rf_model	Tenure	(4.0, 5.0]	277	65	1.0000	0.8014	0.1986
rf_model	Tenure	(5.0, 6.0]	239	68	1.0000	0.8240	0.1760
rf_model	Tenure	(6.0, 7.0]	252	53	1.0000	0.7337	0.2663
rf_model	Tenure	(7.0, 8.0]	258	68	1.0000	0.8156	0.1844
rf_model	Tenure	(8.0, 9.0]	248	71	1.0000	0.8081	0.1919
rf_model	Tenure	(9.0, 10.0]	136	31	1.0000	0.8277	0.1723
rf_model	Balance	(-250.898, 25089.809]	808	220	1.0000	0.8158	0.1842
rf_model	Balance	(25089.809, 50179.618]	18	7	1.0000	0.7500	0.2500
rf_model	Balance	(50179.618, 75269.427]	115	15	1.0000	0.8889	0.1111
rf_model	Balance	(75269.427, 100359.236]	275	80	1.0000	0.7390	0.2610
rf_model	Balance	(100359.236, 125449.045]	621	141	1.0000	0.7723	0.2277
rf_model	Balance	(125449.045, 150538.854]	502	127	1.0000	0.7374	0.2626
rf_model	Balance	(150538.854, 175628.663]	186	39	1.0000	0.7593	0.2407
rf_model	Balance	(175628.663, 200718.472]	46	13	1.0000	0.7250	0.2750
rf_model	Balance	(200718.472, 225808.281]	12	5	1.0000	0.0000	1.0000
rf_model	NumOfProducts	(0.997, 1.3]	1510	374	1.0000	0.7062	0.2938
rf_model	NumOfProducts	(1.9, 2.2]	891	223	1.0000	0.6449	0.3551
rf_model	NumOfProducts	(2.8, 3.1]	151	39	1.0000	0.7315	0.2685
rf_model	HasCrCard	(-0.001, 0.1]	777	196	1.0000	0.7772	0.2228
rf_model	HasCrCard	(0.9, 1.0]	1808	451	1.0000	0.7987	0.2013
rf_model	IsActiveMember	(-0.001, 0.1]	1378	346	1.0000	0.7884	0.2116
rf_model	IsActiveMember	(0.9, 1.0]	1207	301	1.0000	0.7869	0.2131
rf_model	EstimatedSalary	(-188.318, 20001.354]	242	64	1.0000	0.7867	0.2133
rf_model	EstimatedSalary	(20001.354, 39991.128]	265	75	1.0000	0.8682	0.1318
rf_model	EstimatedSalary	(39991.128, 59980.902]	237	61	1.0000	0.7151	0.2849
rf_model	EstimatedSalary	(59980.902, 79970.676]	270	68	1.0000	0.7881	0.2119
rf_model	EstimatedSalary	(79970.676, 99960.45]	267	63	1.0000	0.8684	0.1316
rf_model	EstimatedSalary	(99960.45, 119950.224]	262	77	1.0000	0.7360	0.2640
rf_model	EstimatedSalary	(119950.224, 139939.998]	254	72	1.0000	0.8089	0.1911
rf_model	EstimatedSalary	(139939.998, 159929.772]	237	54	1.0000	0.7560	0.2440
rf_model	EstimatedSalary	(159929.772, 179919.546]	287	66	1.0000	0.7912	0.2088
rf_model	EstimatedSalary	(179919.546, 199909.32]	264	47	1.0000	0.8458	0.1542
rf_model	Geography_Germany	(-0.001, 0.1]	1807	454	1.0000	0.7780	0.2220
rf_model	Geography_Germany	(0.9, 1.0]	778	193	1.0000	0.7393	0.2607
rf_model	Geography_Spain	(-0.001, 0.1]	1949	496	1.0000	0.7907	0.2093
rf_model	Geography_Spain	(0.9, 1.0]	636	151	1.0000	0.7746	0.2254
rf_model	Gender_Male	(-0.001, 0.1]	1285	331	1.0000	0.8017	0.1983
rf_model	Gender_Male	(0.9, 1.0]	1300	316	1.0000	0.7722	0.2278

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2d87

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1a1a

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c5ee

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0444

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:88d2

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4e80

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:023e

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6062

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6aff

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:81d3

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:b18c

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9341

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:38ba

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9d18

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:638c

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c855

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2ce3

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c81f

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2d05

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:d6be

2026-01-28 18:07:55,386 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test. Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

The Robustness Diagnosis test evaluates the resilience of machine learning models to input perturbations by measuring AUC decay under increasing levels of Gaussian noise. Results are presented for both the logistic regression champion model and a random forest model, with AUC and performance decay tracked across multiple perturbation scales for both train and test datasets. The visualizations and tabular data illustrate the comparative robustness of each model, highlighting the magnitude and pattern of performance degradation as noise increases.

Key insights:

Logistic regression model exhibits minimal AUC decay: Across all perturbation sizes (up to 0.5 standard deviations), the log_model_champion maintains stable AUC values on both train (0.684 to 0.6696) and test (0.6726 to 0.6691) datasets, with performance decay remaining below 0.016.
Random forest model shows pronounced sensitivity to noise: The rf_model experiences substantial AUC decline on the train set (from 1.0 to 0.7968) and a notable decrease on the test set (from 0.7926 to 0.7156) as perturbation size increases, with performance decay reaching 0.2032 (train) and 0.077 (test) at the highest noise level.
Threshold failures observed in random forest model: The rf_model fails the robustness threshold on the train set at perturbation sizes of 0.2 and above, and on the test set at the highest perturbation (0.5), as indicated by the "Passed: false" status.
Logistic regression model consistently passes robustness criteria: The log_model_champion passes the robustness threshold at all tested perturbation levels for both train and test datasets.

The results demonstrate that the logistic regression champion model maintains stable predictive performance under increasing Gaussian noise, with negligible AUC decay and consistent threshold compliance. In contrast, the random forest model exhibits marked performance degradation, particularly on the train set, and fails robustness criteria at moderate to high noise levels. These findings indicate a higher degree of robustness to input perturbations in the logistic regression model relative to the random forest model under the tested conditions.

Tables

model	Perturbation Size	Dataset	Row Count	AUC	Performance Decay	Passed
log_model_champion	Baseline (0.0)	train_dataset_final	2585	0.6840	0.0000	True
log_model_champion	Baseline (0.0)	test_dataset_final	647	0.6726	0.0000	True
log_model_champion	0.1	train_dataset_final	2585	0.6843	-0.0003	True
log_model_champion	0.1	test_dataset_final	647	0.6710	0.0015	True
log_model_champion	0.2	train_dataset_final	2585	0.6794	0.0046	True
log_model_champion	0.2	test_dataset_final	647	0.6654	0.0072	True
log_model_champion	0.3	train_dataset_final	2585	0.6770	0.0070	True
log_model_champion	0.3	test_dataset_final	647	0.6663	0.0062	True
log_model_champion	0.4	train_dataset_final	2585	0.6757	0.0083	True
log_model_champion	0.4	test_dataset_final	647	0.6566	0.0159	True
log_model_champion	0.5	train_dataset_final	2585	0.6696	0.0144	True
log_model_champion	0.5	test_dataset_final	647	0.6691	0.0034	True
rf_model	Baseline (0.0)	train_dataset_final	2585	1.0000	0.0000	True
rf_model	Baseline (0.0)	test_dataset_final	647	0.7926	0.0000	True
rf_model	0.1	train_dataset_final	2585	0.9829	0.0171	True
rf_model	0.1	test_dataset_final	647	0.7895	0.0030	True
rf_model	0.2	train_dataset_final	2585	0.9400	0.0600	False
rf_model	0.2	test_dataset_final	647	0.7816	0.0110	True
rf_model	0.3	train_dataset_final	2585	0.8969	0.1031	False
rf_model	0.3	test_dataset_final	647	0.7743	0.0183	True
rf_model	0.4	train_dataset_final	2585	0.8445	0.1555	False
rf_model	0.4	test_dataset_final	647	0.7550	0.0376	True
rf_model	0.5	train_dataset_final	2585	0.7968	0.2032	False
rf_model	0.5	test_dataset_final	647	0.7156	0.0770	False

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:42d8

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:573c

2026-01-28 18:08:17,838 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI

['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

The FeaturesAUC:champion_vs_challenger test evaluates the discriminatory power of each individual feature in a binary classification context by calculating the Area Under the Curve (AUC) for each feature separately. The results are presented as horizontal bar plots, with each bar representing the AUC score for a specific feature, allowing for direct comparison of univariate classification strength across all features. The AUC values range from approximately 0.41 to 0.63, with features ordered from highest to lowest AUC, providing a clear view of which features are most and least effective at distinguishing between the two classes on their own.

Key insights:

Geography_Germany exhibits highest univariate discrimination: Geography_Germany achieves the highest AUC score, exceeding 0.6, indicating the strongest individual ability to separate the two classes among all features evaluated.
Balance and CreditScore show moderate discriminatory power: Both Balance and CreditScore register AUC values above 0.5, reflecting moderate univariate classification strength.
Several features display limited univariate separation: Features such as NumOfProducts, IsActiveMember, and Gender_Male have AUC scores near or below 0.45, indicating limited ability to distinguish between classes when considered independently.
Consistent feature ranking across test runs: The ordering and relative magnitudes of AUC scores are consistent across both test result plots, supporting the stability of the univariate feature evaluation.

The results indicate that Geography_Germany, Balance, and CreditScore are the most individually informative features for binary class separation in the evaluated dataset, while other features contribute less discriminatory power on their own. The observed AUC distribution highlights a clear differentiation in univariate predictive strength across the feature set, with consistent rankings reinforcing the reliability of these findings.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:a620

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:5982

2026-01-28 18:08:35,656 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

The Permutation Feature Importance (PFI) test evaluates the relative importance of each input feature by measuring the decrease in model performance when the feature's values are randomly permuted. The results are presented as bar plots for both the logistic regression (log_model_champion) and random forest (rf_model) models, with each bar representing the importance score for a given feature. The plots allow for direct comparison of feature influence across the two model types.

Key insights:

Divergent top features across models: The logistic regression model assigns highest importance to Geography_Germany, IsActiveMember, and Gender_Male, while the random forest model ranks NumOfProducts and Balance as most influential.
Geography_Germany consistently important: Geography_Germany is among the top three features for both models, indicating a stable influence on predictions across model architectures.
NumOfProducts critical for random forest: NumOfProducts is the most important feature in the random forest model, but is of moderate importance in the logistic regression model.
Low importance for EstimatedSalary and Geography_Spain: Both models assign minimal importance to EstimatedSalary and Geography_Spain, suggesting limited predictive contribution from these features.
Variation in IsActiveMember impact: IsActiveMember is highly important in the logistic regression model but less so in the random forest model, highlighting model-specific feature utilization.

The PFI results reveal distinct patterns of feature reliance between the logistic regression and random forest models. While some features such as Geography_Germany demonstrate consistent importance, others like NumOfProducts and IsActiveMember show model-dependent influence. Several features, including EstimatedSalary and Geography_Spain, contribute minimally to predictive performance in both models. These findings provide a clear view of how each model leverages available features, supporting further analysis of model behavior and risk.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:f739

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:c17d

2026-01-28 18:08:55,506 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

The SHAP Global Importance test evaluates and visualizes the global feature importance for both the champion (logistic regression) and challenger (random forest) models using SHAP values. The results include mean importance plots and summary plots, which display the normalized SHAP values for each feature and illustrate the distribution and impact of feature values on model output. These visualizations provide a comparative perspective on how each model attributes importance to input features.

Key insights:

Champion model dominated by few features: The logistic regression model assigns the highest normalized SHAP importance to IsActiveMember, Geography_Germany, and Gender_Male, with IsActiveMember reaching 100% normalized importance and the next two features also showing high values. Remaining features contribute substantially less to the model's output.
Challenger model focuses on fewer variables: The random forest model's SHAP plots indicate that only Tenure and CreditScore are assigned notable importance, with other features not represented in the summary plots.
Distinct feature attribution patterns: The champion model distributes importance across a broader set of features, while the challenger model concentrates importance on a narrow subset.
SHAP value distributions are compact: Both models exhibit relatively tight SHAP value distributions for their most important features, with no evidence of extreme outliers or high variability in the summary plots.

The SHAP Global Importance analysis reveals that the champion model relies on a wider range of features, with a strong emphasis on IsActiveMember, Geography_Germany, and Gender_Male, while the challenger model attributes nearly all importance to Tenure and CreditScore. Both models display compact SHAP value distributions for their key features, indicating stable feature contributions without pronounced outlier effects. The observed differences in feature attribution highlight distinct model reasoning and may inform further model selection or refinement.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:f5e5

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:d61a

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:762a

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:34e9

2026-01-28 18:09:15,823 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Initialize ValidMind model objects
Assign predictions and probabilities to your ValidMind model objects
Use tests from ValidMind to evaluate the potential of models, including comparative tests between champion and challenger models
Log an artifact in the ValidMind Platform

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	50
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	None
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'sqrt'
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary ` for details.	42
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary ` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples. - If int, then draw `max_samples` samples. - If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`. .. versionadded:: 0.22	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`), - classifications trained on data with missing values. The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide `. .. versionadded:: 1.4	None