ValidMind for model validation 4 — Finalize testing and reporting

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.

This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:

The function can be as simple or as complex as you need it to be — it can use external libraries, make API calls, or do anything else that you can do in Python.
The only requirement is that the function signature and return values can be "understood" and handled by the ValidMind Library. As such, custom tests offer added flexibility by extending the default tests provided by ValidMind, enabling you to document any type of model or use case.

For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to finalize validation and reporting, you'll need to first have:

Registered a model within the ValidMind Platform and granted yourself access to the model as a validator
Installed the ValidMind Library in your local environment, allowing you to access all its features
Learned how to import and initialize datasets and models for use with ValidMind
Understood the basics of how to identify and run validation tests
Run validation tests for your champion and challenger models, and logged the results of those tests to the ValidMind Platform
Inserted your logged test results into your validation report
Added some preliminary artifacts (findings) to your validation report

Need help with the above steps?

Refer to the first three notebooks in this series:

Setting up

This section should be very familiar to you now — as we performed the same actions in the previous two notebooks in this series.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.
Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)

Note: you may need to restart the kernel to use updated packages.

2026-01-28 17:58:32,957 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

# Initialize the raw dataset for use in ValidMind tests
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results table presents the top ten feature pairs ranked by the absolute value of their Pearson correlation coefficients, along with a Pass or Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs display lower correlation values and pass the test.

Key insights:

One feature pair exceeds correlation threshold: The pair (Age, Exited) has a correlation coefficient of 0.3467, surpassing the 0.3 threshold and resulting in a Fail status.
All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.207 to 0.0367, all below the 0.3 threshold and marked as Pass.
Predominantly weak linear relationships: Most feature pairs exhibit weak linear associations, with coefficients clustered near zero.

The test results indicate that the dataset contains minimal evidence of strong linear relationships among most feature pairs, with only the (Age, Exited) pair exceeding the specified correlation threshold. The overall correlation structure suggests a low risk of widespread multicollinearity or feature redundancy, as the majority of feature pairs demonstrate weak linear dependencies.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3467	Fail
(IsActiveMember, Exited)	-0.2070	Pass
(Balance, NumOfProducts)	-0.1793	Pass
(Balance, Exited)	0.1542	Pass
(NumOfProducts, Exited)	-0.0577	Pass
(Age, Balance)	0.0550	Pass
(Tenure, EstimatedSalary)	0.0531	Pass
(NumOfProducts, IsActiveMember)	0.0485	Pass
(Age, NumOfProducts)	-0.0394	Pass
(Tenure, Balance)	-0.0367	Pass

# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3467	Fail
1	(IsActiveMember, Exited)	-0.2070	Pass
2	(Balance, NumOfProducts)	-0.1793	Pass
3	(Balance, Exited)	0.1542	Pass
4	(NumOfProducts, Exited)	-0.0577	Pass
5	(Age, Balance)	0.0550	Pass
6	(Tenure, EstimatedSalary)	0.0531	Pass
7	(NumOfProducts, IsActiveMember)	0.0485	Pass
8	(Age, NumOfProducts)	-0.0394	Pass
9	(Tenure, Balance)	-0.0367	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten absolute Pearson correlation coefficients among feature pairs, each accompanied by a Pass/Fail status based on a threshold of 0.3. All reported coefficients are below the threshold, and all feature pairs received a Pass status.

Key insights:

No high correlations detected: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest magnitude observed at 0.207 between IsActiveMember and Exited.
Consistent Pass status across all pairs: Every feature pair in the top ten correlations received a Pass, indicating no evidence of strong linear relationships among the evaluated features.
Low to moderate relationships observed: The reported coefficients range from -0.207 to 0.1542, reflecting only weak to very weak linear associations between the examined feature pairs.

The test results indicate an absence of strong linear dependencies among the evaluated features, with all pairwise correlations falling well below the specified threshold. This suggests a low risk of feature redundancy or multicollinearity within the dataset based on linear relationships, supporting the interpretability and stability of subsequent modeling efforts.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(IsActiveMember, Exited)	-0.2070	Pass
(Balance, NumOfProducts)	-0.1793	Pass
(Balance, Exited)	0.1542	Pass
(NumOfProducts, Exited)	-0.0577	Pass
(Tenure, EstimatedSalary)	0.0531	Pass
(NumOfProducts, IsActiveMember)	0.0485	Pass
(Tenure, Balance)	-0.0367	Pass
(Tenure, IsActiveMember)	-0.0360	Pass
(Tenure, Exited)	-0.0265	Pass
(HasCrCard, IsActiveMember)	-0.0260	Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
5613	558	7	121235.05	2	1	1	116253.10	0	False	False	False
4434	432	2	135559.80	2	1	1	71856.30	0	True	False	True
6484	769	6	117852.26	2	1	0	147668.64	0	False	False	False
5796	762	10	168920.75	1	1	0	31445.03	1	False	True	False
1668	460	7	0.00	2	1	0	156150.08	1	False	False	False

from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:

Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Train potential challenger model

We'll also train our random forest classification challenger model to see how it compares:

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=50, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

# Assign predictions to Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Assign predictions to Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)

2026-01-28 17:58:51,388 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,390 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,390 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,392 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 17:58:51,394 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,395 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,396 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,397 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 17:58:51,399 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,419 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,420 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,442 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 17:58:51,445 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,457 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,458 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,469 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Implementing custom tests

Thanks to the model documentation (Learn more ...), we know that the model development team implemented a custom test to further evaluate the performance of the champion model.

In a usual model validation situation, you would load a saved custom test provided by the model development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.

Want to learn more about custom tests?

Refer to our in-depth introduction to custom tests: Implement custom tests

Implement a custom inline test

Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the model development team used in their performance evaluations.

An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.

Create a confusion matrix plot

Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:

import matplotlib.pyplot as plt
from sklearn import metrics

# Get the predicted classes
y_pred = log_reg.predict(vm_test_ds.x)

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

cm_display = metrics.ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix, display_labels=[False, True]
)
cm_display.plot()

Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:

The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
- VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
- VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:

# Champion train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error types. The first matrix corresponds to the training dataset, while the second matrix summarizes results for the test dataset.

Key insights:

Balanced true positive and true negative rates: Both training and test datasets show similar counts for true positives and true negatives (train: 794 TP, 835 TN; test: 215 TP, 215 TN), indicating consistent model performance across classes.
Moderate false positive and false negative rates: The number of false positives and false negatives is comparable within each dataset (train: 488 FP, 488 FN; test: 98 FP, 119 FN), suggesting balanced misclassification rates.
Consistent performance across datasets: The relative proportions of each confusion matrix cell are similar between training and test datasets, indicating stable generalization from training to test data.

The confusion matrix results demonstrate that the model maintains balanced classification performance across both training and test datasets, with similar rates of correct and incorrect predictions for each class. The observed symmetry in true and false prediction counts suggests that the model does not exhibit a strong bias toward either class and generalizes consistently between datasets.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:champion:f7ef

ValidMind Figure my_custom_tests.ConfusionMatrix:champion:87f3

2026-01-28 17:59:08,777 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document

# Challenger train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix:challenger test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The confusion matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's ability to correctly classify each class. The results are presented separately for the train and test datasets, allowing for assessment of both in-sample and out-of-sample performance.

Key insights:

Perfect classification on training data: The training dataset confusion matrix shows 1,303 true negatives and 1,282 true positives, with zero false positives and zero false negatives, indicating no misclassifications on the training set.
Reduced accuracy on test data: The test dataset confusion matrix shows 227 true negatives and 229 true positives, with 86 false positives and 105 false negatives, indicating the presence of both types of misclassification in out-of-sample predictions.
Balanced class distribution in test set: The test set contains a similar number of true positives (229) and true negatives (227), suggesting a balanced representation of both classes in the evaluation.

The confusion matrix results indicate that the model achieves perfect separation of classes on the training data, with no observed misclassifications. However, performance on the test data shows a reduction in accuracy, with both false positives and false negatives present. The balanced class distribution in the test set supports a reliable assessment of model generalization. The observed discrepancy between training and test performance highlights the importance of evaluating out-of-sample results to understand model behavior under realistic conditions.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:030a

ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:72ad

2026-01-28 17:59:24,095 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Add parameters to custom tests

Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model, normalize=False):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    if normalize:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")
    else:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

Pass parameters to custom tests

You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.

The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
Since these are VMDataset or VMModel inputs, they have a special meaning.

Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:

# Champion with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Champion

The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by displaying the normalized proportions of true positives, true negatives, false positives, and false negatives. The matrix presents the fraction of predictions in each category, with values normalized such that each cell represents the proportion of total predictions. The results show the distribution of correct and incorrect predictions for both positive and negative classes, providing a comprehensive view of model accuracy and error types.

Key insights:

Balanced correct classification rates: The model correctly classifies both negative (True Negative: 0.33) and positive (True Positive: 0.33) classes at equal rates, indicating symmetric performance across classes.
Moderate false negative and false positive rates: The false negative rate (0.18) and false positive rate (0.15) are of similar magnitude, suggesting that misclassification is distributed relatively evenly between the two error types.
No class dominates prediction errors: The normalized values indicate that neither class is disproportionately affected by misclassification, with all matrix entries within the range of 0.15 to 0.33.

The confusion matrix reveals that the model exhibits balanced performance across both classes, with correct and incorrect predictions distributed symmetrically. The rates of true positives and true negatives are equal, and the error rates for false positives and false negatives are comparable, indicating no significant bias toward either class in prediction outcomes.

Parameters:

{
  "normalize": true
}

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_champion:469c

2026-01-28 17:59:35,649 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document

# Challenger with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Challenger

The Confusion Matrix test evaluates the classification performance of the rf_model on the test_dataset_final by displaying the normalized proportions of true positives, true negatives, false positives, and false negatives. The matrix presents the fraction of predictions in each category, with values normalized such that the sum of all cells equals 1. The results provide a visual summary of the model’s ability to correctly and incorrectly classify both positive and negative cases.

Key insights:

Balanced correct classification rates: The model correctly classifies both negative (True Negative: 0.35) and positive (True Positive: 0.35) cases at equal rates, indicating symmetric performance across classes.
Moderate false negative and false positive rates: The proportion of false negatives (0.16) and false positives (0.13) are similar in magnitude, suggesting that misclassification is distributed relatively evenly between the two error types.
No class dominance observed: The normalized confusion matrix does not indicate a strong bias toward either class, as the correct and incorrect prediction rates are comparable for both positive and negative labels.

The confusion matrix reveals that the rf_model demonstrates balanced classification performance, with equal rates of correct predictions for both classes and similar proportions of false positives and false negatives. This indicates that the model does not favor one class over the other and maintains consistent error rates across both positive and negative predictions.

Parameters:

{
  "normalize": true
}

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_challenger:06ad

2026-01-28 17:59:49,481 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document

Use external test providers

Sometimes you may want to reuse the same set of custom tests across multiple models and share them with others in your organization, like the model development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.

In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:

Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
Save an inline test to a file
Define and register a LocalTestProvider that points to that folder
Run test provider tests
Add the test results to your documentation

Create custom tests folder

Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.

The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:

tests_folder = "my_tests"

import os

# create tests folder
os.makedirs(tests_folder, exist_ok=True)

# remove existing tests
for f in os.listdir(tests_folder):
    # remove files and pycache
    if f.endswith(".py") or f == "__pycache__":
        os.system(f"rm -rf {tests_folder}/{f}")

After running the command above, confirm that a new my_tests directory was created successfully. For example:

~/notebooks/tutorials/model_validation/my_tests/

Save an inline test

The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.

While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.

The confusion_matrix test requires the following additional imports:

import matplotlib.pyplot as plt
from sklearn import metrics

Let's pass these imports to the save() method to ensure they are included in the file with the following command:

confusion_matrix.save(
    # Save it to the custom tests folder we created
    tests_folder,
    imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],
)

2026-01-28 17:59:50,042 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-01-28 17:59:50,043 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix

Confirm that the save() method saved the confusion_matrix function to a file named ConfusionMatrix.py in the my_tests folder.

Note that the new file provides some context on the origin of the test, which is useful for traceability:

# Saved from __main__.confusion_matrix
# Original Test ID: my_custom_tests.ConfusionMatrix
# New Test ID: <test_provider_namespace>.ConfusionMatrix

Additionally, the new test function has been stripped off its decorator, as it now resides in a file that will be loaded by the test provider:
```
def ConfusionMatrix(dataset, model, normalize=False):
```

Register a local test provider

Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:

ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.

Want to learn more about test providers?

An extended introduction to test providers can be found in: Integrate external test providers

Initialize a local test provider

For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.

The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.

Let's go ahead and load the custom tests from our my_tests directory:

from validmind.tests import LocalTestProvider

# initialize the test provider with the tests folder we created earlier
my_test_provider = LocalTestProvider(tests_folder)

vm.tests.register_test_provider(
    namespace="my_test_provider",
    test_provider=my_test_provider,
)
# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`
# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file

Run test provider tests

Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:

For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.

Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.

# Champion with test dataset and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by comparing predicted and true class labels. The resulting 2x2 matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix quantifies the model's ability to correctly identify both positive and negative cases, as well as the frequency of misclassifications.

Key insights:

Balanced true positive and true negative counts: The model correctly classified 215 positive cases (true positives) and 215 negative cases (true negatives), indicating symmetric performance across both classes.
Notable false negative and false positive rates: There are 119 false negatives and 98 false positives, reflecting a moderate level of misclassification in both directions.
Comparable error distribution across classes: The number of false negatives and false positives are of similar magnitude, suggesting that the model does not exhibit a strong bias toward over- or under-predicting either class.

The confusion matrix reveals that the log_model_champion demonstrates balanced classification performance, with equal accuracy in identifying both positive and negative cases. Misclassification rates are moderate and distributed similarly between false positives and false negatives, indicating no pronounced skew in prediction errors. This pattern suggests consistent model behavior across both classes on the test dataset.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:champion:9dee

2026-01-28 18:00:03,845 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document

# Challenger with test dataset  and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix test evaluates the classification performance of the rf_model on the test_dataset_final by comparing predicted and true labels. The resulting 2x2 matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix enables assessment of the model's ability to correctly identify both positive and negative cases, as well as the types and frequencies of misclassifications.

Key insights:

Balanced correct classification of both classes: The model correctly classified 227 negative cases (true negatives) and 229 positive cases (true positives), indicating similar performance across both classes.
Moderate false positive and false negative rates: There were 86 false positives and 105 false negatives, reflecting a moderate level of misclassification for both types of errors.
Comparable error distribution: The counts of false positives and false negatives are of similar magnitude, suggesting that the model does not exhibit a strong bias toward over-predicting either class.

The confusion matrix reveals that the rf_model demonstrates balanced classification performance, with similar accuracy for both positive and negative classes. The distribution of misclassifications is relatively even, indicating no pronounced skew toward either false positives or false negatives. Overall, the model maintains a consistent error profile across both classes.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:challenger:a410

2026-01-28 18:00:17,870 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document

Verify test runs

Our final task is to verify that all the tests provided by the model development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.

Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:

test_config = {
    # Run with the raw dataset
    'validmind.data_validation.DatasetDescription:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.DescriptiveStatistics:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.MissingValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.ClassImbalance:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.Duplicates:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.HighCardinality:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {
            'num_threshold': 100,
            'percent_threshold': 0.1,
            'threshold_type': 'percent'
        }
    },
    'validmind.data_validation.Skewness:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_threshold': 1}
    },
    'validmind.data_validation.UniqueRows:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TooManyZeroValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_percent_threshold': 0.03}
    },
    'validmind.data_validation.IQROutliersTable:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'threshold': 5}
    },
    # Run with the preprocessed dataset
    'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.MissingValues:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'default_column': 'loan_status'}
    },
    # Run with the training and test datasets
    'validmind.data_validation.DescriptiveStatistics:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.TabularDescriptionTables:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.ClassImbalance:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.UniqueRows:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.MutualInformation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_threshold': 0.01}
    },
    'validmind.data_validation.PearsonCorrelationMatrix:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.HighPearsonCorrelation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'max_threshold': 0.3, 'top_n_correlations': 10}
    },
    'validmind.model_validation.ModelMetadata': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ModelParameters': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ROCCurve': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']}
    },
    'validmind.model_validation.sklearn.MinimumROCAUCScore': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']},
        'params': {'min_threshold': 0.5}
    }
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")

validmind.data_validation.DatasetDescription:raw_data

Dataset Description Raw Data

The DatasetDescription test provides a comprehensive summary of the dataset's structure, completeness, and feature characteristics. The results table details each column's data type, count, missingness, and the number of distinct values, offering a clear overview of the dataset composition. All columns are fully populated with no missing values, and the distinct value counts highlight the diversity and granularity of each feature. This summary enables a thorough understanding of the dataset's readiness for modeling and potential areas of complexity.

Key insights:

No missing values across all columns: All 11 columns report 0 missing entries, indicating complete data coverage for every feature.
High cardinality in key numeric features: The Balance and EstimatedSalary columns exhibit high distinct value counts (5088 and 8000 respectively), reflecting continuous or near-continuous distributions.
Low cardinality in categorical features: Categorical columns such as Geography, Gender, HasCrCard, IsActiveMember, and Exited have between 2 and 3 distinct values, supporting straightforward encoding and analysis.
Moderate diversity in demographic and behavioral features: CreditScore and Age show moderate distinct counts (452 and 69), while Tenure and NumOfProducts have lower diversity (11 and 4 distinct values).

The dataset is fully complete with no missing data, supporting robust downstream analysis. Numeric features display a range of cardinalities, from highly granular (EstimatedSalary, Balance) to more discretized (Tenure, NumOfProducts). Categorical features are well-defined with low cardinality, facilitating efficient encoding. The overall structure indicates a dataset suitable for machine learning applications, with no immediate data quality concerns observed in the summary statistics.

Tables

Dataset Description

Name	Type	Count	Distinct	Distinct %
CreditScore	Numeric	8000.0	452	0.0565
Geography	Categorical	8000.0	3	0.0004
Gender	Categorical	8000.0	2	0.0002
Age	Numeric	8000.0	69	0.0086
Tenure	Numeric	8000.0	11	0.0014
Balance	Numeric	8000.0	5088	0.6360
NumOfProducts	Numeric	8000.0	4	0.0005
HasCrCard	Categorical	8000.0	2	0.0002
IsActiveMember	Categorical	8000.0	2	0.0002
EstimatedSalary	Numeric	8000.0	8000	1.0000
Exited	Categorical	8000.0	2	0.0002

2026-01-28 18:00:27,744 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document

validmind.data_validation.DescriptiveStatistics:raw_data

Descriptive Statistics Raw Data

The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables, including measures of central tendency, dispersion, and range, as well as frequency-based summaries for two categorical variables. The numerical table details counts, means, standard deviations, and percentiles, while the categorical table reports unique value counts and the dominance of the most frequent category. These results provide a comprehensive overview of the dataset’s structure and highlight key aspects of variable distributions.

Key insights:

Wide range and skewness in balance values: The Balance variable exhibits a minimum of 0.0, a median of 97,264.0, and a maximum of 250,898.0, with a mean (76,434.10) substantially below the median, indicating a right-skewed distribution and a significant proportion of zero balances.
High concentration in categorical variables: Geography is dominated by France (50.12% of records), and Gender is predominantly Male (54.95%), indicating limited diversity in these categorical features.
Binary variables with balanced representation: HasCrCard and IsActiveMember are binary, with means of 0.70 and 0.52, respectively, suggesting moderate balance between categories.
Consistent sample sizes and low missingness: All variables report a count of 8,000, indicating no missing data across both numerical and categorical fields.
Substantial spread in estimated salary: EstimatedSalary ranges from 12.0 to 199,992.0, with a mean of 99,790.19 and a standard deviation of 57,520.51, reflecting high variability in this feature.

The dataset demonstrates complete data coverage with no missing values and a broad range of values across key numerical variables. Several variables, such as Balance and EstimatedSalary, display substantial dispersion and skewness, while categorical variables show limited diversity due to the dominance of specific categories. These characteristics provide important context for understanding the underlying data structure and potential sources of model risk related to feature distribution and representativeness.

Tables

Numerical Variables

Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
CreditScore	8000.0	650.1596	96.8462	350.0	583.0	652.0	717.0	778.0	813.0	850.0
Age	8000.0	38.9489	10.4590	18.0	32.0	37.0	44.0	53.0	60.0	92.0
Tenure	8000.0	5.0339	2.8853	0.0	3.0	5.0	8.0	9.0	9.0	10.0
Balance	8000.0	76434.0965	62612.2513	0.0	0.0	97264.0	128045.0	149545.0	162488.0	250898.0
NumOfProducts	8000.0	1.5325	0.5805	1.0	1.0	1.0	2.0	2.0	2.0	4.0
HasCrCard	8000.0	0.7026	0.4571	0.0	0.0	1.0	1.0	1.0	1.0	1.0
IsActiveMember	8000.0	0.5199	0.4996	0.0	0.0	1.0	1.0	1.0	1.0	1.0
EstimatedSalary	8000.0	99790.1880	57520.5089	12.0	50857.0	99505.0	149216.0	179486.0	189997.0	199992.0

Categorical Variables

Name	Count	Number of Unique Values	Top Value	Top Value Frequency	Top Value Frequency %
Geography	8000.0	3.0	France	4010.0	50.12
Gender	8000.0	2.0	Male	4396.0	54.95

2026-01-28 18:00:39,250 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document

validmind.data_validation.MissingValues:raw_data

✅ Missing Values Raw Data

The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the threshold. All features in the dataset are shown with zero missing values, and each column is marked as passing the test.

Key insights:

No missing values detected: All features, including CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, have zero missing values.
Universal pass status: Every column meets the missing value threshold criterion, with 0.0% missingness and a Pass status across the dataset.

The dataset demonstrates complete data integrity with respect to missing values, as all features contain full data coverage and satisfy the established threshold. This indicates a high level of data quality for subsequent modeling or analysis steps.

Parameters:

{
  "min_threshold": 1
}

Tables

Column	Number of Missing Values	Percentage of Missing Values (%)	Pass/Fail
CreditScore	0	0.0	Pass
Geography	0	0.0	Pass
Gender	0	0.0	Pass
Age	0	0.0	Pass
Tenure	0	0.0	Pass
Balance	0	0.0	Pass
NumOfProducts	0	0.0	Pass
HasCrCard	0	0.0	Pass
IsActiveMember	0	0.0	Pass
EstimatedSalary	0	0.0	Pass
Exited	0	0.0	Pass

2026-01-28 18:00:47,579 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document

validmind.data_validation.ClassImbalance:raw_data

✅ Class Imbalance Raw Data

The Class Imbalance test evaluates the distribution of target classes within the dataset to identify potential imbalances that could impact model performance. The results table presents the percentage of records for each class in the "Exited" target variable, alongside a pass/fail assessment based on a minimum threshold of 10%. The accompanying bar plot visually depicts the proportion of each class, with class 0 and class 1 shown as distinct bars representing their respective frequencies.

Key insights:

Both classes exceed the minimum threshold: Class 0 constitutes 79.80% and class 1 constitutes 20.20% of the dataset, with both surpassing the 10% minimum threshold.
No classes flagged for imbalance: The pass/fail assessment indicates that neither class is under-represented according to the defined criterion.
Class distribution is visually asymmetric: The bar plot highlights a notable difference in class proportions, with class 0 being the majority class.

The results indicate that, while the dataset is not perfectly balanced, both classes meet the minimum representation threshold set for this test. The observed class distribution is asymmetric, with a substantially higher proportion of class 0 compared to class 1, but no class falls below the specified risk threshold for imbalance.

Parameters:

{
  "min_percent_threshold": 10
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	79.80%	Pass
1	20.20%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:raw_data:c64f

2026-01-28 18:00:59,333 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document

validmind.data_validation.Duplicates:raw_data

✅ Duplicates Raw Data

The Duplicates test evaluates the presence of duplicate rows within the dataset to assess data quality and mitigate risks associated with redundant information. The results table presents the absolute number and percentage of duplicate rows detected, providing a quantitative overview of dataset uniqueness. The test was executed with a minimum threshold parameter set to 1, and the results are summarized in the table titled "Duplicate Rows Results for Dataset."

Key insights:

No duplicate rows detected: The dataset contains 0 duplicate rows, as indicated by the "Number of Duplicates" value.
Zero percent duplication: The "Percentage of Rows (%)" is 0.0%, confirming the absence of redundant entries in the dataset.

The results demonstrate that the dataset is free from duplicate rows, indicating a high level of data uniqueness and integrity. The absence of duplication supports reliable model training and reduces the risk of overfitting due to repeated information.

Parameters:

{
  "min_threshold": 1
}

Tables

Duplicate Rows Results for Dataset

Number of Duplicates	Percentage of Rows (%)
0	0.0

2026-01-28 18:01:05,724 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document

validmind.data_validation.HighCardinality:raw_data

✅ High Cardinality Raw Data

The High Cardinality test evaluates the number of unique values in categorical columns to identify potential risks of overfitting and data noise. The results table presents the number and percentage of distinct values for each categorical column, along with a pass/fail status based on a threshold of 10% distinct values. Both "Geography" and "Gender" columns are assessed, with their respective distinct value counts and percentages reported.

Key insights:

All categorical columns pass cardinality threshold: Both "Geography" (3 distinct values, 0.0375%) and "Gender" (2 distinct values, 0.025%) are well below the 10% threshold, resulting in a "Pass" status for each.
Low cardinality observed across features: The number of unique values in both columns is minimal relative to the total sample size, indicating low cardinality throughout the assessed categorical features.

The results indicate that all evaluated categorical columns exhibit low cardinality, with distinct value counts and percentages substantially below the defined threshold. No evidence of high cardinality or associated overfitting risk is present in the tested features.

Parameters:

{
  "num_threshold": 100,
  "percent_threshold": 0.1,
  "threshold_type": "percent"
}

Tables

Column	Number of Distinct Values	Percentage of Distinct Values (%)	Pass/Fail
Geography	3	0.0375	Pass
Gender	2	0.0250	Pass

2026-01-28 18:01:12,152 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document

validmind.data_validation.Skewness:raw_data

❌ Skewness Raw Data

The Skewness:raw_data test evaluates the asymmetry of numerical data distributions by calculating skewness values for each numeric column and comparing them to a maximum threshold of 1. The results table presents skewness values and pass/fail outcomes for each column, indicating whether the distributional asymmetry exceeds the defined threshold. Columns with skewness values below 1 are marked as "Pass," while those exceeding the threshold are marked as "Fail." This assessment provides a quantitative overview of distributional characteristics relevant to data quality and model performance.

Key insights:

Most columns exhibit low skewness: The majority of numeric columns, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary, have skewness values well below the threshold of 1 and pass the test.
Two columns exceed skewness threshold: Age (skewness = 1.0245) and Exited (skewness = 1.4847) exceed the maximum threshold, resulting in a fail outcome for these columns.
Skewness values are generally close to zero: Several columns, such as Tenure (0.0077), EstimatedSalary (0.0095), and CreditScore (‑0.062), display skewness values near zero, indicating near-symmetric distributions.

The skewness assessment reveals that most numeric columns in the dataset have distributions that are approximately symmetric or only moderately skewed, remaining within the defined threshold. However, Age and Exited display higher levels of asymmetry, exceeding the maximum skewness threshold and indicating notable distributional skew in these variables. The overall distributional profile suggests that, aside from these exceptions, the dataset maintains a balanced structure with respect to skewness.

Parameters:

{
  "max_threshold": 1
}

Tables

Skewness Results for Dataset

Column	Skewness	Pass/Fail
CreditScore	-0.0620	Pass
Age	1.0245	Fail
Tenure	0.0077	Pass
Balance	-0.1353	Pass
NumOfProducts	0.7172	Pass
HasCrCard	-0.8867	Pass
IsActiveMember	-0.0796	Pass
EstimatedSalary	0.0095	Pass
Exited	1.4847	Fail

2026-01-28 18:01:24,737 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document

validmind.data_validation.UniqueRows:raw_data

❌ Unique Rows Raw Data

The UniqueRows test evaluates the diversity of the dataset by measuring the proportion of unique values in each column and comparing it to a minimum percentage threshold. The results table presents, for each column, the number and percentage of unique values, along with a pass/fail outcome based on whether the percentage exceeds the 1% threshold. Columns such as CreditScore, Balance, and EstimatedSalary show high percentages of unique values and pass the test, while most categorical and low-cardinality columns do not meet the threshold and fail.

Key insights:

High uniqueness in continuous variables: EstimatedSalary (100%), Balance (63.6%), and CreditScore (5.65%) exceed the 1% uniqueness threshold, indicating substantial diversity in these columns.
Low uniqueness in categorical variables: Columns such as Geography (0.0375%), Gender (0.025%), HasCrCard (0.025%), IsActiveMember (0.025%), and Exited (0.025%) have very low percentages of unique values and fail the test.
Majority of columns fail uniqueness threshold: Only 3 out of 11 columns pass the test, with the remaining 8 columns—including Age (0.8625%), Tenure (0.1375%), and NumOfProducts (0.05%)—falling below the 1% threshold.

The results indicate that while continuous variables in the dataset exhibit high diversity, the majority of categorical and low-cardinality columns do not meet the minimum uniqueness threshold. This pattern reflects a concentration of unique values in a subset of features, with limited diversity observed in most categorical variables. The overall data structure is characterized by high uniqueness in select columns and low uniqueness in the majority of others.

Parameters:

{
  "min_percent_threshold": 1
}

Tables

Column	Number of Unique Values	Percentage of Unique Values (%)	Pass/Fail
CreditScore	452	5.6500	Pass
Geography	3	0.0375	Fail
Gender	2	0.0250	Fail
Age	69	0.8625	Fail
Tenure	11	0.1375	Fail
Balance	5088	63.6000	Pass
NumOfProducts	4	0.0500	Fail
HasCrCard	2	0.0250	Fail
IsActiveMember	2	0.0250	Fail
EstimatedSalary	8000	100.0000	Pass
Exited	2	0.0250	Fail

2026-01-28 18:01:34,678 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document

validmind.data_validation.TooManyZeroValues:raw_data

❌ Too Many Zero Values Raw Data

The TooManyZeroValues test identifies numerical columns with a proportion of zero values exceeding a defined threshold, set here at 0.03%. The results table summarizes the number and percentage of zero values for each numerical column, along with a pass/fail status based on the threshold. All four evaluated columns—Tenure, Balance, HasCrCard, and IsActiveMember—are reported with their respective zero value counts and fail the test due to exceeding the threshold.

Key insights:

All evaluated columns exceed zero value threshold: Each of the four numerical columns has a percentage of zero values significantly above the 0.03% threshold, resulting in a fail status for all.
High concentration of zeros in Balance and IsActiveMember: Balance contains 36.4% zero values, and IsActiveMember contains 48.01%, indicating substantial sparsity in these features.
Binary indicator columns show elevated zero rates: HasCrCard and IsActiveMember, likely representing binary indicators, have 29.74% and 48.01% zero values, respectively, reflecting a high proportion of one class.
Tenure column also affected: Tenure registers 4.04% zero values, which, while lower than other columns, still exceeds the threshold and results in a fail.

All assessed numerical columns display zero value proportions well above the defined threshold, with Balance and IsActiveMember exhibiting particularly high rates of zeros. The presence of elevated zero counts across both continuous and likely binary columns indicates a pattern of data sparsity or class imbalance in these features. This distribution warrants consideration in subsequent modeling steps, as it may influence feature utility and model performance.

Parameters:

{
  "max_percent_threshold": 0.03
}

Tables

Variable	Row Count	Number of Zero Values	Percentage of Zero Values (%)	Pass/Fail
Tenure	8000	323	4.0375	Fail
Balance	8000	2912	36.4000	Fail
HasCrCard	8000	2379	29.7375	Fail
IsActiveMember	8000	3841	48.0125	Fail

2026-01-28 18:01:43,906 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document

validmind.data_validation.IQROutliersTable:raw_data

IQR Outliers Table Raw Data

The Interquartile Range Outliers Table (IQROutliersTable) test identifies and summarizes outliers in numerical features using the IQR method, with the outlier threshold set to 5. The results are presented in a summary table that would list, for each numerical feature, the count and distributional statistics of detected outliers. In this test run, the summary table contains no entries, indicating the absence of detected outliers across all evaluated numerical features.

Key insights:

No outliers detected in any feature: The summary table is empty, confirming that no numerical features exhibited values outside the IQR-based outlier thresholds.
Uniform data distribution within threshold: All numerical feature values fall within the calculated IQR bounds, given the threshold of 5.

The absence of detected outliers indicates that the dataset's numerical features are uniformly distributed within the specified IQR threshold. No evidence of extreme or anomalous values was observed under the applied test parameters.

Parameters:

{
  "threshold": 5
}

Tables

Summary of Outliers Detected by IQR Method

2026-01-28 18:01:51,989 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document

validmind.data_validation.DescriptiveStatistics:preprocessed_data

Descriptive Statistics Preprocessed Data

The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the preprocessed dataset. The results are presented in two summary tables: one for numerical variables, detailing central tendency, dispersion, and range; and one for categorical variables, summarizing value counts, unique value diversity, and the dominance of top categories. These tables provide a comprehensive overview of the dataset’s structure, supporting assessment of data quality and potential risk factors.

Key insights:

Wide range and skewness in Balance: The Balance variable exhibits a minimum of 0.0, a median of 102,386.0, and a maximum of 250,898.0, with a mean (81,585.1) substantially below the median, indicating right-skewness and a concentration of lower values.
CreditScore distribution is broad and symmetric: CreditScore ranges from 350.0 to 850.0, with a mean (647.7) closely aligned to the median (649.0), suggesting a relatively symmetric distribution.
Categorical variables show moderate diversity: Geography has three unique values, with France as the most frequent (46.04%), and Gender is nearly balanced (Male: 51.27%), indicating no single category dominates excessively.
Binary variables are well represented: HasCrCard and IsActiveMember are binary, with HasCrCard showing 69.5% positive responses and IsActiveMember at 47.3%, reflecting reasonable class balance.

The dataset demonstrates a broad and well-populated range for key numerical variables, with some evidence of skewness in Balance. Categorical variables display moderate diversity, with no overwhelming dominance by a single category. Binary variables are distributed without extreme imbalance. Overall, the data structure supports robust modeling, with distributional characteristics that warrant monitoring for potential skewness or concentration effects.

Tables

Numerical Variables

Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
CreditScore	3232.0	647.6711	98.7191	350.0	581.0	649.0	717.0	779.0	817.0	850.0
Tenure	3232.0	5.0285	2.9152	0.0	3.0	5.0	8.0	9.0	10.0	10.0
Balance	3232.0	81585.0888	61380.7207	0.0	0.0	102386.0	128886.0	150802.0	164733.0	250898.0
NumOfProducts	3232.0	1.5108	0.6703	1.0	1.0	1.0	2.0	2.0	3.0	4.0
HasCrCard	3232.0	0.6952	0.4604	0.0	0.0	1.0	1.0	1.0	1.0	1.0
IsActiveMember	3232.0	0.4728	0.4993	0.0	0.0	0.0	1.0	1.0	1.0	1.0
EstimatedSalary	3232.0	100296.6888	57771.2303	12.0	50956.0	100642.0	150557.0	179122.0	188759.0	199992.0

Categorical Variables

Name	Count	Number of Unique Values	Top Value	Top Value Frequency	Top Value Frequency %
Geography	3232.0	3.0	France	1488.0	46.04
Gender	3232.0	2.0	Male	1657.0	51.27

2026-01-28 18:02:02,433 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document

validmind.data_validation.TabularDescriptionTables:preprocessed_data

Tabular Description Tables Preprocessed Data

The Descriptive Statistics test evaluates the distribution, completeness, and data types of numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables and two categorical variables, including measures of central tendency, range, missingness, and unique value counts. All variables are reported with their respective data types and observed value ranges, providing a comprehensive overview of the dataset’s structure and integrity.

Key insights:

No missing values detected: All numerical and categorical variables report 0.0% missing values, indicating complete data coverage across all fields.
Consistent data types across variables: Numerical variables are represented as int64 or float64, while categorical variables are of object type, aligning with their expected formats.
Limited cardinality in categorical variables: Geography contains three unique values (France, Germany, Spain), and Gender contains two unique values (Female, Male), supporting straightforward categorical encoding.
Wide value ranges in numerical variables: CreditScore ranges from 350 to 850, Balance from 0.0 to 250,898.09, and EstimatedSalary from 11.58 to 199,992.48, reflecting substantial spread in key financial indicators.
Binary encoding for indicator variables: HasCrCard, IsActiveMember, and Exited are encoded as binary (0/1) int64 variables, supporting direct use in binary classification or indicator analysis.

The dataset exhibits complete data with no missing values and appropriate data types for all variables. Categorical variables display low cardinality, and numerical variables cover broad value ranges, particularly in financial fields. The structure and integrity of the data are well-documented, providing a reliable foundation for subsequent modeling and analysis.

Tables

Numerical Variable	Num of Obs	Mean	Min	Max	Data Type
CreditScore	3232	647.6711	350.00	850.00	int64
Tenure	3232	5.0285	0.00	10.00	int64
Balance	3232	81585.0888	0.00	250898.09	float64
NumOfProducts	3232	1.5108	1.00	4.00	int64
HasCrCard	3232	0.6952	0.00	1.00	int64
IsActiveMember	3232	0.4728	0.00	1.00	int64
EstimatedSalary	3232	100296.6888	11.58	199992.48	float64
Exited	3232	0.5000	0.00	1.00	int64

Categorical Variable	Num of Obs	Num of Unique Values	Unique Values	Missing Values (%)	Data Type
Geography	3232.0	3.0	['France' 'Germany' 'Spain']	0.0	object
Gender	3232.0	2.0	['Female' 'Male']	0.0	object

2026-01-28 18:02:10,565 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document

validmind.data_validation.MissingValues:preprocessed_data

✅ Missing Values Preprocessed Data

The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the specified threshold. All features in the dataset are listed with their corresponding missing value statistics and test outcomes.

Key insights:

No missing values detected: All features report zero missing values, with both the number and percentage of missing entries recorded as 0.0%.
Universal test pass across features: Every column meets the missing value threshold criterion, resulting in a Pass status for all features.

The dataset demonstrates complete data integrity with respect to missing values, as all features contain fully populated entries and satisfy the established threshold. This indicates a high level of data quality for subsequent modeling or analysis steps.

Parameters:

{
  "min_threshold": 1
}

Tables

Column	Number of Missing Values	Percentage of Missing Values (%)	Pass/Fail
CreditScore	0	0.0	Pass
Geography	0	0.0	Pass
Gender	0	0.0	Pass
Tenure	0	0.0	Pass
Balance	0	0.0	Pass
NumOfProducts	0	0.0	Pass
HasCrCard	0	0.0	Pass
IsActiveMember	0	0.0	Pass
EstimatedSalary	0	0.0	Pass
Exited	0	0.0	Pass

2026-01-28 18:02:15,261 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document

validmind.data_validation.TabularNumericalHistograms:preprocessed_data

Tabular Numerical Histograms Preprocessed Data

The TabularNumericalHistograms:preprocessed_data test provides a visual summary of the distribution of each numerical feature in the dataset using histograms. The resulting plots display the frequency distribution for variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary. These visualizations enable assessment of central tendency, spread, skewness, and the presence of outliers or unusual patterns in the input data.

Key insights:

CreditScore distribution is unimodal and slightly right-skewed: The CreditScore histogram shows a single peak centered around 650–700, with a longer tail extending toward higher values, indicating mild right skewness and a concentration of scores in the mid-to-high range.
Tenure is nearly uniform except at endpoints: The Tenure variable displays an approximately uniform distribution across most values, with lower frequencies at the minimum and maximum tenure values.
Balance exhibits a strong spike at zero: The Balance histogram reveals a pronounced spike at zero, with the remainder of the distribution forming a bell-shaped curve centered around 120,000, indicating a substantial subset of accounts with no balance.
NumOfProducts is highly concentrated at lower values: The distribution of NumOfProducts is heavily concentrated at 1 and 2 products, with very few instances at 3 or 4, indicating limited product diversification among most customers.
HasCrCard and IsActiveMember are binary with class imbalance: Both HasCrCard and IsActiveMember show binary distributions, with HasCrCard skewed toward 1 (majority have a credit card) and IsActiveMember showing a moderate imbalance between active and inactive members.
EstimatedSalary is uniformly distributed: The EstimatedSalary histogram is approximately flat across the range, indicating a uniform distribution of salary values in the dataset.

The histograms collectively indicate that most numerical features are either uniformly or unimodally distributed, with some variables exhibiting notable skewness or concentration at specific values. The presence of a large number of zero balances and the concentration of product counts at lower values are prominent characteristics. Binary features display varying degrees of class imbalance. No extreme outliers or highly irregular distributions are observed in the visualized features.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:6708

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:b637

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:4886

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:b756

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:8793

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:0b90

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:babd

2026-01-28 18:02:35,003 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document

validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data

Tabular Categorical Bar Plots Preprocessed Data

The TabularCategoricalBarPlots test evaluates the distribution of categorical variables by generating bar plots for each category within the dataset. The resulting plots display the frequency counts for each category in the "Geography" and "Gender" features, providing a visual summary of the dataset's categorical composition. These visualizations facilitate the identification of category balance and potential representation issues within the data.

Key insights:

Geography distribution is imbalanced: The "Geography" feature shows the highest count for France, followed by Germany and then Spain, with France having approximately 50% more instances than Spain.
Gender distribution is relatively balanced: The "Gender" feature displays similar counts for Male and Female categories, with only a modest difference between the two.

The categorical composition of the dataset reveals a notable imbalance in the "Geography" feature, with France being the most represented and Spain the least. In contrast, the "Gender" feature demonstrates a near-equal distribution between categories. These patterns provide a clear overview of category representation, highlighting areas where category imbalance may influence downstream modeling.

Figures

ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:92b2

ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:a4ab

2026-01-28 18:02:45,130 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document

validmind.data_validation.TargetRateBarPlots:preprocessed_data

Target Rate Bar Plots Preprocessed Data

The TargetRateBarPlots test visualizes the distribution and target rates of categorical features to provide insight into model decision patterns. The results display paired bar plots for each categorical variable, with the left plot showing the frequency of each category and the right plot depicting the mean target rate (proportion of positive class) for each category. The features analyzed include Geography and Gender, with each category’s count and corresponding target rate presented side by side for direct comparison.

Key insights:

Distinct target rate variation by Geography: The target rate for Germany is notably higher than for France and Spain, with Germany exceeding 0.6 while France and Spain are both near 0.4.
Balanced category representation in Gender: Male and Female categories have similar sample counts, each above 1500, indicating balanced representation in the dataset.
Gender target rate disparity: The target rate for Female is higher than for Male, with Female above 0.5 and Male below 0.45.
Uneven category counts in Geography: France has the highest count, followed by Germany and then Spain, indicating some imbalance in category frequencies.

The results reveal pronounced differences in target rates across both Geography and Gender categories, with Germany and Female categories exhibiting higher proportions of positive class outcomes. Category representation is balanced for Gender but shows moderate imbalance for Geography. These patterns highlight areas where model predictions and data composition differ across categorical groups.

Figures

ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:fcaa

ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:92f7

2026-01-28 18:02:56,308 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document

validmind.data_validation.DescriptiveStatistics:development_data

Descriptive Statistics Development Data

The Descriptive Statistics test evaluates the distributional characteristics of numerical variables in the development and test datasets. The results present summary statistics for each variable, including count, mean, standard deviation, minimum, maximum, and key percentiles. These statistics provide a comprehensive overview of the central tendency, dispersion, and range for each feature, enabling assessment of data quality and potential risk factors.

Key insights:

Consistent central tendencies across datasets: Means and medians (50th percentiles) for key variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary are closely aligned between the development and test datasets, indicating stable distributions.
Wide range and high variance in Balance and EstimatedSalary: Both Balance and EstimatedSalary exhibit large standard deviations (Balance: ~61,500–61,800; EstimatedSalary: ~57,300–57,900) and wide ranges, with minimum values near zero and maximums exceeding 199,000, reflecting substantial dispersion and potential for outliers.
Binary variables show expected distribution: HasCrCard and IsActiveMember are binary variables with means near 0.7 and 0.47, respectively, and standard deviations close to 0.46–0.50, consistent with their binary nature and indicating no missingness.
No evidence of missing data: All variables report counts equal to the total number of records in their respective datasets, indicating complete data coverage for the analyzed features.

The descriptive statistics indicate that the numerical variables in both the development and test datasets are well-aligned in terms of central tendency and spread, with no evidence of missing data. The wide dispersion observed in Balance and EstimatedSalary highlights the presence of substantial variability, which may influence model sensitivity to these features. Binary variables display balanced distributions, and overall, the datasets exhibit stable and consistent statistical properties across splits.

Tables

dataset	Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
train_dataset_final	CreditScore	2585.0	648.3965	98.5054	350.0	581.0	648.0	717.0	779.0	818.0	850.0
train_dataset_final	Tenure	2585.0	4.9988	2.9000	0.0	3.0	5.0	8.0	9.0	10.0	10.0
train_dataset_final	Balance	2585.0	81084.2264	61520.0012	0.0	0.0	101487.0	128744.0	150547.0	164361.0	250898.0
train_dataset_final	NumOfProducts	2585.0	1.5122	0.6729	1.0	1.0	1.0	2.0	2.0	3.0	4.0
train_dataset_final	HasCrCard	2585.0	0.6917	0.4619	0.0	0.0	1.0	1.0	1.0	1.0	1.0
train_dataset_final	IsActiveMember	2585.0	0.4723	0.4993	0.0	0.0	0.0	1.0	1.0	1.0	1.0
train_dataset_final	EstimatedSalary	2585.0	100084.7896	57898.2146	12.0	50073.0	99504.0	151317.0	179252.0	189373.0	199992.0
test_dataset_final	CreditScore	647.0	644.7728	99.5921	350.0	578.0	650.0	716.0	779.0	809.0	850.0
test_dataset_final	Tenure	647.0	5.1468	2.9744	0.0	3.0	5.0	8.0	9.0	10.0	10.0
test_dataset_final	Balance	647.0	83586.2161	60827.1306	0.0	0.0	105420.0	129698.0	151837.0	165360.0	210433.0
test_dataset_final	NumOfProducts	647.0	1.5054	0.6604	1.0	1.0	1.0	2.0	2.0	3.0	4.0
test_dataset_final	HasCrCard	647.0	0.7094	0.4544	0.0	0.0	1.0	1.0	1.0	1.0	1.0
test_dataset_final	IsActiveMember	647.0	0.4745	0.4997	0.0	0.0	0.0	1.0	1.0	1.0	1.0
test_dataset_final	EstimatedSalary	647.0	101143.3033	57297.7411	143.0	52326.0	104153.0	149426.0	178913.0	188084.0	199662.0

2026-01-28 18:03:14,102 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document

validmind.data_validation.TabularDescriptionTables:development_data

Tabular Description Tables Development Data

The Descriptive Statistics test evaluates the distributional characteristics and completeness of numerical variables in the train and test datasets. The results present summary statistics, including mean, minimum, maximum, and missing value percentages, for each numerical variable across both datasets. All variables are reported with their respective data types and observation counts, providing a comprehensive overview of the dataset structure and integrity.

Key insights:

No missing values across all variables: All numerical variables in both train and test datasets report 0.0% missing values, indicating complete data coverage for these fields.
Consistent data types and observation counts: All variables are typed as either int64 or float64, with observation counts matching the dataset sizes (2585 for train, 647 for test), confirming structural consistency.
Stable variable ranges between datasets: Minimum and maximum values for variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited are closely aligned between train and test datasets, with only minor variations.
Comparable central tendencies: Mean values for all variables are similar between train and test datasets, with differences remaining within a narrow range (e.g., CreditScore mean: 648.40 in train vs. 644.77 in test; EstimatedSalary mean: 100,084.79 in train vs. 101,143.30 in test).

The descriptive statistics indicate that the numerical variables are complete, consistently typed, and structurally aligned across both train and test datasets. Central tendencies and value ranges are stable, supporting the integrity and comparability of the data for subsequent modeling and analysis. No data quality issues or anomalies are observed in the reported statistics.

Tables

dataset	Numerical Variable	Num of Obs	Mean	Min	Max	Data Type
train_dataset_final	CreditScore	2585	648.3965	350.00	850.00	int64
train_dataset_final	Tenure	2585	4.9988	0.00	10.00	int64
train_dataset_final	Balance	2585	81084.2264	0.00	250898.09	float64
train_dataset_final	NumOfProducts	2585	1.5122	1.00	4.00	int64
train_dataset_final	HasCrCard	2585	0.6917	0.00	1.00	int64
train_dataset_final	IsActiveMember	2585	0.4723	0.00	1.00	int64
train_dataset_final	EstimatedSalary	2585	100084.7896	11.58	199992.48	float64
train_dataset_final	Exited	2585	0.4959	0.00	1.00	int64
test_dataset_final	CreditScore	647	644.7728	350.00	850.00	int64
test_dataset_final	Tenure	647	5.1468	0.00	10.00	int64
test_dataset_final	Balance	647	83586.2161	0.00	210433.08	float64
test_dataset_final	NumOfProducts	647	1.5054	1.00	4.00	int64
test_dataset_final	HasCrCard	647	0.7094	0.00	1.00	int64
test_dataset_final	IsActiveMember	647	0.4745	0.00	1.00	int64
test_dataset_final	EstimatedSalary	647	101143.3033	142.81	199661.50	float64
test_dataset_final	Exited	647	0.5162	0.00	1.00	int64

2026-01-28 18:03:24,312 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document

validmind.data_validation.ClassImbalance:development_data

✅ Class Imbalance Development Data

The Class Imbalance test evaluates the distribution of target classes within the training and test datasets to identify potential imbalances that could affect model performance. The results present the percentage representation of each class in both datasets, with a minimum threshold of 10% set for each class to pass. Bar plots visualize the class proportions, supporting interpretation of the class distribution.

Key insights:

Balanced class distribution in training data: In the train_dataset_final, class 0 constitutes 50.41% and class 1 constitutes 49.59% of the records, both well above the 10% threshold.
Balanced class distribution in test data: In the test_dataset_final, class 1 constitutes 51.62% and class 0 constitutes 48.38% of the records, both exceeding the minimum threshold.
All classes pass imbalance criteria: Both classes in both datasets are marked as "Pass" for the class imbalance test, indicating no class falls below the specified threshold.
Visual confirmation of near-equal proportions: Bar plots for both datasets show visually similar heights for each class, confirming the quantitative findings.

The results indicate that both the training and test datasets exhibit a balanced distribution of the target classes, with each class representing approximately half of the records. No evidence of class imbalance is observed, and all classes meet the minimum representation criteria. This distribution supports the development and evaluation of models without risk of bias due to class under-representation.

Parameters:

{
  "min_percent_threshold": 10
}

Tables

dataset	Exited	Percentage of Rows (%)	Pass/Fail
train_dataset_final	0	50.41%	Pass
train_dataset_final	1	49.59%	Pass
test_dataset_final	1	51.62%	Pass
test_dataset_final	0	48.38%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:development_data:052d

ValidMind Figure validmind.data_validation.ClassImbalance:development_data:bdef

2026-01-28 18:03:35,823 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document

validmind.data_validation.UniqueRows:development_data

❌ Unique Rows Development Data

The UniqueRows test evaluates the diversity of each column in the dataset by measuring the proportion of unique values relative to the total number of rows, with a minimum threshold set at 1%. The results table presents the number and percentage of unique values for each column in both the training and test datasets, along with a pass/fail outcome based on the threshold. Columns with a percentage of unique values below 1% are marked as "Fail," while those meeting or exceeding the threshold are marked as "Pass."

Key insights:

High uniqueness in continuous variables: Columns such as EstimatedSalary and Balance exhibit high percentages of unique values (100% and 67.7% in training; 100% and 69.4% in test), consistently passing the uniqueness threshold.
Low uniqueness in categorical variables: Columns representing categorical or binary features (e.g., HasCrCard, IsActiveMember, Geography_Germany, Geography_Spain, Gender_Male, Exited) have very low percentages of unique values (all below 1%) and fail the test in both datasets.
Mixed results for ordinal variables: CreditScore passes the threshold in both datasets (16.5% in training, 47.3% in test), while Tenure passes in the test set (1.7%) but fails in the training set (0.43%).
Consistent failure in product count features: NumOfProducts fails the uniqueness threshold in both datasets, with unique value percentages well below 1%.

The results indicate that continuous variables in both training and test datasets demonstrate high diversity, consistently exceeding the uniqueness threshold. In contrast, categorical and binary variables, as well as certain ordinal features, exhibit low uniqueness and do not meet the threshold, resulting in test failures for these columns. This pattern reflects the inherent nature of categorical and binary variables, which are limited in the number of possible unique values, and highlights the distinction in uniqueness profiles between variable types within the dataset.

Parameters:

{
  "min_percent_threshold": 1
}

Tables

dataset	Column	Number of Unique Values	Percentage of Unique Values (%)	Pass/Fail
train_dataset_final	CreditScore	427	16.5184	Pass
train_dataset_final	Tenure	11	0.4255	Fail
train_dataset_final	Balance	1750	67.6983	Pass
train_dataset_final	NumOfProducts	4	0.1547	Fail
train_dataset_final	HasCrCard	2	0.0774	Fail
train_dataset_final	IsActiveMember	2	0.0774	Fail
train_dataset_final	EstimatedSalary	2585	100.0000	Pass
train_dataset_final	Geography_Germany	2	0.0774	Fail
train_dataset_final	Geography_Spain	2	0.0774	Fail
train_dataset_final	Gender_Male	2	0.0774	Fail
train_dataset_final	Exited	2	0.0774	Fail
test_dataset_final	CreditScore	306	47.2952	Pass
test_dataset_final	Tenure	11	1.7002	Pass
test_dataset_final	Balance	449	69.3972	Pass
test_dataset_final	NumOfProducts	4	0.6182	Fail
test_dataset_final	HasCrCard	2	0.3091	Fail
test_dataset_final	IsActiveMember	2	0.3091	Fail
test_dataset_final	EstimatedSalary	647	100.0000	Pass
test_dataset_final	Geography_Germany	2	0.3091	Fail
test_dataset_final	Geography_Spain	2	0.3091	Fail
test_dataset_final	Gender_Male	2	0.3091	Fail
test_dataset_final	Exited	2	0.3091	Fail

2026-01-28 18:03:51,084 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document

validmind.data_validation.TabularNumericalHistograms:development_data

Tabular Numerical Histograms Development Data

The TabularNumericalHistograms test provides a visual assessment of the distribution of each numerical feature in both the training and test datasets. The resulting histograms display the frequency distribution for each variable, enabling identification of skewness, outliers, and other distributional characteristics. This visualization supports the evaluation of input data quality and the detection of potential risks related to data distribution.

Key insights:

CreditScore distribution is right-skewed: Both training and test datasets show a right-skewed distribution for CreditScore, with a concentration of values between 550 and 750 and a long tail toward higher scores.
Tenure is nearly uniform except at boundaries: Tenure displays an approximately uniform distribution across most values, with lower frequencies at the minimum and maximum values in both datasets.
Balance shows a high frequency at zero: A substantial proportion of records have a zero balance, with the remainder forming a bell-shaped distribution centered around 120,000–130,000.
NumOfProducts is heavily concentrated at lower values: The majority of records have one or two products, with very few instances at three or four products.
HasCrCard and IsActiveMember are binary with class imbalance: Both features are binary; HasCrCard is skewed toward 1, while IsActiveMember is more evenly split but still shows a higher count for 0 in the training set.
EstimatedSalary is uniformly distributed: EstimatedSalary appears to be evenly distributed across its range in both datasets, with no significant skewness or outliers.
Geography and Gender features show categorical splits: Geography_Germany and Geography_Spain indicate more records in the 'false' category, while Gender_Male is nearly balanced between true and false.

The histograms reveal that most numerical features exhibit stable and interpretable distributions across both training and test datasets. Notable characteristics include a high proportion of zero values in Balance, right-skewness in CreditScore, and strong concentration at lower values for NumOfProducts. Binary and categorical features display varying degrees of class imbalance. No extreme outliers or unexpected distributional anomalies are observed, supporting the overall integrity of the input data distributions.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:89f9

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:d9b4

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:6d03

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:b0e0

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:1c54

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:b98e

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:d027

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:e419

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:c3ed

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:fa69

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:cfe0

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:6309

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:e418

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:0650

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:af5a

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:adc3

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:4795

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:9593

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:03c1

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:d5d9

2026-01-28 18:04:11,858 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document

validmind.data_validation.MutualInformation:development_data

Mutual Information Development Data

The Mutual Information test evaluates the statistical dependency between each feature and the target variable to quantify feature relevance for model development. The results are presented as normalized mutual information scores for both the training and test datasets, with a threshold of 0.01 used to distinguish features with minimal information content. Bar plots display the relative importance of each feature, highlighting the distribution and magnitude of mutual information across the feature set.

Key insights:

NumOfProducts consistently highest information score: NumOfProducts exhibits the highest mutual information score in both training (≈0.09) and test (≈0.096) datasets, indicating strong relevance to the target variable.
Feature importance distribution shifts between datasets: In the training set, Geography_Germany and Balance follow as the next most informative features, while in the test set, Balance, IsActiveMember, and Gender_Male show increased scores and prominence.
Several features below information threshold: In the training set, CreditScore, IsActiveMember, Geography_Spain, Gender_Male, Tenure, HasCrCard, and EstimatedSalary all fall below the 0.01 threshold, indicating limited information content. In the test set, only HasCrCard, EstimatedSalary, and Geography_Spain remain below this threshold.
Increased information content in test set for multiple features: IsActiveMember, Gender_Male, Geography_Germany, and Tenure display higher mutual information scores in the test set compared to the training set, with all exceeding the 0.01 threshold.

The mutual information analysis reveals a concentrated distribution of feature relevance, with NumOfProducts consistently providing the highest information content across both datasets. The test set demonstrates a broader set of features exceeding the information threshold, suggesting increased relevance for IsActiveMember, Gender_Male, Geography_Germany, and Tenure relative to the training set. Several features, including HasCrCard, EstimatedSalary, and Geography_Spain, consistently exhibit minimal information content, indicating limited direct association with the target variable. The observed shifts in feature importance between datasets highlight potential changes in feature-target relationships that warrant ongoing monitoring.

Parameters:

{
  "min_threshold": 0.01
}

Figures

ValidMind Figure validmind.data_validation.MutualInformation:development_data:4321

ValidMind Figure validmind.data_validation.MutualInformation:development_data:990f

2026-01-28 18:04:31,027 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document

validmind.data_validation.PearsonCorrelationMatrix:development_data

Pearson Correlation Matrix Development Data

The Pearson Correlation Matrix test evaluates the linear dependency between all pairs of numerical variables in the dataset, providing a matrix of Pearson correlation coefficients visualized as a heatmap. The results display the correlation structure for both the training and test datasets, with coefficients ranging from -1 to 1, and highlight any high correlations (absolute value > 0.7) for further review. The heatmaps show the magnitude and direction of relationships between variables, allowing for identification of potential redundancy or multicollinearity.

Key insights:

No high correlations detected: All pairwise correlation coefficients in both training and test datasets are below the 0.7 threshold, indicating the absence of strong linear dependencies among variables.
Consistent correlation structure across splits: The correlation patterns and magnitudes are stable between the training and test datasets, with the highest observed correlations (e.g., Balance and Geography_Germany, approximately 0.41–0.42) remaining moderate and consistent.
Low risk of multicollinearity: The lack of high-magnitude correlations suggests that the variables are largely independent, minimizing the risk of redundancy or instability in model estimation.

The correlation analysis reveals a stable and moderate correlation structure across both training and test datasets, with no evidence of strong linear relationships or multicollinearity among the variables. This indicates that the dataset maintains a high degree of variable independence, supporting reliable model estimation and interpretability.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:90b5

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:7384

2026-01-28 18:04:43,647 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document

validmind.data_validation.HighPearsonCorrelation:development_data

❌ High Pearson Correlation Development Data

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results present the top ten absolute Pearson correlation coefficients for both the training and test datasets, indicating whether each pair exceeds the predefined threshold of 0.3. Correlation coefficients, feature pairs, and pass/fail status are reported for each dataset, highlighting the most significant relationships observed.

Key insights:

Two feature pairs exceed correlation threshold: In both the training and test datasets, the pairs (Balance, Geography_Germany) and (Geography_Germany, Geography_Spain) display absolute correlation coefficients above the 0.3 threshold, with values ranging from 0.3627 to 0.421. These pairs are flagged as "Fail" by the test.
Consistent correlation structure across datasets: The same feature pairs exceed the threshold in both the training and test datasets, with similar coefficient magnitudes, indicating stable correlation patterns between these variables.
All other feature pairs below threshold: The remaining top correlations in both datasets have absolute coefficients below 0.3, resulting in a "Pass" status and indicating no further high linear relationships among other feature pairs.

The test results reveal that only two feature pairs—(Balance, Geography_Germany) and (Geography_Germany, Geography_Spain)—exhibit moderate linear correlations exceeding the specified threshold in both the training and test datasets. The consistency of these correlations across datasets suggests a stable relationship between these variables. All other feature pairs remain below the threshold, indicating limited risk of widespread multicollinearity or feature redundancy within the evaluated features.

Parameters:

{
  "max_threshold": 0.3,
  "top_n_correlations": 10
}

Tables

dataset	Columns	Coefficient	Pass/Fail
train_dataset_final	(Balance, Geography_Germany)	0.4147	Fail
train_dataset_final	(Geography_Germany, Geography_Spain)	-0.3627	Fail
train_dataset_final	(IsActiveMember, Exited)	-0.1977	Pass
train_dataset_final	(Geography_Germany, Exited)	0.1931	Pass
train_dataset_final	(Balance, NumOfProducts)	-0.1903	Pass
train_dataset_final	(Balance, Geography_Spain)	-0.1624	Pass
train_dataset_final	(Balance, Exited)	0.1441	Pass
train_dataset_final	(Gender_Male, Exited)	-0.1271	Pass
train_dataset_final	(Tenure, EstimatedSalary)	0.0660	Pass
train_dataset_final	(NumOfProducts, Exited)	-0.0594	Pass
test_dataset_final	(Balance, Geography_Germany)	0.4210	Fail
test_dataset_final	(Geography_Germany, Geography_Spain)	-0.3787	Fail
test_dataset_final	(Geography_Germany, Exited)	0.2500	Pass
test_dataset_final	(IsActiveMember, Exited)	-0.2445	Pass
test_dataset_final	(Balance, Exited)	0.1940	Pass
test_dataset_final	(Balance, Geography_Spain)	-0.1718	Pass
test_dataset_final	(Gender_Male, Exited)	-0.1635	Pass
test_dataset_final	(Balance, NumOfProducts)	-0.1341	Pass
test_dataset_final	(HasCrCard, Geography_Germany)	0.1086	Pass
test_dataset_final	(Tenure, HasCrCard)	0.0981	Pass

2026-01-28 18:04:52,126 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document

validmind.model_validation.ModelMetadata

Model Metadata

The ModelMetadata test compares the metadata of different models to assess consistency in architecture, framework, framework version, and programming language. The summary table presents metadata fields for each model, including modeling technique, framework, version, and programming language. Both models, log_model_champion and rf_model, are included in the comparison, with all relevant metadata fields populated.

Key insights:

Consistent modeling technique and framework: Both models use the SKlearnModel technique and the sklearn framework.
Identical framework versions: Framework Version 1.8.0 is used for both models.
Uniform programming language: Python is the programming language for both models.

The metadata comparison indicates complete alignment across all evaluated fields for the two models. No inconsistencies or missing metadata are observed, and the models share the same architecture, framework, version, and programming language. This uniformity supports ease of integration and management within the modeling environment.

Tables

model	Modeling Technique	Modeling Framework	Framework Version	Programming Language
log_model_champion	SKlearnModel	sklearn	1.8.0	Python
rf_model	SKlearnModel	sklearn	1.8.0	Python

2026-01-28 18:04:55,917 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document

validmind.model_validation.sklearn.ModelParameters

Model Parameters

The Model Parameters test extracts and displays all configuration parameters for each model to ensure transparency and reproducibility. The results present a structured table listing parameter names and their corresponding values for both the logistic regression model (log_model_champion) and the random forest model (rf_model). Each parameter setting is shown explicitly, providing a comprehensive view of the model configurations at the time of testing.

Key insights:

Explicit parameterization for both models: All parameters for log_model_champion and rf_model are listed, including regularization, solver, and iteration settings for the logistic regression model, and tree construction, sampling, and random state settings for the random forest model.
No missing or undefined parameters: Every parameter extracted via the scikit-learn API is present in the results, with no missing or null values observed.
Use of non-default and default values: The logistic regression model specifies penalty='l1' and solver='liblinear', while the random forest model uses n_estimators=50 and random_state=42, indicating a mix of explicitly set and default parameter values.

The extracted parameter set provides a complete and transparent record of model configuration for both the logistic regression and random forest models. All relevant parameters are captured, supporting reproducibility and facilitating future audits or validation efforts. The results indicate that both models are fully parameterized with no missing configuration elements.

Tables

model	Parameter	Value
log_model_champion	C	1
log_model_champion	dual	False
log_model_champion	fit_intercept	True
log_model_champion	intercept_scaling	1
log_model_champion	max_iter	100
log_model_champion	penalty	l1
log_model_champion	solver	liblinear
log_model_champion	tol	0.0001
log_model_champion	verbose	0
log_model_champion	warm_start	False
rf_model	bootstrap	True
rf_model	ccp_alpha	0.0
rf_model	criterion	gini
rf_model	max_features	sqrt
rf_model	min_impurity_decrease	0.0
rf_model	min_samples_leaf	1
rf_model	min_samples_split	2
rf_model	min_weight_fraction_leaf	0.0
rf_model	n_estimators	50
rf_model	oob_score	False
rf_model	random_state	42
rf_model	verbose	0
rf_model	warm_start	False

2026-01-28 18:05:03,226 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document

validmind.model_validation.sklearn.ROCCurve

ROC Curve

The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) for both the training and test datasets. The resulting plots display the trade-off between the true positive rate and false positive rate at various thresholds, with the AUC quantifying the model's ability to distinguish between classes. The ROC curves and corresponding AUC values are presented for both the train_dataset_final and test_dataset_final.

Key insights:

AUC indicates moderate discriminative power on training data: The AUC for the train_dataset_final is 0.68, reflecting moderate ability to separate positive and negative classes.
Improved AUC on test data: The AUC for the test_dataset_final is 0.73, indicating stronger discriminative performance on unseen data compared to the training set.
ROC curves consistently above random baseline: Both ROC curves remain above the random classifier line (AUC = 0.5), demonstrating that the model provides meaningful class separation across thresholds.

The ROC Curve test results show that log_model_champion achieves moderate to good discrimination between classes, with AUC values of 0.68 on the training set and 0.73 on the test set. The model's performance is consistently above the random baseline, and the higher AUC on the test data suggests effective generalization to new samples. No indications of model collapse or lack of discriminative ability are observed in these results.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:364c

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:5bfd

2026-01-28 18:05:14,617 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document

validmind.model_validation.sklearn.MinimumROCAUCScore

✅ Minimum ROCAUC Score

The Minimum ROC AUC Score test evaluates whether the model's multiclass ROC AUC score meets or exceeds a specified minimum threshold, providing an assessment of the model's ability to distinguish between classes. The results table presents ROC AUC scores for both the training and test datasets, alongside the applied threshold and the corresponding pass/fail outcome. Both datasets are evaluated against a minimum threshold of 0.5, with the observed scores and test outcomes reported for each.

Key insights:

ROC AUC scores exceed threshold on all datasets: Both the training (0.6766) and test (0.7271) datasets register ROC AUC scores above the minimum threshold of 0.5.
Test dataset shows higher discriminative performance: The ROC AUC score on the test dataset (0.7271) is higher than on the training dataset (0.6766), indicating improved class separation on unseen data.
Test passed for all evaluated datasets: The pass/fail status is "Pass" for both datasets, confirming that the model meets the minimum ROC AUC requirement across splits.

The results indicate that the model demonstrates adequate discriminative capability on both training and test datasets, with ROC AUC scores consistently above the defined threshold. The higher score on the test dataset suggests stable or improved generalization performance, and the model satisfies the minimum ROC AUC criterion across all evaluated data partitions.

Parameters:

{
  "min_threshold": 0.5
}

Tables

dataset	Score	Threshold	Pass/Fail
train_dataset_final	0.6766	0.5	Pass
test_dataset_final	0.7271	0.5	Pass

2026-01-28 18:05:20,533 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document

In summary

In this final notebook, you learned how to:

Implement a custom inline test
Run and log your custom inline tests
Use external custom test providers
Run and log tests from your custom test providers
Re-run tests provided by your model development team to verify that they were run and reported accurately

With our ValidMind for model validation series of notebooks, you learned how to validate a model end-to-end with the ValidMind Library by running through some common scenarios in a typical model validation setting:

Verifying the data quality steps performed by the model development team
Independently replicating the champion model's results and conducting additional tests to assess performance, stability, and robustness
Setting up test inputs and a challenger model for comparative analysis
Running validation tests, analyzing results, and logging artifacts to ValidMind

Next steps

Work with your validation report

Now that you've logged all your test results and verified the work done by the model development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:

Inserting additional test results: Click Link Evidence to Report under any section of 2. Validation in your validation report. (Learn more: Link evidence to reports)
Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)
Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage model findings)
Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)
Assessing compliance: Under the Guideline for any validation report section, click ASSESSMENT and select the compliance status from the drop-down menu. (Learn more: Provide compliance assessments)
Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including model developers. Propose suggested changes in the model documentation, work with versioned history, and use comments to discuss specific portions of the model documentation. (Learn more: Collaborate with others)

When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough model validation history. (Learn more: Submit for approval)

Learn more

Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining model validation:

More how-to guides and code samples

Discover more learning resources

All notebook samples can be found in the following directories of the ValidMind Library GitHub repository:

Or, visit our documentation to learn more about ValidMind.

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	50
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	None
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'sqrt'
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary ` for details.	42
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary ` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples. - If int, then draw `max_samples` samples. - If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`. .. versionadded:: 0.22	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`), - classifications trained on data with missing values. The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide `. .. versionadded:: 1.4	None