ValidMind for model validation 4 — Finalize testing and reporting

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.

This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:

For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to finalize validation and reporting, you'll need to first have:

Setting up

This section should be very familiar to you now — as we performed the same actions in the previous two notebooks in this series.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)
Note: you may need to restart the kernel to use updated packages.
2026-01-28 17:58:32,957 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}
# Initialize the raw dataset for use in ValidMind tests
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)
import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)
# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results table presents the top ten feature pairs ranked by the absolute value of their Pearson correlation coefficients, along with a Pass or Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs display lower correlation values and pass the test.

Key insights:

  • One feature pair exceeds correlation threshold: The pair (Age, Exited) has a correlation coefficient of 0.3467, surpassing the 0.3 threshold and resulting in a Fail status.
  • All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.207 to 0.0367, all below the 0.3 threshold and marked as Pass.
  • Predominantly weak linear relationships: Most feature pairs exhibit weak linear associations, with coefficients clustered near zero.

The test results indicate that the dataset contains minimal evidence of strong linear relationships among most feature pairs, with only the (Age, Exited) pair exceeding the specified correlation threshold. The overall correlation structure suggests a low risk of widespread multicollinearity or feature redundancy, as the majority of feature pairs demonstrate weak linear dependencies.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3467 Fail
(IsActiveMember, Exited) -0.2070 Pass
(Balance, NumOfProducts) -0.1793 Pass
(Balance, Exited) 0.1542 Pass
(NumOfProducts, Exited) -0.0577 Pass
(Age, Balance) 0.0550 Pass
(Tenure, EstimatedSalary) 0.0531 Pass
(NumOfProducts, IsActiveMember) 0.0485 Pass
(Age, NumOfProducts) -0.0394 Pass
(Tenure, Balance) -0.0367 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3467 Fail
1 (IsActiveMember, Exited) -0.2070 Pass
2 (Balance, NumOfProducts) -0.1793 Pass
3 (Balance, Exited) 0.1542 Pass
4 (NumOfProducts, Exited) -0.0577 Pass
5 (Age, Balance) 0.0550 Pass
6 (Tenure, EstimatedSalary) 0.0531 Pass
7 (NumOfProducts, IsActiveMember) 0.0485 Pass
8 (Age, NumOfProducts) -0.0394 Pass
9 (Tenure, Balance) -0.0367 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']
# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten absolute Pearson correlation coefficients among feature pairs, each accompanied by a Pass/Fail status based on a threshold of 0.3. All reported coefficients are below the threshold, and all feature pairs received a Pass status.

Key insights:

  • No high correlations detected: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest magnitude observed at 0.207 between IsActiveMember and Exited.
  • Consistent Pass status across all pairs: Every feature pair in the top ten correlations received a Pass, indicating no evidence of strong linear relationships among the evaluated features.
  • Low to moderate relationships observed: The reported coefficients range from -0.207 to 0.1542, reflecting only weak to very weak linear associations between the examined feature pairs.

The test results indicate an absence of strong linear dependencies among the evaluated features, with all pairwise correlations falling well below the specified threshold. This suggests a low risk of feature redundancy or multicollinearity within the dataset based on linear relationships, supporting the interpretability and stability of subsequent modeling efforts.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.2070 Pass
(Balance, NumOfProducts) -0.1793 Pass
(Balance, Exited) 0.1542 Pass
(NumOfProducts, Exited) -0.0577 Pass
(Tenure, EstimatedSalary) 0.0531 Pass
(NumOfProducts, IsActiveMember) 0.0485 Pass
(Tenure, Balance) -0.0367 Pass
(Tenure, IsActiveMember) -0.0360 Pass
(Tenure, Exited) -0.0265 Pass
(HasCrCard, IsActiveMember) -0.0260 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
5613 558 7 121235.05 2 1 1 116253.10 0 False False False
4434 432 2 135559.80 2 1 1 71856.30 0 True False True
6484 769 6 117852.26 2 1 0 147668.64 0 False False False
5796 762 10 168920.75 1 1 0 31445.03 1 False True False
1668 460 7 0.00 2 1 0 156150.08 1 False False False
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:

Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Train potential challenger model

We'll also train our random forest classification challenger model to see how it compares:

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)
# Assign predictions to Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Assign predictions to Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-01-28 17:58:51,388 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,390 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,390 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,392 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 17:58:51,394 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,395 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,396 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,397 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 17:58:51,399 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,419 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,420 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,442 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 17:58:51,445 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,457 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,458 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,469 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Implementing custom tests

Thanks to the model documentation (Learn more ...), we know that the model development team implemented a custom test to further evaluate the performance of the champion model.

In a usual model validation situation, you would load a saved custom test provided by the model development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.

Want to learn more about custom tests?

Refer to our in-depth introduction to custom tests: Implement custom tests

Implement a custom inline test

Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the model development team used in their performance evaluations.

  • An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
  • You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.

Create a confusion matrix plot

Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:

import matplotlib.pyplot as plt
from sklearn import metrics

# Get the predicted classes
y_pred = log_reg.predict(vm_test_ds.x)

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

cm_display = metrics.ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix, display_labels=[False, True]
)
cm_display.plot()

Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:

  • The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
    • VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
    • VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
  • The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
  • The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
  • The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
  • The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:

# Champion train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error types. The first matrix corresponds to the training dataset, while the second matrix summarizes results for the test dataset.

Key insights:

  • Balanced true positive and true negative rates: Both training and test datasets show similar counts for true positives and true negatives (train: 794 TP, 835 TN; test: 215 TP, 215 TN), indicating consistent model performance across classes.
  • Moderate false positive and false negative rates: The number of false positives and false negatives is comparable within each dataset (train: 488 FP, 488 FN; test: 98 FP, 119 FN), suggesting balanced misclassification rates.
  • Consistent performance across datasets: The relative proportions of each confusion matrix cell are similar between training and test datasets, indicating stable generalization from training to test data.

The confusion matrix results demonstrate that the model maintains balanced classification performance across both training and test datasets, with similar rates of correct and incorrect predictions for each class. The observed symmetry in true and false prediction counts suggests that the model does not exhibit a strong bias toward either class and generalizes consistently between datasets.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:champion:f7ef
ValidMind Figure my_custom_tests.ConfusionMatrix:champion:87f3
2026-01-28 17:59:08,777 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document
# Challenger train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix:challenger test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The confusion matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's ability to correctly classify each class. The results are presented separately for the train and test datasets, allowing for assessment of both in-sample and out-of-sample performance.

Key insights:

  • Perfect classification on training data: The training dataset confusion matrix shows 1,303 true negatives and 1,282 true positives, with zero false positives and zero false negatives, indicating no misclassifications on the training set.
  • Reduced accuracy on test data: The test dataset confusion matrix shows 227 true negatives and 229 true positives, with 86 false positives and 105 false negatives, indicating the presence of both types of misclassification in out-of-sample predictions.
  • Balanced class distribution in test set: The test set contains a similar number of true positives (229) and true negatives (227), suggesting a balanced representation of both classes in the evaluation.

The confusion matrix results indicate that the model achieves perfect separation of classes on the training data, with no observed misclassifications. However, performance on the test data shows a reduction in accuracy, with both false positives and false negatives present. The balanced class distribution in the test set supports a reliable assessment of model generalization. The observed discrepancy between training and test performance highlights the importance of evaluating out-of-sample results to understand model behavior under realistic conditions.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:030a
ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:72ad
2026-01-28 17:59:24,095 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Add parameters to custom tests

Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model, normalize=False):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    if normalize:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")
    else:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

Pass parameters to custom tests

You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.

  • The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
  • Since these are VMDataset or VMModel inputs, they have a special meaning.

Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:

# Champion with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Champion

The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by displaying the normalized proportions of true positives, true negatives, false positives, and false negatives. The matrix presents the fraction of predictions in each category, with values normalized such that each cell represents the proportion of total predictions. The results show the distribution of correct and incorrect predictions for both positive and negative classes, providing a comprehensive view of model accuracy and error types.

Key insights:

  • Balanced correct classification rates: The model correctly classifies both negative (True Negative: 0.33) and positive (True Positive: 0.33) classes at equal rates, indicating symmetric performance across classes.
  • Moderate false negative and false positive rates: The false negative rate (0.18) and false positive rate (0.15) are of similar magnitude, suggesting that misclassification is distributed relatively evenly between the two error types.
  • No class dominates prediction errors: The normalized values indicate that neither class is disproportionately affected by misclassification, with all matrix entries within the range of 0.15 to 0.33.

The confusion matrix reveals that the model exhibits balanced performance across both classes, with correct and incorrect predictions distributed symmetrically. The rates of true positives and true negatives are equal, and the error rates for false positives and false negatives are comparable, indicating no significant bias toward either class in prediction outcomes.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_champion:469c
2026-01-28 17:59:35,649 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document
# Challenger with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Challenger

The Confusion Matrix test evaluates the classification performance of the rf_model on the test_dataset_final by displaying the normalized proportions of true positives, true negatives, false positives, and false negatives. The matrix presents the fraction of predictions in each category, with values normalized such that the sum of all cells equals 1. The results provide a visual summary of the model’s ability to correctly and incorrectly classify both positive and negative cases.

Key insights:

  • Balanced correct classification rates: The model correctly classifies both negative (True Negative: 0.35) and positive (True Positive: 0.35) cases at equal rates, indicating symmetric performance across classes.
  • Moderate false negative and false positive rates: The proportion of false negatives (0.16) and false positives (0.13) are similar in magnitude, suggesting that misclassification is distributed relatively evenly between the two error types.
  • No class dominance observed: The normalized confusion matrix does not indicate a strong bias toward either class, as the correct and incorrect prediction rates are comparable for both positive and negative labels.

The confusion matrix reveals that the rf_model demonstrates balanced classification performance, with equal rates of correct predictions for both classes and similar proportions of false positives and false negatives. This indicates that the model does not favor one class over the other and maintains consistent error rates across both positive and negative predictions.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_challenger:06ad
2026-01-28 17:59:49,481 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document

Use external test providers

Sometimes you may want to reuse the same set of custom tests across multiple models and share them with others in your organization, like the model development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.

In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:

  1. Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
  2. Save an inline test to a file
  3. Define and register a LocalTestProvider that points to that folder
  4. Run test provider tests
  5. Add the test results to your documentation

Create custom tests folder

Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.

The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:

tests_folder = "my_tests"

import os

# create tests folder
os.makedirs(tests_folder, exist_ok=True)

# remove existing tests
for f in os.listdir(tests_folder):
    # remove files and pycache
    if f.endswith(".py") or f == "__pycache__":
        os.system(f"rm -rf {tests_folder}/{f}")

After running the command above, confirm that a new my_tests directory was created successfully. For example:

~/notebooks/tutorials/model_validation/my_tests/

Save an inline test

The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.

While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.

The confusion_matrix test requires the following additional imports:

import matplotlib.pyplot as plt
from sklearn import metrics

Let's pass these imports to the save() method to ensure they are included in the file with the following command:

confusion_matrix.save(
    # Save it to the custom tests folder we created
    tests_folder,
    imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],
)
2026-01-28 17:59:50,042 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-01-28 17:59:50,043 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
  • # Saved from __main__.confusion_matrix
    # Original Test ID: my_custom_tests.ConfusionMatrix
    # New Test ID: <test_provider_namespace>.ConfusionMatrix
  • def ConfusionMatrix(dataset, model, normalize=False):

Register a local test provider

Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:

  • ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
  • You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
Want to learn more about test providers?

An extended introduction to test providers can be found in: Integrate external test providers
Initialize a local test provider

For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.

The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.

Let's go ahead and load the custom tests from our my_tests directory:

from validmind.tests import LocalTestProvider

# initialize the test provider with the tests folder we created earlier
my_test_provider = LocalTestProvider(tests_folder)

vm.tests.register_test_provider(
    namespace="my_test_provider",
    test_provider=my_test_provider,
)
# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`
# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file
Run test provider tests

Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:

  • For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
  • For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.

Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.

# Champion with test dataset and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by comparing predicted and true class labels. The resulting 2x2 matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix quantifies the model's ability to correctly identify both positive and negative cases, as well as the frequency of misclassifications.

Key insights:

  • Balanced true positive and true negative counts: The model correctly classified 215 positive cases (true positives) and 215 negative cases (true negatives), indicating symmetric performance across both classes.
  • Notable false negative and false positive rates: There are 119 false negatives and 98 false positives, reflecting a moderate level of misclassification in both directions.
  • Comparable error distribution across classes: The number of false negatives and false positives are of similar magnitude, suggesting that the model does not exhibit a strong bias toward over- or under-predicting either class.

The confusion matrix reveals that the log_model_champion demonstrates balanced classification performance, with equal accuracy in identifying both positive and negative cases. Misclassification rates are moderate and distributed similarly between false positives and false negatives, indicating no pronounced skew in prediction errors. This pattern suggests consistent model behavior across both classes on the test dataset.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:champion:9dee
2026-01-28 18:00:03,845 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document
# Challenger with test dataset  and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix test evaluates the classification performance of the rf_model on the test_dataset_final by comparing predicted and true labels. The resulting 2x2 matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix enables assessment of the model's ability to correctly identify both positive and negative cases, as well as the types and frequencies of misclassifications.

Key insights:

  • Balanced correct classification of both classes: The model correctly classified 227 negative cases (true negatives) and 229 positive cases (true positives), indicating similar performance across both classes.
  • Moderate false positive and false negative rates: There were 86 false positives and 105 false negatives, reflecting a moderate level of misclassification for both types of errors.
  • Comparable error distribution: The counts of false positives and false negatives are of similar magnitude, suggesting that the model does not exhibit a strong bias toward over-predicting either class.

The confusion matrix reveals that the rf_model demonstrates balanced classification performance, with similar accuracy for both positive and negative classes. The distribution of misclassifications is relatively even, indicating no pronounced skew toward either false positives or false negatives. Overall, the model maintains a consistent error profile across both classes.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:challenger:a410
2026-01-28 18:00:17,870 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document

Verify test runs

Our final task is to verify that all the tests provided by the model development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.

Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:

test_config = {
    # Run with the raw dataset
    'validmind.data_validation.DatasetDescription:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.DescriptiveStatistics:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.MissingValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.ClassImbalance:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.Duplicates:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.HighCardinality:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {
            'num_threshold': 100,
            'percent_threshold': 0.1,
            'threshold_type': 'percent'
        }
    },
    'validmind.data_validation.Skewness:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_threshold': 1}
    },
    'validmind.data_validation.UniqueRows:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TooManyZeroValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_percent_threshold': 0.03}
    },
    'validmind.data_validation.IQROutliersTable:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'threshold': 5}
    },
    # Run with the preprocessed dataset
    'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.MissingValues:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'default_column': 'loan_status'}
    },
    # Run with the training and test datasets
    'validmind.data_validation.DescriptiveStatistics:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.TabularDescriptionTables:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.ClassImbalance:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.UniqueRows:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.MutualInformation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_threshold': 0.01}
    },
    'validmind.data_validation.PearsonCorrelationMatrix:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.HighPearsonCorrelation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'max_threshold': 0.3, 'top_n_correlations': 10}
    },
    'validmind.model_validation.ModelMetadata': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ModelParameters': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ROCCurve': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']}
    },
    'validmind.model_validation.sklearn.MinimumROCAUCScore': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']},
        'params': {'min_threshold': 0.5}
    }
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")
validmind.data_validation.DatasetDescription:raw_data

Dataset Description Raw Data

The DatasetDescription test provides a comprehensive summary of the dataset's structure, completeness, and feature characteristics. The results table details each column's data type, count, missingness, and the number of distinct values, offering a clear overview of the dataset composition. All columns are fully populated with no missing values, and the distinct value counts highlight the diversity and granularity of each feature. This summary enables a thorough understanding of the dataset's readiness for modeling and potential areas of complexity.

Key insights:

  • No missing values across all columns: All 11 columns report 0 missing entries, indicating complete data coverage for every feature.
  • High cardinality in key numeric features: The Balance and EstimatedSalary columns exhibit high distinct value counts (5088 and 8000 respectively), reflecting continuous or near-continuous distributions.
  • Low cardinality in categorical features: Categorical columns such as Geography, Gender, HasCrCard, IsActiveMember, and Exited have between 2 and 3 distinct values, supporting straightforward encoding and analysis.
  • Moderate diversity in demographic and behavioral features: CreditScore and Age show moderate distinct counts (452 and 69), while Tenure and NumOfProducts have lower diversity (11 and 4 distinct values).

The dataset is fully complete with no missing data, supporting robust downstream analysis. Numeric features display a range of cardinalities, from highly granular (EstimatedSalary, Balance) to more discretized (Tenure, NumOfProducts). Categorical features are well-defined with low cardinality, facilitating efficient encoding. The overall structure indicates a dataset suitable for machine learning applications, with no immediate data quality concerns observed in the summary statistics.

Tables

Dataset Description

Name Type Count Missing Missing % Distinct Distinct %
CreditScore Numeric 8000.0 0 0.0 452 0.0565
Geography Categorical 8000.0 0 0.0 3 0.0004
Gender Categorical 8000.0 0 0.0 2 0.0002
Age Numeric 8000.0 0 0.0 69 0.0086
Tenure Numeric 8000.0 0 0.0 11 0.0014
Balance Numeric 8000.0 0 0.0 5088 0.6360
NumOfProducts Numeric 8000.0 0 0.0 4 0.0005
HasCrCard Categorical 8000.0 0 0.0 2 0.0002
IsActiveMember Categorical 8000.0 0 0.0 2 0.0002
EstimatedSalary Numeric 8000.0 0 0.0 8000 1.0000
Exited Categorical 8000.0 0 0.0 2 0.0002
2026-01-28 18:00:27,744 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:raw_data

Descriptive Statistics Raw Data

The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables, including measures of central tendency, dispersion, and range, as well as frequency-based summaries for two categorical variables. The numerical table details counts, means, standard deviations, and percentiles, while the categorical table reports unique value counts and the dominance of the most frequent category. These results provide a comprehensive overview of the dataset’s structure and highlight key aspects of variable distributions.

Key insights:

  • Wide range and skewness in balance values: The Balance variable exhibits a minimum of 0.0, a median of 97,264.0, and a maximum of 250,898.0, with a mean (76,434.10) substantially below the median, indicating a right-skewed distribution and a significant proportion of zero balances.
  • High concentration in categorical variables: Geography is dominated by France (50.12% of records), and Gender is predominantly Male (54.95%), indicating limited diversity in these categorical features.
  • Binary variables with balanced representation: HasCrCard and IsActiveMember are binary, with means of 0.70 and 0.52, respectively, suggesting moderate balance between categories.
  • Consistent sample sizes and low missingness: All variables report a count of 8,000, indicating no missing data across both numerical and categorical fields.
  • Substantial spread in estimated salary: EstimatedSalary ranges from 12.0 to 199,992.0, with a mean of 99,790.19 and a standard deviation of 57,520.51, reflecting high variability in this feature.

The dataset demonstrates complete data coverage with no missing values and a broad range of values across key numerical variables. Several variables, such as Balance and EstimatedSalary, display substantial dispersion and skewness, while categorical variables show limited diversity due to the dominance of specific categories. These characteristics provide important context for understanding the underlying data structure and potential sources of model risk related to feature distribution and representativeness.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 8000.0 650.1596 96.8462 350.0 583.0 652.0 717.0 778.0 813.0 850.0
Age 8000.0 38.9489 10.4590 18.0 32.0 37.0 44.0 53.0 60.0 92.0
Tenure 8000.0 5.0339 2.8853 0.0 3.0 5.0 8.0 9.0 9.0 10.0
Balance 8000.0 76434.0965 62612.2513 0.0 0.0 97264.0 128045.0 149545.0 162488.0 250898.0
NumOfProducts 8000.0 1.5325 0.5805 1.0 1.0 1.0 2.0 2.0 2.0 4.0
HasCrCard 8000.0 0.7026 0.4571 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 8000.0 0.5199 0.4996 0.0 0.0 1.0 1.0 1.0 1.0 1.0
EstimatedSalary 8000.0 99790.1880 57520.5089 12.0 50857.0 99505.0 149216.0 179486.0 189997.0 199992.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 8000.0 3.0 France 4010.0 50.12
Gender 8000.0 2.0 Male 4396.0 54.95
2026-01-28 18:00:39,250 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document
validmind.data_validation.MissingValues:raw_data

✅ Missing Values Raw Data

The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the threshold. All features in the dataset are shown with zero missing values, and each column is marked as passing the test.

Key insights:

  • No missing values detected: All features, including CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, have zero missing values.
  • Universal pass status: Every column meets the missing value threshold criterion, with 0.0% missingness and a Pass status across the dataset.

The dataset demonstrates complete data integrity with respect to missing values, as all features contain full data coverage and satisfy the established threshold. This indicates a high level of data quality for subsequent modeling or analysis steps.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Age 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-01-28 18:00:47,579 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document
validmind.data_validation.ClassImbalance:raw_data

✅ Class Imbalance Raw Data

The Class Imbalance test evaluates the distribution of target classes within the dataset to identify potential imbalances that could impact model performance. The results table presents the percentage of records for each class in the "Exited" target variable, alongside a pass/fail assessment based on a minimum threshold of 10%. The accompanying bar plot visually depicts the proportion of each class, with class 0 and class 1 shown as distinct bars representing their respective frequencies.

Key insights:

  • Both classes exceed the minimum threshold: Class 0 constitutes 79.80% and class 1 constitutes 20.20% of the dataset, with both surpassing the 10% minimum threshold.
  • No classes flagged for imbalance: The pass/fail assessment indicates that neither class is under-represented according to the defined criterion.
  • Class distribution is visually asymmetric: The bar plot highlights a notable difference in class proportions, with class 0 being the majority class.

The results indicate that, while the dataset is not perfectly balanced, both classes meet the minimum representation threshold set for this test. The observed class distribution is asymmetric, with a substantially higher proportion of class 0 compared to class 1, but no class falls below the specified risk threshold for imbalance.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

Exited Class Imbalance

Exited Percentage of Rows (%) Pass/Fail
0 79.80% Pass
1 20.20% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:raw_data:c64f
2026-01-28 18:00:59,333 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document
validmind.data_validation.Duplicates:raw_data

✅ Duplicates Raw Data

The Duplicates test evaluates the presence of duplicate rows within the dataset to assess data quality and mitigate risks associated with redundant information. The results table presents the absolute number and percentage of duplicate rows detected, providing a quantitative overview of dataset uniqueness. The test was executed with a minimum threshold parameter set to 1, and the results are summarized in the table titled "Duplicate Rows Results for Dataset."

Key insights:

  • No duplicate rows detected: The dataset contains 0 duplicate rows, as indicated by the "Number of Duplicates" value.
  • Zero percent duplication: The "Percentage of Rows (%)" is 0.0%, confirming the absence of redundant entries in the dataset.

The results demonstrate that the dataset is free from duplicate rows, indicating a high level of data uniqueness and integrity. The absence of duplication supports reliable model training and reduces the risk of overfitting due to repeated information.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Duplicate Rows Results for Dataset

Number of Duplicates Percentage of Rows (%)
0 0.0
2026-01-28 18:01:05,724 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document
validmind.data_validation.HighCardinality:raw_data

✅ High Cardinality Raw Data

The High Cardinality test evaluates the number of unique values in categorical columns to identify potential risks of overfitting and data noise. The results table presents the number and percentage of distinct values for each categorical column, along with a pass/fail status based on a threshold of 10% distinct values. Both "Geography" and "Gender" columns are assessed, with their respective distinct value counts and percentages reported.

Key insights:

  • All categorical columns pass cardinality threshold: Both "Geography" (3 distinct values, 0.0375%) and "Gender" (2 distinct values, 0.025%) are well below the 10% threshold, resulting in a "Pass" status for each.
  • Low cardinality observed across features: The number of unique values in both columns is minimal relative to the total sample size, indicating low cardinality throughout the assessed categorical features.

The results indicate that all evaluated categorical columns exhibit low cardinality, with distinct value counts and percentages substantially below the defined threshold. No evidence of high cardinality or associated overfitting risk is present in the tested features.

Parameters:

{
  "num_threshold": 100,
  "percent_threshold": 0.1,
  "threshold_type": "percent"
}
            

Tables

Column Number of Distinct Values Percentage of Distinct Values (%) Pass/Fail
Geography 3 0.0375 Pass
Gender 2 0.0250 Pass
2026-01-28 18:01:12,152 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document
validmind.data_validation.Skewness:raw_data

❌ Skewness Raw Data

The Skewness:raw_data test evaluates the asymmetry of numerical data distributions by calculating skewness values for each numeric column and comparing them to a maximum threshold of 1. The results table presents skewness values and pass/fail outcomes for each column, indicating whether the distributional asymmetry exceeds the defined threshold. Columns with skewness values below 1 are marked as "Pass," while those exceeding the threshold are marked as "Fail." This assessment provides a quantitative overview of distributional characteristics relevant to data quality and model performance.

Key insights:

  • Most columns exhibit low skewness: The majority of numeric columns, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary, have skewness values well below the threshold of 1 and pass the test.
  • Two columns exceed skewness threshold: Age (skewness = 1.0245) and Exited (skewness = 1.4847) exceed the maximum threshold, resulting in a fail outcome for these columns.
  • Skewness values are generally close to zero: Several columns, such as Tenure (0.0077), EstimatedSalary (0.0095), and CreditScore (‑0.062), display skewness values near zero, indicating near-symmetric distributions.

The skewness assessment reveals that most numeric columns in the dataset have distributions that are approximately symmetric or only moderately skewed, remaining within the defined threshold. However, Age and Exited display higher levels of asymmetry, exceeding the maximum skewness threshold and indicating notable distributional skew in these variables. The overall distributional profile suggests that, aside from these exceptions, the dataset maintains a balanced structure with respect to skewness.

Parameters:

{
  "max_threshold": 1
}
            

Tables

Skewness Results for Dataset

Column Skewness Pass/Fail
CreditScore -0.0620 Pass
Age 1.0245 Fail
Tenure 0.0077 Pass
Balance -0.1353 Pass
NumOfProducts 0.7172 Pass
HasCrCard -0.8867 Pass
IsActiveMember -0.0796 Pass
EstimatedSalary 0.0095 Pass
Exited 1.4847 Fail
2026-01-28 18:01:24,737 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document
validmind.data_validation.UniqueRows:raw_data

❌ Unique Rows Raw Data

The UniqueRows test evaluates the diversity of the dataset by measuring the proportion of unique values in each column and comparing it to a minimum percentage threshold. The results table presents, for each column, the number and percentage of unique values, along with a pass/fail outcome based on whether the percentage exceeds the 1% threshold. Columns such as CreditScore, Balance, and EstimatedSalary show high percentages of unique values and pass the test, while most categorical and low-cardinality columns do not meet the threshold and fail.

Key insights:

  • High uniqueness in continuous variables: EstimatedSalary (100%), Balance (63.6%), and CreditScore (5.65%) exceed the 1% uniqueness threshold, indicating substantial diversity in these columns.
  • Low uniqueness in categorical variables: Columns such as Geography (0.0375%), Gender (0.025%), HasCrCard (0.025%), IsActiveMember (0.025%), and Exited (0.025%) have very low percentages of unique values and fail the test.
  • Majority of columns fail uniqueness threshold: Only 3 out of 11 columns pass the test, with the remaining 8 columns—including Age (0.8625%), Tenure (0.1375%), and NumOfProducts (0.05%)—falling below the 1% threshold.

The results indicate that while continuous variables in the dataset exhibit high diversity, the majority of categorical and low-cardinality columns do not meet the minimum uniqueness threshold. This pattern reflects a concentration of unique values in a subset of features, with limited diversity observed in most categorical variables. The overall data structure is characterized by high uniqueness in select columns and low uniqueness in the majority of others.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
CreditScore 452 5.6500 Pass
Geography 3 0.0375 Fail
Gender 2 0.0250 Fail
Age 69 0.8625 Fail
Tenure 11 0.1375 Fail
Balance 5088 63.6000 Pass
NumOfProducts 4 0.0500 Fail
HasCrCard 2 0.0250 Fail
IsActiveMember 2 0.0250 Fail
EstimatedSalary 8000 100.0000 Pass
Exited 2 0.0250 Fail
2026-01-28 18:01:34,678 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document
validmind.data_validation.TooManyZeroValues:raw_data

❌ Too Many Zero Values Raw Data

The TooManyZeroValues test identifies numerical columns with a proportion of zero values exceeding a defined threshold, set here at 0.03%. The results table summarizes the number and percentage of zero values for each numerical column, along with a pass/fail status based on the threshold. All four evaluated columns—Tenure, Balance, HasCrCard, and IsActiveMember—are reported with their respective zero value counts and fail the test due to exceeding the threshold.

Key insights:

  • All evaluated columns exceed zero value threshold: Each of the four numerical columns has a percentage of zero values significantly above the 0.03% threshold, resulting in a fail status for all.
  • High concentration of zeros in Balance and IsActiveMember: Balance contains 36.4% zero values, and IsActiveMember contains 48.01%, indicating substantial sparsity in these features.
  • Binary indicator columns show elevated zero rates: HasCrCard and IsActiveMember, likely representing binary indicators, have 29.74% and 48.01% zero values, respectively, reflecting a high proportion of one class.
  • Tenure column also affected: Tenure registers 4.04% zero values, which, while lower than other columns, still exceeds the threshold and results in a fail.

All assessed numerical columns display zero value proportions well above the defined threshold, with Balance and IsActiveMember exhibiting particularly high rates of zeros. The presence of elevated zero counts across both continuous and likely binary columns indicates a pattern of data sparsity or class imbalance in these features. This distribution warrants consideration in subsequent modeling steps, as it may influence feature utility and model performance.

Parameters:

{
  "max_percent_threshold": 0.03
}
            

Tables

Variable Row Count Number of Zero Values Percentage of Zero Values (%) Pass/Fail
Tenure 8000 323 4.0375 Fail
Balance 8000 2912 36.4000 Fail
HasCrCard 8000 2379 29.7375 Fail
IsActiveMember 8000 3841 48.0125 Fail
2026-01-28 18:01:43,906 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document
validmind.data_validation.IQROutliersTable:raw_data

IQR Outliers Table Raw Data

The Interquartile Range Outliers Table (IQROutliersTable) test identifies and summarizes outliers in numerical features using the IQR method, with the outlier threshold set to 5. The results are presented in a summary table that would list, for each numerical feature, the count and distributional statistics of detected outliers. In this test run, the summary table contains no entries, indicating the absence of detected outliers across all evaluated numerical features.

Key insights:

  • No outliers detected in any feature: The summary table is empty, confirming that no numerical features exhibited values outside the IQR-based outlier thresholds.
  • Uniform data distribution within threshold: All numerical feature values fall within the calculated IQR bounds, given the threshold of 5.

The absence of detected outliers indicates that the dataset's numerical features are uniformly distributed within the specified IQR threshold. No evidence of extreme or anomalous values was observed under the applied test parameters.

Parameters:

{
  "threshold": 5
}
            

Tables

Summary of Outliers Detected by IQR Method

2026-01-28 18:01:51,989 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:preprocessed_data

Descriptive Statistics Preprocessed Data

The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the preprocessed dataset. The results are presented in two summary tables: one for numerical variables, detailing central tendency, dispersion, and range; and one for categorical variables, summarizing value counts, unique value diversity, and the dominance of top categories. These tables provide a comprehensive overview of the dataset’s structure, supporting assessment of data quality and potential risk factors.

Key insights:

  • Wide range and skewness in Balance: The Balance variable exhibits a minimum of 0.0, a median of 102,386.0, and a maximum of 250,898.0, with a mean (81,585.1) substantially below the median, indicating right-skewness and a concentration of lower values.
  • CreditScore distribution is broad and symmetric: CreditScore ranges from 350.0 to 850.0, with a mean (647.7) closely aligned to the median (649.0), suggesting a relatively symmetric distribution.
  • Categorical variables show moderate diversity: Geography has three unique values, with France as the most frequent (46.04%), and Gender is nearly balanced (Male: 51.27%), indicating no single category dominates excessively.
  • Binary variables are well represented: HasCrCard and IsActiveMember are binary, with HasCrCard showing 69.5% positive responses and IsActiveMember at 47.3%, reflecting reasonable class balance.

The dataset demonstrates a broad and well-populated range for key numerical variables, with some evidence of skewness in Balance. Categorical variables display moderate diversity, with no overwhelming dominance by a single category. Binary variables are distributed without extreme imbalance. Overall, the data structure supports robust modeling, with distributional characteristics that warrant monitoring for potential skewness or concentration effects.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 3232.0 647.6711 98.7191 350.0 581.0 649.0 717.0 779.0 817.0 850.0
Tenure 3232.0 5.0285 2.9152 0.0 3.0 5.0 8.0 9.0 10.0 10.0
Balance 3232.0 81585.0888 61380.7207 0.0 0.0 102386.0 128886.0 150802.0 164733.0 250898.0
NumOfProducts 3232.0 1.5108 0.6703 1.0 1.0 1.0 2.0 2.0 3.0 4.0
HasCrCard 3232.0 0.6952 0.4604 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 3232.0 0.4728 0.4993 0.0 0.0 0.0 1.0 1.0 1.0 1.0
EstimatedSalary 3232.0 100296.6888 57771.2303 12.0 50956.0 100642.0 150557.0 179122.0 188759.0 199992.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 3232.0 3.0 France 1488.0 46.04
Gender 3232.0 2.0 Male 1657.0 51.27
2026-01-28 18:02:02,433 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:preprocessed_data

Tabular Description Tables Preprocessed Data

The Descriptive Statistics test evaluates the distribution, completeness, and data types of numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables and two categorical variables, including measures of central tendency, range, missingness, and unique value counts. All variables are reported with their respective data types and observed value ranges, providing a comprehensive overview of the dataset’s structure and integrity.

Key insights:

  • No missing values detected: All numerical and categorical variables report 0.0% missing values, indicating complete data coverage across all fields.
  • Consistent data types across variables: Numerical variables are represented as int64 or float64, while categorical variables are of object type, aligning with their expected formats.
  • Limited cardinality in categorical variables: Geography contains three unique values (France, Germany, Spain), and Gender contains two unique values (Female, Male), supporting straightforward categorical encoding.
  • Wide value ranges in numerical variables: CreditScore ranges from 350 to 850, Balance from 0.0 to 250,898.09, and EstimatedSalary from 11.58 to 199,992.48, reflecting substantial spread in key financial indicators.
  • Binary encoding for indicator variables: HasCrCard, IsActiveMember, and Exited are encoded as binary (0/1) int64 variables, supporting direct use in binary classification or indicator analysis.

The dataset exhibits complete data with no missing values and appropriate data types for all variables. Categorical variables display low cardinality, and numerical variables cover broad value ranges, particularly in financial fields. The structure and integrity of the data are well-documented, providing a reliable foundation for subsequent modeling and analysis.

Tables

Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
CreditScore 3232 647.6711 350.00 850.00 0.0 int64
Tenure 3232 5.0285 0.00 10.00 0.0 int64
Balance 3232 81585.0888 0.00 250898.09 0.0 float64
NumOfProducts 3232 1.5108 1.00 4.00 0.0 int64
HasCrCard 3232 0.6952 0.00 1.00 0.0 int64
IsActiveMember 3232 0.4728 0.00 1.00 0.0 int64
EstimatedSalary 3232 100296.6888 11.58 199992.48 0.0 float64
Exited 3232 0.5000 0.00 1.00 0.0 int64
Categorical Variable Num of Obs Num of Unique Values Unique Values Missing Values (%) Data Type
Geography 3232.0 3.0 ['France' 'Germany' 'Spain'] 0.0 object
Gender 3232.0 2.0 ['Female' 'Male'] 0.0 object
2026-01-28 18:02:10,565 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document
validmind.data_validation.MissingValues:preprocessed_data

✅ Missing Values Preprocessed Data

The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the specified threshold. All features in the dataset are listed with their corresponding missing value statistics and test outcomes.

Key insights:

  • No missing values detected: All features report zero missing values, with both the number and percentage of missing entries recorded as 0.0%.
  • Universal test pass across features: Every column meets the missing value threshold criterion, resulting in a Pass status for all features.

The dataset demonstrates complete data integrity with respect to missing values, as all features contain fully populated entries and satisfy the established threshold. This indicates a high level of data quality for subsequent modeling or analysis steps.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-01-28 18:02:15,261 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:preprocessed_data

Tabular Numerical Histograms Preprocessed Data

The TabularNumericalHistograms:preprocessed_data test provides a visual summary of the distribution of each numerical feature in the dataset using histograms. The resulting plots display the frequency distribution for variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary. These visualizations enable assessment of central tendency, spread, skewness, and the presence of outliers or unusual patterns in the input data.

Key insights:

  • CreditScore distribution is unimodal and slightly right-skewed: The CreditScore histogram shows a single peak centered around 650–700, with a longer tail extending toward higher values, indicating mild right skewness and a concentration of scores in the mid-to-high range.
  • Tenure is nearly uniform except at endpoints: The Tenure variable displays an approximately uniform distribution across most values, with lower frequencies at the minimum and maximum tenure values.
  • Balance exhibits a strong spike at zero: The Balance histogram reveals a pronounced spike at zero, with the remainder of the distribution forming a bell-shaped curve centered around 120,000, indicating a substantial subset of accounts with no balance.
  • NumOfProducts is highly concentrated at lower values: The distribution of NumOfProducts is heavily concentrated at 1 and 2 products, with very few instances at 3 or 4, indicating limited product diversification among most customers.
  • HasCrCard and IsActiveMember are binary with class imbalance: Both HasCrCard and IsActiveMember show binary distributions, with HasCrCard skewed toward 1 (majority have a credit card) and IsActiveMember showing a moderate imbalance between active and inactive members.
  • EstimatedSalary is uniformly distributed: The EstimatedSalary histogram is approximately flat across the range, indicating a uniform distribution of salary values in the dataset.

The histograms collectively indicate that most numerical features are either uniformly or unimodally distributed, with some variables exhibiting notable skewness or concentration at specific values. The presence of a large number of zero balances and the concentration of product counts at lower values are prominent characteristics. Binary features display varying degrees of class imbalance. No extreme outliers or highly irregular distributions are observed in the visualized features.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:6708
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:b637
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:4886
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:b756
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:8793
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:0b90
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:babd
2026-01-28 18:02:35,003 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document
validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data

Tabular Categorical Bar Plots Preprocessed Data

The TabularCategoricalBarPlots test evaluates the distribution of categorical variables by generating bar plots for each category within the dataset. The resulting plots display the frequency counts for each category in the "Geography" and "Gender" features, providing a visual summary of the dataset's categorical composition. These visualizations facilitate the identification of category balance and potential representation issues within the data.

Key insights:

  • Geography distribution is imbalanced: The "Geography" feature shows the highest count for France, followed by Germany and then Spain, with France having approximately 50% more instances than Spain.
  • Gender distribution is relatively balanced: The "Gender" feature displays similar counts for Male and Female categories, with only a modest difference between the two.

The categorical composition of the dataset reveals a notable imbalance in the "Geography" feature, with France being the most represented and Spain the least. In contrast, the "Gender" feature demonstrates a near-equal distribution between categories. These patterns provide a clear overview of category representation, highlighting areas where category imbalance may influence downstream modeling.

Figures

ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:92b2
ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:a4ab
2026-01-28 18:02:45,130 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.TargetRateBarPlots:preprocessed_data

Target Rate Bar Plots Preprocessed Data

The TargetRateBarPlots test visualizes the distribution and target rates of categorical features to provide insight into model decision patterns. The results display paired bar plots for each categorical variable, with the left plot showing the frequency of each category and the right plot depicting the mean target rate (proportion of positive class) for each category. The features analyzed include Geography and Gender, with each category’s count and corresponding target rate presented side by side for direct comparison.

Key insights:

  • Distinct target rate variation by Geography: The target rate for Germany is notably higher than for France and Spain, with Germany exceeding 0.6 while France and Spain are both near 0.4.
  • Balanced category representation in Gender: Male and Female categories have similar sample counts, each above 1500, indicating balanced representation in the dataset.
  • Gender target rate disparity: The target rate for Female is higher than for Male, with Female above 0.5 and Male below 0.45.
  • Uneven category counts in Geography: France has the highest count, followed by Germany and then Spain, indicating some imbalance in category frequencies.

The results reveal pronounced differences in target rates across both Geography and Gender categories, with Germany and Female categories exhibiting higher proportions of positive class outcomes. Category representation is balanced for Gender but shows moderate imbalance for Geography. These patterns highlight areas where model predictions and data composition differ across categorical groups.

Figures

ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:fcaa
ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:92f7
2026-01-28 18:02:56,308 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:development_data

Descriptive Statistics Development Data

The Descriptive Statistics test evaluates the distributional characteristics of numerical variables in the development and test datasets. The results present summary statistics for each variable, including count, mean, standard deviation, minimum, maximum, and key percentiles. These statistics provide a comprehensive overview of the central tendency, dispersion, and range for each feature, enabling assessment of data quality and potential risk factors.

Key insights:

  • Consistent central tendencies across datasets: Means and medians (50th percentiles) for key variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary are closely aligned between the development and test datasets, indicating stable distributions.
  • Wide range and high variance in Balance and EstimatedSalary: Both Balance and EstimatedSalary exhibit large standard deviations (Balance: ~61,500–61,800; EstimatedSalary: ~57,300–57,900) and wide ranges, with minimum values near zero and maximums exceeding 199,000, reflecting substantial dispersion and potential for outliers.
  • Binary variables show expected distribution: HasCrCard and IsActiveMember are binary variables with means near 0.7 and 0.47, respectively, and standard deviations close to 0.46–0.50, consistent with their binary nature and indicating no missingness.
  • No evidence of missing data: All variables report counts equal to the total number of records in their respective datasets, indicating complete data coverage for the analyzed features.

The descriptive statistics indicate that the numerical variables in both the development and test datasets are well-aligned in terms of central tendency and spread, with no evidence of missing data. The wide dispersion observed in Balance and EstimatedSalary highlights the presence of substantial variability, which may influence model sensitivity to these features. Binary variables display balanced distributions, and overall, the datasets exhibit stable and consistent statistical properties across splits.

Tables

dataset Name Count Mean Std Min 25% 50% 75% 90% 95% Max
train_dataset_final CreditScore 2585.0 648.3965 98.5054 350.0 581.0 648.0 717.0 779.0 818.0 850.0
train_dataset_final Tenure 2585.0 4.9988 2.9000 0.0 3.0 5.0 8.0 9.0 10.0 10.0
train_dataset_final Balance 2585.0 81084.2264 61520.0012 0.0 0.0 101487.0 128744.0 150547.0 164361.0 250898.0
train_dataset_final NumOfProducts 2585.0 1.5122 0.6729 1.0 1.0 1.0 2.0 2.0 3.0 4.0
train_dataset_final HasCrCard 2585.0 0.6917 0.4619 0.0 0.0 1.0 1.0 1.0 1.0 1.0
train_dataset_final IsActiveMember 2585.0 0.4723 0.4993 0.0 0.0 0.0 1.0 1.0 1.0 1.0
train_dataset_final EstimatedSalary 2585.0 100084.7896 57898.2146 12.0 50073.0 99504.0 151317.0 179252.0 189373.0 199992.0
test_dataset_final CreditScore 647.0 644.7728 99.5921 350.0 578.0 650.0 716.0 779.0 809.0 850.0
test_dataset_final Tenure 647.0 5.1468 2.9744 0.0 3.0 5.0 8.0 9.0 10.0 10.0
test_dataset_final Balance 647.0 83586.2161 60827.1306 0.0 0.0 105420.0 129698.0 151837.0 165360.0 210433.0
test_dataset_final NumOfProducts 647.0 1.5054 0.6604 1.0 1.0 1.0 2.0 2.0 3.0 4.0
test_dataset_final HasCrCard 647.0 0.7094 0.4544 0.0 0.0 1.0 1.0 1.0 1.0 1.0
test_dataset_final IsActiveMember 647.0 0.4745 0.4997 0.0 0.0 0.0 1.0 1.0 1.0 1.0
test_dataset_final EstimatedSalary 647.0 101143.3033 57297.7411 143.0 52326.0 104153.0 149426.0 178913.0 188084.0 199662.0
2026-01-28 18:03:14,102 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:development_data

Tabular Description Tables Development Data

The Descriptive Statistics test evaluates the distributional characteristics and completeness of numerical variables in the train and test datasets. The results present summary statistics, including mean, minimum, maximum, and missing value percentages, for each numerical variable across both datasets. All variables are reported with their respective data types and observation counts, providing a comprehensive overview of the dataset structure and integrity.

Key insights:

  • No missing values across all variables: All numerical variables in both train and test datasets report 0.0% missing values, indicating complete data coverage for these fields.
  • Consistent data types and observation counts: All variables are typed as either int64 or float64, with observation counts matching the dataset sizes (2585 for train, 647 for test), confirming structural consistency.
  • Stable variable ranges between datasets: Minimum and maximum values for variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited are closely aligned between train and test datasets, with only minor variations.
  • Comparable central tendencies: Mean values for all variables are similar between train and test datasets, with differences remaining within a narrow range (e.g., CreditScore mean: 648.40 in train vs. 644.77 in test; EstimatedSalary mean: 100,084.79 in train vs. 101,143.30 in test).

The descriptive statistics indicate that the numerical variables are complete, consistently typed, and structurally aligned across both train and test datasets. Central tendencies and value ranges are stable, supporting the integrity and comparability of the data for subsequent modeling and analysis. No data quality issues or anomalies are observed in the reported statistics.

Tables

dataset Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
train_dataset_final CreditScore 2585 648.3965 350.00 850.00 0.0 int64
train_dataset_final Tenure 2585 4.9988 0.00 10.00 0.0 int64
train_dataset_final Balance 2585 81084.2264 0.00 250898.09 0.0 float64
train_dataset_final NumOfProducts 2585 1.5122 1.00 4.00 0.0 int64
train_dataset_final HasCrCard 2585 0.6917 0.00 1.00 0.0 int64
train_dataset_final IsActiveMember 2585 0.4723 0.00 1.00 0.0 int64
train_dataset_final EstimatedSalary 2585 100084.7896 11.58 199992.48 0.0 float64
train_dataset_final Exited 2585 0.4959 0.00 1.00 0.0 int64
test_dataset_final CreditScore 647 644.7728 350.00 850.00 0.0 int64
test_dataset_final Tenure 647 5.1468 0.00 10.00 0.0 int64
test_dataset_final Balance 647 83586.2161 0.00 210433.08 0.0 float64
test_dataset_final NumOfProducts 647 1.5054 1.00 4.00 0.0 int64
test_dataset_final HasCrCard 647 0.7094 0.00 1.00 0.0 int64
test_dataset_final IsActiveMember 647 0.4745 0.00 1.00 0.0 int64
test_dataset_final EstimatedSalary 647 101143.3033 142.81 199661.50 0.0 float64
test_dataset_final Exited 647 0.5162 0.00 1.00 0.0 int64
2026-01-28 18:03:24,312 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document
validmind.data_validation.ClassImbalance:development_data

✅ Class Imbalance Development Data

The Class Imbalance test evaluates the distribution of target classes within the training and test datasets to identify potential imbalances that could affect model performance. The results present the percentage representation of each class in both datasets, with a minimum threshold of 10% set for each class to pass. Bar plots visualize the class proportions, supporting interpretation of the class distribution.

Key insights:

  • Balanced class distribution in training data: In the train_dataset_final, class 0 constitutes 50.41% and class 1 constitutes 49.59% of the records, both well above the 10% threshold.
  • Balanced class distribution in test data: In the test_dataset_final, class 1 constitutes 51.62% and class 0 constitutes 48.38% of the records, both exceeding the minimum threshold.
  • All classes pass imbalance criteria: Both classes in both datasets are marked as "Pass" for the class imbalance test, indicating no class falls below the specified threshold.
  • Visual confirmation of near-equal proportions: Bar plots for both datasets show visually similar heights for each class, confirming the quantitative findings.

The results indicate that both the training and test datasets exhibit a balanced distribution of the target classes, with each class representing approximately half of the records. No evidence of class imbalance is observed, and all classes meet the minimum representation criteria. This distribution supports the development and evaluation of models without risk of bias due to class under-representation.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

dataset Exited Percentage of Rows (%) Pass/Fail
train_dataset_final 0 50.41% Pass
train_dataset_final 1 49.59% Pass
test_dataset_final 1 51.62% Pass
test_dataset_final 0 48.38% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:development_data:052d
ValidMind Figure validmind.data_validation.ClassImbalance:development_data:bdef
2026-01-28 18:03:35,823 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document
validmind.data_validation.UniqueRows:development_data

❌ Unique Rows Development Data

The UniqueRows test evaluates the diversity of each column in the dataset by measuring the proportion of unique values relative to the total number of rows, with a minimum threshold set at 1%. The results table presents the number and percentage of unique values for each column in both the training and test datasets, along with a pass/fail outcome based on the threshold. Columns with a percentage of unique values below 1% are marked as "Fail," while those meeting or exceeding the threshold are marked as "Pass."

Key insights:

  • High uniqueness in continuous variables: Columns such as EstimatedSalary and Balance exhibit high percentages of unique values (100% and 67.7% in training; 100% and 69.4% in test), consistently passing the uniqueness threshold.
  • Low uniqueness in categorical variables: Columns representing categorical or binary features (e.g., HasCrCard, IsActiveMember, Geography_Germany, Geography_Spain, Gender_Male, Exited) have very low percentages of unique values (all below 1%) and fail the test in both datasets.
  • Mixed results for ordinal variables: CreditScore passes the threshold in both datasets (16.5% in training, 47.3% in test), while Tenure passes in the test set (1.7%) but fails in the training set (0.43%).
  • Consistent failure in product count features: NumOfProducts fails the uniqueness threshold in both datasets, with unique value percentages well below 1%.

The results indicate that continuous variables in both training and test datasets demonstrate high diversity, consistently exceeding the uniqueness threshold. In contrast, categorical and binary variables, as well as certain ordinal features, exhibit low uniqueness and do not meet the threshold, resulting in test failures for these columns. This pattern reflects the inherent nature of categorical and binary variables, which are limited in the number of possible unique values, and highlights the distinction in uniqueness profiles between variable types within the dataset.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

dataset Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
train_dataset_final CreditScore 427 16.5184 Pass
train_dataset_final Tenure 11 0.4255 Fail
train_dataset_final Balance 1750 67.6983 Pass
train_dataset_final NumOfProducts 4 0.1547 Fail
train_dataset_final HasCrCard 2 0.0774 Fail
train_dataset_final IsActiveMember 2 0.0774 Fail
train_dataset_final EstimatedSalary 2585 100.0000 Pass
train_dataset_final Geography_Germany 2 0.0774 Fail
train_dataset_final Geography_Spain 2 0.0774 Fail
train_dataset_final Gender_Male 2 0.0774 Fail
train_dataset_final Exited 2 0.0774 Fail
test_dataset_final CreditScore 306 47.2952 Pass
test_dataset_final Tenure 11 1.7002 Pass
test_dataset_final Balance 449 69.3972 Pass
test_dataset_final NumOfProducts 4 0.6182 Fail
test_dataset_final HasCrCard 2 0.3091 Fail
test_dataset_final IsActiveMember 2 0.3091 Fail
test_dataset_final EstimatedSalary 647 100.0000 Pass
test_dataset_final Geography_Germany 2 0.3091 Fail
test_dataset_final Geography_Spain 2 0.3091 Fail
test_dataset_final Gender_Male 2 0.3091 Fail
test_dataset_final Exited 2 0.3091 Fail
2026-01-28 18:03:51,084 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:development_data

Tabular Numerical Histograms Development Data

The TabularNumericalHistograms test provides a visual assessment of the distribution of each numerical feature in both the training and test datasets. The resulting histograms display the frequency distribution for each variable, enabling identification of skewness, outliers, and other distributional characteristics. This visualization supports the evaluation of input data quality and the detection of potential risks related to data distribution.

Key insights:

  • CreditScore distribution is right-skewed: Both training and test datasets show a right-skewed distribution for CreditScore, with a concentration of values between 550 and 750 and a long tail toward higher scores.
  • Tenure is nearly uniform except at boundaries: Tenure displays an approximately uniform distribution across most values, with lower frequencies at the minimum and maximum values in both datasets.
  • Balance shows a high frequency at zero: A substantial proportion of records have a zero balance, with the remainder forming a bell-shaped distribution centered around 120,000–130,000.
  • NumOfProducts is heavily concentrated at lower values: The majority of records have one or two products, with very few instances at three or four products.
  • HasCrCard and IsActiveMember are binary with class imbalance: Both features are binary; HasCrCard is skewed toward 1, while IsActiveMember is more evenly split but still shows a higher count for 0 in the training set.
  • EstimatedSalary is uniformly distributed: EstimatedSalary appears to be evenly distributed across its range in both datasets, with no significant skewness or outliers.
  • Geography and Gender features show categorical splits: Geography_Germany and Geography_Spain indicate more records in the 'false' category, while Gender_Male is nearly balanced between true and false.

The histograms reveal that most numerical features exhibit stable and interpretable distributions across both training and test datasets. Notable characteristics include a high proportion of zero values in Balance, right-skewness in CreditScore, and strong concentration at lower values for NumOfProducts. Binary and categorical features display varying degrees of class imbalance. No extreme outliers or unexpected distributional anomalies are observed, supporting the overall integrity of the input data distributions.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:89f9
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:d9b4
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:6d03
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:b0e0
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:1c54
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:b98e
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:d027
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:e419
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:c3ed
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:fa69
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:cfe0
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:6309
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:e418
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:0650
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:af5a
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:adc3
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:4795
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:9593
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:03c1
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:d5d9
2026-01-28 18:04:11,858 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document
validmind.data_validation.MutualInformation:development_data

Mutual Information Development Data

The Mutual Information test evaluates the statistical dependency between each feature and the target variable to quantify feature relevance for model development. The results are presented as normalized mutual information scores for both the training and test datasets, with a threshold of 0.01 used to distinguish features with minimal information content. Bar plots display the relative importance of each feature, highlighting the distribution and magnitude of mutual information across the feature set.

Key insights:

  • NumOfProducts consistently highest information score: NumOfProducts exhibits the highest mutual information score in both training (≈0.09) and test (≈0.096) datasets, indicating strong relevance to the target variable.
  • Feature importance distribution shifts between datasets: In the training set, Geography_Germany and Balance follow as the next most informative features, while in the test set, Balance, IsActiveMember, and Gender_Male show increased scores and prominence.
  • Several features below information threshold: In the training set, CreditScore, IsActiveMember, Geography_Spain, Gender_Male, Tenure, HasCrCard, and EstimatedSalary all fall below the 0.01 threshold, indicating limited information content. In the test set, only HasCrCard, EstimatedSalary, and Geography_Spain remain below this threshold.
  • Increased information content in test set for multiple features: IsActiveMember, Gender_Male, Geography_Germany, and Tenure display higher mutual information scores in the test set compared to the training set, with all exceeding the 0.01 threshold.

The mutual information analysis reveals a concentrated distribution of feature relevance, with NumOfProducts consistently providing the highest information content across both datasets. The test set demonstrates a broader set of features exceeding the information threshold, suggesting increased relevance for IsActiveMember, Gender_Male, Geography_Germany, and Tenure relative to the training set. Several features, including HasCrCard, EstimatedSalary, and Geography_Spain, consistently exhibit minimal information content, indicating limited direct association with the target variable. The observed shifts in feature importance between datasets highlight potential changes in feature-target relationships that warrant ongoing monitoring.

Parameters:

{
  "min_threshold": 0.01
}
            

Figures

ValidMind Figure validmind.data_validation.MutualInformation:development_data:4321
ValidMind Figure validmind.data_validation.MutualInformation:development_data:990f
2026-01-28 18:04:31,027 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document
validmind.data_validation.PearsonCorrelationMatrix:development_data

Pearson Correlation Matrix Development Data

The Pearson Correlation Matrix test evaluates the linear dependency between all pairs of numerical variables in the dataset, providing a matrix of Pearson correlation coefficients visualized as a heatmap. The results display the correlation structure for both the training and test datasets, with coefficients ranging from -1 to 1, and highlight any high correlations (absolute value > 0.7) for further review. The heatmaps show the magnitude and direction of relationships between variables, allowing for identification of potential redundancy or multicollinearity.

Key insights:

  • No high correlations detected: All pairwise correlation coefficients in both training and test datasets are below the 0.7 threshold, indicating the absence of strong linear dependencies among variables.
  • Consistent correlation structure across splits: The correlation patterns and magnitudes are stable between the training and test datasets, with the highest observed correlations (e.g., Balance and Geography_Germany, approximately 0.41–0.42) remaining moderate and consistent.
  • Low risk of multicollinearity: The lack of high-magnitude correlations suggests that the variables are largely independent, minimizing the risk of redundancy or instability in model estimation.

The correlation analysis reveals a stable and moderate correlation structure across both training and test datasets, with no evidence of strong linear relationships or multicollinearity among the variables. This indicates that the dataset maintains a high degree of variable independence, supporting reliable model estimation and interpretability.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:90b5
ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:7384
2026-01-28 18:04:43,647 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document
validmind.data_validation.HighPearsonCorrelation:development_data

❌ High Pearson Correlation Development Data

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results present the top ten absolute Pearson correlation coefficients for both the training and test datasets, indicating whether each pair exceeds the predefined threshold of 0.3. Correlation coefficients, feature pairs, and pass/fail status are reported for each dataset, highlighting the most significant relationships observed.

Key insights:

  • Two feature pairs exceed correlation threshold: In both the training and test datasets, the pairs (Balance, Geography_Germany) and (Geography_Germany, Geography_Spain) display absolute correlation coefficients above the 0.3 threshold, with values ranging from 0.3627 to 0.421. These pairs are flagged as "Fail" by the test.
  • Consistent correlation structure across datasets: The same feature pairs exceed the threshold in both the training and test datasets, with similar coefficient magnitudes, indicating stable correlation patterns between these variables.
  • All other feature pairs below threshold: The remaining top correlations in both datasets have absolute coefficients below 0.3, resulting in a "Pass" status and indicating no further high linear relationships among other feature pairs.

The test results reveal that only two feature pairs—(Balance, Geography_Germany) and (Geography_Germany, Geography_Spain)—exhibit moderate linear correlations exceeding the specified threshold in both the training and test datasets. The consistency of these correlations across datasets suggests a stable relationship between these variables. All other feature pairs remain below the threshold, indicating limited risk of widespread multicollinearity or feature redundancy within the evaluated features.

Parameters:

{
  "max_threshold": 0.3,
  "top_n_correlations": 10
}
            

Tables

dataset Columns Coefficient Pass/Fail
train_dataset_final (Balance, Geography_Germany) 0.4147 Fail
train_dataset_final (Geography_Germany, Geography_Spain) -0.3627 Fail
train_dataset_final (IsActiveMember, Exited) -0.1977 Pass
train_dataset_final (Geography_Germany, Exited) 0.1931 Pass
train_dataset_final (Balance, NumOfProducts) -0.1903 Pass
train_dataset_final (Balance, Geography_Spain) -0.1624 Pass
train_dataset_final (Balance, Exited) 0.1441 Pass
train_dataset_final (Gender_Male, Exited) -0.1271 Pass
train_dataset_final (Tenure, EstimatedSalary) 0.0660 Pass
train_dataset_final (NumOfProducts, Exited) -0.0594 Pass
test_dataset_final (Balance, Geography_Germany) 0.4210 Fail
test_dataset_final (Geography_Germany, Geography_Spain) -0.3787 Fail
test_dataset_final (Geography_Germany, Exited) 0.2500 Pass
test_dataset_final (IsActiveMember, Exited) -0.2445 Pass
test_dataset_final (Balance, Exited) 0.1940 Pass
test_dataset_final (Balance, Geography_Spain) -0.1718 Pass
test_dataset_final (Gender_Male, Exited) -0.1635 Pass
test_dataset_final (Balance, NumOfProducts) -0.1341 Pass
test_dataset_final (HasCrCard, Geography_Germany) 0.1086 Pass
test_dataset_final (Tenure, HasCrCard) 0.0981 Pass
2026-01-28 18:04:52,126 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document
validmind.model_validation.ModelMetadata

Model Metadata

The ModelMetadata test compares the metadata of different models to assess consistency in architecture, framework, framework version, and programming language. The summary table presents metadata fields for each model, including modeling technique, framework, version, and programming language. Both models, log_model_champion and rf_model, are included in the comparison, with all relevant metadata fields populated.

Key insights:

  • Consistent modeling technique and framework: Both models use the SKlearnModel technique and the sklearn framework.
  • Identical framework versions: Framework Version 1.8.0 is used for both models.
  • Uniform programming language: Python is the programming language for both models.

The metadata comparison indicates complete alignment across all evaluated fields for the two models. No inconsistencies or missing metadata are observed, and the models share the same architecture, framework, version, and programming language. This uniformity supports ease of integration and management within the modeling environment.

Tables

model Modeling Technique Modeling Framework Framework Version Programming Language
log_model_champion SKlearnModel sklearn 1.8.0 Python
rf_model SKlearnModel sklearn 1.8.0 Python
2026-01-28 18:04:55,917 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document
validmind.model_validation.sklearn.ModelParameters

Model Parameters

The Model Parameters test extracts and displays all configuration parameters for each model to ensure transparency and reproducibility. The results present a structured table listing parameter names and their corresponding values for both the logistic regression model (log_model_champion) and the random forest model (rf_model). Each parameter setting is shown explicitly, providing a comprehensive view of the model configurations at the time of testing.

Key insights:

  • Explicit parameterization for both models: All parameters for log_model_champion and rf_model are listed, including regularization, solver, and iteration settings for the logistic regression model, and tree construction, sampling, and random state settings for the random forest model.
  • No missing or undefined parameters: Every parameter extracted via the scikit-learn API is present in the results, with no missing or null values observed.
  • Use of non-default and default values: The logistic regression model specifies penalty='l1' and solver='liblinear', while the random forest model uses n_estimators=50 and random_state=42, indicating a mix of explicitly set and default parameter values.

The extracted parameter set provides a complete and transparent record of model configuration for both the logistic regression and random forest models. All relevant parameters are captured, supporting reproducibility and facilitating future audits or validation efforts. The results indicate that both models are fully parameterized with no missing configuration elements.

Tables

model Parameter Value
log_model_champion C 1
log_model_champion dual False
log_model_champion fit_intercept True
log_model_champion intercept_scaling 1
log_model_champion max_iter 100
log_model_champion penalty l1
log_model_champion solver liblinear
log_model_champion tol 0.0001
log_model_champion verbose 0
log_model_champion warm_start False
rf_model bootstrap True
rf_model ccp_alpha 0.0
rf_model criterion gini
rf_model max_features sqrt
rf_model min_impurity_decrease 0.0
rf_model min_samples_leaf 1
rf_model min_samples_split 2
rf_model min_weight_fraction_leaf 0.0
rf_model n_estimators 50
rf_model oob_score False
rf_model random_state 42
rf_model verbose 0
rf_model warm_start False
2026-01-28 18:05:03,226 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document
validmind.model_validation.sklearn.ROCCurve

ROC Curve

The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) for both the training and test datasets. The resulting plots display the trade-off between the true positive rate and false positive rate at various thresholds, with the AUC quantifying the model's ability to distinguish between classes. The ROC curves and corresponding AUC values are presented for both the train_dataset_final and test_dataset_final.

Key insights:

  • AUC indicates moderate discriminative power on training data: The AUC for the train_dataset_final is 0.68, reflecting moderate ability to separate positive and negative classes.
  • Improved AUC on test data: The AUC for the test_dataset_final is 0.73, indicating stronger discriminative performance on unseen data compared to the training set.
  • ROC curves consistently above random baseline: Both ROC curves remain above the random classifier line (AUC = 0.5), demonstrating that the model provides meaningful class separation across thresholds.

The ROC Curve test results show that log_model_champion achieves moderate to good discrimination between classes, with AUC values of 0.68 on the training set and 0.73 on the test set. The model's performance is consistently above the random baseline, and the higher AUC on the test data suggests effective generalization to new samples. No indications of model collapse or lack of discriminative ability are observed in these results.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:364c
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:5bfd
2026-01-28 18:05:14,617 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document
validmind.model_validation.sklearn.MinimumROCAUCScore

✅ Minimum ROCAUC Score

The Minimum ROC AUC Score test evaluates whether the model's multiclass ROC AUC score meets or exceeds a specified minimum threshold, providing an assessment of the model's ability to distinguish between classes. The results table presents ROC AUC scores for both the training and test datasets, alongside the applied threshold and the corresponding pass/fail outcome. Both datasets are evaluated against a minimum threshold of 0.5, with the observed scores and test outcomes reported for each.

Key insights:

  • ROC AUC scores exceed threshold on all datasets: Both the training (0.6766) and test (0.7271) datasets register ROC AUC scores above the minimum threshold of 0.5.
  • Test dataset shows higher discriminative performance: The ROC AUC score on the test dataset (0.7271) is higher than on the training dataset (0.6766), indicating improved class separation on unseen data.
  • Test passed for all evaluated datasets: The pass/fail status is "Pass" for both datasets, confirming that the model meets the minimum ROC AUC requirement across splits.

The results indicate that the model demonstrates adequate discriminative capability on both training and test datasets, with ROC AUC scores consistently above the defined threshold. The higher score on the test dataset suggests stable or improved generalization performance, and the model satisfies the minimum ROC AUC criterion across all evaluated data partitions.

Parameters:

{
  "min_threshold": 0.5
}
            

Tables

dataset Score Threshold Pass/Fail
train_dataset_final 0.6766 0.5 Pass
test_dataset_final 0.7271 0.5 Pass
2026-01-28 18:05:20,533 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document

In summary

In this final notebook, you learned how to:

With our ValidMind for model validation series of notebooks, you learned how to validate a model end-to-end with the ValidMind Library by running through some common scenarios in a typical model validation setting:

  • Verifying the data quality steps performed by the model development team
  • Independently replicating the champion model's results and conducting additional tests to assess performance, stability, and robustness
  • Setting up test inputs and a challenger model for comparative analysis
  • Running validation tests, analyzing results, and logging artifacts to ValidMind

Next steps

Work with your validation report

Now that you've logged all your test results and verified the work done by the model development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:

  • Inserting additional test results: Click Link Evidence to Report under any section of 2. Validation in your validation report. (Learn more: Link evidence to reports)

  • Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)

  • Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage model findings)

  • Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)

  • Assessing compliance: Under the Guideline for any validation report section, click ASSESSMENT and select the compliance status from the drop-down menu. (Learn more: Provide compliance assessments)

  • Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including model developers. Propose suggested changes in the model documentation, work with versioned history, and use comments to discuss specific portions of the model documentation. (Learn more: Collaborate with others)

When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough model validation history. (Learn more: Submit for approval)

Learn more

Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining model validation:

More how-to guides and code samples

Discover more learning resources

All notebook samples can be found in the following directories of the ValidMind Library GitHub repository:

Or, visit our documentation to learn more about ValidMind.


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial