ValidMind for model validation 4 — Finalize testing and reporting
Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.
This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:
The function can be as simple or as complex as you need it to be — it can use external libraries, make API calls, or do anything else that you can do in Python.
The only requirement is that the function signature and return values can be "understood" and handled by the ValidMind Library. As such, custom tests offer added flexibility by extending the default tests provided by ValidMind, enabling you to document any type of model or use case.
For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.
Learn by doing
Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals
Prerequisites
In order to finalize validation and reporting, you'll need to first have:
Need help with the above steps?
Refer to the first three notebooks in this series:
# Make sure the ValidMind Library is installed%pip install -q validmind# Load your model identifier credentials from an `.env` file%load_ext dotenv%dotenv .env# Or replace with your code snippetimport validmind as vmvm.init(# api_host="...",# api_key="...",# api_secret="...",# model="...",)
Note: you may need to restart the kernel to use updated packages.
2026-01-28 17:58:32,957 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report
Import the sample dataset
Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:
# Load the sample datasetfrom validmind.datasets.classification import customer_churn as demo_datasetprint(f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}")raw_df = demo_dataset.load_data()
Loaded demo dataset with:
• Target column: 'Exited'
• Class labels: {'0': 'Did not exit', '1': 'Exited'}
# Initialize the raw dataset for use in ValidMind testsvm_raw_dataset = vm.init_dataset( dataset=raw_df, input_id="raw_dataset", target_column="Exited",)
import pandas as pdraw_copy_df = raw_df.sample(frac=1) # Create a copy of the raw dataset# Create a balanced dataset with the same number of exited and not exited customersexited_df = raw_copy_df.loc[raw_copy_df["Exited"] ==1]not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] ==0].sample(n=exited_df.shape[0])balanced_raw_df = pd.concat([exited_df, not_exited_df])balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)
Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:
# Register new data and now 'balanced_raw_dataset' is the new dataset object of interestvm_balanced_raw_dataset = vm.init_dataset( dataset=balanced_raw_df, input_id="balanced_raw_dataset", target_column="Exited",)
# Run HighPearsonCorrelation test with our balanced dataset as input and return a result objectcorr_result = vm.tests.run_test( test_id="validmind.data_validation.HighPearsonCorrelation", params={"max_threshold": 0.3}, inputs={"dataset": vm_balanced_raw_dataset},)
❌ High Pearson Correlation
The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results table presents the top ten feature pairs ranked by the absolute value of their Pearson correlation coefficients, along with a Pass or Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs display lower correlation values and pass the test.
Key insights:
One feature pair exceeds correlation threshold: The pair (Age, Exited) has a correlation coefficient of 0.3467, surpassing the 0.3 threshold and resulting in a Fail status.
All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.207 to 0.0367, all below the 0.3 threshold and marked as Pass.
Predominantly weak linear relationships: Most feature pairs exhibit weak linear associations, with coefficients clustered near zero.
The test results indicate that the dataset contains minimal evidence of strong linear relationships among most feature pairs, with only the (Age, Exited) pair exceeding the specified correlation threshold. The overall correlation structure suggests a low risk of widespread multicollinearity or feature redundancy, as the majority of feature pairs demonstrate weak linear dependencies.
Parameters:
{
"max_threshold": 0.3
}
Tables
Columns
Coefficient
Pass/Fail
(Age, Exited)
0.3467
Fail
(IsActiveMember, Exited)
-0.2070
Pass
(Balance, NumOfProducts)
-0.1793
Pass
(Balance, Exited)
0.1542
Pass
(NumOfProducts, Exited)
-0.0577
Pass
(Age, Balance)
0.0550
Pass
(Tenure, EstimatedSalary)
0.0531
Pass
(NumOfProducts, IsActiveMember)
0.0485
Pass
(Age, NumOfProducts)
-0.0394
Pass
(Tenure, Balance)
-0.0367
Pass
# From result object, extract table from `corr_result.tables`features_df = corr_result.tables[0].datafeatures_df
Columns
Coefficient
Pass/Fail
0
(Age, Exited)
0.3467
Fail
1
(IsActiveMember, Exited)
-0.2070
Pass
2
(Balance, NumOfProducts)
-0.1793
Pass
3
(Balance, Exited)
0.1542
Pass
4
(NumOfProducts, Exited)
-0.0577
Pass
5
(Age, Balance)
0.0550
Pass
6
(Tenure, EstimatedSalary)
0.0531
Pass
7
(NumOfProducts, IsActiveMember)
0.0485
Pass
8
(Age, NumOfProducts)
-0.0394
Pass
9
(Tenure, Balance)
-0.0367
Pass
# Extract list of features that failed the testhigh_correlation_features = features_df[features_df["Pass/Fail"] =="Fail"]["Columns"].tolist()high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of stringshigh_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]high_correlation_features
['Age']
# Remove the highly correlated features from the datasetbalanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)# Re-initialize the dataset objectvm_raw_dataset_preprocessed = vm.init_dataset( dataset=balanced_raw_no_age_df, input_id="raw_dataset_preprocessed", target_column="Exited",)
# Re-run the test with the reduced feature setcorr_result = vm.tests.run_test( test_id="validmind.data_validation.HighPearsonCorrelation", params={"max_threshold": 0.3}, inputs={"dataset": vm_raw_dataset_preprocessed},)
✅ High Pearson Correlation
The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten absolute Pearson correlation coefficients among feature pairs, each accompanied by a Pass/Fail status based on a threshold of 0.3. All reported coefficients are below the threshold, and all feature pairs received a Pass status.
Key insights:
No high correlations detected: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest magnitude observed at 0.207 between IsActiveMember and Exited.
Consistent Pass status across all pairs: Every feature pair in the top ten correlations received a Pass, indicating no evidence of strong linear relationships among the evaluated features.
Low to moderate relationships observed: The reported coefficients range from -0.207 to 0.1542, reflecting only weak to very weak linear associations between the examined feature pairs.
The test results indicate an absence of strong linear dependencies among the evaluated features, with all pairwise correlations falling well below the specified threshold. This suggests a low risk of feature redundancy or multicollinearity within the dataset based on linear relationships, supporting the interpretability and stability of subsequent modeling efforts.
Parameters:
{
"max_threshold": 0.3
}
Tables
Columns
Coefficient
Pass/Fail
(IsActiveMember, Exited)
-0.2070
Pass
(Balance, NumOfProducts)
-0.1793
Pass
(Balance, Exited)
0.1542
Pass
(NumOfProducts, Exited)
-0.0577
Pass
(Tenure, EstimatedSalary)
0.0531
Pass
(NumOfProducts, IsActiveMember)
0.0485
Pass
(Tenure, Balance)
-0.0367
Pass
(Tenure, IsActiveMember)
-0.0360
Pass
(Tenure, Exited)
-0.0265
Pass
(HasCrCard, IsActiveMember)
-0.0260
Pass
Split the preprocessed dataset
With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:
# Encode categorical features in the datasetbalanced_raw_no_age_df = pd.get_dummies( balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True)balanced_raw_no_age_df.head()
CreditScore
Tenure
Balance
NumOfProducts
HasCrCard
IsActiveMember
EstimatedSalary
Exited
Geography_Germany
Geography_Spain
Gender_Male
5613
558
7
121235.05
2
1
1
116253.10
0
False
False
False
4434
432
2
135559.80
2
1
1
71856.30
0
True
False
True
6484
769
6
117852.26
2
1
0
147668.64
0
False
False
False
5796
762
10
168920.75
1
1
0
31445.03
1
False
True
False
1668
460
7
0.00
2
1
0
156150.08
1
False
False
False
from sklearn.model_selection import train_test_split# Split the dataset into train and testtrain_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)X_train = train_df.drop("Exited", axis=1)y_train = train_df["Exited"]X_test = test_df.drop("Exited", axis=1)y_test = test_df["Exited"]
With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl
# Import the champion modelimport pickle as pklwithopen("lr_model_champion.pkl", "rb") as f: log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:
Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
Train potential challenger model
We'll also train our random forest classification challenger model to see how it compares:
# Import the Random Forest Classification modelfrom sklearn.ensemble import RandomForestClassifier# Create the model instance with 50 decision treesrf_model = RandomForestClassifier( n_estimators=50, random_state=42,)# Train the modelrf_model.fit(X_train, y_train)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:
# Initialize the champion logistic regression modelvm_log_model = vm.init_model( log_reg, input_id="log_model_champion",)# Initialize the challenger random forest classification modelvm_rf_model = vm.init_model( rf_model, input_id="rf_model",)
# Assign predictions to Champion — Logistic regression modelvm_train_ds.assign_predictions(model=vm_log_model)vm_test_ds.assign_predictions(model=vm_log_model)# Assign predictions to Challenger — Random forest classification modelvm_train_ds.assign_predictions(model=vm_rf_model)vm_test_ds.assign_predictions(model=vm_rf_model)
2026-01-28 17:58:51,388 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,390 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,390 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,392 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 17:58:51,394 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,395 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,396 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,397 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 17:58:51,399 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,419 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,420 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,442 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-28 17:58:51,445 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-28 17:58:51,457 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-28 17:58:51,458 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-28 17:58:51,469 - INFO(validmind.vm_models.dataset.utils): Done running predict()
Implementing custom tests
Thanks to the model documentation (Learn more ...), we know that the model development team implemented a custom test to further evaluate the performance of the champion model.
In a usual model validation situation, you would load a saved custom test provided by the model development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.
Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the model development team used in their performance evaluations.
An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.
Create a confusion matrix plot
Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:
import matplotlib.pyplot as pltfrom sklearn import metrics# Get the predicted classesy_pred = log_reg.predict(vm_test_ds.x)confusion_matrix = metrics.confusion_matrix(y_test, y_pred)cm_display = metrics.ConfusionMatrixDisplay( confusion_matrix=confusion_matrix, display_labels=[False, True])cm_display.plot()
Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:
The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")def confusion_matrix(dataset, model):"""The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. The confusion matrix is a 2x2 table that contains 4 values: - True Positive (TP): the number of correct positive predictions - True Negative (TN): the number of correct negative predictions - False Positive (FP): the number of incorrect positive predictions - False Negative (FN): the number of incorrect negative predictions The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure. """ y_true = dataset.y y_pred = dataset.y_pred(model=model) confusion_matrix = metrics.confusion_matrix(y_true, y_pred) cm_display = metrics.ConfusionMatrixDisplay( confusion_matrix=confusion_matrix, display_labels=[False, True] ) cm_display.plot() plt.close() # close the plot to avoid displaying itreturn cm_display.figure_ # return the figure object itself
You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:
The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error types. The first matrix corresponds to the training dataset, while the second matrix summarizes results for the test dataset.
Key insights:
Balanced true positive and true negative rates: Both training and test datasets show similar counts for true positives and true negatives (train: 794 TP, 835 TN; test: 215 TP, 215 TN), indicating consistent model performance across classes.
Moderate false positive and false negative rates: The number of false positives and false negatives is comparable within each dataset (train: 488 FP, 488 FN; test: 98 FP, 119 FN), suggesting balanced misclassification rates.
Consistent performance across datasets: The relative proportions of each confusion matrix cell are similar between training and test datasets, indicating stable generalization from training to test data.
The confusion matrix results demonstrate that the model maintains balanced classification performance across both training and test datasets, with similar rates of correct and incorrect predictions for each class. The observed symmetry in true and false prediction counts suggests that the model does not exhibit a strong bias toward either class and generalizes consistently between datasets.
Figures
2026-01-28 17:59:08,777 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document
The Confusion Matrix:challenger test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The confusion matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's ability to correctly classify each class. The results are presented separately for the train and test datasets, allowing for assessment of both in-sample and out-of-sample performance.
Key insights:
Perfect classification on training data: The training dataset confusion matrix shows 1,303 true negatives and 1,282 true positives, with zero false positives and zero false negatives, indicating no misclassifications on the training set.
Reduced accuracy on test data: The test dataset confusion matrix shows 227 true negatives and 229 true positives, with 86 false positives and 105 false negatives, indicating the presence of both types of misclassification in out-of-sample predictions.
Balanced class distribution in test set: The test set contains a similar number of true positives (229) and true negatives (227), suggesting a balanced representation of both classes in the evaluation.
The confusion matrix results indicate that the model achieves perfect separation of classes on the training data, with no observed misclassifications. However, performance on the test data shows a reduction in accuracy, with both false positives and false negatives present. The balanced class distribution in the test set supports a reliable assessment of model generalization. The observed discrepancy between training and test performance highlights the importance of evaluating out-of-sample results to understand model behavior under realistic conditions.
Figures
2026-01-28 17:59:24,095 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.
That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.
Add parameters to custom tests
Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:
@vm.test("my_custom_tests.ConfusionMatrix")def confusion_matrix(dataset, model, normalize=False):"""The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. The confusion matrix is a 2x2 table that contains 4 values: - True Positive (TP): the number of correct positive predictions - True Negative (TN): the number of correct negative predictions - False Positive (FP): the number of incorrect positive predictions - False Negative (FN): the number of incorrect negative predictions The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure. """ y_true = dataset.y y_pred = dataset.y_pred(model=model)if normalize: confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")else: confusion_matrix = metrics.confusion_matrix(y_true, y_pred) cm_display = metrics.ConfusionMatrixDisplay( confusion_matrix=confusion_matrix, display_labels=[False, True] ) cm_display.plot() plt.close() # close the plot to avoid displaying itreturn cm_display.figure_ # return the figure object itself
Pass parameters to custom tests
You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.
The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
Since these are VMDataset or VMModel inputs, they have a special meaning.
Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:
# Champion with test dataset and normalize=Truevm.tests.run_test( test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion", input_grid={"dataset": [vm_test_ds],"model" : [vm_log_model] }, params={"normalize": True}).log()
Confusion Matrix Test Normalized Champion
The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by displaying the normalized proportions of true positives, true negatives, false positives, and false negatives. The matrix presents the fraction of predictions in each category, with values normalized such that each cell represents the proportion of total predictions. The results show the distribution of correct and incorrect predictions for both positive and negative classes, providing a comprehensive view of model accuracy and error types.
Key insights:
Balanced correct classification rates: The model correctly classifies both negative (True Negative: 0.33) and positive (True Positive: 0.33) classes at equal rates, indicating symmetric performance across classes.
Moderate false negative and false positive rates: The false negative rate (0.18) and false positive rate (0.15) are of similar magnitude, suggesting that misclassification is distributed relatively evenly between the two error types.
No class dominates prediction errors: The normalized values indicate that neither class is disproportionately affected by misclassification, with all matrix entries within the range of 0.15 to 0.33.
The confusion matrix reveals that the model exhibits balanced performance across both classes, with correct and incorrect predictions distributed symmetrically. The rates of true positives and true negatives are equal, and the error rates for false positives and false negatives are comparable, indicating no significant bias toward either class in prediction outcomes.
Parameters:
{
"normalize": true
}
Figures
2026-01-28 17:59:35,649 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document
# Challenger with test dataset and normalize=Truevm.tests.run_test( test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger", input_grid={"dataset": [vm_test_ds],"model" : [vm_rf_model] }, params={"normalize": True}).log()
Confusion Matrix Test Normalized Challenger
The Confusion Matrix test evaluates the classification performance of the rf_model on the test_dataset_final by displaying the normalized proportions of true positives, true negatives, false positives, and false negatives. The matrix presents the fraction of predictions in each category, with values normalized such that the sum of all cells equals 1. The results provide a visual summary of the model’s ability to correctly and incorrectly classify both positive and negative cases.
Key insights:
Balanced correct classification rates: The model correctly classifies both negative (True Negative: 0.35) and positive (True Positive: 0.35) cases at equal rates, indicating symmetric performance across classes.
Moderate false negative and false positive rates: The proportion of false negatives (0.16) and false positives (0.13) are similar in magnitude, suggesting that misclassification is distributed relatively evenly between the two error types.
No class dominance observed: The normalized confusion matrix does not indicate a strong bias toward either class, as the correct and incorrect prediction rates are comparable for both positive and negative labels.
The confusion matrix reveals that the rf_model demonstrates balanced classification performance, with equal rates of correct predictions for both classes and similar proportions of false positives and false negatives. This indicates that the model does not favor one class over the other and maintains consistent error rates across both positive and negative predictions.
Parameters:
{
"normalize": true
}
Figures
2026-01-28 17:59:49,481 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document
Use external test providers
Sometimes you may want to reuse the same set of custom tests across multiple models and share them with others in your organization, like the model development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.
In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:
Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.
The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:
tests_folder ="my_tests"import os# create tests folderos.makedirs(tests_folder, exist_ok=True)# remove existing testsfor f in os.listdir(tests_folder):# remove files and pycacheif f.endswith(".py") or f =="__pycache__": os.system(f"rm -rf {tests_folder}/{f}")
After running the command above, confirm that a new my_tests directory was created successfully. For example:
~/notebooks/tutorials/model_validation/my_tests/
Save an inline test
The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.
While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.
The confusion_matrix test requires the following additional imports:
import matplotlib.pyplot as pltfrom sklearn import metrics
Let's pass these imports to the save() method to ensure they are included in the file with the following command:
confusion_matrix.save(# Save it to the custom tests folder we created tests_folder, imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],)
2026-01-28 17:59:50,042 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-01-28 17:59:50,043 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
# Saved from __main__.confusion_matrix
# Original Test ID: my_custom_tests.ConfusionMatrix
# New Test ID: <test_provider_namespace>.ConfusionMatrix
Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:
ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.
The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.
Let's go ahead and load the custom tests from our my_tests directory:
from validmind.tests import LocalTestProvider# initialize the test provider with the tests folder we created earliermy_test_provider = LocalTestProvider(tests_folder)vm.tests.register_test_provider( namespace="my_test_provider", test_provider=my_test_provider,)# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file
Run test provider tests
Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:
For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.
Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.
# Champion with test dataset and test provider custom testvm.tests.run_test( test_id="my_test_provider.ConfusionMatrix:champion", input_grid={"dataset": [vm_test_ds],"model" : [vm_log_model] }).log()
Confusion Matrix Champion
The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by comparing predicted and true class labels. The resulting 2x2 matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix quantifies the model's ability to correctly identify both positive and negative cases, as well as the frequency of misclassifications.
Key insights:
Balanced true positive and true negative counts: The model correctly classified 215 positive cases (true positives) and 215 negative cases (true negatives), indicating symmetric performance across both classes.
Notable false negative and false positive rates: There are 119 false negatives and 98 false positives, reflecting a moderate level of misclassification in both directions.
Comparable error distribution across classes: The number of false negatives and false positives are of similar magnitude, suggesting that the model does not exhibit a strong bias toward over- or under-predicting either class.
The confusion matrix reveals that the log_model_champion demonstrates balanced classification performance, with equal accuracy in identifying both positive and negative cases. Misclassification rates are moderate and distributed similarly between false positives and false negatives, indicating no pronounced skew in prediction errors. This pattern suggests consistent model behavior across both classes on the test dataset.
Figures
2026-01-28 18:00:03,845 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document
# Challenger with test dataset and test provider custom testvm.tests.run_test( test_id="my_test_provider.ConfusionMatrix:challenger", input_grid={"dataset": [vm_test_ds],"model" : [vm_rf_model] }).log()
Confusion Matrix Challenger
The Confusion Matrix test evaluates the classification performance of the rf_model on the test_dataset_final by comparing predicted and true labels. The resulting 2x2 matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix enables assessment of the model's ability to correctly identify both positive and negative cases, as well as the types and frequencies of misclassifications.
Key insights:
Balanced correct classification of both classes: The model correctly classified 227 negative cases (true negatives) and 229 positive cases (true positives), indicating similar performance across both classes.
Moderate false positive and false negative rates: There were 86 false positives and 105 false negatives, reflecting a moderate level of misclassification for both types of errors.
Comparable error distribution: The counts of false positives and false negatives are of similar magnitude, suggesting that the model does not exhibit a strong bias toward over-predicting either class.
The confusion matrix reveals that the rf_model demonstrates balanced classification performance, with similar accuracy for both positive and negative classes. The distribution of misclassifications is relatively even, indicating no pronounced skew toward either false positives or false negatives. Overall, the model maintains a consistent error profile across both classes.
Figures
2026-01-28 18:00:17,870 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document
Verify test runs
Our final task is to verify that all the tests provided by the model development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.
Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:
for t in test_config:print(t)try:# Check if test has input_gridif'input_grid'in test_config[t]:# For tests with input_grid, pass the input_grid configurationif'params'in test_config[t]: vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()else: vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()else:# Original logic for regular inputsif'params'in test_config[t]: vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()else: vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()exceptExceptionas e:print(f"Error running test {t}: {str(e)}")
The DatasetDescription test provides a comprehensive summary of the dataset's structure, completeness, and feature characteristics. The results table details each column's data type, count, missingness, and the number of distinct values, offering a clear overview of the dataset composition. All columns are fully populated with no missing values, and the distinct value counts highlight the diversity and granularity of each feature. This summary enables a thorough understanding of the dataset's readiness for modeling and potential areas of complexity.
Key insights:
No missing values across all columns: All 11 columns report 0 missing entries, indicating complete data coverage for every feature.
High cardinality in key numeric features: The Balance and EstimatedSalary columns exhibit high distinct value counts (5088 and 8000 respectively), reflecting continuous or near-continuous distributions.
Low cardinality in categorical features: Categorical columns such as Geography, Gender, HasCrCard, IsActiveMember, and Exited have between 2 and 3 distinct values, supporting straightforward encoding and analysis.
Moderate diversity in demographic and behavioral features: CreditScore and Age show moderate distinct counts (452 and 69), while Tenure and NumOfProducts have lower diversity (11 and 4 distinct values).
The dataset is fully complete with no missing data, supporting robust downstream analysis. Numeric features display a range of cardinalities, from highly granular (EstimatedSalary, Balance) to more discretized (Tenure, NumOfProducts). Categorical features are well-defined with low cardinality, facilitating efficient encoding. The overall structure indicates a dataset suitable for machine learning applications, with no immediate data quality concerns observed in the summary statistics.
Tables
Dataset Description
Name
Type
Count
Missing
Missing %
Distinct
Distinct %
CreditScore
Numeric
8000.0
0
0.0
452
0.0565
Geography
Categorical
8000.0
0
0.0
3
0.0004
Gender
Categorical
8000.0
0
0.0
2
0.0002
Age
Numeric
8000.0
0
0.0
69
0.0086
Tenure
Numeric
8000.0
0
0.0
11
0.0014
Balance
Numeric
8000.0
0
0.0
5088
0.6360
NumOfProducts
Numeric
8000.0
0
0.0
4
0.0005
HasCrCard
Categorical
8000.0
0
0.0
2
0.0002
IsActiveMember
Categorical
8000.0
0
0.0
2
0.0002
EstimatedSalary
Numeric
8000.0
0
0.0
8000
1.0000
Exited
Categorical
8000.0
0
0.0
2
0.0002
2026-01-28 18:00:27,744 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables, including measures of central tendency, dispersion, and range, as well as frequency-based summaries for two categorical variables. The numerical table details counts, means, standard deviations, and percentiles, while the categorical table reports unique value counts and the dominance of the most frequent category. These results provide a comprehensive overview of the dataset’s structure and highlight key aspects of variable distributions.
Key insights:
Wide range and skewness in balance values: The Balance variable exhibits a minimum of 0.0, a median of 97,264.0, and a maximum of 250,898.0, with a mean (76,434.10) substantially below the median, indicating a right-skewed distribution and a significant proportion of zero balances.
High concentration in categorical variables: Geography is dominated by France (50.12% of records), and Gender is predominantly Male (54.95%), indicating limited diversity in these categorical features.
Binary variables with balanced representation: HasCrCard and IsActiveMember are binary, with means of 0.70 and 0.52, respectively, suggesting moderate balance between categories.
Consistent sample sizes and low missingness: All variables report a count of 8,000, indicating no missing data across both numerical and categorical fields.
Substantial spread in estimated salary: EstimatedSalary ranges from 12.0 to 199,992.0, with a mean of 99,790.19 and a standard deviation of 57,520.51, reflecting high variability in this feature.
The dataset demonstrates complete data coverage with no missing values and a broad range of values across key numerical variables. Several variables, such as Balance and EstimatedSalary, display substantial dispersion and skewness, while categorical variables show limited diversity due to the dominance of specific categories. These characteristics provide important context for understanding the underlying data structure and potential sources of model risk related to feature distribution and representativeness.
Tables
Numerical Variables
Name
Count
Mean
Std
Min
25%
50%
75%
90%
95%
Max
CreditScore
8000.0
650.1596
96.8462
350.0
583.0
652.0
717.0
778.0
813.0
850.0
Age
8000.0
38.9489
10.4590
18.0
32.0
37.0
44.0
53.0
60.0
92.0
Tenure
8000.0
5.0339
2.8853
0.0
3.0
5.0
8.0
9.0
9.0
10.0
Balance
8000.0
76434.0965
62612.2513
0.0
0.0
97264.0
128045.0
149545.0
162488.0
250898.0
NumOfProducts
8000.0
1.5325
0.5805
1.0
1.0
1.0
2.0
2.0
2.0
4.0
HasCrCard
8000.0
0.7026
0.4571
0.0
0.0
1.0
1.0
1.0
1.0
1.0
IsActiveMember
8000.0
0.5199
0.4996
0.0
0.0
1.0
1.0
1.0
1.0
1.0
EstimatedSalary
8000.0
99790.1880
57520.5089
12.0
50857.0
99505.0
149216.0
179486.0
189997.0
199992.0
Categorical Variables
Name
Count
Number of Unique Values
Top Value
Top Value Frequency
Top Value Frequency %
Geography
8000.0
3.0
France
4010.0
50.12
Gender
8000.0
2.0
Male
4396.0
54.95
2026-01-28 18:00:39,250 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document
validmind.data_validation.MissingValues:raw_data
✅ Missing Values Raw Data
The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the threshold. All features in the dataset are shown with zero missing values, and each column is marked as passing the test.
Key insights:
No missing values detected: All features, including CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, have zero missing values.
Universal pass status: Every column meets the missing value threshold criterion, with 0.0% missingness and a Pass status across the dataset.
The dataset demonstrates complete data integrity with respect to missing values, as all features contain full data coverage and satisfy the established threshold. This indicates a high level of data quality for subsequent modeling or analysis steps.
Parameters:
{
"min_threshold": 1
}
Tables
Column
Number of Missing Values
Percentage of Missing Values (%)
Pass/Fail
CreditScore
0
0.0
Pass
Geography
0
0.0
Pass
Gender
0
0.0
Pass
Age
0
0.0
Pass
Tenure
0
0.0
Pass
Balance
0
0.0
Pass
NumOfProducts
0
0.0
Pass
HasCrCard
0
0.0
Pass
IsActiveMember
0
0.0
Pass
EstimatedSalary
0
0.0
Pass
Exited
0
0.0
Pass
2026-01-28 18:00:47,579 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document
validmind.data_validation.ClassImbalance:raw_data
✅ Class Imbalance Raw Data
The Class Imbalance test evaluates the distribution of target classes within the dataset to identify potential imbalances that could impact model performance. The results table presents the percentage of records for each class in the "Exited" target variable, alongside a pass/fail assessment based on a minimum threshold of 10%. The accompanying bar plot visually depicts the proportion of each class, with class 0 and class 1 shown as distinct bars representing their respective frequencies.
Key insights:
Both classes exceed the minimum threshold: Class 0 constitutes 79.80% and class 1 constitutes 20.20% of the dataset, with both surpassing the 10% minimum threshold.
No classes flagged for imbalance: The pass/fail assessment indicates that neither class is under-represented according to the defined criterion.
Class distribution is visually asymmetric: The bar plot highlights a notable difference in class proportions, with class 0 being the majority class.
The results indicate that, while the dataset is not perfectly balanced, both classes meet the minimum representation threshold set for this test. The observed class distribution is asymmetric, with a substantially higher proportion of class 0 compared to class 1, but no class falls below the specified risk threshold for imbalance.
Parameters:
{
"min_percent_threshold": 10
}
Tables
Exited Class Imbalance
Exited
Percentage of Rows (%)
Pass/Fail
0
79.80%
Pass
1
20.20%
Pass
Figures
2026-01-28 18:00:59,333 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document
validmind.data_validation.Duplicates:raw_data
✅ Duplicates Raw Data
The Duplicates test evaluates the presence of duplicate rows within the dataset to assess data quality and mitigate risks associated with redundant information. The results table presents the absolute number and percentage of duplicate rows detected, providing a quantitative overview of dataset uniqueness. The test was executed with a minimum threshold parameter set to 1, and the results are summarized in the table titled "Duplicate Rows Results for Dataset."
Key insights:
No duplicate rows detected: The dataset contains 0 duplicate rows, as indicated by the "Number of Duplicates" value.
Zero percent duplication: The "Percentage of Rows (%)" is 0.0%, confirming the absence of redundant entries in the dataset.
The results demonstrate that the dataset is free from duplicate rows, indicating a high level of data uniqueness and integrity. The absence of duplication supports reliable model training and reduces the risk of overfitting due to repeated information.
Parameters:
{
"min_threshold": 1
}
Tables
Duplicate Rows Results for Dataset
Number of Duplicates
Percentage of Rows (%)
0
0.0
2026-01-28 18:01:05,724 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document
The High Cardinality test evaluates the number of unique values in categorical columns to identify potential risks of overfitting and data noise. The results table presents the number and percentage of distinct values for each categorical column, along with a pass/fail status based on a threshold of 10% distinct values. Both "Geography" and "Gender" columns are assessed, with their respective distinct value counts and percentages reported.
Key insights:
All categorical columns pass cardinality threshold: Both "Geography" (3 distinct values, 0.0375%) and "Gender" (2 distinct values, 0.025%) are well below the 10% threshold, resulting in a "Pass" status for each.
Low cardinality observed across features: The number of unique values in both columns is minimal relative to the total sample size, indicating low cardinality throughout the assessed categorical features.
The results indicate that all evaluated categorical columns exhibit low cardinality, with distinct value counts and percentages substantially below the defined threshold. No evidence of high cardinality or associated overfitting risk is present in the tested features.
2026-01-28 18:01:12,152 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document
validmind.data_validation.Skewness:raw_data
❌ Skewness Raw Data
The Skewness:raw_data test evaluates the asymmetry of numerical data distributions by calculating skewness values for each numeric column and comparing them to a maximum threshold of 1. The results table presents skewness values and pass/fail outcomes for each column, indicating whether the distributional asymmetry exceeds the defined threshold. Columns with skewness values below 1 are marked as "Pass," while those exceeding the threshold are marked as "Fail." This assessment provides a quantitative overview of distributional characteristics relevant to data quality and model performance.
Key insights:
Most columns exhibit low skewness: The majority of numeric columns, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary, have skewness values well below the threshold of 1 and pass the test.
Two columns exceed skewness threshold: Age (skewness = 1.0245) and Exited (skewness = 1.4847) exceed the maximum threshold, resulting in a fail outcome for these columns.
Skewness values are generally close to zero: Several columns, such as Tenure (0.0077), EstimatedSalary (0.0095), and CreditScore (‑0.062), display skewness values near zero, indicating near-symmetric distributions.
The skewness assessment reveals that most numeric columns in the dataset have distributions that are approximately symmetric or only moderately skewed, remaining within the defined threshold. However, Age and Exited display higher levels of asymmetry, exceeding the maximum skewness threshold and indicating notable distributional skew in these variables. The overall distributional profile suggests that, aside from these exceptions, the dataset maintains a balanced structure with respect to skewness.
Parameters:
{
"max_threshold": 1
}
Tables
Skewness Results for Dataset
Column
Skewness
Pass/Fail
CreditScore
-0.0620
Pass
Age
1.0245
Fail
Tenure
0.0077
Pass
Balance
-0.1353
Pass
NumOfProducts
0.7172
Pass
HasCrCard
-0.8867
Pass
IsActiveMember
-0.0796
Pass
EstimatedSalary
0.0095
Pass
Exited
1.4847
Fail
2026-01-28 18:01:24,737 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document
validmind.data_validation.UniqueRows:raw_data
❌ Unique Rows Raw Data
The UniqueRows test evaluates the diversity of the dataset by measuring the proportion of unique values in each column and comparing it to a minimum percentage threshold. The results table presents, for each column, the number and percentage of unique values, along with a pass/fail outcome based on whether the percentage exceeds the 1% threshold. Columns such as CreditScore, Balance, and EstimatedSalary show high percentages of unique values and pass the test, while most categorical and low-cardinality columns do not meet the threshold and fail.
Key insights:
High uniqueness in continuous variables: EstimatedSalary (100%), Balance (63.6%), and CreditScore (5.65%) exceed the 1% uniqueness threshold, indicating substantial diversity in these columns.
Low uniqueness in categorical variables: Columns such as Geography (0.0375%), Gender (0.025%), HasCrCard (0.025%), IsActiveMember (0.025%), and Exited (0.025%) have very low percentages of unique values and fail the test.
Majority of columns fail uniqueness threshold: Only 3 out of 11 columns pass the test, with the remaining 8 columns—including Age (0.8625%), Tenure (0.1375%), and NumOfProducts (0.05%)—falling below the 1% threshold.
The results indicate that while continuous variables in the dataset exhibit high diversity, the majority of categorical and low-cardinality columns do not meet the minimum uniqueness threshold. This pattern reflects a concentration of unique values in a subset of features, with limited diversity observed in most categorical variables. The overall data structure is characterized by high uniqueness in select columns and low uniqueness in the majority of others.
Parameters:
{
"min_percent_threshold": 1
}
Tables
Column
Number of Unique Values
Percentage of Unique Values (%)
Pass/Fail
CreditScore
452
5.6500
Pass
Geography
3
0.0375
Fail
Gender
2
0.0250
Fail
Age
69
0.8625
Fail
Tenure
11
0.1375
Fail
Balance
5088
63.6000
Pass
NumOfProducts
4
0.0500
Fail
HasCrCard
2
0.0250
Fail
IsActiveMember
2
0.0250
Fail
EstimatedSalary
8000
100.0000
Pass
Exited
2
0.0250
Fail
2026-01-28 18:01:34,678 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document
The TooManyZeroValues test identifies numerical columns with a proportion of zero values exceeding a defined threshold, set here at 0.03%. The results table summarizes the number and percentage of zero values for each numerical column, along with a pass/fail status based on the threshold. All four evaluated columns—Tenure, Balance, HasCrCard, and IsActiveMember—are reported with their respective zero value counts and fail the test due to exceeding the threshold.
Key insights:
All evaluated columns exceed zero value threshold: Each of the four numerical columns has a percentage of zero values significantly above the 0.03% threshold, resulting in a fail status for all.
High concentration of zeros in Balance and IsActiveMember: Balance contains 36.4% zero values, and IsActiveMember contains 48.01%, indicating substantial sparsity in these features.
Binary indicator columns show elevated zero rates: HasCrCard and IsActiveMember, likely representing binary indicators, have 29.74% and 48.01% zero values, respectively, reflecting a high proportion of one class.
Tenure column also affected: Tenure registers 4.04% zero values, which, while lower than other columns, still exceeds the threshold and results in a fail.
All assessed numerical columns display zero value proportions well above the defined threshold, with Balance and IsActiveMember exhibiting particularly high rates of zeros. The presence of elevated zero counts across both continuous and likely binary columns indicates a pattern of data sparsity or class imbalance in these features. This distribution warrants consideration in subsequent modeling steps, as it may influence feature utility and model performance.
Parameters:
{
"max_percent_threshold": 0.03
}
Tables
Variable
Row Count
Number of Zero Values
Percentage of Zero Values (%)
Pass/Fail
Tenure
8000
323
4.0375
Fail
Balance
8000
2912
36.4000
Fail
HasCrCard
8000
2379
29.7375
Fail
IsActiveMember
8000
3841
48.0125
Fail
2026-01-28 18:01:43,906 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document
The Interquartile Range Outliers Table (IQROutliersTable) test identifies and summarizes outliers in numerical features using the IQR method, with the outlier threshold set to 5. The results are presented in a summary table that would list, for each numerical feature, the count and distributional statistics of detected outliers. In this test run, the summary table contains no entries, indicating the absence of detected outliers across all evaluated numerical features.
Key insights:
No outliers detected in any feature: The summary table is empty, confirming that no numerical features exhibited values outside the IQR-based outlier thresholds.
Uniform data distribution within threshold: All numerical feature values fall within the calculated IQR bounds, given the threshold of 5.
The absence of detected outliers indicates that the dataset's numerical features are uniformly distributed within the specified IQR threshold. No evidence of extreme or anomalous values was observed under the applied test parameters.
Parameters:
{
"threshold": 5
}
Tables
Summary of Outliers Detected by IQR Method
2026-01-28 18:01:51,989 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the preprocessed dataset. The results are presented in two summary tables: one for numerical variables, detailing central tendency, dispersion, and range; and one for categorical variables, summarizing value counts, unique value diversity, and the dominance of top categories. These tables provide a comprehensive overview of the dataset’s structure, supporting assessment of data quality and potential risk factors.
Key insights:
Wide range and skewness in Balance: The Balance variable exhibits a minimum of 0.0, a median of 102,386.0, and a maximum of 250,898.0, with a mean (81,585.1) substantially below the median, indicating right-skewness and a concentration of lower values.
CreditScore distribution is broad and symmetric: CreditScore ranges from 350.0 to 850.0, with a mean (647.7) closely aligned to the median (649.0), suggesting a relatively symmetric distribution.
Categorical variables show moderate diversity: Geography has three unique values, with France as the most frequent (46.04%), and Gender is nearly balanced (Male: 51.27%), indicating no single category dominates excessively.
Binary variables are well represented: HasCrCard and IsActiveMember are binary, with HasCrCard showing 69.5% positive responses and IsActiveMember at 47.3%, reflecting reasonable class balance.
The dataset demonstrates a broad and well-populated range for key numerical variables, with some evidence of skewness in Balance. Categorical variables display moderate diversity, with no overwhelming dominance by a single category. Binary variables are distributed without extreme imbalance. Overall, the data structure supports robust modeling, with distributional characteristics that warrant monitoring for potential skewness or concentration effects.
Tables
Numerical Variables
Name
Count
Mean
Std
Min
25%
50%
75%
90%
95%
Max
CreditScore
3232.0
647.6711
98.7191
350.0
581.0
649.0
717.0
779.0
817.0
850.0
Tenure
3232.0
5.0285
2.9152
0.0
3.0
5.0
8.0
9.0
10.0
10.0
Balance
3232.0
81585.0888
61380.7207
0.0
0.0
102386.0
128886.0
150802.0
164733.0
250898.0
NumOfProducts
3232.0
1.5108
0.6703
1.0
1.0
1.0
2.0
2.0
3.0
4.0
HasCrCard
3232.0
0.6952
0.4604
0.0
0.0
1.0
1.0
1.0
1.0
1.0
IsActiveMember
3232.0
0.4728
0.4993
0.0
0.0
0.0
1.0
1.0
1.0
1.0
EstimatedSalary
3232.0
100296.6888
57771.2303
12.0
50956.0
100642.0
150557.0
179122.0
188759.0
199992.0
Categorical Variables
Name
Count
Number of Unique Values
Top Value
Top Value Frequency
Top Value Frequency %
Geography
3232.0
3.0
France
1488.0
46.04
Gender
3232.0
2.0
Male
1657.0
51.27
2026-01-28 18:02:02,433 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document
The Descriptive Statistics test evaluates the distribution, completeness, and data types of numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables and two categorical variables, including measures of central tendency, range, missingness, and unique value counts. All variables are reported with their respective data types and observed value ranges, providing a comprehensive overview of the dataset’s structure and integrity.
Key insights:
No missing values detected: All numerical and categorical variables report 0.0% missing values, indicating complete data coverage across all fields.
Consistent data types across variables: Numerical variables are represented as int64 or float64, while categorical variables are of object type, aligning with their expected formats.
Limited cardinality in categorical variables: Geography contains three unique values (France, Germany, Spain), and Gender contains two unique values (Female, Male), supporting straightforward categorical encoding.
Wide value ranges in numerical variables: CreditScore ranges from 350 to 850, Balance from 0.0 to 250,898.09, and EstimatedSalary from 11.58 to 199,992.48, reflecting substantial spread in key financial indicators.
Binary encoding for indicator variables: HasCrCard, IsActiveMember, and Exited are encoded as binary (0/1) int64 variables, supporting direct use in binary classification or indicator analysis.
The dataset exhibits complete data with no missing values and appropriate data types for all variables. Categorical variables display low cardinality, and numerical variables cover broad value ranges, particularly in financial fields. The structure and integrity of the data are well-documented, providing a reliable foundation for subsequent modeling and analysis.
Tables
Numerical Variable
Num of Obs
Mean
Min
Max
Missing Values (%)
Data Type
CreditScore
3232
647.6711
350.00
850.00
0.0
int64
Tenure
3232
5.0285
0.00
10.00
0.0
int64
Balance
3232
81585.0888
0.00
250898.09
0.0
float64
NumOfProducts
3232
1.5108
1.00
4.00
0.0
int64
HasCrCard
3232
0.6952
0.00
1.00
0.0
int64
IsActiveMember
3232
0.4728
0.00
1.00
0.0
int64
EstimatedSalary
3232
100296.6888
11.58
199992.48
0.0
float64
Exited
3232
0.5000
0.00
1.00
0.0
int64
Categorical Variable
Num of Obs
Num of Unique Values
Unique Values
Missing Values (%)
Data Type
Geography
3232.0
3.0
['France' 'Germany' 'Spain']
0.0
object
Gender
3232.0
2.0
['Female' 'Male']
0.0
object
2026-01-28 18:02:10,565 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document
The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the specified threshold. All features in the dataset are listed with their corresponding missing value statistics and test outcomes.
Key insights:
No missing values detected: All features report zero missing values, with both the number and percentage of missing entries recorded as 0.0%.
Universal test pass across features: Every column meets the missing value threshold criterion, resulting in a Pass status for all features.
The dataset demonstrates complete data integrity with respect to missing values, as all features contain fully populated entries and satisfy the established threshold. This indicates a high level of data quality for subsequent modeling or analysis steps.
Parameters:
{
"min_threshold": 1
}
Tables
Column
Number of Missing Values
Percentage of Missing Values (%)
Pass/Fail
CreditScore
0
0.0
Pass
Geography
0
0.0
Pass
Gender
0
0.0
Pass
Tenure
0
0.0
Pass
Balance
0
0.0
Pass
NumOfProducts
0
0.0
Pass
HasCrCard
0
0.0
Pass
IsActiveMember
0
0.0
Pass
EstimatedSalary
0
0.0
Pass
Exited
0
0.0
Pass
2026-01-28 18:02:15,261 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document
The TabularNumericalHistograms:preprocessed_data test provides a visual summary of the distribution of each numerical feature in the dataset using histograms. The resulting plots display the frequency distribution for variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary. These visualizations enable assessment of central tendency, spread, skewness, and the presence of outliers or unusual patterns in the input data.
Key insights:
CreditScore distribution is unimodal and slightly right-skewed: The CreditScore histogram shows a single peak centered around 650–700, with a longer tail extending toward higher values, indicating mild right skewness and a concentration of scores in the mid-to-high range.
Tenure is nearly uniform except at endpoints: The Tenure variable displays an approximately uniform distribution across most values, with lower frequencies at the minimum and maximum tenure values.
Balance exhibits a strong spike at zero: The Balance histogram reveals a pronounced spike at zero, with the remainder of the distribution forming a bell-shaped curve centered around 120,000, indicating a substantial subset of accounts with no balance.
NumOfProducts is highly concentrated at lower values: The distribution of NumOfProducts is heavily concentrated at 1 and 2 products, with very few instances at 3 or 4, indicating limited product diversification among most customers.
HasCrCard and IsActiveMember are binary with class imbalance: Both HasCrCard and IsActiveMember show binary distributions, with HasCrCard skewed toward 1 (majority have a credit card) and IsActiveMember showing a moderate imbalance between active and inactive members.
EstimatedSalary is uniformly distributed: The EstimatedSalary histogram is approximately flat across the range, indicating a uniform distribution of salary values in the dataset.
The histograms collectively indicate that most numerical features are either uniformly or unimodally distributed, with some variables exhibiting notable skewness or concentration at specific values. The presence of a large number of zero balances and the concentration of product counts at lower values are prominent characteristics. Binary features display varying degrees of class imbalance. No extreme outliers or highly irregular distributions are observed in the visualized features.
Figures
2026-01-28 18:02:35,003 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document
The TabularCategoricalBarPlots test evaluates the distribution of categorical variables by generating bar plots for each category within the dataset. The resulting plots display the frequency counts for each category in the "Geography" and "Gender" features, providing a visual summary of the dataset's categorical composition. These visualizations facilitate the identification of category balance and potential representation issues within the data.
Key insights:
Geography distribution is imbalanced: The "Geography" feature shows the highest count for France, followed by Germany and then Spain, with France having approximately 50% more instances than Spain.
Gender distribution is relatively balanced: The "Gender" feature displays similar counts for Male and Female categories, with only a modest difference between the two.
The categorical composition of the dataset reveals a notable imbalance in the "Geography" feature, with France being the most represented and Spain the least. In contrast, the "Gender" feature demonstrates a near-equal distribution between categories. These patterns provide a clear overview of category representation, highlighting areas where category imbalance may influence downstream modeling.
Figures
2026-01-28 18:02:45,130 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document
The TargetRateBarPlots test visualizes the distribution and target rates of categorical features to provide insight into model decision patterns. The results display paired bar plots for each categorical variable, with the left plot showing the frequency of each category and the right plot depicting the mean target rate (proportion of positive class) for each category. The features analyzed include Geography and Gender, with each category’s count and corresponding target rate presented side by side for direct comparison.
Key insights:
Distinct target rate variation by Geography: The target rate for Germany is notably higher than for France and Spain, with Germany exceeding 0.6 while France and Spain are both near 0.4.
Balanced category representation in Gender: Male and Female categories have similar sample counts, each above 1500, indicating balanced representation in the dataset.
Gender target rate disparity: The target rate for Female is higher than for Male, with Female above 0.5 and Male below 0.45.
Uneven category counts in Geography: France has the highest count, followed by Germany and then Spain, indicating some imbalance in category frequencies.
The results reveal pronounced differences in target rates across both Geography and Gender categories, with Germany and Female categories exhibiting higher proportions of positive class outcomes. Category representation is balanced for Gender but shows moderate imbalance for Geography. These patterns highlight areas where model predictions and data composition differ across categorical groups.
Figures
2026-01-28 18:02:56,308 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics of numerical variables in the development and test datasets. The results present summary statistics for each variable, including count, mean, standard deviation, minimum, maximum, and key percentiles. These statistics provide a comprehensive overview of the central tendency, dispersion, and range for each feature, enabling assessment of data quality and potential risk factors.
Key insights:
Consistent central tendencies across datasets: Means and medians (50th percentiles) for key variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary are closely aligned between the development and test datasets, indicating stable distributions.
Wide range and high variance in Balance and EstimatedSalary: Both Balance and EstimatedSalary exhibit large standard deviations (Balance: ~61,500–61,800; EstimatedSalary: ~57,300–57,900) and wide ranges, with minimum values near zero and maximums exceeding 199,000, reflecting substantial dispersion and potential for outliers.
Binary variables show expected distribution: HasCrCard and IsActiveMember are binary variables with means near 0.7 and 0.47, respectively, and standard deviations close to 0.46–0.50, consistent with their binary nature and indicating no missingness.
No evidence of missing data: All variables report counts equal to the total number of records in their respective datasets, indicating complete data coverage for the analyzed features.
The descriptive statistics indicate that the numerical variables in both the development and test datasets are well-aligned in terms of central tendency and spread, with no evidence of missing data. The wide dispersion observed in Balance and EstimatedSalary highlights the presence of substantial variability, which may influence model sensitivity to these features. Binary variables display balanced distributions, and overall, the datasets exhibit stable and consistent statistical properties across splits.
Tables
dataset
Name
Count
Mean
Std
Min
25%
50%
75%
90%
95%
Max
train_dataset_final
CreditScore
2585.0
648.3965
98.5054
350.0
581.0
648.0
717.0
779.0
818.0
850.0
train_dataset_final
Tenure
2585.0
4.9988
2.9000
0.0
3.0
5.0
8.0
9.0
10.0
10.0
train_dataset_final
Balance
2585.0
81084.2264
61520.0012
0.0
0.0
101487.0
128744.0
150547.0
164361.0
250898.0
train_dataset_final
NumOfProducts
2585.0
1.5122
0.6729
1.0
1.0
1.0
2.0
2.0
3.0
4.0
train_dataset_final
HasCrCard
2585.0
0.6917
0.4619
0.0
0.0
1.0
1.0
1.0
1.0
1.0
train_dataset_final
IsActiveMember
2585.0
0.4723
0.4993
0.0
0.0
0.0
1.0
1.0
1.0
1.0
train_dataset_final
EstimatedSalary
2585.0
100084.7896
57898.2146
12.0
50073.0
99504.0
151317.0
179252.0
189373.0
199992.0
test_dataset_final
CreditScore
647.0
644.7728
99.5921
350.0
578.0
650.0
716.0
779.0
809.0
850.0
test_dataset_final
Tenure
647.0
5.1468
2.9744
0.0
3.0
5.0
8.0
9.0
10.0
10.0
test_dataset_final
Balance
647.0
83586.2161
60827.1306
0.0
0.0
105420.0
129698.0
151837.0
165360.0
210433.0
test_dataset_final
NumOfProducts
647.0
1.5054
0.6604
1.0
1.0
1.0
2.0
2.0
3.0
4.0
test_dataset_final
HasCrCard
647.0
0.7094
0.4544
0.0
0.0
1.0
1.0
1.0
1.0
1.0
test_dataset_final
IsActiveMember
647.0
0.4745
0.4997
0.0
0.0
0.0
1.0
1.0
1.0
1.0
test_dataset_final
EstimatedSalary
647.0
101143.3033
57297.7411
143.0
52326.0
104153.0
149426.0
178913.0
188084.0
199662.0
2026-01-28 18:03:14,102 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics and completeness of numerical variables in the train and test datasets. The results present summary statistics, including mean, minimum, maximum, and missing value percentages, for each numerical variable across both datasets. All variables are reported with their respective data types and observation counts, providing a comprehensive overview of the dataset structure and integrity.
Key insights:
No missing values across all variables: All numerical variables in both train and test datasets report 0.0% missing values, indicating complete data coverage for these fields.
Consistent data types and observation counts: All variables are typed as either int64 or float64, with observation counts matching the dataset sizes (2585 for train, 647 for test), confirming structural consistency.
Stable variable ranges between datasets: Minimum and maximum values for variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited are closely aligned between train and test datasets, with only minor variations.
Comparable central tendencies: Mean values for all variables are similar between train and test datasets, with differences remaining within a narrow range (e.g., CreditScore mean: 648.40 in train vs. 644.77 in test; EstimatedSalary mean: 100,084.79 in train vs. 101,143.30 in test).
The descriptive statistics indicate that the numerical variables are complete, consistently typed, and structurally aligned across both train and test datasets. Central tendencies and value ranges are stable, supporting the integrity and comparability of the data for subsequent modeling and analysis. No data quality issues or anomalies are observed in the reported statistics.
Tables
dataset
Numerical Variable
Num of Obs
Mean
Min
Max
Missing Values (%)
Data Type
train_dataset_final
CreditScore
2585
648.3965
350.00
850.00
0.0
int64
train_dataset_final
Tenure
2585
4.9988
0.00
10.00
0.0
int64
train_dataset_final
Balance
2585
81084.2264
0.00
250898.09
0.0
float64
train_dataset_final
NumOfProducts
2585
1.5122
1.00
4.00
0.0
int64
train_dataset_final
HasCrCard
2585
0.6917
0.00
1.00
0.0
int64
train_dataset_final
IsActiveMember
2585
0.4723
0.00
1.00
0.0
int64
train_dataset_final
EstimatedSalary
2585
100084.7896
11.58
199992.48
0.0
float64
train_dataset_final
Exited
2585
0.4959
0.00
1.00
0.0
int64
test_dataset_final
CreditScore
647
644.7728
350.00
850.00
0.0
int64
test_dataset_final
Tenure
647
5.1468
0.00
10.00
0.0
int64
test_dataset_final
Balance
647
83586.2161
0.00
210433.08
0.0
float64
test_dataset_final
NumOfProducts
647
1.5054
1.00
4.00
0.0
int64
test_dataset_final
HasCrCard
647
0.7094
0.00
1.00
0.0
int64
test_dataset_final
IsActiveMember
647
0.4745
0.00
1.00
0.0
int64
test_dataset_final
EstimatedSalary
647
101143.3033
142.81
199661.50
0.0
float64
test_dataset_final
Exited
647
0.5162
0.00
1.00
0.0
int64
2026-01-28 18:03:24,312 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document
The Class Imbalance test evaluates the distribution of target classes within the training and test datasets to identify potential imbalances that could affect model performance. The results present the percentage representation of each class in both datasets, with a minimum threshold of 10% set for each class to pass. Bar plots visualize the class proportions, supporting interpretation of the class distribution.
Key insights:
Balanced class distribution in training data: In the train_dataset_final, class 0 constitutes 50.41% and class 1 constitutes 49.59% of the records, both well above the 10% threshold.
Balanced class distribution in test data: In the test_dataset_final, class 1 constitutes 51.62% and class 0 constitutes 48.38% of the records, both exceeding the minimum threshold.
All classes pass imbalance criteria: Both classes in both datasets are marked as "Pass" for the class imbalance test, indicating no class falls below the specified threshold.
Visual confirmation of near-equal proportions: Bar plots for both datasets show visually similar heights for each class, confirming the quantitative findings.
The results indicate that both the training and test datasets exhibit a balanced distribution of the target classes, with each class representing approximately half of the records. No evidence of class imbalance is observed, and all classes meet the minimum representation criteria. This distribution supports the development and evaluation of models without risk of bias due to class under-representation.
Parameters:
{
"min_percent_threshold": 10
}
Tables
dataset
Exited
Percentage of Rows (%)
Pass/Fail
train_dataset_final
0
50.41%
Pass
train_dataset_final
1
49.59%
Pass
test_dataset_final
1
51.62%
Pass
test_dataset_final
0
48.38%
Pass
Figures
2026-01-28 18:03:35,823 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document
The UniqueRows test evaluates the diversity of each column in the dataset by measuring the proportion of unique values relative to the total number of rows, with a minimum threshold set at 1%. The results table presents the number and percentage of unique values for each column in both the training and test datasets, along with a pass/fail outcome based on the threshold. Columns with a percentage of unique values below 1% are marked as "Fail," while those meeting or exceeding the threshold are marked as "Pass."
Key insights:
High uniqueness in continuous variables: Columns such as EstimatedSalary and Balance exhibit high percentages of unique values (100% and 67.7% in training; 100% and 69.4% in test), consistently passing the uniqueness threshold.
Low uniqueness in categorical variables: Columns representing categorical or binary features (e.g., HasCrCard, IsActiveMember, Geography_Germany, Geography_Spain, Gender_Male, Exited) have very low percentages of unique values (all below 1%) and fail the test in both datasets.
Mixed results for ordinal variables: CreditScore passes the threshold in both datasets (16.5% in training, 47.3% in test), while Tenure passes in the test set (1.7%) but fails in the training set (0.43%).
Consistent failure in product count features: NumOfProducts fails the uniqueness threshold in both datasets, with unique value percentages well below 1%.
The results indicate that continuous variables in both training and test datasets demonstrate high diversity, consistently exceeding the uniqueness threshold. In contrast, categorical and binary variables, as well as certain ordinal features, exhibit low uniqueness and do not meet the threshold, resulting in test failures for these columns. This pattern reflects the inherent nature of categorical and binary variables, which are limited in the number of possible unique values, and highlights the distinction in uniqueness profiles between variable types within the dataset.
Parameters:
{
"min_percent_threshold": 1
}
Tables
dataset
Column
Number of Unique Values
Percentage of Unique Values (%)
Pass/Fail
train_dataset_final
CreditScore
427
16.5184
Pass
train_dataset_final
Tenure
11
0.4255
Fail
train_dataset_final
Balance
1750
67.6983
Pass
train_dataset_final
NumOfProducts
4
0.1547
Fail
train_dataset_final
HasCrCard
2
0.0774
Fail
train_dataset_final
IsActiveMember
2
0.0774
Fail
train_dataset_final
EstimatedSalary
2585
100.0000
Pass
train_dataset_final
Geography_Germany
2
0.0774
Fail
train_dataset_final
Geography_Spain
2
0.0774
Fail
train_dataset_final
Gender_Male
2
0.0774
Fail
train_dataset_final
Exited
2
0.0774
Fail
test_dataset_final
CreditScore
306
47.2952
Pass
test_dataset_final
Tenure
11
1.7002
Pass
test_dataset_final
Balance
449
69.3972
Pass
test_dataset_final
NumOfProducts
4
0.6182
Fail
test_dataset_final
HasCrCard
2
0.3091
Fail
test_dataset_final
IsActiveMember
2
0.3091
Fail
test_dataset_final
EstimatedSalary
647
100.0000
Pass
test_dataset_final
Geography_Germany
2
0.3091
Fail
test_dataset_final
Geography_Spain
2
0.3091
Fail
test_dataset_final
Gender_Male
2
0.3091
Fail
test_dataset_final
Exited
2
0.3091
Fail
2026-01-28 18:03:51,084 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document
The TabularNumericalHistograms test provides a visual assessment of the distribution of each numerical feature in both the training and test datasets. The resulting histograms display the frequency distribution for each variable, enabling identification of skewness, outliers, and other distributional characteristics. This visualization supports the evaluation of input data quality and the detection of potential risks related to data distribution.
Key insights:
CreditScore distribution is right-skewed: Both training and test datasets show a right-skewed distribution for CreditScore, with a concentration of values between 550 and 750 and a long tail toward higher scores.
Tenure is nearly uniform except at boundaries: Tenure displays an approximately uniform distribution across most values, with lower frequencies at the minimum and maximum values in both datasets.
Balance shows a high frequency at zero: A substantial proportion of records have a zero balance, with the remainder forming a bell-shaped distribution centered around 120,000–130,000.
NumOfProducts is heavily concentrated at lower values: The majority of records have one or two products, with very few instances at three or four products.
HasCrCard and IsActiveMember are binary with class imbalance: Both features are binary; HasCrCard is skewed toward 1, while IsActiveMember is more evenly split but still shows a higher count for 0 in the training set.
EstimatedSalary is uniformly distributed: EstimatedSalary appears to be evenly distributed across its range in both datasets, with no significant skewness or outliers.
Geography and Gender features show categorical splits: Geography_Germany and Geography_Spain indicate more records in the 'false' category, while Gender_Male is nearly balanced between true and false.
The histograms reveal that most numerical features exhibit stable and interpretable distributions across both training and test datasets. Notable characteristics include a high proportion of zero values in Balance, right-skewness in CreditScore, and strong concentration at lower values for NumOfProducts. Binary and categorical features display varying degrees of class imbalance. No extreme outliers or unexpected distributional anomalies are observed, supporting the overall integrity of the input data distributions.
Figures
2026-01-28 18:04:11,858 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document
The Mutual Information test evaluates the statistical dependency between each feature and the target variable to quantify feature relevance for model development. The results are presented as normalized mutual information scores for both the training and test datasets, with a threshold of 0.01 used to distinguish features with minimal information content. Bar plots display the relative importance of each feature, highlighting the distribution and magnitude of mutual information across the feature set.
Key insights:
NumOfProducts consistently highest information score: NumOfProducts exhibits the highest mutual information score in both training (≈0.09) and test (≈0.096) datasets, indicating strong relevance to the target variable.
Feature importance distribution shifts between datasets: In the training set, Geography_Germany and Balance follow as the next most informative features, while in the test set, Balance, IsActiveMember, and Gender_Male show increased scores and prominence.
Several features below information threshold: In the training set, CreditScore, IsActiveMember, Geography_Spain, Gender_Male, Tenure, HasCrCard, and EstimatedSalary all fall below the 0.01 threshold, indicating limited information content. In the test set, only HasCrCard, EstimatedSalary, and Geography_Spain remain below this threshold.
Increased information content in test set for multiple features: IsActiveMember, Gender_Male, Geography_Germany, and Tenure display higher mutual information scores in the test set compared to the training set, with all exceeding the 0.01 threshold.
The mutual information analysis reveals a concentrated distribution of feature relevance, with NumOfProducts consistently providing the highest information content across both datasets. The test set demonstrates a broader set of features exceeding the information threshold, suggesting increased relevance for IsActiveMember, Gender_Male, Geography_Germany, and Tenure relative to the training set. Several features, including HasCrCard, EstimatedSalary, and Geography_Spain, consistently exhibit minimal information content, indicating limited direct association with the target variable. The observed shifts in feature importance between datasets highlight potential changes in feature-target relationships that warrant ongoing monitoring.
Parameters:
{
"min_threshold": 0.01
}
Figures
2026-01-28 18:04:31,027 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document
The Pearson Correlation Matrix test evaluates the linear dependency between all pairs of numerical variables in the dataset, providing a matrix of Pearson correlation coefficients visualized as a heatmap. The results display the correlation structure for both the training and test datasets, with coefficients ranging from -1 to 1, and highlight any high correlations (absolute value > 0.7) for further review. The heatmaps show the magnitude and direction of relationships between variables, allowing for identification of potential redundancy or multicollinearity.
Key insights:
No high correlations detected: All pairwise correlation coefficients in both training and test datasets are below the 0.7 threshold, indicating the absence of strong linear dependencies among variables.
Consistent correlation structure across splits: The correlation patterns and magnitudes are stable between the training and test datasets, with the highest observed correlations (e.g., Balance and Geography_Germany, approximately 0.41–0.42) remaining moderate and consistent.
Low risk of multicollinearity: The lack of high-magnitude correlations suggests that the variables are largely independent, minimizing the risk of redundancy or instability in model estimation.
The correlation analysis reveals a stable and moderate correlation structure across both training and test datasets, with no evidence of strong linear relationships or multicollinearity among the variables. This indicates that the dataset maintains a high degree of variable independence, supporting reliable model estimation and interpretability.
Figures
2026-01-28 18:04:43,647 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document
The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results present the top ten absolute Pearson correlation coefficients for both the training and test datasets, indicating whether each pair exceeds the predefined threshold of 0.3. Correlation coefficients, feature pairs, and pass/fail status are reported for each dataset, highlighting the most significant relationships observed.
Key insights:
Two feature pairs exceed correlation threshold: In both the training and test datasets, the pairs (Balance, Geography_Germany) and (Geography_Germany, Geography_Spain) display absolute correlation coefficients above the 0.3 threshold, with values ranging from 0.3627 to 0.421. These pairs are flagged as "Fail" by the test.
Consistent correlation structure across datasets: The same feature pairs exceed the threshold in both the training and test datasets, with similar coefficient magnitudes, indicating stable correlation patterns between these variables.
All other feature pairs below threshold: The remaining top correlations in both datasets have absolute coefficients below 0.3, resulting in a "Pass" status and indicating no further high linear relationships among other feature pairs.
The test results reveal that only two feature pairs—(Balance, Geography_Germany) and (Geography_Germany, Geography_Spain)—exhibit moderate linear correlations exceeding the specified threshold in both the training and test datasets. The consistency of these correlations across datasets suggests a stable relationship between these variables. All other feature pairs remain below the threshold, indicating limited risk of widespread multicollinearity or feature redundancy within the evaluated features.
2026-01-28 18:04:52,126 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document
validmind.model_validation.ModelMetadata
Model Metadata
The ModelMetadata test compares the metadata of different models to assess consistency in architecture, framework, framework version, and programming language. The summary table presents metadata fields for each model, including modeling technique, framework, version, and programming language. Both models, log_model_champion and rf_model, are included in the comparison, with all relevant metadata fields populated.
Key insights:
Consistent modeling technique and framework: Both models use the SKlearnModel technique and the sklearn framework.
Identical framework versions: Framework Version 1.8.0 is used for both models.
Uniform programming language: Python is the programming language for both models.
The metadata comparison indicates complete alignment across all evaluated fields for the two models. No inconsistencies or missing metadata are observed, and the models share the same architecture, framework, version, and programming language. This uniformity supports ease of integration and management within the modeling environment.
Tables
model
Modeling Technique
Modeling Framework
Framework Version
Programming Language
log_model_champion
SKlearnModel
sklearn
1.8.0
Python
rf_model
SKlearnModel
sklearn
1.8.0
Python
2026-01-28 18:04:55,917 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document
The Model Parameters test extracts and displays all configuration parameters for each model to ensure transparency and reproducibility. The results present a structured table listing parameter names and their corresponding values for both the logistic regression model (log_model_champion) and the random forest model (rf_model). Each parameter setting is shown explicitly, providing a comprehensive view of the model configurations at the time of testing.
Key insights:
Explicit parameterization for both models: All parameters for log_model_champion and rf_model are listed, including regularization, solver, and iteration settings for the logistic regression model, and tree construction, sampling, and random state settings for the random forest model.
No missing or undefined parameters: Every parameter extracted via the scikit-learn API is present in the results, with no missing or null values observed.
Use of non-default and default values: The logistic regression model specifies penalty='l1' and solver='liblinear', while the random forest model uses n_estimators=50 and random_state=42, indicating a mix of explicitly set and default parameter values.
The extracted parameter set provides a complete and transparent record of model configuration for both the logistic regression and random forest models. All relevant parameters are captured, supporting reproducibility and facilitating future audits or validation efforts. The results indicate that both models are fully parameterized with no missing configuration elements.
Tables
model
Parameter
Value
log_model_champion
C
1
log_model_champion
dual
False
log_model_champion
fit_intercept
True
log_model_champion
intercept_scaling
1
log_model_champion
max_iter
100
log_model_champion
penalty
l1
log_model_champion
solver
liblinear
log_model_champion
tol
0.0001
log_model_champion
verbose
0
log_model_champion
warm_start
False
rf_model
bootstrap
True
rf_model
ccp_alpha
0.0
rf_model
criterion
gini
rf_model
max_features
sqrt
rf_model
min_impurity_decrease
0.0
rf_model
min_samples_leaf
1
rf_model
min_samples_split
2
rf_model
min_weight_fraction_leaf
0.0
rf_model
n_estimators
50
rf_model
oob_score
False
rf_model
random_state
42
rf_model
verbose
0
rf_model
warm_start
False
2026-01-28 18:05:03,226 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document
validmind.model_validation.sklearn.ROCCurve
ROC Curve
The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) for both the training and test datasets. The resulting plots display the trade-off between the true positive rate and false positive rate at various thresholds, with the AUC quantifying the model's ability to distinguish between classes. The ROC curves and corresponding AUC values are presented for both the train_dataset_final and test_dataset_final.
Key insights:
AUC indicates moderate discriminative power on training data: The AUC for the train_dataset_final is 0.68, reflecting moderate ability to separate positive and negative classes.
Improved AUC on test data: The AUC for the test_dataset_final is 0.73, indicating stronger discriminative performance on unseen data compared to the training set.
ROC curves consistently above random baseline: Both ROC curves remain above the random classifier line (AUC = 0.5), demonstrating that the model provides meaningful class separation across thresholds.
The ROC Curve test results show that log_model_champion achieves moderate to good discrimination between classes, with AUC values of 0.68 on the training set and 0.73 on the test set. The model's performance is consistently above the random baseline, and the higher AUC on the test data suggests effective generalization to new samples. No indications of model collapse or lack of discriminative ability are observed in these results.
Figures
2026-01-28 18:05:14,617 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document
The Minimum ROC AUC Score test evaluates whether the model's multiclass ROC AUC score meets or exceeds a specified minimum threshold, providing an assessment of the model's ability to distinguish between classes. The results table presents ROC AUC scores for both the training and test datasets, alongside the applied threshold and the corresponding pass/fail outcome. Both datasets are evaluated against a minimum threshold of 0.5, with the observed scores and test outcomes reported for each.
Key insights:
ROC AUC scores exceed threshold on all datasets: Both the training (0.6766) and test (0.7271) datasets register ROC AUC scores above the minimum threshold of 0.5.
Test dataset shows higher discriminative performance: The ROC AUC score on the test dataset (0.7271) is higher than on the training dataset (0.6766), indicating improved class separation on unseen data.
Test passed for all evaluated datasets: The pass/fail status is "Pass" for both datasets, confirming that the model meets the minimum ROC AUC requirement across splits.
The results indicate that the model demonstrates adequate discriminative capability on both training and test datasets, with ROC AUC scores consistently above the defined threshold. The higher score on the test dataset suggests stable or improved generalization performance, and the model satisfies the minimum ROC AUC criterion across all evaluated data partitions.
Parameters:
{
"min_threshold": 0.5
}
Tables
dataset
Score
Threshold
Pass/Fail
train_dataset_final
0.6766
0.5
Pass
test_dataset_final
0.7271
0.5
Pass
2026-01-28 18:05:20,533 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document
In summary
In this final notebook, you learned how to:
With our ValidMind for model validation series of notebooks, you learned how to validate a model end-to-end with the ValidMind Library by running through some common scenarios in a typical model validation setting:
Verifying the data quality steps performed by the model development team
Independently replicating the champion model's results and conducting additional tests to assess performance, stability, and robustness
Setting up test inputs and a challenger model for comparative analysis
Running validation tests, analyzing results, and logging artifacts to ValidMind
Next steps
Work with your validation report
Now that you've logged all your test results and verified the work done by the model development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:
Inserting additional test results: Click Link Evidence to Report under any section of 2. Validation in your validation report. (Learn more: Link evidence to reports)
Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)
Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage model findings)
Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)
Assessing compliance: Under the Guideline for any validation report section, click ASSESSMENT and select the compliance status from the drop-down menu. (Learn more: Provide compliance assessments)
Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including model developers. Propose suggested changes in the model documentation, work with versioned history, and use comments to discuss specific portions of the model documentation. (Learn more: Collaborate with others)
When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough model validation history. (Learn more: Submit for approval)
Learn more
Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining model validation: