ValidMind for model validation 2 — Start the model validation process

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this second notebook, independently verify the data quality tests performed on the dataset used to train the champion model.

You'll learn how to run relevant validation tests with ValidMind, log the results of those tests to the ValidMind Platform, and insert your logged test results as evidence into your validation report. You'll become familiar with the tests available in ValidMind, as well as how to run them. Running tests during model validation is crucial to the effective challenge process, as we want to independently evaluate the evidence and assessments provided by the model development team.

While running our tests in this notebook, we'll focus on:

Ensuring that data used for training and testing the model is of appropriate data quality
Ensuring that the raw data has been preprocessed appropriately and that the resulting final datasets reflects this

For a full list of out-of-the-box tests, refer to our Test descriptions or try the interactive Test sandbox.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to independently assess the quality of your datasets with notebook, you'll need to first have:

Registered a model within the ValidMind Platform and granted yourself access to the model as a validator
Installed the ValidMind Library in your local environment, allowing you to access all its features

Need help with the above steps?

Refer to the first notebook in this series: 1 — Set up the ValidMind Library for validation

Setting up

Initialize the ValidMind Library

First, let's connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.
Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)

Note: you may need to restart the kernel to use updated packages.

2025-12-31 22:31:39,654 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Load the sample dataset

Let's first import the public Bank Customer Churn Prediction dataset from Kaggle, which was used to develop the dummy champion model.

We'll use this dataset to review steps that should have been conducted during the initial development and documentation of the model to ensure that the model was built correctly. By independently performing steps taken by the model development team, we can confirm whether the model was built using appropriate and properly processed data.

In our below example, note that:

The target column, Exited has a value of 1 when a customer has churned and 0 otherwise.
The ValidMind Library provides a wrapper to automatically load the dataset as a Pandas DataFrame object. A Pandas Dataframe is a two-dimensional tabular data structure that makes use of rows and columns.

from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
raw_df.head()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0	619	France	Female	42	2	0.00	1	1	1	101348.88	1
1	608	Spain	Female	41	1	83807.86	1	0	1	112542.58	0
2	502	France	Female	42	8	159660.80	3	1	0	113931.57	1
3	699	France	Female	39	1	0.00	2	0	0	93826.63	0
4	850	Spain	Female	43	2	125510.82	1	1	1	79084.10	0

Verifying data quality adjustments

Let's say that thanks to the documentation submitted by the model development team (Learn more ...), we know that the sample dataset was first modified before being used to train the champion model. After performing some data quality assessments on the raw dataset, it was determined that the dataset required rebalancing, and highly correlated features were also removed.

Identify qualitative tests

During model validation, we use the same data processing logic and training procedure to confirm that the model's results can be reproduced independently, so let's start by doing some data quality assessments by running a few individual tests just like the development team did.

Use the vm.tests.list_tests() function introduced by the first notebook in this series in combination with vm.tests.list_tags() and vm.tests.list_tasks() to find which prebuilt tests are relevant for data quality assessment:

tasks represent the kind of modeling task associated with a test. Here we'll focus on classification tasks.
tags are free-form descriptions providing more details about the test, for example, what category the test falls into. Here we'll focus on the data_quality tag.

# Get the list of available task types
sorted(vm.tests.list_tasks())

['classification',
 'clustering',
 'data_validation',
 'feature_extraction',
 'monitoring',
 'nlp',
 'regression',
 'residual_analysis',
 'text_classification',
 'text_generation',
 'text_qa',
 'text_summarization',
 'time_series_forecasting',
 'visualization']

# Get the list of available tags
sorted(vm.tests.list_tags())

['AUC',
 'analysis',
 'anomaly_detection',
 'bias_and_fairness',
 'binary_classification',
 'calibration',
 'categorical_data',
 'classification',
 'classification_metrics',
 'clustering',
 'correlation',
 'credit_risk',
 'data_analysis',
 'data_distribution',
 'data_quality',
 'data_validation',
 'descriptive_statistics',
 'dimensionality_reduction',
 'distribution',
 'embeddings',
 'feature_importance',
 'feature_selection',
 'few_shot',
 'forecasting',
 'frequency_analysis',
 'kmeans',
 'linear_regression',
 'llm',
 'logistic_regression',
 'metadata',
 'model_comparison',
 'model_diagnosis',
 'model_explainability',
 'model_interpretation',
 'model_performance',
 'model_predictions',
 'model_selection',
 'model_training',
 'model_validation',
 'multiclass_classification',
 'nlp',
 'normality',
 'numerical_data',
 'outliers',
 'qualitative',
 'rag_performance',
 'ragas',
 'regression',
 'retrieval_performance',
 'scorecard',
 'seasonality',
 'senstivity_analysis',
 'sklearn',
 'stationarity',
 'statistical_test',
 'statistics',
 'statsmodels',
 'tabular_data',
 'text_data',
 'threshold_optimization',
 'time_series_data',
 'unit_root_test',
 'visualization',
 'zero_shot']

You can pass tags and tasks as parameters to the vm.tests.list_tests() function to filter the tests based on the tags and task types.

For example, to find tests related to tabular data quality for classification models, you can call list_tests() like this:

vm.tests.list_tests(task="classification", tags=["tabular_data", "data_quality"])

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.data_validation.ClassImbalance	Class Imbalance	Evaluates and quantifies class distribution imbalance in a dataset used by a machine learning model....	True	True	['dataset']	{'min_percent_threshold': {'type': 'int', 'default': 10}}	['tabular_data', 'binary_classification', 'multiclass_classification', 'data_quality']	['classification']
validmind.data_validation.DescriptiveStatistics	Descriptive Statistics	Performs a detailed descriptive statistical analysis of both numerical and categorical data within a model's...	False	True	['dataset']	{}	['tabular_data', 'time_series_data', 'data_quality']	['classification', 'regression']
validmind.data_validation.Duplicates	Duplicates	Tests dataset for duplicate entries, ensuring model reliability via data quality verification....	False	True	['dataset']	{'min_threshold': {'type': '_empty', 'default': 1}}	['tabular_data', 'data_quality', 'text_data']	['classification', 'regression']
validmind.data_validation.HighCardinality	High Cardinality	Assesses the number of unique values in categorical columns to detect high cardinality and potential overfitting....	False	True	['dataset']	{'num_threshold': {'type': 'int', 'default': 100}, 'percent_threshold': {'type': 'float', 'default': 0.1}, 'threshold_type': {'type': 'str', 'default': 'percent'}}	['tabular_data', 'data_quality', 'categorical_data']	['classification', 'regression']
validmind.data_validation.HighPearsonCorrelation	High Pearson Correlation	Identifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity....	False	True	['dataset']	{'max_threshold': {'type': 'float', 'default': 0.3}, 'top_n_correlations': {'type': 'int', 'default': 10}, 'feature_columns': {'type': 'list', 'default': None}}	['tabular_data', 'data_quality', 'correlation']	['classification', 'regression']
validmind.data_validation.MissingValues	Missing Values	Evaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold....	False	True	['dataset']	{'min_threshold': {'type': 'int', 'default': 1}}	['tabular_data', 'data_quality']	['classification', 'regression']
validmind.data_validation.MissingValuesBarPlot	Missing Values Bar Plot	Assesses the percentage and distribution of missing values in the dataset via a bar plot, with emphasis on...	True	False	['dataset']	{'threshold': {'type': 'int', 'default': 80}, 'fig_height': {'type': 'int', 'default': 600}}	['tabular_data', 'data_quality', 'visualization']	['classification', 'regression']
validmind.data_validation.Skewness	Skewness	Evaluates the skewness of numerical data in a dataset to check against a defined threshold, aiming to ensure data...	False	True	['dataset']	{'max_threshold': {'type': '_empty', 'default': 1}}	['data_quality', 'tabular_data']	['classification', 'regression']
validmind.plots.BoxPlot	Box Plot	Generates customizable box plots for numerical features in a dataset with optional grouping using Plotly....	True	False	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'group_by': {'type': 'Optional', 'default': None}, 'width': {'type': 'int', 'default': 1800}, 'height': {'type': 'int', 'default': 1200}, 'colors': {'type': 'Optional', 'default': None}, 'show_outliers': {'type': 'bool', 'default': True}, 'title_prefix': {'type': 'str', 'default': 'Box Plot of'}}	['tabular_data', 'visualization', 'data_quality']	['classification', 'regression', 'clustering']
validmind.plots.HistogramPlot	Histogram Plot	Generates customizable histogram plots for numerical features in a dataset using Plotly....	True	False	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'bins': {'type': 'Union', 'default': 30}, 'color': {'type': 'str', 'default': 'steelblue'}, 'opacity': {'type': 'float', 'default': 0.7}, 'show_kde': {'type': 'bool', 'default': True}, 'normalize': {'type': 'bool', 'default': False}, 'log_scale': {'type': 'bool', 'default': False}, 'title_prefix': {'type': 'str', 'default': 'Histogram of'}, 'width': {'type': 'int', 'default': 1200}, 'height': {'type': 'int', 'default': 800}, 'n_cols': {'type': 'int', 'default': 2}, 'vertical_spacing': {'type': 'float', 'default': 0.15}, 'horizontal_spacing': {'type': 'float', 'default': 0.1}}	['tabular_data', 'visualization', 'data_quality']	['classification', 'regression', 'clustering']
validmind.stats.DescriptiveStats	Descriptive Stats	Provides comprehensive descriptive statistics for numerical features in a dataset....	False	True	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'include_advanced': {'type': 'bool', 'default': True}, 'confidence_level': {'type': 'float', 'default': 0.95}}	['tabular_data', 'statistics', 'data_quality']	['classification', 'regression', 'clustering']

Want to learn more about navigating ValidMind tests?

Refer to our notebook outlining the utilities available for viewing and understanding available ValidMind tests: Explore tests

Initialize the ValidMind datasets

With the individual tests we want to run identified, the next step is to connect your data with a ValidMind Dataset object. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind, but you only need to do it once per dataset.

Initialize a ValidMind dataset object using the init_dataset function from the ValidMind (vm) module. For this example, we'll pass in the following arguments:

dataset — The raw dataset that you want to provide as input to tests.
input_id — A unique identifier that allows tracking what inputs are used when running each individual test.
target_column — A required argument if tests require access to true values. This is the name of the target column in the dataset.

# vm_raw_dataset is now a VMDataset object that you can pass to any ValidMind test
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)

Run data quality tests

Now that we know how to initialize a ValidMind dataset object, we're ready to run some tests!

You run individual tests by calling the run_test function provided by the validmind.tests module. For the examples below, we'll pass in the following arguments:

test_id — The ID of the test to run, as seen in the ID column when you run list_tests.
params — A dictionary of parameters for the test. These will override any default_params set in the test definition.

Run tabular data tests

The inputs expected by a test can also be found in the test definition — let's take validmind.data_validation.DescriptiveStatistics as an example.

Note that the output of the describe_test() function below shows that this test expects a dataset as input:

vm.tests.describe_test("validmind.data_validation.DescriptiveStatistics")

▶ Test: Descriptive Statistics ('validmind.data_validation.DescriptiveStatistics')

Descriptive Statistics

Performs a detailed descriptive statistical analysis of both numerical and categorical data within a model's dataset.

Purpose

The purpose of the Descriptive Statistics metric is to provide a comprehensive summary of both numerical and categorical data within a dataset. This involves statistics such as count, mean, standard deviation, minimum and maximum values for numerical data. For categorical data, it calculates the count, number of unique values, most common value and its frequency, and the proportion of the most frequent value relative to the total. The goal is to visualize the overall distribution of the variables in the dataset, aiding in understanding the model's behavior and predicting its performance.

Test Mechanism

The testing mechanism utilizes two in-built functions of pandas dataframes: describe() for numerical fields and value_counts() for categorical fields. The describe() function pulls out several summary statistics, while value_counts() accounts for unique values. The resulting data is formatted into two distinct tables, one for numerical and another for categorical variable summaries. These tables provide a clear summary of the main characteristics of the variables, which can be instrumental in assessing the model's performance.

Signs of High Risk

Skewed data or significant outliers can represent high risk. For numerical data, this may be reflected via a significant difference between the mean and median (50% percentile).
For categorical data, a lack of diversity (low count of unique values), or overdominance of a single category (high frequency of the top value) can indicate high risk.

Strengths

Provides a comprehensive summary of the dataset, shedding light on the distribution and characteristics of the variables under consideration.
It is a versatile and robust method, applicable to both numerical and categorical data.
Helps highlight crucial anomalies such as outliers, extreme skewness, or lack of diversity, which are vital in understanding model behavior during testing and validation.

Limitations

While this metric offers a high-level overview of the data, it may fail to detect subtle correlations or complex patterns.
Does not offer any insights on the relationship between variables.
Alone, descriptive statistics cannot be used to infer properties about future unseen data.
Should be used in conjunction with other statistical tests to provide a comprehensive understanding of the model's data.

Required Inputs: dataset

How to Run:

Code:

        
import validmind as vm

# inputs dictionary maps your inputs to the expected input names
# keys are the expected input names and values are the actual inputs
# values may be string input_ids or the actual VMDataset or VMModel objects
inputs = {
    "dataset": "my_vm_dataset"
}
params = {}

# to run and view the result of this test, run the following code:
result = vm.tests.run_test(
  "validmind.data_validation.DescriptiveStatistics", inputs=inputs, params=params
)

# To see the result of the test, ensure that you have called `vm.init()` and then run:
result.log()

Now, let's run a few tests to assess the quality of the dataset:

result2 = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_raw_dataset},
    params={"min_percent_threshold": 30},
)

❌ Class Imbalance

Class Imbalance is designed to evaluate the distribution of target classes in a dataset used by a machine learning model. Its primary purpose is to ensure that the classes are not overly skewed, which could lead to bias in the model's predictions. A balanced training dataset is crucial to avoid creating a model that performs well for the majority class but poorly for the minority class.

The test operates by calculating the frequency of each class in the target column of the dataset, expressed as a percentage of the total records. It checks whether each class appears in at least a set minimum percentage of the total records, with the default threshold set at 10%. This threshold is adjustable to accommodate different use cases. The test identifies any class that falls below this threshold as high risk, indicating potential class imbalance. The methodology involves counting the occurrences of each class, dividing by the total number of records, and comparing the result to the threshold. A class distribution that meets or exceeds the threshold is considered balanced, while those that do not are flagged as imbalanced.

The primary advantages of this test include its ability to quickly identify under-represented classes that could affect the efficiency of a machine learning model. The straightforward calculation makes it efficient and easy to implement. Additionally, the test is highly informative as it not only spots imbalance but also quantifies the degree of imbalance. The adjustable threshold provides flexibility, allowing adaptation to different domains or specific needs. Furthermore, the test generates a visually insightful plot that shows the classes and their corresponding proportions, enhancing interpretability and comprehension of the data.

It should be noted that the test might struggle with datasets containing a high number of classes, where imbalance could be inherent due to the natural distribution. The sensitivity to the threshold value might lead to incorrect detection of imbalance if set too high. Additionally, the test does not account for varying costs or impacts of misclassifying different classes, which can vary based on specific applications. While it identifies imbalances, it does not provide direct methods to address or correct them. The test is applicable only for classification tasks and is unsuitable for regression or clustering.

This test shows the results in both tabular and graphical formats. The table presents the percentage of rows for each class and indicates whether each class passes or fails the threshold test. The plot visually represents the class distribution, with the x-axis showing the class labels and the y-axis indicating the percentage of records. The table reveals that class '0' comprises 79.80% of the dataset and passes the threshold, while class '1' comprises 20.20% and fails, given the threshold is set at 30%. The plot corroborates this, showing a significant imbalance between the two classes, with class '0' dominating the dataset.

The test results reveal the following key insights:

Significant Class Imbalance: Class '0' constitutes 79.80% of the dataset, while class '1' makes up only 20.20%, indicating a significant imbalance.
Threshold Failure for Minority Class: Class '1' fails to meet the 30% threshold, highlighting a potential risk for model bias towards the majority class.

Based on these results, the dataset exhibits a notable class imbalance, with class '0' being significantly over-represented compared to class '1'. This imbalance suggests a potential risk for the model to be biased towards predicting the majority class more accurately. The failure of class '1' to meet the threshold indicates that the model may struggle with accurately predicting this minority class, potentially leading to lower performance in real-world applications where this class is critical. The insights emphasize the need for strategies to address this imbalance to ensure the model's predictions are fair and accurate across all classes.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	79.80%	Pass
1	20.20%	Fail

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:5c94

The output above shows that the class imbalance test did not pass according to the value we set for min_percent_threshold — great, this matches what was reported by the model development team.

To address this issue, we'll re-run the test on some processed data. In this case let's apply a very simple rebalancing technique to the dataset:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

With this new balanced dataset, you can re-run the individual test to see if it now passes the class imbalance test requirement.

As this is technically a different dataset, remember to first initialize a new ValidMind Dataset object to pass in as input as required by run_test():

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

# Pass the initialized `balanced_raw_dataset` as input into the test run
result = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_balanced_raw_dataset},
    params={"min_percent_threshold": 30},
)

✅ Class Imbalance

The test operates by calculating the frequency of each class in the target column of the dataset, expressed as a percentage. It checks whether each class appears in at least a set minimum percentage of the total records, with the default threshold set at 10%. This involves counting the occurrences of each class and dividing by the total number of records to obtain the percentage. The test then compares these percentages against the threshold to determine if any class is under-represented. A class that falls below the threshold is marked as high risk, indicating potential imbalance. The typical range for these percentages is from 0% to 100%, with values below the threshold considered poor and indicative of imbalance.

The primary advantages of this test include its ability to quickly identify under-represented classes that could affect model efficiency. The straightforward calculation and the informative nature of the test make it particularly useful, as it not only spots imbalance but also quantifies it. The adjustable threshold allows for flexibility, adapting to different use-cases or domain-specific needs. Additionally, the test provides a visual plot showing class proportions, enhancing interpretability and comprehension of the data.

It should be noted that the test might struggle with datasets containing a high number of classes, where imbalance could be inherent. Sensitivity to the threshold value might lead to incorrect imbalance detection if set too high. The test does not account for varying costs or impacts of misclassifying different classes, which can vary by application. While it identifies imbalances, it does not offer direct methods to address them. The test is applicable only for classification tasks and is unsuitable for regression or clustering.

This test shows the results in both tabular and graphical formats. The table presents the percentage of rows for each class and indicates whether each class passes or fails based on the threshold. The plot visually represents the class distribution, with the x-axis showing the classes and the y-axis showing the percentage. Both classes, 0 and 1, have equal representation at 50%, which is above the 30% threshold, resulting in a "Pass" for both. The plot confirms this balance, with both bars reaching the 0.5 mark on the y-axis, indicating equal distribution.

The test results reveal the following key insights:

Balanced Class Distribution: Both classes, 0 and 1, have an equal distribution of 50%, which is well above the 30% threshold, indicating no class imbalance.
Pass for All Classes: Each class meets the minimum percentage requirement, resulting in a "Pass" for both classes, suggesting a well-balanced dataset.

Based on these results, the dataset demonstrates a balanced class distribution, with both classes equally represented. This balance suggests that the model is unlikely to be biased towards any particular class, supporting robust and fair predictions. The equal distribution across classes ensures that the model can learn effectively from all available data, reducing the risk of skewed predictions and enhancing overall model performance.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	50.00%	Pass
1	50.00%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:3f7e

Remove highly correlated features

Next, let's also remove highly correlated features from our dataset as outlined by the development team. Removing highly correlated features helps make the model simpler, more stable, and easier to understand.

You can utilize the output from a ValidMind test for further use — in this below example, to retrieve the list of features with the highest correlation coefficients and use them to reduce the final list of features for modeling.

First, we'll run validmind.data_validation.HighPearsonCorrelation with the balanced_raw_dataset we initialized previously as input as is for comparison with later runs:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

High Pearson Correlation is designed to identify highly correlated feature pairs in a dataset, suggesting feature redundancy or multicollinearity. The primary purpose of this test is to measure the linear relationship between features, which can indicate potential issues such as multicollinearity that may affect the performance and interpretability of machine learning models.

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It measures the strength and direction of the linear relationship between two variables, with the correlation coefficient ranging from -1 to 1. A value close to 1 indicates a strong positive linear relationship, while a value close to -1 indicates a strong negative linear relationship. A value around 0 suggests no linear relationship. The test sorts these correlations, removing duplicates and self-correlations, and evaluates them against a pre-set threshold, which is 0.3 by default. If the absolute value of a correlation exceeds this threshold, it is flagged as a potential issue. The test also returns the top n strongest correlations, providing a clear view of the most significant relationships in the dataset.

The primary advantages of this test include its ability to quickly and effectively identify linear relationships between feature pairs, which is crucial for understanding potential multicollinearity in the dataset. This transparency allows developers and risk management teams to address these issues early in the model development process, potentially improving model performance and interpretability. The test's output is straightforward, displaying pairs of correlated variables along with their correlation coefficients and a Pass or Fail status, making it easy to interpret and act upon.

It should be noted that this test has limitations, including its focus solely on linear relationships, which means it cannot detect nonlinear dependencies. Additionally, the Pearson correlation coefficient is sensitive to outliers, which can skew results and lead to misleading conclusions. The test is also limited to identifying redundancy within feature pairs, potentially missing more complex relationships involving three or more variables. High correlation coefficients exceeding the threshold indicate a risk of multicollinearity, which can lead to model overfitting and reduced interpretability.

This test shows a table format output that lists feature pairs, their Pearson correlation coefficients, and a Pass or Fail status based on the threshold of 0.3. The table is straightforward to read, with each row representing a pair of features and their corresponding correlation coefficient. The key measurement is the correlation coefficient, which quantifies the linear relationship between the features. Notable observations include the feature pair (Age, Exited) with a coefficient of 0.3441, which fails the test due to exceeding the threshold, indicating a potential issue with multicollinearity. Other feature pairs, such as (IsActiveMember, Exited) and (Balance, NumOfProducts), have coefficients below the threshold, passing the test and suggesting no immediate concern for multicollinearity.

The test results reveal the following key insights:

Age and Exited Correlation: The feature pair (Age, Exited) has a correlation coefficient of 0.3441, which exceeds the threshold of 0.3, indicating a potential multicollinearity issue that could affect model performance.
General Pass for Other Pairs: Most feature pairs, such as (IsActiveMember, Exited) and (Balance, NumOfProducts), have correlation coefficients below the threshold, suggesting no significant linear relationship and passing the test.
Negative Correlations Observed: Several feature pairs, including (Balance, NumOfProducts) and (IsActiveMember, Exited), exhibit negative correlations, indicating inverse relationships, though these are not strong enough to fail the test.

Based on these results, the test highlights a specific concern with the (Age, Exited) feature pair, which may require further investigation to address potential multicollinearity. The majority of feature pairs pass the test, indicating that linear relationships are not a widespread issue in the dataset. The presence of negative correlations suggests some inverse relationships, but these do not pose a significant risk under the current threshold. Overall, the test provides valuable insights into the dataset's structure, guiding further analysis and model development efforts.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3441	Fail
(IsActiveMember, Exited)	-0.1917	Pass
(Balance, NumOfProducts)	-0.1741	Pass
(Balance, Exited)	0.1398	Pass
(NumOfProducts, Exited)	-0.0526	Pass
(Tenure, IsActiveMember)	-0.0382	Pass
(CreditScore, Exited)	-0.0364	Pass
(HasCrCard, IsActiveMember)	-0.0347	Pass
(NumOfProducts, IsActiveMember)	0.0299	Pass
(Balance, HasCrCard)	-0.0299	Pass

The output above shows that the test did not pass according to the value we set for max_threshold — as reported and expected.

corr_result is an object of type TestResult. We can inspect the result object to see what the test has produced:

print(type(corr_result))
print("Result ID: ", corr_result.result_id)
print("Params: ", corr_result.params)
print("Passed: ", corr_result.passed)
print("Tables: ", corr_result.tables)

<class 'validmind.vm_models.result.result.TestResult'>
Result ID:  validmind.data_validation.HighPearsonCorrelation
Params:  {'max_threshold': 0.3}
Passed:  False
Tables:  [ResultTable]

Let's remove the highly correlated features and create a new VM dataset object.

We'll begin by checking out the table in the result and extracting a list of features that failed the test:

# Extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3441	Fail
1	(IsActiveMember, Exited)	-0.1917	Pass
2	(Balance, NumOfProducts)	-0.1741	Pass
3	(Balance, Exited)	0.1398	Pass
4	(NumOfProducts, Exited)	-0.0526	Pass
5	(Tenure, IsActiveMember)	-0.0382	Pass
6	(CreditScore, Exited)	-0.0364	Pass
7	(HasCrCard, IsActiveMember)	-0.0347	Pass
8	(NumOfProducts, IsActiveMember)	0.0299	Pass
9	(Balance, HasCrCard)	-0.0299	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

Next, extract the feature names from the list of strings (example: (Age, Exited) > Age):

high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

Now, it's time to re-initialize the dataset with the highly correlated features removed.

Note the use of a different input_id. This allows tracking the inputs used when running each individual test.

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

Re-running the test with the reduced feature set should pass the test:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It measures the strength and direction of the linear relationship between two variables, with the correlation coefficient ranging from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The test sorts these correlations, removing duplicates and self-correlations, and evaluates them against a pre-set threshold (defaulted at 0.3). If the absolute value of a correlation exceeds this threshold, it suggests a significant linear relationship. The test then returns the top n strongest correlations, providing a Pass or Fail status based on the threshold.

The primary advantages of this test include its ability to quickly and effectively identify linear relationships between feature pairs, which is crucial for detecting multicollinearity early in the model development process. This transparency allows developers to understand which features may be redundant, potentially simplifying the model and improving its interpretability. By highlighting these relationships, the test aids in preventing overfitting and ensures that the model's predictive power is not compromised by redundant features. This makes it particularly useful in scenarios where model simplicity and interpretability are prioritized.

It should be noted that the test is limited to identifying linear relationships and does not account for nonlinear dependencies, which may also impact model performance. Additionally, the Pearson correlation coefficient is sensitive to outliers, which can skew results and lead to misleading interpretations. The test only examines pairwise relationships, potentially missing more complex interactions among three or more variables. High correlation coefficients exceeding the threshold indicate a risk of multicollinearity, which can lead to overfitting and reduce the model's interpretability by obscuring the individual predictive power of features.

This test shows a table format output, listing feature pairs, their correlation coefficients, and a Pass or Fail status. Each row represents a pair of features, with the "Columns" field indicating the feature pair, the "Coefficient" field showing the Pearson correlation coefficient, and the "Pass/Fail" field indicating whether the correlation exceeds the threshold. The coefficients range from -0.1917 to 0.1398, all of which are below the threshold of 0.3, resulting in a Pass status for all pairs. Notable observations include the highest correlation between "IsActiveMember" and "Exited" at -0.1917, and the lowest between "Tenure" and "Exited" at -0.026. The results suggest that none of the feature pairs exhibit a strong linear relationship, indicating low risk of multicollinearity.

The test results reveal the following key insights:

Low Correlation Across Features: All feature pairs have correlation coefficients below the threshold of 0.3, indicating a low risk of multicollinearity.
Negative Correlations Predominate: Most feature pairs exhibit negative correlations, with "IsActiveMember" and "Exited" showing the strongest negative relationship at -0.1917.
Minimal Positive Correlations: The only positive correlation of note is between "Balance" and "Exited" at 0.1398, which is still well below the threshold.

Based on these results, the dataset exhibits low levels of linear correlation among the features, suggesting minimal risk of multicollinearity affecting the model's performance. The predominance of negative correlations indicates that as one feature increases, the other tends to decrease, though not strongly enough to warrant concern. The absence of strong positive correlations further supports the conclusion that feature redundancy is not a significant issue in this dataset. This analysis provides confidence in the dataset's suitability for modeling without the need for immediate feature reduction or transformation to address multicollinearity.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(IsActiveMember, Exited)	-0.1917	Pass
(Balance, NumOfProducts)	-0.1741	Pass
(Balance, Exited)	0.1398	Pass
(NumOfProducts, Exited)	-0.0526	Pass
(Tenure, IsActiveMember)	-0.0382	Pass
(CreditScore, Exited)	-0.0364	Pass
(HasCrCard, IsActiveMember)	-0.0347	Pass
(NumOfProducts, IsActiveMember)	0.0299	Pass
(Balance, HasCrCard)	-0.0299	Pass
(Tenure, Exited)	-0.0260	Pass

You can also plot the correlation matrix to visualize the new correlation between features:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.PearsonCorrelationMatrix",
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

Pearson Correlation Matrix

Pearson Correlation Matrix is designed to evaluate the extent of linear dependency between numerical variables in a dataset. The primary purpose is to identify potential redundancy by revealing high correlations, which can help in reducing dimensionality without significantly impacting model performance.

The test operates by generating a correlation matrix for all numerical variables using the Pearson correlation formula. This formula measures the linear relationship between two variables, providing a coefficient that ranges from -1 to 1. A value of 1 indicates a perfect positive correlation, -1 a perfect negative correlation, and 0 no correlation. The test visualizes these relationships in a heat map, where the color intensity represents the magnitude and direction of the correlation. High correlations, typically above 0.7 in absolute terms, are highlighted to indicate potential redundancy.

The primary advantages of this test include its ability to detect and quantify linear relationships between variables, which aids in identifying redundant variables. This can simplify models and potentially improve performance by reducing complexity. The heatmap visualization offers an intuitive overview of correlations, making it accessible even to those not comfortable with numerical matrices. This visual representation helps in quickly identifying areas of concern or interest within the dataset.

It should be noted that this test is limited to detecting linear relationships, potentially missing non-linear dependencies that could be important for dimensionality reduction. It measures only the degree of linear relationship, not the strength of one variable's effect on another. The threshold of 0.7 for high correlation is arbitrary and might exclude valid dependencies with lower coefficients. Additionally, a large number of highly correlated variables can indicate redundancy, posing a risk of overfitting.

This test shows a heat map representing the Pearson correlation coefficients between pairs of numerical variables. The axes of the heat map list the variables, and each cell shows the correlation coefficient between the variables at that intersection. The color scale ranges from -1 to 1, with darker colors indicating stronger correlations. Notable observations include the absence of any correlations exceeding the 0.7 threshold, suggesting minimal redundancy. The highest correlation observed is between "Balance" and "Exited" at 0.14, which is relatively low. The heat map provides a clear visual representation of these relationships, allowing for easy identification of any significant correlations.

The test results reveal the following key insights:

Low Overall Correlation: The dataset shows generally low correlations between variables, with no coefficients exceeding 0.7.
Balance and Exited Relationship: The highest correlation is between "Balance" and "Exited" at 0.14, indicating a weak relationship.
Minimal Redundancy: The lack of high correlations suggests minimal redundancy among the variables.

Based on these results, the dataset exhibits low linear dependency among its numerical variables, indicating minimal redundancy. This suggests that the variables are likely contributing unique information to the model. The weak correlation between "Balance" and "Exited" may warrant further investigation, but overall, the dataset appears well-suited for modeling without significant risk of overfitting due to redundancy. The heat map effectively highlights these relationships, providing a clear and accessible overview of the dataset's structure.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:73be

Documenting test results

Now that we've done some analysis on two different datasets, we can use ValidMind to easily document why certain things were done to our raw data with testing to support it. Every test result returned by the run_test() function has a .log() method that can be used to send the test results to the ValidMind Platform.

When logging validation test results to the platform, you'll need to manually add those results to the desired section of the validation report. To demonstrate how to add test results to your validation report, we'll log our data quality tests and insert the results via the ValidMind Platform.

Configure and run comparison tests

Below, we'll perform comparison tests between the original raw dataset (raw_dataset) and the final preprocessed (raw_dataset_preprocessed) dataset, again logging the results to the ValidMind Platform.

We can specify all the tests we'd ike to run in a dictionary called test_config, and we'll pass in the following arguments for each test:

params: Individual test parameters.
input_grid: Individual test inputs to compare. In this case, we'll input our two datasets for comparison.

Note here that the input_grid expects the input_id of the dataset as the value rather than the variable name we specified:

# Individual test config with inputs specified
test_config = {
    "validmind.data_validation.ClassImbalance": {
        "input_grid": {"dataset": ["raw_dataset", "raw_dataset_preprocessed"]},
        "params": {"min_percent_threshold": 30}
    },
    "validmind.data_validation.HighPearsonCorrelation": {
        "input_grid": {"dataset": ["raw_dataset", "raw_dataset_preprocessed"]},
        "params": {"max_threshold": 0.3}
    },
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")

validmind.data_validation.ClassImbalance

❌ Class Imbalance

Class Imbalance is designed to evaluate the distribution of target classes in a dataset used by a machine learning model. Its primary purpose is to ensure that the classes are not overly skewed, which could lead to bias in the model's predictions. A balanced training dataset is crucial to avoid creating a model that is biased with high accuracy for the majority class and low accuracy for the minority class.

The test operates by calculating the frequency of each class in the target column of the dataset, expressed as a percentage. It checks whether each class appears in at least a set minimum percentage of the total records, with the default threshold set at 10%. The test uses the target column data to compute the percentage of each class, comparing these values against the threshold. A class is marked as high risk if it represents less than the threshold, indicating potential imbalance. The test provides a pass/fail outcome for each class based on this criterion, with a typical range of 0% to 100%. A class percentage below the threshold is considered poor, suggesting imbalance, while percentages above are deemed acceptable.

The primary advantages of this test include its ability to spot under-represented classes that could affect the efficiency of a machine learning model. The calculation is straightforward and swift, making it highly informative as it not only spots imbalance but also quantifies the degree of imbalance. The adjustable threshold allows flexibility and adaptation to different use-cases or domain-specific needs. Additionally, the test creates a visually insightful plot showing the classes and their corresponding proportions, enhancing interpretability and comprehension of the data.

It should be noted that the test might struggle to perform well or provide vital insights for datasets with a high number of classes, where imbalance could be inevitable due to inherent class distribution. Sensitivity to the threshold value might result in faulty detection of imbalance if the threshold is set excessively high. Regardless of the percentage threshold, it doesn't account for varying costs or impacts of misclassifying different classes, which might fluctuate based on specific applications or domains. While it can identify imbalances in class distribution, it doesn't provide direct methods to address or correct these imbalances. The test is only applicable for classification operations and unsuitable for regression or clustering tasks.

This test shows the class distribution in both the raw and preprocessed datasets through tables and plots. The tables present the percentage of rows for each class and indicate whether each class passes or fails the threshold test. The plots visually represent the class proportions, with the y-axis showing the percentage and the x-axis representing the class labels. In the raw dataset, Class 0 constitutes 79.80% of the data, passing the threshold, while Class 1, at 20.20%, fails. The preprocessed dataset shows an even distribution, with both classes at 50%, passing the threshold. The visualizations highlight the imbalance in the raw dataset and the balanced distribution in the preprocessed dataset.

The test results reveal the following key insights:

Raw Dataset Imbalance: The raw dataset shows a significant imbalance, with Class 0 making up 79.80% and Class 1 only 20.20%, failing the threshold.
Preprocessed Dataset Balance: The preprocessed dataset achieves a perfect balance, with both classes at 50%, meeting the threshold requirements.

Based on these results, the raw dataset exhibits a clear class imbalance, with Class 1 under-represented, which could lead to biased model predictions. The preprocessing step successfully addresses this issue, resulting in a balanced dataset where both classes are equally represented. This balance is crucial for ensuring that the model can learn effectively from all classes, reducing the risk of bias and improving predictive performance across different class labels. The insights highlight the importance of preprocessing in achieving a balanced dataset, which is essential for robust model training and evaluation.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

dataset	Exited	Percentage of Rows (%)	Pass/Fail
raw_dataset	0	79.80%	Pass
raw_dataset	1	20.20%	Fail
raw_dataset_preprocessed	0	50.00%	Pass
raw_dataset_preprocessed	1	50.00%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:4551

ValidMind Figure validmind.data_validation.ClassImbalance:4177

2025-12-31 22:33:03,381 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance does not exist in model's document

validmind.data_validation.HighPearsonCorrelation

❌ High Pearson Correlation

High Pearson Correlation is designed to identify highly correlated feature pairs in a dataset, suggesting feature redundancy or multicollinearity. The primary purpose of this test is to measure the linear relationship between features, which can indicate potential issues such as feature redundancy or multicollinearity that may affect the performance and interpretability of machine learning models.

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It then sorts these correlations, removing duplicates and self-correlations. The Pearson correlation coefficient measures the strength and direction of the linear relationship between two variables, ranging from -1 to 1. A value close to 1 indicates a strong positive linear relationship, while a value close to -1 indicates a strong negative linear relationship. A value around 0 suggests no linear relationship. The test assigns a Pass or Fail status based on whether the absolute value of the correlation coefficient exceeds a pre-set threshold, which is 0.3 by default. The test also returns the top n strongest correlations, providing insights into the most significant relationships within the dataset.

The primary advantages of this test include its ability to quickly and simply identify relationships between feature pairs, which is crucial for early detection of multicollinearity issues that could disrupt model training. The test generates a transparent output, displaying pairs of correlated variables along with their Pearson correlation coefficients and Pass or Fail status. This transparency aids developers and risk management teams in understanding potential feature redundancies and their implications for model performance and interpretability. By identifying these relationships early, the test helps in refining feature selection and improving model robustness.

It should be noted that the test is limited to detecting linear relationships and may not capture nonlinear dependencies between features. Additionally, the Pearson correlation coefficient is sensitive to outliers, which can significantly affect the results. The test focuses on pairwise feature relationships, potentially missing more complex interactions involving three or more variables. High correlation coefficients indicate a risk of multicollinearity, which can lead to model overfitting and reduced interpretability, as it becomes challenging to discern the individual predictive power of correlated features.

This test shows the results in a tabular format, listing feature pairs, their correlation coefficients, and Pass or Fail status. The table includes data from both raw and preprocessed datasets, allowing for a comprehensive view of feature relationships. Each row represents a pair of features, with columns indicating the dataset, feature pair, correlation coefficient, and whether the correlation exceeds the threshold. The coefficients range from -0.3045 to 0.281, with the threshold set at 0.3. Notably, the pair (Balance, NumOfProducts) in the raw dataset fails the test with a coefficient of -0.3045, indicating a significant negative correlation. The table provides a clear overview of which feature pairs exhibit strong linear relationships, guiding further analysis and potential feature selection adjustments.

The test results reveal the following key insights:

Significant Negative Correlation in Raw Dataset: The feature pair (Balance, NumOfProducts) in the raw dataset shows a significant negative correlation with a coefficient of -0.3045, failing the test threshold.
Moderate Positive Correlation with Age and Exited: The pair (Age, Exited) in the raw dataset has a moderate positive correlation of 0.281, which is below the threshold but noteworthy for its potential impact on model behavior.
Consistent Pass Status in Preprocessed Dataset: All feature pairs in the preprocessed dataset pass the test, indicating that preprocessing may have mitigated some of the linear relationships present in the raw data.

Based on these results, the test highlights the presence of a significant negative correlation between Balance and NumOfProducts in the raw dataset, suggesting potential multicollinearity issues that could affect model performance. The moderate positive correlation between Age and Exited, although below the threshold, may still influence model predictions and should be considered in feature selection. The consistent pass status in the preprocessed dataset suggests that preprocessing steps have effectively reduced linear dependencies, enhancing the dataset's suitability for modeling. These insights underscore the importance of addressing feature correlations to ensure robust and interpretable machine learning models.

Parameters:

{
  "max_threshold": 0.3
}

Tables

dataset	Columns	Coefficient	Pass/Fail
raw_dataset	(Balance, NumOfProducts)	-0.3045	Fail
raw_dataset	(Age, Exited)	0.2810	Pass
raw_dataset	(IsActiveMember, Exited)	-0.1515	Pass
raw_dataset	(Balance, Exited)	0.1174	Pass
raw_dataset	(Age, IsActiveMember)	0.0873	Pass
raw_dataset	(NumOfProducts, Exited)	-0.0523	Pass
raw_dataset	(Age, NumOfProducts)	-0.0306	Pass
raw_dataset	(CreditScore, IsActiveMember)	0.0306	Pass
raw_dataset	(Tenure, IsActiveMember)	-0.0293	Pass
raw_dataset	(Age, Balance)	0.0290	Pass
raw_dataset_preprocessed	(IsActiveMember, Exited)	-0.1917	Pass
raw_dataset_preprocessed	(Balance, NumOfProducts)	-0.1741	Pass
raw_dataset_preprocessed	(Balance, Exited)	0.1398	Pass
raw_dataset_preprocessed	(NumOfProducts, Exited)	-0.0526	Pass
raw_dataset_preprocessed	(Tenure, IsActiveMember)	-0.0382	Pass
raw_dataset_preprocessed	(CreditScore, Exited)	-0.0364	Pass
raw_dataset_preprocessed	(HasCrCard, IsActiveMember)	-0.0347	Pass
raw_dataset_preprocessed	(NumOfProducts, IsActiveMember)	0.0299	Pass
raw_dataset_preprocessed	(Balance, HasCrCard)	-0.0299	Pass
raw_dataset_preprocessed	(Tenure, Exited)	-0.0260	Pass

2025-12-31 22:33:15,770 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log tests with unique identifiers

Next, we'll use the previously initialized vm_balanced_raw_dataset (that still has a highly correlated Age column) as input to run an individual test, then log the result to the ValidMind Platform.

When running individual tests, you can use a custom result_id to tag the individual result with a unique identifier:

This result_id can be appended to test_id with a : separator.
The balanced_raw_dataset result identifier will correspond to the balanced_raw_dataset input, the dataset that still has the Age column.

result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation:balanced_raw_dataset",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)
result.log()

❌ High Pearson Correlation Balanced Raw Dataset

High Pearson Correlation: Balanced Raw Dataset is designed to identify highly correlated feature pairs in a dataset, which may suggest feature redundancy or multicollinearity. The primary purpose of this test is to measure the linear relationship between features, allowing developers and risk management teams to address potential impacts on a machine learning model's performance and interpretability.

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It then sorts these correlations, removing duplicates and self-correlations. The Pearson correlation coefficient, a statistical measure, quantifies the degree of linear relationship between two variables, ranging from -1 to 1. A value close to 1 or -1 indicates a strong linear relationship, while a value near 0 suggests no linear relationship. The test assigns a Pass or Fail status based on whether the absolute value of the correlation coefficient exceeds a pre-set threshold, which is 0.3 by default. The test also returns the top n strongest correlations, with n set to 10 by default, but this can be adjusted using the top_n_correlations parameter.

The primary advantages of this test include its ability to quickly and simply identify relationships between feature pairs, providing a transparent output that displays pairs of correlated variables, the Pearson correlation coefficient, and a Pass or Fail status for each. This transparency aids in the early identification of potential multicollinearity issues that could disrupt model training. By highlighting these relationships, the test helps ensure that the model remains interpretable and that each feature's predictive power is authentic and not overshadowed by redundancy.

It should be noted that the test is limited to identifying linear relationships and does not account for nonlinear dependencies. It is sensitive to outliers, which can significantly affect the correlation coefficient, potentially leading to misleading results. Additionally, the test only identifies redundancy within feature pairs and may not detect more complex relationships involving three or more variables. High correlation coefficients exceeding the threshold indicate a high risk of multicollinearity, which can lead to model overfitting and reduced interpretability.

This test shows the results in a tabular format, listing feature pairs, their Pearson correlation coefficients, and a Pass or Fail status based on the threshold of 0.3. The table provides a clear view of which feature pairs exhibit strong linear relationships. The columns represent the feature pairs, the calculated correlation coefficient, and the Pass/Fail status. The coefficients range from -0.3441 to 0.3441, indicating varying degrees of linear relationships. Notably, the pair (Age, Exited) fails the test with a coefficient of 0.3441, suggesting a significant linear relationship that exceeds the threshold. Other pairs, such as (IsActiveMember, Exited) and (Balance, NumOfProducts), show weaker correlations and pass the test.

The test results reveal the following key insights:

Significant Correlation Between Age and Exited: The feature pair (Age, Exited) exhibits a Pearson correlation coefficient of 0.3441, which exceeds the threshold, indicating a significant linear relationship that may suggest multicollinearity or feature redundancy.
Weak Correlations Among Other Feature Pairs: Other feature pairs, such as (IsActiveMember, Exited) and (Balance, NumOfProducts), have correlation coefficients below the threshold, indicating weaker linear relationships that do not pose a risk of multicollinearity.
Overall Low Correlation Across Dataset: The majority of feature pairs show low correlation coefficients, suggesting that the dataset generally lacks strong linear relationships, which is favorable for model interpretability and performance.

Based on these results, the dataset exhibits a generally low level of linear correlation among most feature pairs, which is beneficial for maintaining model interpretability and reducing the risk of multicollinearity. However, the significant correlation between Age and Exited suggests a potential area of concern that may require further investigation to ensure that the model's predictive power is not compromised by feature redundancy. The overall pattern of low correlations supports the dataset's suitability for modeling, with minimal risk of overfitting due to multicollinearity.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3441	Fail
(IsActiveMember, Exited)	-0.1917	Pass
(Balance, NumOfProducts)	-0.1741	Pass
(Balance, Exited)	0.1398	Pass
(NumOfProducts, Exited)	-0.0526	Pass
(Tenure, IsActiveMember)	-0.0382	Pass
(CreditScore, Exited)	-0.0364	Pass
(HasCrCard, IsActiveMember)	-0.0347	Pass
(NumOfProducts, IsActiveMember)	0.0299	Pass
(Balance, HasCrCard)	-0.0299	Pass

2025-12-31 22:33:26,245 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:balanced_raw_dataset does not exist in model's document

Add test results to reporting

With some test results logged, let's head to the model we connected to at the beginning of this notebook and learn how to insert a test result into our validation report (Need more help?).

While the example below focuses on a specific test result, you can follow the same general procedure for your other results:

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Validation Report under Documents.
Locate the Data Preparation section and click on 2.2.1. Data Quality to expand that section.
Under the Class Imbalance Assessment section, locate Validator Evidence then click Link Evidence to Report:
Select the Class Imbalance test results we logged: ValidMind Data Validation Class Imbalance
Click Update Linked Evidence to add the test results to the validation report.

Confirm that the results for the Class Imbalance test you inserted has been correctly inserted into section 2.2.1. Data Quality of the report:
Note that these test results are flagged as Requires Attention — as they include comparative results from our initial raw dataset.

Click See evidence details to review the LLM-generated description that summarizes the test results, that confirm that our final preprocessed dataset actually passes our test:

Here in this text editor, you can make qualitative edits to the draft that ValidMind generated to finalize the test results.

Learn more: Work with content blocks

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing.

To start, let's grab the first few rows from the balanced_raw_no_age_df dataset we initialized earlier:

balanced_raw_no_age_df.head()

	CreditScore	Geography	Gender	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
6609	675	France	Male	1	0.00	3	1	0	85901.09	0
3797	487	Spain	Male	6	61691.45	1	1	1	53087.98	0
6470	593	France	Male	6	171740.69	1	0	0	20893.61	0
1932	774	France	Female	2	56580.93	1	1	0	113266.28	1
1951	660	Germany	Male	1	129901.21	1	1	0	26025.60	1

Before training the model, we need to encode the categorical features in the dataset:

Use the OneHotEncoder class from the sklearn.preprocessing module to encode the categorical features.
The categorical features in the dataset are Geography and Gender.

balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
6609	675	1	0.00	3	1	0	85901.09	0	False	False	True
3797	487	6	61691.45	1	1	1	53087.98	0	False	True	True
6470	593	6	171740.69	1	0	0	20893.61	0	False	False	True
1932	774	2	56580.93	1	1	0	113266.28	1	False	False	False
1951	660	1	129901.21	1	1	0	26025.60	1	True	False	True

Splitting our dataset into training and testing is essential for proper validation testing, as this helps assess how well the model generalizes to unseen data:

We start by dividing our balanced_raw_no_age_df dataset into training and test subsets using train_test_split, with 80% of the data allocated to training (train_df) and 20% to testing (test_df).
From each subset, we separate the features (all columns except "Exited") into X_train and X_test, and the target column ("Exited") into y_train and y_test.

from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

Initialize the split datasets

Next, let's initialize the training and testing datasets so they are available for use:

vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

In summary

In this second notebook, you learned how to:

Import a sample dataset
Identify which tests you might want to run with ValidMind
Initialize ValidMind datasets
Run individual tests
Utilize the output from tests you’ve run
Log test results as evidence to the ValidMind Platform
Insert test results into your validation report

Next steps

Develop potential challenger models

Now that you're familiar with the basics of using the ValidMind Library, let's use it to develop a challenger model: 3 — Developing a potential challenger model