ValidMind for model development 2 — Start the model development process

Learn how to use ValidMind for your end-to-end model documentation process with our series of four introductory notebooks. In this second notebook, you'll run tests and investigate results, then add the results or evidence to your documentation.

You'll become familiar with the individual tests available in ValidMind, as well as how to run them and change parameters as necessary. Using ValidMind's repository of individual tests as building blocks helps you ensure that a model is being built appropriately.

For a full list of out-of-the-box tests, refer to our Test descriptions or try the interactive Test sandbox.

Learn by doing

Our course tailor-made for developers new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Developer Fundamentals

Prerequisites

In order to log test results or evidence to your model documentation with this notebook, you'll need to first have:

Registered a model within the ValidMind Platform with a predefined documentation template
Installed the ValidMind Library in your local environment, allowing you to access all its features

Need help with the above steps?

Refer to the first notebook in this series: 1 — Set up the ValidMind Library

Setting up

Initialize the ValidMind Library

First, let's connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model development" series of notebooks.
Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)

Note: you may need to restart the kernel to use updated packages.

2025-12-31 22:13:14,675 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model development (ID: cmalgf3qi02ce199qm3rdkl46)
📁 Document Type: model_documentation

Import sample dataset

Then, let's import the public Bank Customer Churn Prediction dataset from Kaggle.

In our below example, note that:

The target column, Exited has a value of 1 when a customer has churned and 0 otherwise.
The ValidMind Library provides a wrapper to automatically load the dataset as a Pandas DataFrame object. A Pandas Dataframe is a two-dimensional tabular data structure that makes use of rows and columns.

from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
raw_df.head()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0	619	France	Female	42	2	0.00	1	1	1	101348.88	1
1	608	Spain	Female	41	1	83807.86	1	0	1	112542.58	0
2	502	France	Female	42	8	159660.80	3	1	0	113931.57	1
3	699	France	Female	39	1	0.00	2	0	0	93826.63	0
4	850	Spain	Female	43	2	125510.82	1	1	1	79084.10	0

Identify qualitative tests

Next, let's say we want to do some data quality assessments by running a few individual tests.

Use the vm.tests.list_tests() function introduced by the first notebook in this series in combination with vm.tests.list_tags() and vm.tests.list_tasks() to find which prebuilt tests are relevant for data quality assessment:

tasks represent the kind of modeling task associated with a test. Here we'll focus on classification tasks.
tags are free-form descriptions providing more details about the test, for example, what category the test falls into. Here we'll focus on the data_quality tag.

# Get the list of available task types
sorted(vm.tests.list_tasks())

['classification',
 'clustering',
 'data_validation',
 'feature_extraction',
 'monitoring',
 'nlp',
 'regression',
 'residual_analysis',
 'text_classification',
 'text_generation',
 'text_qa',
 'text_summarization',
 'time_series_forecasting',
 'visualization']

# Get the list of available tags
sorted(vm.tests.list_tags())

['AUC',
 'analysis',
 'anomaly_detection',
 'bias_and_fairness',
 'binary_classification',
 'calibration',
 'categorical_data',
 'classification',
 'classification_metrics',
 'clustering',
 'correlation',
 'credit_risk',
 'data_analysis',
 'data_distribution',
 'data_quality',
 'data_validation',
 'descriptive_statistics',
 'dimensionality_reduction',
 'distribution',
 'embeddings',
 'feature_importance',
 'feature_selection',
 'few_shot',
 'forecasting',
 'frequency_analysis',
 'kmeans',
 'linear_regression',
 'llm',
 'logistic_regression',
 'metadata',
 'model_comparison',
 'model_diagnosis',
 'model_explainability',
 'model_interpretation',
 'model_performance',
 'model_predictions',
 'model_selection',
 'model_training',
 'model_validation',
 'multiclass_classification',
 'nlp',
 'normality',
 'numerical_data',
 'outliers',
 'qualitative',
 'rag_performance',
 'ragas',
 'regression',
 'retrieval_performance',
 'scorecard',
 'seasonality',
 'senstivity_analysis',
 'sklearn',
 'stationarity',
 'statistical_test',
 'statistics',
 'statsmodels',
 'tabular_data',
 'text_data',
 'threshold_optimization',
 'time_series_data',
 'unit_root_test',
 'visualization',
 'zero_shot']

You can pass tags and tasks as parameters to the vm.tests.list_tests() function to filter the tests based on the tags and task types.

For example, to find tests related to tabular data quality for classification models, you can call list_tests() like this:

vm.tests.list_tests(task="classification", tags=["tabular_data", "data_quality"])

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.data_validation.ClassImbalance	Class Imbalance	Evaluates and quantifies class distribution imbalance in a dataset used by a machine learning model....	True	True	['dataset']	{'min_percent_threshold': {'type': 'int', 'default': 10}}	['tabular_data', 'binary_classification', 'multiclass_classification', 'data_quality']	['classification']
validmind.data_validation.DescriptiveStatistics	Descriptive Statistics	Performs a detailed descriptive statistical analysis of both numerical and categorical data within a model's...	False	True	['dataset']	{}	['tabular_data', 'time_series_data', 'data_quality']	['classification', 'regression']
validmind.data_validation.Duplicates	Duplicates	Tests dataset for duplicate entries, ensuring model reliability via data quality verification....	False	True	['dataset']	{'min_threshold': {'type': '_empty', 'default': 1}}	['tabular_data', 'data_quality', 'text_data']	['classification', 'regression']
validmind.data_validation.HighCardinality	High Cardinality	Assesses the number of unique values in categorical columns to detect high cardinality and potential overfitting....	False	True	['dataset']	{'num_threshold': {'type': 'int', 'default': 100}, 'percent_threshold': {'type': 'float', 'default': 0.1}, 'threshold_type': {'type': 'str', 'default': 'percent'}}	['tabular_data', 'data_quality', 'categorical_data']	['classification', 'regression']
validmind.data_validation.HighPearsonCorrelation	High Pearson Correlation	Identifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity....	False	True	['dataset']	{'max_threshold': {'type': 'float', 'default': 0.3}, 'top_n_correlations': {'type': 'int', 'default': 10}, 'feature_columns': {'type': 'list', 'default': None}}	['tabular_data', 'data_quality', 'correlation']	['classification', 'regression']
validmind.data_validation.MissingValues	Missing Values	Evaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold....	False	True	['dataset']	{'min_threshold': {'type': 'int', 'default': 1}}	['tabular_data', 'data_quality']	['classification', 'regression']
validmind.data_validation.MissingValuesBarPlot	Missing Values Bar Plot	Assesses the percentage and distribution of missing values in the dataset via a bar plot, with emphasis on...	True	False	['dataset']	{'threshold': {'type': 'int', 'default': 80}, 'fig_height': {'type': 'int', 'default': 600}}	['tabular_data', 'data_quality', 'visualization']	['classification', 'regression']
validmind.data_validation.Skewness	Skewness	Evaluates the skewness of numerical data in a dataset to check against a defined threshold, aiming to ensure data...	False	True	['dataset']	{'max_threshold': {'type': '_empty', 'default': 1}}	['data_quality', 'tabular_data']	['classification', 'regression']
validmind.plots.BoxPlot	Box Plot	Generates customizable box plots for numerical features in a dataset with optional grouping using Plotly....	True	False	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'group_by': {'type': 'Optional', 'default': None}, 'width': {'type': 'int', 'default': 1800}, 'height': {'type': 'int', 'default': 1200}, 'colors': {'type': 'Optional', 'default': None}, 'show_outliers': {'type': 'bool', 'default': True}, 'title_prefix': {'type': 'str', 'default': 'Box Plot of'}}	['tabular_data', 'visualization', 'data_quality']	['classification', 'regression', 'clustering']
validmind.plots.HistogramPlot	Histogram Plot	Generates customizable histogram plots for numerical features in a dataset using Plotly....	True	False	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'bins': {'type': 'Union', 'default': 30}, 'color': {'type': 'str', 'default': 'steelblue'}, 'opacity': {'type': 'float', 'default': 0.7}, 'show_kde': {'type': 'bool', 'default': True}, 'normalize': {'type': 'bool', 'default': False}, 'log_scale': {'type': 'bool', 'default': False}, 'title_prefix': {'type': 'str', 'default': 'Histogram of'}, 'width': {'type': 'int', 'default': 1200}, 'height': {'type': 'int', 'default': 800}, 'n_cols': {'type': 'int', 'default': 2}, 'vertical_spacing': {'type': 'float', 'default': 0.15}, 'horizontal_spacing': {'type': 'float', 'default': 0.1}}	['tabular_data', 'visualization', 'data_quality']	['classification', 'regression', 'clustering']
validmind.stats.DescriptiveStats	Descriptive Stats	Provides comprehensive descriptive statistics for numerical features in a dataset....	False	True	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'include_advanced': {'type': 'bool', 'default': True}, 'confidence_level': {'type': 'float', 'default': 0.95}}	['tabular_data', 'statistics', 'data_quality']	['classification', 'regression', 'clustering']

Want to learn more about navigating ValidMind tests?

Refer to our notebook outlining the utilities available for viewing and understanding available ValidMind tests: Explore tests

Initialize the ValidMind datasets

With the individual tests we want to run identified, the next step is to connect your data with a ValidMind Dataset object. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind, but you only need to do it once per dataset.

Initialize a ValidMind dataset object using the init_dataset function from the ValidMind (vm) module. For this example, we'll pass in the following arguments:

dataset — The raw dataset that you want to provide as input to tests.
input_id — A unique identifier that allows tracking what inputs are used when running each individual test.
target_column — A required argument if tests require access to true values. This is the name of the target column in the dataset.

# vm_raw_dataset is now a VMDataset object that you can pass to any ValidMind test
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)

Running tests

Now that we know how to initialize a ValidMind dataset object, we're ready to run some tests!

You run individual tests by calling the run_test function provided by the validmind.tests module. For the examples below, we'll pass in the following arguments:

test_id — The ID of the test to run, as seen in the ID column when you run list_tests.
params — A dictionary of parameters for the test. These will override any default_params set in the test definition.

Run tabular data tests

The inputs expected by a test can also be found in the test definition — let's take validmind.data_validation.DescriptiveStatistics as an example.

Note that the output of the describe_test() function below shows that this test expects a dataset as input:

vm.tests.describe_test("validmind.data_validation.DescriptiveStatistics")

▶ Test: Descriptive Statistics ('validmind.data_validation.DescriptiveStatistics')

Descriptive Statistics

Performs a detailed descriptive statistical analysis of both numerical and categorical data within a model's dataset.

Purpose

The purpose of the Descriptive Statistics metric is to provide a comprehensive summary of both numerical and categorical data within a dataset. This involves statistics such as count, mean, standard deviation, minimum and maximum values for numerical data. For categorical data, it calculates the count, number of unique values, most common value and its frequency, and the proportion of the most frequent value relative to the total. The goal is to visualize the overall distribution of the variables in the dataset, aiding in understanding the model's behavior and predicting its performance.

Test Mechanism

The testing mechanism utilizes two in-built functions of pandas dataframes: describe() for numerical fields and value_counts() for categorical fields. The describe() function pulls out several summary statistics, while value_counts() accounts for unique values. The resulting data is formatted into two distinct tables, one for numerical and another for categorical variable summaries. These tables provide a clear summary of the main characteristics of the variables, which can be instrumental in assessing the model's performance.

Signs of High Risk

Skewed data or significant outliers can represent high risk. For numerical data, this may be reflected via a significant difference between the mean and median (50% percentile).
For categorical data, a lack of diversity (low count of unique values), or overdominance of a single category (high frequency of the top value) can indicate high risk.

Strengths

Provides a comprehensive summary of the dataset, shedding light on the distribution and characteristics of the variables under consideration.
It is a versatile and robust method, applicable to both numerical and categorical data.
Helps highlight crucial anomalies such as outliers, extreme skewness, or lack of diversity, which are vital in understanding model behavior during testing and validation.

Limitations

While this metric offers a high-level overview of the data, it may fail to detect subtle correlations or complex patterns.
Does not offer any insights on the relationship between variables.
Alone, descriptive statistics cannot be used to infer properties about future unseen data.
Should be used in conjunction with other statistical tests to provide a comprehensive understanding of the model's data.

Required Inputs: dataset

How to Run:

Code:

        
import validmind as vm

# inputs dictionary maps your inputs to the expected input names
# keys are the expected input names and values are the actual inputs
# values may be string input_ids or the actual VMDataset or VMModel objects
inputs = {
    "dataset": "my_vm_dataset"
}
params = {}

# to run and view the result of this test, run the following code:
result = vm.tests.run_test(
  "validmind.data_validation.DescriptiveStatistics", inputs=inputs, params=params
)

# To see the result of the test, ensure that you have called `vm.init()` and then run:
result.log()

Now, let's run a few tests to assess the quality of the dataset:

result = vm.tests.run_test(
    test_id="validmind.data_validation.DescriptiveStatistics",
    inputs={"dataset": vm_raw_dataset},
)

Descriptive Statistics

Descriptive Statistics is designed to provide a comprehensive summary of both numerical and categorical data within a dataset. The primary purpose of this test is to visualize the overall distribution of the variables, which aids in understanding the model's behavior and predicting its performance. By summarizing key statistics such as count, mean, standard deviation, and frequency, the test offers insights into the data's characteristics and potential anomalies.

The test operates by utilizing two in-built functions of pandas dataframes: describe() for numerical fields and value_counts() for categorical fields. The describe() function extracts summary statistics including count, mean, standard deviation, minimum, and maximum values, as well as percentiles for numerical data. These metrics help in understanding the central tendency, dispersion, and overall range of the data. For categorical data, value_counts() calculates the count of each category, the number of unique values, the most common value, and its frequency. This provides insights into the distribution and dominance of categories within the dataset. The results are formatted into two distinct tables, one for numerical and another for categorical variable summaries, offering a clear overview of the dataset's main characteristics.

The primary advantages of this test include its ability to provide a detailed summary of the dataset, highlighting the distribution and characteristics of the variables under consideration. It is a versatile and robust method applicable to both numerical and categorical data, making it particularly useful for identifying anomalies such as outliers, extreme skewness, or lack of diversity. These insights are crucial for understanding model behavior during testing and validation, as they can indicate potential areas of concern or interest that may affect model performance.

It should be noted that while this test offers a high-level overview of the data, it may fail to detect subtle correlations or complex patterns. It does not provide insights into the relationships between variables, which can be critical for understanding interactions within the dataset. Additionally, descriptive statistics alone cannot infer properties about future unseen data, and should be used in conjunction with other statistical tests to provide a comprehensive understanding of the model's data. Signs of high risk include skewed data or significant outliers, which may be reflected by a significant difference between the mean and median for numerical data, or a lack of diversity and overdominance of a single category for categorical data.

This test shows the results in two tables: one for numerical variables and another for categorical variables. The numerical table includes columns for count, mean, standard deviation, minimum, maximum, and various percentiles (25%, 50%, 75%, 90%, 95%). These metrics provide a detailed view of the distribution and spread of each numerical variable. For instance, the CreditScore variable has a mean of 650.16 and a standard deviation of 96.85, indicating a relatively wide spread around the mean. The categorical table includes columns for the count of each category, the number of unique values, the most common value, and its frequency and percentage. For example, the Geography variable shows that "France" is the most common category, representing 50.12% of the data. These tables allow for a quick assessment of the data's distribution and highlight any potential areas of concern, such as skewness or lack of diversity.

The test results reveal the following key insights:

CreditScore Distribution: The CreditScore variable has a mean of 650.16 with a standard deviation of 96.85, indicating a wide distribution. The median (50th percentile) is 652, suggesting a relatively symmetric distribution around the mean.
Age Range and Distribution: The Age variable shows a mean of 38.95 and a standard deviation of 10.46, with ages ranging from 18 to 92. The median age is 37, indicating a slightly younger population.
Balance Variability: The Balance variable has a mean of 76,434.10 and a high standard deviation of 62,612.25, with a median of 97,264, indicating significant variability and potential skewness.
Geography Dominance: In the categorical data, "France" is the most common value for Geography, accounting for 50.12% of the data, suggesting a potential lack of diversity in this variable.
Gender Imbalance: The Gender variable shows a slight imbalance, with "Male" being the most common category at 54.95%.

Based on these results, the dataset exhibits a relatively balanced distribution for most numerical variables, with some notable variability in Balance and a slight skew in Age. The categorical variables show a dominance of certain categories, particularly in Geography and Gender, which may influence model behavior. These insights suggest that while the dataset is generally well-distributed, attention should be paid to the variability in Balance and the dominance in categorical variables, as these could impact the model's predictive performance and generalizability.

Tables

Numerical Variables

Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
CreditScore	8000.0	650.1596	96.8462	350.0	583.0	652.0	717.0	778.0	813.0	850.0
Age	8000.0	38.9489	10.4590	18.0	32.0	37.0	44.0	53.0	60.0	92.0
Tenure	8000.0	5.0339	2.8853	0.0	3.0	5.0	8.0	9.0	9.0	10.0
Balance	8000.0	76434.0965	62612.2513	0.0	0.0	97264.0	128045.0	149545.0	162488.0	250898.0
NumOfProducts	8000.0	1.5325	0.5805	1.0	1.0	1.0	2.0	2.0	2.0	4.0
HasCrCard	8000.0	0.7026	0.4571	0.0	0.0	1.0	1.0	1.0	1.0	1.0
IsActiveMember	8000.0	0.5199	0.4996	0.0	0.0	1.0	1.0	1.0	1.0	1.0
EstimatedSalary	8000.0	99790.1880	57520.5089	12.0	50857.0	99505.0	149216.0	179486.0	189997.0	199992.0

Categorical Variables

Name	Count	Number of Unique Values	Top Value	Top Value Frequency	Top Value Frequency %
Geography	8000.0	3.0	France	4010.0	50.12
Gender	8000.0	2.0	Male	4396.0	54.95

result2 = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_raw_dataset},
    params={"min_percent_threshold": 30},
)

❌ Class Imbalance

Class Imbalance is designed to evaluate the distribution of target classes in a dataset used by a machine learning model. Its primary purpose is to ensure that the classes are not overly skewed, which could lead to bias in the model's predictions. A balanced training dataset is crucial to avoid creating a model that is biased with high accuracy for the majority class and low accuracy for the minority class.

The test operates by calculating the frequency of each class in the target column of the dataset, expressed as a percentage. It checks whether each class appears in at least a set minimum percentage of the total records, with the default threshold set at 10%. The test uses this threshold to determine if any class is under-represented, marking it as high risk if it falls below the threshold. The methodology involves counting the occurrences of each class and dividing by the total number of records to derive the percentage. A class distribution is considered balanced if all classes meet or exceed the threshold, while any class below it is flagged as imbalanced.

The primary advantages of this test include its ability to quickly identify under-represented classes that could affect the efficiency of a machine learning model. The calculation is straightforward and swift, providing a clear quantification of class imbalance. The test is highly informative, not only spotting imbalance but also quantifying its degree. The adjustable threshold allows flexibility and adaptation to different use-cases or domain-specific needs. Additionally, the test provides a visually insightful plot showing the classes and their corresponding proportions, enhancing interpretability and comprehension of the data.

It should be noted that the test might struggle to provide vital insights for datasets with a high number of classes, where imbalance could be inevitable due to inherent class distribution. Sensitivity to the threshold value might result in faulty detection of imbalance if the threshold is set excessively high. Regardless of the percentage threshold, it doesn't account for varying costs or impacts of misclassifying different classes, which might fluctuate based on specific applications or domains. While it can identify imbalances in class distribution, it doesn't provide direct methods to address or correct these imbalances. The test is only applicable for classification operations and unsuitable for regression or clustering tasks.

This test shows the class distribution in the dataset through both a table and a plot. The table titled "Exited Class Imbalance" presents the percentage of rows for each class and indicates whether each class passes or fails the set threshold of 30%. The plot visually represents the same data, with the x-axis showing the class labels and the y-axis showing the percentage of each class. The table reveals that class '0' comprises 79.80% of the dataset and passes the threshold, while class '1' comprises 20.20% and fails. The plot corroborates this, with a significantly taller bar for class '0' compared to class '1', indicating a clear imbalance. The key metric here is the percentage of rows, which highlights the disparity between the classes and the failure of class '1' to meet the threshold.

The test results reveal the following key insights:

Significant Class Imbalance: The dataset shows a significant imbalance, with class '0' making up 79.80% of the data, while class '1' only accounts for 20.20%.
Threshold Failure for Minority Class: Class '1' fails to meet the 30% threshold, indicating a potential risk of bias in model predictions due to under-representation.

Based on these results, the dataset exhibits a clear class imbalance, with class '0' being the majority and class '1' failing to meet the minimum threshold. This imbalance suggests that the model may perform well on the majority class but poorly on the minority class, potentially leading to biased predictions. The visual and tabular data both highlight the disparity, emphasizing the need for strategies to address this imbalance to ensure fair and accurate model performance across all classes.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	79.80%	Pass
1	20.20%	Fail

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:69a3

The output above shows that the class imbalance test did not pass according to the value we set for min_percent_threshold.

To address this issue, we'll re-run the test on some processed data. In this case let's apply a very simple rebalancing technique to the dataset:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

With this new balanced dataset, you can re-run the individual test to see if it now passes the class imbalance test requirement.

As this is technically a different dataset, remember to first initialize a new ValidMind Dataset object to pass in as input as required by run_test():

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

# Pass the initialized `balanced_raw_dataset` as input into the test run
result = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_balanced_raw_dataset},
    params={"min_percent_threshold": 30},
)

✅ Class Imbalance

The test operates by calculating the frequency of each class in the target column of the dataset, expressed as a percentage. It checks whether each class appears in at least a set minimum percentage of the total records, with the default threshold set at 10%. This percentage is adjustable to accommodate different use cases. The test identifies any class that falls below this threshold as high risk, indicating potential class imbalance. The methodology involves counting the occurrences of each class and dividing by the total number of records to derive the percentage. A balanced class distribution is generally considered when all classes meet or exceed the threshold, while a poor result is when one or more classes fall below it.

The primary advantages of this test include its ability to quickly identify under-represented classes that could affect the efficiency of a machine learning model. The straightforward calculation makes it swift and easy to implement. It is highly informative, not only spotting imbalance but also quantifying the degree of imbalance. The adjustable threshold allows flexibility and adaptation to different domains or specific needs. Additionally, the test provides a visual plot showing the classes and their proportions, enhancing interpretability and comprehension of the data.

It should be noted that the test might struggle with datasets containing a high number of classes, where imbalance could be inherent due to the class distribution. Sensitivity to the threshold value might lead to incorrect detection of imbalance if set too high. The test does not account for varying costs or impacts of misclassifying different classes, which might vary based on specific applications. While it identifies imbalances, it does not provide direct methods to address or correct them. The test is applicable only for classification tasks and is unsuitable for regression or clustering.

This test shows the results in both tabular and graphical formats. The table presents the percentage of rows for each class and indicates whether each class passes or fails based on the threshold. The plot visually represents the class distribution, with the x-axis showing the classes and the y-axis displaying the percentage. Both classes, 0 and 1, have an equal distribution of 50%, which is well above the 30% threshold, resulting in a "Pass" for both. The plot confirms this balance, with both bars reaching the 0.5 mark on the y-axis, indicating equal representation.

The test results reveal the following key insights:

Balanced Class Distribution: Both classes, 0 and 1, have an equal representation of 50%, indicating a perfectly balanced dataset.
Threshold Compliance: Each class exceeds the 30% minimum threshold, confirming that no class is under-represented.

Based on these results, the dataset demonstrates a balanced class distribution, with both classes equally represented at 50%. This balance suggests that the model is unlikely to be biased towards any particular class, as both meet the set threshold comfortably. The equal distribution ensures that the model can potentially perform well across all classes, reducing the risk of skewed predictions. This balance is crucial for maintaining model accuracy and fairness, particularly in applications where class representation is critical.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	50.00%	Pass
1	50.00%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:1103

Utilize test output

You can utilize the output from a ValidMind test for further use, for example, if you want to remove highly correlated features. Removing highly correlated features helps make the model simpler, more stable, and easier to understand.

Below we demonstrate how to retrieve the list of features with the highest correlation coefficients and use them to reduce the final list of features for modeling.

First, we'll run validmind.data_validation.HighPearsonCorrelation with the balanced_raw_dataset we initialized previously as input as is for comparison with later runs:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

High Pearson Correlation is designed to identify highly correlated feature pairs in a dataset, suggesting feature redundancy or multicollinearity. The primary purpose of this test is to measure the linear relationship between features, which can indicate potential issues such as feature redundancy or multicollinearity that may affect the performance and interpretability of machine learning models.

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It measures the strength and direction of the linear relationship between two variables, with the correlation coefficient ranging from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The test sorts these correlations, removing duplicates and self-correlations, and evaluates them against a pre-set threshold (defaulted at 0.3). If the absolute value of a correlation exceeds this threshold, it is flagged as a potential issue. The test also returns the top n strongest correlations, providing a clear view of the most significant relationships in the dataset.

The primary advantages of this test include its ability to quickly and effectively identify linear relationships between feature pairs, which is crucial for understanding potential multicollinearity issues. This transparency allows developers and risk management teams to address these issues early in the model development process. The test's output is straightforward, displaying pairs of correlated variables along with their Pearson correlation coefficients and a Pass or Fail status. This makes it particularly useful for scenarios where model interpretability and performance are critical, as it helps ensure that the model is not unduly influenced by redundant features.

It should be noted that the test is limited to identifying linear relationships and may not detect nonlinear dependencies between features. Additionally, the Pearson correlation coefficient is sensitive to outliers, which can significantly affect the results. The test only identifies redundancy within feature pairs, potentially missing more complex relationships involving three or more variables. High correlation coefficients exceeding the threshold indicate a high risk of multicollinearity, which can lead to model overfitting and reduced interpretability.

This test shows the results in a tabular format, where each row represents a pair of features with their corresponding Pearson correlation coefficient and a Pass or Fail status. The table includes columns for the feature pairs, the calculated correlation coefficient, and whether the correlation exceeds the threshold of 0.3. The coefficients range from -0.1819 to 0.3287, with only one pair, (Age, Exited), failing the test due to a coefficient of 0.3287, indicating a potential issue with multicollinearity. The other pairs have coefficients below the threshold, suggesting no immediate concerns. The table provides a clear and concise view of the relationships between features, allowing for easy identification of potential multicollinearity issues.

The test results reveal the following key insights:

Age and Exited Correlation: The pair (Age, Exited) has a correlation coefficient of 0.3287, which exceeds the threshold, indicating a potential multicollinearity issue.
Low Correlation Among Other Features: Other feature pairs, such as (IsActiveMember, Exited) and (Balance, NumOfProducts), have correlation coefficients well below the threshold, suggesting minimal risk of multicollinearity.
Negative Correlations Observed: Several feature pairs, including (Balance, NumOfProducts) and (NumOfProducts, Exited), exhibit negative correlations, indicating inverse relationships, though these are not strong enough to pose a risk.

Based on these results, the test highlights a potential multicollinearity issue between Age and Exited, which may require further investigation to ensure model robustness. The other feature pairs show low correlation coefficients, indicating that multicollinearity is not a significant concern for these variables. The presence of negative correlations suggests some inverse relationships, but these are not strong enough to impact the model adversely. Overall, the test provides valuable insights into the linear relationships between features, helping to ensure that the model remains interpretable and performs effectively without being influenced by redundant features.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3287	Fail
(IsActiveMember, Exited)	-0.1831	Pass
(Balance, NumOfProducts)	-0.1819	Pass
(Balance, Exited)	0.1629	Pass
(Tenure, IsActiveMember)	-0.0547	Pass
(NumOfProducts, Exited)	-0.0485	Pass
(HasCrCard, IsActiveMember)	-0.0401	Pass
(Age, Balance)	0.0388	Pass
(Age, IsActiveMember)	0.0377	Pass
(Age, NumOfProducts)	-0.0333	Pass

The output above shows that the test did not pass according to the value we set for max_threshold.

corr_result is an object of type TestResult. We can inspect the result object to see what the test has produced:

print(type(corr_result))
print("Result ID: ", corr_result.result_id)
print("Params: ", corr_result.params)
print("Passed: ", corr_result.passed)
print("Tables: ", corr_result.tables)

<class 'validmind.vm_models.result.result.TestResult'>
Result ID:  validmind.data_validation.HighPearsonCorrelation
Params:  {'max_threshold': 0.3}
Passed:  False
Tables:  [ResultTable]

Let's remove the highly correlated features and create a new VM dataset object.

We'll begin by checking out the table in the result and extracting a list of features that failed the test:

# Extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3287	Fail
1	(IsActiveMember, Exited)	-0.1831	Pass
2	(Balance, NumOfProducts)	-0.1819	Pass
3	(Balance, Exited)	0.1629	Pass
4	(Tenure, IsActiveMember)	-0.0547	Pass
5	(NumOfProducts, Exited)	-0.0485	Pass
6	(HasCrCard, IsActiveMember)	-0.0401	Pass
7	(Age, Balance)	0.0388	Pass
8	(Age, IsActiveMember)	0.0377	Pass
9	(Age, NumOfProducts)	-0.0333	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

Next, extract the feature names from the list of strings (example: (Age, Exited) > Age):

high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

Now, it's time to re-initialize the dataset with the highly correlated features removed.

Note the use of a different input_id. This allows tracking the inputs used when running each individual test.

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

Re-running the test with the reduced feature set should pass the test:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

High Pearson Correlation is designed to identify highly correlated feature pairs in a dataset, suggesting feature redundancy or multicollinearity. The primary purpose of this test is to measure the linear relationship between features, which can indicate potential issues such as multicollinearity that may affect the performance and interpretability of machine learning models.

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It measures the strength and direction of the linear relationship between two variables, with the correlation coefficient ranging from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The test sorts these correlations, removing duplicates and self-correlations, and evaluates them against a pre-set threshold (defaulted at 0.3). If the absolute value of a correlation exceeds this threshold, it suggests a significant linear relationship. The test then returns the top n strongest correlations, providing a Pass or Fail status based on the threshold.

The primary advantages of this test include its ability to quickly and effectively identify linear relationships between feature pairs, which is crucial for detecting multicollinearity early in the model development process. This transparency allows developers to understand which features may be redundant, thus aiding in feature selection and model simplification. By highlighting these relationships, the test helps maintain model interpretability and prevents overfitting, which can occur when redundant features are included in the model.

It should be noted that the test is limited to detecting only linear relationships, which means it may not capture nonlinear dependencies between features. Additionally, the Pearson correlation coefficient is sensitive to outliers, which can skew results and lead to misleading interpretations. The test also focuses on pairwise feature relationships, potentially missing more complex interactions involving three or more variables. High correlation coefficients exceeding the threshold indicate a risk of multicollinearity, which can complicate model interpretation and reduce predictive accuracy.

This test shows the results in a tabular format, listing feature pairs, their correlation coefficients, and a Pass or Fail status based on the threshold of 0.3. Each row represents a pair of features, with the correlation coefficient indicating the strength and direction of their linear relationship. The table is sorted by the absolute value of the correlation coefficients, highlighting the strongest relationships. All listed correlations are below the threshold, resulting in a Pass status for each pair. The coefficients range from -0.1831 to 0.0331, indicating weak linear relationships across the dataset. Notably, the strongest correlation is between "IsActiveMember" and "Exited" with a coefficient of -0.1831, suggesting a weak negative relationship.

The test results reveal the following key insights:

Weak Negative Correlation Between IsActiveMember and Exited: The strongest correlation observed is -0.1831 between "IsActiveMember" and "Exited", indicating a weak negative linear relationship.
Balance and NumOfProducts Show Weak Negative Correlation: A correlation of -0.1819 between "Balance" and "NumOfProducts" suggests a weak negative relationship, which is not significant enough to indicate redundancy.
Overall Weak Correlations: All feature pairs exhibit weak correlations, with coefficients well below the threshold of 0.3, indicating no significant linear relationships that would suggest multicollinearity.

Based on these results, the dataset does not exhibit any strong linear relationships between feature pairs, as all correlations are below the threshold of 0.3. This suggests that multicollinearity is not a significant concern in this dataset, and the features are likely to contribute independently to the model's predictive power. The weak correlations observed indicate that the features are not redundant, supporting the model's interpretability and robustness. These insights provide confidence that the dataset is well-suited for modeling without the need for extensive feature reduction due to multicollinearity.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(IsActiveMember, Exited)	-0.1831	Pass
(Balance, NumOfProducts)	-0.1819	Pass
(Balance, Exited)	0.1629	Pass
(Tenure, IsActiveMember)	-0.0547	Pass
(NumOfProducts, Exited)	-0.0485	Pass
(HasCrCard, IsActiveMember)	-0.0401	Pass
(Tenure, EstimatedSalary)	0.0331	Pass
(NumOfProducts, IsActiveMember)	0.0330	Pass
(CreditScore, EstimatedSalary)	-0.0291	Pass
(Tenure, HasCrCard)	0.0264	Pass

You can also plot the correlation matrix to visualize the new correlation between features:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.PearsonCorrelationMatrix",
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

Pearson Correlation Matrix

Pearson Correlation Matrix is designed to evaluate the extent of linear dependency between numerical variables in a dataset. The primary purpose is to identify potential redundancy by revealing high correlations, which can help in reducing dimensionality without significantly impacting model performance.

The test operates by generating a correlation matrix for all numerical variables using the Pearson correlation formula. This formula measures the linear relationship between two variables, producing a coefficient ranging from -1 to 1. A value of 1 indicates a perfect positive correlation, -1 a perfect negative correlation, and 0 no correlation. The test visualizes these relationships in a heat map, where colors represent the magnitude and direction of correlations. High correlations, above 0.7 in absolute terms, are highlighted in white, indicating potential redundancy. The matrix requires numerical data inputs, and the coefficients are derived by comparing each pair of variables to assess their linear dependency.

The primary advantages of this test include its ability to detect and quantify linear relationships between variables, which is crucial for identifying redundant variables. This simplification can lead to more efficient models by reducing complexity and potentially improving performance. The heatmap visualization offers an intuitive overview of correlations, making it accessible even to those less familiar with numerical matrices. This visual representation helps in quickly identifying areas of concern or interest, facilitating better decision-making in model development.

It should be noted that this test is limited to detecting linear relationships, potentially missing non-linear dependencies that could be valuable for dimensionality reduction. It measures only the degree of linear relationship, not the strength of one variable's effect on another. The chosen threshold of 0.7 for high correlation is arbitrary and might exclude meaningful dependencies with lower coefficients. High correlations across many variables can indicate redundancy, posing a risk of overfitting if not addressed.

This test shows a heat map representing the Pearson correlation coefficients between various numerical variables. The matrix is symmetric, with each cell showing the correlation between the variables on the corresponding row and column. The color scale ranges from -1 (dark blue) to 1 (light blue), with white indicating high correlations above 0.7 in absolute terms. The diagonal represents self-correlation, always equal to 1. Notable observations include the lack of any high correlations, as no cells are highlighted in white. The highest correlation observed is between "Balance" and "Exited" at 0.16, which is relatively low. Most correlations are close to zero, indicating weak linear relationships between the variables.

The test results reveal the following key insights:

Low Overall Correlation: Most variables exhibit low correlation coefficients, indicating weak linear relationships.
Balance and Exited: The highest correlation is 0.16 between "Balance" and "Exited", suggesting a slight positive relationship.
Minimal Redundancy: The absence of high correlations suggests minimal redundancy among the variables.

Based on these results, the dataset shows minimal linear dependency between variables, indicating a low risk of redundancy. The weak correlations suggest that each variable may contribute unique information to the model. This lack of strong linear relationships implies that dimensionality reduction through variable removal may not be necessary, as the variables do not exhibit significant overlap in the information they provide. The model can potentially benefit from retaining all variables, as they appear to offer distinct insights into the data.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:9198

Documenting test results

Now that we've done some analysis on two different datasets, we can use ValidMind to easily document why certain things were done to our raw data with testing to support it.

Every test result returned by the run_test() function has a .log() method that can be used to send the test results to the ValidMind Platform:

When using run_documentation_tests(), documentation sections will be automatically populated with the results of all tests registered in the documentation template.
When logging individual test results to the platform, you'll need to manually add those results to the desired section of the model documentation.

To demonstrate how to add test results to your model documentation, we'll populate the entire Data Preparation section of the documentation using the clean vm_raw_dataset_preprocessed dataset as input, and then document an additional individual result for the highly correlated dataset vm_balanced_raw_dataset.

Run and log multiple tests

run_documentation_tests() allows you to run multiple tests at once and automatically log the results to your documentation. Below, we'll run the tests using the previously initialized vm_raw_dataset_preprocessed as input — this will populate the entire Data Preparation section for every test that is part of the documentation template.

For this example, we'll pass in the following arguments:

inputs: Any inputs to be passed to the tests.
config: A dictionary <test_id>:<test_config> that allows configuring each test individually. Each test config requires the following:
- params: Individual test parameters.
- inputs: Individual test inputs. This overrides any inputs passed from the run_documentation_tests() function.

When including explicit configuration for individual tests, you'll need to specify the inputs even if they mirror what is included in your global configuration.

# Individual test config with inputs specified
test_config = {
    "validmind.data_validation.ClassImbalance": {
        "params": {"min_percent_threshold": 30},
        "inputs": {"dataset": vm_raw_dataset_preprocessed},
    },
    "validmind.data_validation.HighPearsonCorrelation": {
        "params": {"max_threshold": 0.3},
        "inputs": {"dataset": vm_raw_dataset_preprocessed},
    },
}

# Global test config
tests_suite = vm.run_documentation_tests(
    inputs={
        "dataset": vm_raw_dataset_preprocessed,
    },
    config=test_config,
    section=["data_preparation"],
)

Test suite complete!

26/26 (100.0%)

Test Suite Results: Binary Classification V2

Check out the updated documentation on ValidMind.

Template for binary classification models.

▶ Data Preparation

Data Preparation

▶ Test Result: Dataset Description (validmind.data_validation.DatasetDescription)

Dataset Description

Dataset Description is designed to provide a comprehensive analysis and statistical summary of each column in a machine learning model's dataset. The primary purpose of this test is to offer a detailed overview of the dataset's characteristics, which includes vital statistics such as count, distinct values, missing values, and histograms for various data types. This analysis aids in understanding the data's structure and distribution, which is crucial for model training and evaluation.

The test operates by first inferring the data type of each column in the dataset, categorizing them as numerical, categorical, boolean, or text. For each column, the test collects statistical information, including the total count of entries, the number of missing values and their proportion, and the count of unique values along with their proportion. For numerical columns, histograms are generated to depict the distribution of data, while categorical, boolean, and text columns have histograms showing the frequency of each unique value. The test also identifies unsupported data types, raising an error if encountered. The summary table aggregates these insights, providing a holistic view of the dataset's features. The metrics used, such as count and distinct values, help in assessing data completeness and variability, with typical ranges indicating the extent of data diversity and potential issues like high cardinality or missing data.

The primary advantages of this test include its ability to deliver a detailed dataset analysis with versatile summaries, such as counts, unique values, and histograms. It is particularly useful for identifying potential data quality issues, such as missing values or unsupported data types, and for understanding data distribution patterns. The test's flexibility in handling different data types—numerical, categorical, boolean, and text—makes it a valuable tool for developers to make informed decisions about data preprocessing and model training. By providing a comprehensive understanding of dataset features, the test supports the identification of potential challenges and opportunities for data-driven decision-making.

It should be noted that the test has certain limitations and potential risks. The computational cost can be high, especially for large datasets with numerous columns, which may impact resource efficiency. The use of an arbitrary number of bins for histograms might not always capture the optimal data distribution, potentially leading to misinterpretation. Unsupported data types result in errors, limiting the evaluation of the dataset. Additionally, columns with all null or missing values are excluded from histogram computation, which might overlook certain data characteristics. While the test validates dataset quality, it does not directly address model performance, necessitating further analysis for comprehensive model evaluation.

This test shows a detailed table format output that provides a statistical summary of each column in the dataset. The table includes columns for the name, type, count, missing values, missing percentage, distinct values, and distinct percentage for each dataset feature. For instance, the "CreditScore" column is numeric with 3,232 entries, no missing values, and 434 distinct values, representing 13.43% of the total. The "Geography" column is categorical with three distinct values, while "EstimatedSalary" is numeric with each entry being unique. The table format allows for easy reading, with each row representing a dataset feature and its corresponding statistics. Key measurements include the count of entries, which indicates data completeness, and the distinct percentage, which highlights data variability. Notable observations include the high distinct percentage for "Balance" and "EstimatedSalary," suggesting significant variability, while categorical columns like "Gender" and "HasCrCard" show limited distinct values, indicating potential binary or low-cardinality features.

The test results reveal the following key insights:

Complete Data Coverage: All columns have a complete count of 3,232 entries with no missing values, indicating a fully populated dataset.
High Variability in Numeric Columns: Columns like "CreditScore" and "Balance" exhibit high distinct percentages, suggesting significant variability and potential for diverse data patterns.
Low Cardinality in Categorical Columns: Categorical columns such as "Gender" and "HasCrCard" have low distinct percentages, indicating binary or limited category options.
Unique Values in Estimated Salary: The "EstimatedSalary" column has a distinct percentage of 100%, highlighting that each entry is unique, which may impact model feature importance.

Based on these results, the dataset exhibits a robust structure with complete data coverage and significant variability in numeric columns, which can enhance model training by providing diverse data patterns. The low cardinality in categorical columns suggests potential binary features, which may simplify model complexity. The unique values in the "EstimatedSalary" column could influence feature importance, necessitating careful consideration during model development. Overall, the dataset's characteristics provide a solid foundation for machine learning applications, with the potential for rich insights and predictive capabilities.

Tables

Dataset Description

Name	Type	Count	Distinct	Distinct %
CreditScore	Numeric	3232.0	434	0.1343
Geography	Categorical	3232.0	3	0.0009
Gender	Categorical	3232.0	2	0.0006
Tenure	Numeric	3232.0	11	0.0034
Balance	Numeric	3232.0	2190	0.6776
NumOfProducts	Numeric	3232.0	4	0.0012
HasCrCard	Categorical	3232.0	2	0.0006
IsActiveMember	Categorical	3232.0	2	0.0006
EstimatedSalary	Numeric	3232.0	3232	1.0000
Exited	Categorical	3232.0	2	0.0006

▶ Test Result: Class Imbalance (validmind.data_validation.ClassImbalance)

✅ Class Imbalance

The test operates by calculating the frequency of each class in the target column of the dataset, expressed as a percentage. It checks whether each class appears in at least a set minimum percentage of the total records, with the default threshold set at 10%. This involves counting the occurrences of each class and dividing by the total number of records to obtain the percentage. The test then compares these percentages against the threshold to determine if any class is under-represented. A class that falls below the threshold is marked as high risk, indicating potential imbalance. The typical range for these percentages is from 0% to 100%, with values below the threshold considered poor and indicative of imbalance.

The primary advantages of this test include its ability to spot under-represented classes that could affect the efficiency of a machine learning model. The calculation is straightforward and swift, making it easy to implement. The test is highly informative as it not only spots imbalance but also quantifies the degree of imbalance. The adjustable threshold allows for flexibility and adaptation to different use-cases or domain-specific needs. Additionally, the test provides a visually insightful plot showing the classes and their corresponding proportions, enhancing interpretability and comprehension of the data.

It should be noted that the test might struggle to perform well or provide vital insights for datasets with a high number of classes, where imbalance could be inevitable due to inherent class distribution. Sensitivity to the threshold value might result in faulty detection of imbalance if the threshold is set excessively high. Regardless of the percentage threshold, it doesn't account for varying costs or impacts of misclassifying different classes, which might fluctuate based on specific applications or domains. While it can identify imbalances in class distribution, it doesn't provide direct methods to address or correct these imbalances. The test is only applicable for classification operations and unsuitable for regression or clustering tasks.

This test shows the results in both tabular and graphical formats. The table titled "Exited Class Imbalance" presents the percentage of rows for each class and whether they pass or fail the threshold test. Each class, labeled as "Exited" with values 0 and 1, shows a percentage of 50.00%, both passing the test. The accompanying plot visually represents these percentages, with the x-axis showing the class labels and the y-axis showing the percentage. The bars for both classes reach the 50% mark, indicating equal distribution. The plot's scale ranges from 0% to 50%, and the equal height of the bars confirms the balanced distribution. Notably, both classes meet the minimum threshold of 30%, indicating no imbalance.

The test results reveal the following key insights:

Balanced Class Distribution: Both classes, labeled as 0 and 1, have an equal distribution of 50.00%, indicating a perfectly balanced dataset.
Threshold Compliance: Each class surpasses the minimum percentage threshold of 30%, resulting in a pass for both classes.

Based on these results, the dataset demonstrates a balanced class distribution, with each class meeting the set threshold. This balance suggests that the model is unlikely to be biased towards any particular class, supporting fair and accurate predictions. The equal distribution of classes ensures that the model can learn effectively from the data, reducing the risk of skewed predictions. This balance is crucial for maintaining the model's performance and reliability across different scenarios.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	50.00%	Pass
1	50.00%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:008e

▶ Test Result: Duplicates (validmind.data_validation.Duplicates)

✅ Duplicates

Duplicates is designed to ensure the reliability of a model by verifying the quality of the dataset through the detection of duplicate entries. The primary purpose of this test is to identify and quantify duplicate rows within the dataset, which can adversely affect model performance by introducing bias or redundancy. By identifying duplicates, the test helps maintain the integrity of the dataset, ensuring that the model is trained on unique and diverse data points, thereby enhancing its generalization capabilities.

The test operates by examining each row in the dataset to identify duplicates. If a specific text column is designated, the test focuses on that column; otherwise, it evaluates all feature columns. The test calculates the number and percentage of duplicate rows, presenting these metrics in a DataFrame. The methodology involves comparing each row against all others to detect exact matches, which are then counted and expressed as a percentage of the total dataset. The test is considered successful if the number of duplicates is below a user-defined threshold, indicating a dataset free from significant redundancy. The percentage of duplicates can range from 0% to 100%, with lower percentages indicating better data quality. A result of 0% duplicates is ideal, suggesting no redundancy in the dataset.

The primary advantages of this test include its ability to enhance model training by ensuring the dataset is not compromised by duplicate entries, which can skew statistical analyses and lead to overfitting. By providing both absolute and percentage values of duplicate rows, the test offers a comprehensive overview of data quality. This dual metric approach allows for a nuanced understanding of the dataset's integrity, making it particularly useful in scenarios where data quality is paramount. Additionally, the test's customizable threshold feature enables users to define acceptable levels of duplicates, tailoring the test to specific project requirements and ensuring flexibility in its application.

It should be noted that the test has limitations, such as its inability to differentiate between benign duplicates, which may occur naturally, and problematic duplicates resulting from data collection or processing errors. As the dataset size increases, the test becomes more computationally demanding, potentially limiting its applicability to very large datasets. Furthermore, the test only identifies exact duplicates, potentially overlooking semantically similar entries that are not identical but may still impact model performance. High numbers or percentages of duplicates can indicate data collection issues, leading to overfitting and poor model generalization on unseen data.

This test shows the results in a tabular format, specifically focusing on the number of duplicate rows and their percentage relative to the total dataset. The table titled "Duplicate Rows Results for Dataset" presents two key metrics: the "Number of Duplicates" and the "Percentage of Rows (%)". In this case, the table indicates that there are zero duplicate entries, with a corresponding percentage of 0.0%. This suggests that the dataset is free from redundancy, which is a positive indicator of data quality. The absence of duplicates implies that the dataset is well-prepared for model training, reducing the risk of overfitting and enhancing the model's ability to generalize to new data. The table is straightforward to interpret, with the key metrics clearly labeled, allowing for quick assessment of the dataset's integrity.

The test results reveal the following key insights:

No Duplicate Entries Detected: The dataset contains zero duplicate rows, indicating a high level of data quality and integrity.
Optimal Data Quality: With 0.0% of the dataset consisting of duplicates, the data is well-suited for training purposes, minimizing the risk of overfitting.

Based on these results, the dataset demonstrates excellent data quality, with no duplicate entries detected. This suggests that the data collection and processing methods are robust, providing a solid foundation for model training. The absence of duplicates ensures that the model will not be biased by redundant information, allowing it to learn more effectively from the unique data points available. This enhances the model's potential to generalize well to new, unseen data, thereby improving its overall performance and reliability. The results underscore the importance of thorough data quality checks in the pre-processing phase, ensuring that the dataset is optimized for model development.

Tables

Duplicate Rows Results for Dataset

Number of Duplicates	Percentage of Rows (%)
0	0.0

▶ Test Result: High Cardinality (validmind.data_validation.HighCardinality)

✅ High Cardinality

High Cardinality is designed to evaluate the number of unique values present in the categorical columns of a dataset to detect high cardinality and potential overfitting. The primary purpose of this test is to identify columns with a large number of unique, non-repetitive values, which can introduce noise and lead to overfitting in predictive models.

The test operates by first identifying columns classified as "Categorical" within the dataset. For each of these columns, it calculates the number of distinct values (n_distinct) and the percentage of distinct values (p_distinct). The test then compares n_distinct against a predefined numeric threshold to determine if the column passes or fails. A column passes the test if n_distinct is less than the threshold, indicating manageable cardinality. The percentage of distinct values is calculated by dividing n_distinct by the total number of entries in the column, providing a relative measure of cardinality. This percentage typically ranges from 0% to 100%, where lower percentages suggest fewer unique values relative to the dataset size, which is generally favorable for model stability.

The primary advantages of this test include its ability to detect potential overfitting early in the model development process by identifying columns with high cardinality. This early detection helps in maintaining data quality by highlighting potential outliers and inconsistencies. The test's versatility allows it to be applied to both classification and regression tasks, making it a valuable tool across various modeling scenarios. By focusing on categorical data, it provides a targeted approach to managing the complexity and noise that can arise from high cardinality features.

It should be noted that the test is limited to categorical data types, which restricts its applicability to numerical or continuous features. This limitation means that the test may overlook important patterns in non-categorical data. Additionally, the static threshold used for determining high cardinality may not be optimal for all datasets, potentially leading to false positives or negatives. The test also does not account for the contextual importance of unique values, which could result in the exclusion of critical data points that are significant for the model's predictive power.

This test shows the results in a tabular format, presenting the column name, number of distinct values, percentage of distinct values, and pass/fail status for each categorical column. The table is straightforward to read, with each row representing a different column from the dataset. The key measurements include the number of distinct values, which indicates the absolute cardinality, and the percentage of distinct values, which provides a relative measure. Notable observations from the table include the "Geography" column with 3 distinct values and a 0.0928% distinct value percentage, and the "Gender" column with 2 distinct values and a 0.0619% distinct value percentage. Both columns pass the test, suggesting that their cardinality is within acceptable limits.

The test results reveal the following key insights:

Geography Column Passes with Low Cardinality: The "Geography" column has 3 distinct values, representing 0.0928% of the total entries, which is well below the threshold, indicating low cardinality and a pass status.
Gender Column Passes with Minimal Distinct Values: The "Gender" column contains only 2 distinct values, accounting for 0.0619% of the dataset, also passing the test and suggesting minimal risk of overfitting.

Based on these results, the dataset's categorical columns exhibit low cardinality, which reduces the risk of overfitting and suggests that the model is less likely to be influenced by noise from these features. The "Geography" and "Gender" columns both pass the test, indicating that their unique value counts are within acceptable limits, contributing to the model's stability and reliability. This analysis provides confidence that the categorical features are unlikely to introduce significant noise or complexity, supporting robust model performance.

Tables

Column	Number of Distinct Values	Percentage of Distinct Values (%)	Pass/Fail
Geography	3	0.0928	Pass
Gender	2	0.0619	Pass

▶ Test Result: Missing Values (validmind.data_validation.MissingValues)

✅ Missing Values

Missing Values is designed to evaluate the quality of a dataset by measuring the number of missing values across all features. The primary purpose of this test is to ensure that the ratio of missing data to total data is less than a predefined threshold, thereby maintaining the data quality necessary for reliable predictive strength in a machine learning model.

The test operates by iterating through each column of the dataset, counting missing values represented as NaNs, and calculating the percentage they represent against the total number of rows. This percentage is then compared to a predefined min_threshold to determine if the data quality is acceptable. The test results are presented in a table format, summarizing each column, the number of missing values, the percentage of missing values, and a Pass/Fail status based on the threshold comparison. The typical range for the percentage of missing values is from 0% to 100%, with lower percentages indicating better data quality. A result below the threshold is generally considered good, while exceeding it suggests potential data quality issues.

The primary advantages of this test include its ability to quickly and granularly identify missing data across each feature in the dataset. This capability is particularly useful for maintaining data quality, which is essential for constructing efficient machine learning models. By providing a straightforward means of assessing missing data, the test helps ensure that datasets are robust enough to support reliable model predictions. Additionally, the test's simplicity and clarity make it accessible for users to interpret and act upon the results without requiring extensive statistical knowledge.

It should be noted that the test does not suggest the root causes of the missing values or recommend ways to impute or handle them. This limitation means that while the test can identify the presence of missing data, it does not provide guidance on how to address it. Furthermore, the test may overlook features with significant missing data that still fall below the min_threshold, potentially impacting the model's performance. Additionally, the test does not account for data encoded as values like "-999" or "None," which might not technically classify as missing but could bear similar implications for data quality.

This test shows the results in a table format, where each row represents a feature of the dataset, and columns include the feature name, the number of missing values, the percentage of missing values, and a Pass/Fail status. The table is straightforward to read, with each feature's missing data characteristics clearly outlined. Key measurements include the absolute number of missing values and their percentage relative to the total dataset size. Notable observations from the table indicate that all features have zero missing values, resulting in a 0.0% missing value percentage for each feature. Consequently, every feature passes the test, suggesting excellent data quality across the dataset.

The test results reveal the following key insights:

Complete Data Integrity: All features in the dataset have zero missing values, indicating complete data integrity and reliability for model training and evaluation.
Uniform Data Quality: The uniformity in the absence of missing values across all features suggests consistent data collection and preprocessing practices, which are crucial for maintaining model performance.
Pass Status Across Features: Each feature passes the test, confirming that the dataset meets the predefined quality threshold, ensuring that no feature is at risk of introducing bias or inaccuracies due to missing data.

Based on these results, the dataset demonstrates a high level of data quality, with no missing values detected across any of the features. This uniformity in data integrity suggests that the dataset is well-prepared for use in machine learning applications, as it minimizes the risk of model bias or inaccuracies due to incomplete data. The consistent pass status across all features indicates robust data collection and preprocessing practices, which are essential for maintaining reliable model performance. These insights collectively affirm the dataset's readiness for further analysis and model development, providing a solid foundation for predictive modeling efforts.

Tables

Column	Number of Missing Values	Percentage of Missing Values (%)	Pass/Fail
CreditScore	0	0.0	Pass
Geography	0	0.0	Pass
Gender	0	0.0	Pass
Tenure	0	0.0	Pass
Balance	0	0.0	Pass
NumOfProducts	0	0.0	Pass
HasCrCard	0	0.0	Pass
IsActiveMember	0	0.0	Pass
EstimatedSalary	0	0.0	Pass
Exited	0	0.0	Pass

▶ Test Result: Skewness (validmind.data_validation.Skewness)

❌ Skewness

Skewness is designed to evaluate the asymmetry in the distribution of numerical data within a dataset, ensuring data quality and optimizing model performance. The primary purpose of this test is to identify deviations from a normal distribution, which can impact the performance of machine learning models, particularly in classification and regression tasks.

The test operates by calculating the skewness of each numerical column in the dataset. Skewness is a statistical measure that quantifies the degree of asymmetry of a distribution around its mean. A skewness value of zero indicates a perfectly symmetrical distribution, while positive or negative values indicate right or left skew, respectively. The test compares these calculated skewness values against a predefined threshold, typically set at 1. If a column's skewness exceeds this threshold, it is flagged as failing the test, indicating potential data quality issues. The test results include the skewness values and pass/fail status for each column, providing a clear indication of which columns may require further attention.

The primary advantages of this test include its ability to quickly and efficiently identify skewed data distributions, which can be detrimental to model performance. By providing a quantitative measure of skewness, the test allows for the customization of the threshold to suit specific user needs, making it adaptable to various datasets and modeling scenarios. This adaptability is particularly useful in ensuring that the foundational assumptions of machine learning models are not violated, thereby reducing the risk of biased predictions and improving overall model reliability.

It should be noted that the test is limited to evaluating numeric columns, potentially overlooking skewness or bias in categorical data. Additionally, the assumption that data should follow a normal distribution may not always be applicable, as real-world data often exhibit different distribution patterns. The subjective nature of the threshold for skewness also requires expert input to ensure it is appropriately set for the specific context, necessitating recurrent iterations and refinements to maintain accuracy and relevance.

This test shows the skewness results for various columns in the dataset, presented in a tabular format. Each row of the table corresponds to a specific column, with columns for the column name, skewness value, and pass/fail status. The skewness values indicate the degree of asymmetry in the data distribution, with values close to zero suggesting symmetry. The pass/fail status is determined by comparing the skewness value against the threshold of 1. Notable observations include the "NumOfProducts" column, which has a skewness of 1.2471, exceeding the threshold and resulting in a fail status. Other columns, such as "CreditScore" and "Balance," exhibit skewness values of -0.0729 and -0.2475, respectively, both passing the test and indicating relatively symmetrical distributions.

The test results reveal the following key insights:

Overall Symmetry in Most Columns: Most columns, such as "CreditScore," "Tenure," and "Balance," exhibit skewness values close to zero, indicating symmetrical distributions and passing the test.
Significant Skewness in NumOfProducts: The "NumOfProducts" column shows a skewness of 1.2471, exceeding the threshold and failing the test, suggesting a right-skewed distribution that may impact model performance.
Uniformity in EstimatedSalary and Exited: The "EstimatedSalary" and "Exited" columns have skewness values of 0.0005 and 0.0, respectively, indicating near-perfect symmetry and passing the test.

Based on these results, the dataset generally exhibits symmetrical distributions across most columns, which is favorable for maintaining the assumptions of normality in machine learning models. However, the "NumOfProducts" column's significant skewness suggests a potential area for data preprocessing or transformation to mitigate its impact on model performance. The overall symmetry in other columns supports the reliability of the dataset for predictive modeling, with minimal risk of skewness-induced bias. This analysis provides a clear understanding of the data distribution characteristics, guiding further data preparation and model development efforts.

Tables

Skewness Results for Dataset

Column	Skewness	Pass/Fail
CreditScore	-0.0729	Pass
Tenure	0.0383	Pass
Balance	-0.2475	Pass
NumOfProducts	1.2471	Fail
HasCrCard	-0.8742	Pass
IsActiveMember	0.1577	Pass
EstimatedSalary	0.0005	Pass
Exited	0.0000	Pass

▶ Test Result: Unique Rows (validmind.data_validation.UniqueRows)

❌ Unique Rows

Unique Rows is designed to verify the diversity of the dataset by ensuring that the count of unique rows exceeds a prescribed threshold. This test is crucial for assessing the quality of data used in training machine learning models, as it ensures a varied collection of data, which is essential for developing unbiased and robust models capable of generalizing well to new data.

The test operates by first calculating the total number of rows in the dataset. It then determines the count of unique rows for each column. The percentage of unique rows is calculated as the ratio of unique rows to the total row count. This percentage is compared against a predefined minimum threshold. If the percentage of unique rows for a column is less than this threshold, the column fails the test. The test results are compiled to provide a pass or fail verdict for each column, ensuring that all columns meet the diversity requirement. This mechanism helps identify columns with insufficient diversity, which could lead to overfitting and poor model generalization.

The primary advantages of this test include its efficiency in evaluating data diversity across each column in the dataset. By systematically assessing the uniqueness of data, the test provides a quick and reliable method to gauge data quality, which is pivotal in developing effective machine learning models. The test's ability to highlight columns with low diversity helps in identifying potential data quality issues early in the model development process, allowing for timely interventions to improve data collection or preprocessing strategies.

It should be noted that the Unique Rows test assumes that data quality is directly proportional to its uniqueness, which may not always be the case. In some contexts, non-unique rows might be essential and should not be disregarded. Additionally, the test does not account for the relative importance of each column in predicting the output, treating all columns equally. This limitation can be particularly pronounced in datasets with categorical variables, where the number of unique categories is inherently limited, potentially leading to misleading results regarding data quality.

This test shows the results in a tabular format, where each row represents a column from the dataset, and the columns of the table include the column name, the number of unique values, the percentage of unique values, and a pass/fail status. The table provides a clear overview of the diversity of each column, with the percentage of unique values indicating how much of the dataset's total row count is represented by unique entries in that column. A higher percentage suggests greater diversity, which is generally desirable. Notable observations include columns like "EstimatedSalary" with 100% unique values, indicating high diversity, while columns like "Geography" and "Gender" have very low percentages, failing the test due to insufficient diversity.

The test results reveal the following key insights:

High Diversity in Estimated Salary: The "EstimatedSalary" column shows 100% unique values, indicating excellent diversity and passing the test.
Low Diversity in Categorical Columns: Columns such as "Geography," "Gender," and "HasCrCard" exhibit very low percentages of unique values, failing the test and suggesting limited diversity.
Moderate Diversity in Credit Score and Balance: The "CreditScore" and "Balance" columns have moderate diversity, with percentages of 13.43% and 67.76%, respectively, both passing the test.
Uniformity in Binary Columns: Columns like "IsActiveMember" and "Exited" show minimal diversity with only 2 unique values each, failing the test due to their binary nature.

Based on these results, the dataset exhibits a mix of high and low diversity across different columns. The high diversity in the "EstimatedSalary" column suggests that this feature is well-represented and varied, which is beneficial for model training. However, the low diversity in categorical and binary columns indicates potential areas of concern, as these features may not provide sufficient variability to support robust model development. The moderate diversity in "CreditScore" and "Balance" suggests these features are reasonably varied, contributing positively to the dataset's overall quality. These insights highlight the importance of ensuring adequate diversity in all features to enhance the model's ability to generalize effectively.

Tables

Column	Number of Unique Values	Percentage of Unique Values (%)	Pass/Fail
CreditScore	434	13.4282	Pass
Geography	3	0.0928	Fail
Gender	2	0.0619	Fail
Tenure	11	0.3403	Fail
Balance	2190	67.7599	Pass
NumOfProducts	4	0.1238	Fail
HasCrCard	2	0.0619	Fail
IsActiveMember	2	0.0619	Fail
EstimatedSalary	3232	100.0000	Pass
Exited	2	0.0619	Fail

▶ Test Result: Too Many Zero Values (validmind.data_validation.TooManyZeroValues)

❌ Too Many Zero Values

Too Many Zero Values is designed to identify numerical columns in a dataset that contain an excessive number of zero values, which may indicate data sparsity or a lack of variation. The primary purpose of this test is to highlight columns where the proportion of zero values exceeds a predefined threshold, potentially impacting the effectiveness of a machine learning model.

The test operates by iterating through each numerical column in the dataset, calculating the total number of zero values, and determining their ratio to the total number of rows. This ratio is then compared against a threshold, set by default at 0.03%. If the percentage of zero values in a column surpasses this threshold, the column is flagged as having too many zero values. The test results are presented in a summary format, indicating the count and percentage of zero values for each column, along with a pass or fail status. This mechanism allows for the identification of columns that may require further investigation due to potential data quality issues.

The primary advantages of this test include its ability to efficiently pinpoint columns with a high concentration of zero values, which might otherwise be overlooked in large datasets. By providing both counts and percentages of zero values, the test offers a detailed view of the data distribution, facilitating a deeper understanding of the dataset's characteristics. Additionally, the test's flexibility in adjusting the threshold for what constitutes 'too many' zero values allows it to be tailored to specific analytical needs, making it a versatile tool for data quality assessment in various contexts.

It should be noted that this test is limited to detecting zero values and does not account for other potential data issues such as outliers, missing values, or extreme values. It also does not consider the context in which zero values may be meaningful, potentially leading to misinterpretation if zero values are expected in certain columns. Furthermore, the test does not evaluate non-numerical columns, which may have their own data quality concerns. The presence of a high percentage of zero values in a column may not always indicate poor data quality, as it could be a normal characteristic of the data in specific scenarios.

This test shows the results in a tabular format, where each row represents a numerical column from the dataset. The table includes columns for the variable name, total row count, number of zero values, percentage of zero values, and a pass or fail status based on the threshold. The key measurements are the number and percentage of zero values, which are critical for assessing data sparsity. Notable observations include the fact that all tested columns have failed the test, indicating a higher than acceptable percentage of zero values. The "Tenure" column, for instance, has 3.8985% zero values, while the "Balance" column has a significantly higher percentage at 32.271%. The "IsActiveMember" column shows the highest percentage of zero values at 53.9295%, suggesting a potential issue with data variation in these columns.

The test results reveal the following key insights:

High Zero Value Percentage in 'IsActiveMember': The "IsActiveMember" column exhibits the highest percentage of zero values at 53.9295%, indicating a significant lack of variation and potential data quality concerns.
Significant Zero Values in 'Balance': The "Balance" column has 32.271% zero values, suggesting that a substantial portion of the dataset may have no recorded balance, which could impact model predictions.
Moderate Zero Values in 'HasCrCard': With 29.9814% zero values, the "HasCrCard" column also fails the test, pointing to a notable proportion of entries without credit card information.
'Tenure' Column Fails Threshold: Although the "Tenure" column has the lowest percentage of zero values at 3.8985%, it still exceeds the threshold, indicating potential issues with data completeness.

Based on these results, the dataset exhibits a significant presence of zero values across multiple columns, which may affect the model's ability to learn effectively from the data. The high percentage of zero values in the "IsActiveMember" and "Balance" columns suggests potential data sparsity issues that could lead to biased or inaccurate model outputs. The "HasCrCard" and "Tenure" columns also show a concerning level of zero values, which may require further investigation to determine the underlying causes. These insights highlight the need for careful consideration of data quality and completeness when developing models, as the presence of excessive zero values can significantly impact model performance and reliability.

Tables

Variable	Row Count	Number of Zero Values	Percentage of Zero Values (%)	Pass/Fail
Tenure	3232	126	3.8985	Fail
Balance	3232	1043	32.2710	Fail
HasCrCard	3232	969	29.9814	Fail
IsActiveMember	3232	1743	53.9295	Fail

▶ Test Result: IQR Outliers Table (validmind.data_validation.IQROutliersTable)

IQR Outliers Table

IQR Outliers Table is designed to identify and summarize outliers within numerical features of a dataset using the Interquartile Range (IQR) method. This test is essential in data preprocessing as it helps detect anomalies that could skew statistical analyses and affect the performance of machine learning models.

The test operates by calculating the IQR for each numerical feature, which is the range between the first quartile (25th percentile) and the third quartile (75th percentile). An outlier is defined as any data point that falls below "Q1 - 1.5 * IQR" or above "Q3 + 1.5 * IQR". This method is robust because it relies on quartile calculations, which are less sensitive to extreme values. The test outputs the number of outliers and their summary statistics, including minimum, 25th percentile, median, 75th percentile, and maximum values for each feature. The default threshold for identifying outliers is set at 1.5 but can be adjusted based on user requirements. This approach is particularly useful for datasets where numerical features are prone to extreme values that could distort analysis.

The primary advantages of this test include its ability to provide a detailed summary of outliers for each numerical feature, which aids in identifying features with potential data quality issues. The IQR method is advantageous because it is less affected by extremely high or low values, making it a robust choice for datasets with outliers. Additionally, the test can be customized to focus on specific features and adjust the threshold for outlier detection, offering flexibility in its application. This makes it particularly useful in scenarios where data integrity is crucial, and outliers could significantly impact model performance.

It should be noted that the test has limitations, such as the potential for false positives in datasets with non-normal distributions, particularly those that are skewed. The IQR method does not provide interpretations or recommendations for handling outliers, leaving further analysis to data scientists. It is also limited to numerical features and does not apply to categorical data. The default thresholds may not be suitable for datasets with significant preprocessing or those with high kurtosis, where the tails of the distribution are heavier than normal.

This test shows a summary of outliers detected by the IQR method, presented in a table format. The table includes columns for each variable, detailing the total count of outliers, mean value of the variable, and summary statistics of the outliers, such as minimum, 25th percentile, median, 75th percentile, and maximum values. For the "CreditScore" variable, there are 8 outliers with a mean value of 646.4957, and outlier values ranging from 350 to 367. The "NumOfProducts" variable has 44 outliers, all with a value of 4, indicating a potential data entry issue or a feature with limited variability. The table provides a clear overview of the outlier distribution, allowing for easy identification of features with significant outlier presence.

The test results reveal the following key insights:

CreditScore Outliers Concentration: The "CreditScore" variable has 8 outliers, all concentrated at the lower end of the distribution, with values ranging from 350 to 367, significantly below the mean of 646.4957.
NumOfProducts Uniform Outliers: The "NumOfProducts" variable shows a uniform outlier value of 4 across all 44 detected outliers, suggesting a potential issue with data entry or a feature with limited variability.

Based on these results, the analysis highlights that the "CreditScore" variable has a small number of outliers concentrated at the lower end, which could indicate a subset of the population with significantly lower credit scores. In contrast, the "NumOfProducts" variable shows a uniform outlier value, suggesting a potential data quality issue or a feature with limited variability. These insights are crucial for understanding the distribution and potential anomalies within the dataset, which can inform further data cleaning and preprocessing steps to ensure robust model performance.

Tables

Summary of Outliers Detected by IQR Method

Variable	Total Count of Outliers	Mean Value of Variable	Minimum Outlier Value	Outlier Value at 25th Percentile	Outlier Value at 50th Percentile	Outlier Value at 75th Percentile	Maximum Outlier Value
CreditScore	8	646.4957	350	350.0	350.0	360.5	367
NumOfProducts	44	1.5046	4	4.0	4.0	4.0	4

▶ Test Result: IQR Outliers Bar Plot (validmind.data_validation.IQROutliersBarPlot)

IQR Outliers Bar Plot

IQR Outliers Bar Plot is designed to visually analyze and evaluate the extent of outliers in numeric variables based on percentiles. Its primary purpose is to clarify the dataset's distribution, flag possible abnormalities, and gauge potential risks associated with processing potentially skewed data, which can affect the machine learning model's predictive prowess.

The test operates by calculating the 25th percentile (Q1) and 75th percentile (Q3) for each numeric feature in the dataset to derive the Interquartile Range (IQR), which is the difference between Q1 and Q3. It then determines the lower and upper thresholds by subtracting Q1 from the threshold times IQR and adding Q3 to the threshold times IQR, with the default threshold set at 1.5. Values falling below the lower threshold or exceeding the upper threshold are labeled as outliers. The number of outliers is tallied for different percentiles, such as [0-25], [25-50], [50-75], and [75-100], and these counts are used to construct a bar plot for each feature, showcasing the distribution of outliers across different percentiles.

The primary advantages of this test include its ability to effectively identify outliers in the data through visual means, facilitating easier comprehension and offering insights into the outliers' possible impact on the model. It provides flexibility by accommodating all numeric features or a chosen subset and is task-agnostic, making it viable for both classification and regression tasks. Additionally, it can handle large datasets as its operation does not rely on computationally heavy processes, making it efficient and scalable.

It should be noted that the test is limited to numerical variables and does not extend to categorical ones. It only reveals the presence and distribution of outliers without providing insights into how these outliers might affect the model's predictive performance. The assumption that data is unimodal and symmetric may not always hold true, and in cases with non-normal distributions, the results can be misleading. Signs of high risk include a prevalence of outliers potentially skewing the distribution, outliers dominating higher percentiles (75-100), and certain features harboring most of their values as outliers, which may not contribute positively to the model's forecasting ability.

This test shows the distribution of outliers across different percentiles for numeric features using bar plots. Each plot represents a feature, with the x-axis indicating the percentile ranges [0-25], [25-50], [50-75], and [75-100], and the y-axis showing the outlier count. The plots provide a visual representation of where outliers are concentrated within the data distribution. For instance, in the "CreditScore" plot, outliers are primarily concentrated in the 50-75 percentile range, with a smaller number in the 75-100 range. In contrast, the "NumOfProducts" plot shows a significant concentration of outliers in the 75-100 percentile range. These visualizations help identify features with potential skewness and extreme values, which could impact model performance.

The test results reveal the following key insights:

Concentration of Outliers in CreditScore: The "CreditScore" feature shows a significant concentration of outliers in the 50-75 percentile range, indicating potential skewness in this segment of the data.
Extreme Values in NumOfProducts: The "NumOfProducts" feature has a high concentration of outliers in the 75-100 percentile range, suggesting the presence of extreme values that could influence model predictions.

Based on these results, the insights indicate that certain features, such as "CreditScore" and "NumOfProducts," exhibit notable concentrations of outliers in specific percentile ranges. This suggests potential skewness and the presence of extreme values that could impact the model's predictive performance. Understanding these distributions is crucial for assessing the risk of skewed data and its influence on model behavior, guiding data preprocessing and feature engineering efforts to mitigate potential adverse effects.

Figures

ValidMind Figure validmind.data_validation.IQROutliersBarPlot:1c6a

ValidMind Figure validmind.data_validation.IQROutliersBarPlot:f0b8

▶ Test Result: Descriptive Statistics (validmind.data_validation.DescriptiveStatistics)

Descriptive Statistics

The test operates by utilizing two primary functions from the pandas library: describe() for numerical data and value_counts() for categorical data. The describe() function calculates summary statistics including count, mean, standard deviation, minimum, and maximum values, as well as quartiles, which help in understanding the central tendency and dispersion of numerical variables. The value_counts() function, on the other hand, provides a count of unique values, the most common value, and its frequency for categorical variables. These metrics are crucial for identifying patterns such as skewness, outliers, and dominance of certain categories. The typical range for these metrics varies; for instance, the mean and standard deviation can take any real number, while frequency percentages range from 0% to 100%. A significant difference between mean and median may indicate skewness, while a high frequency of a single category could suggest a lack of diversity.

The primary advantages of this test include its ability to offer a detailed overview of the dataset, highlighting the distribution and characteristics of the variables. This versatility makes it applicable to both numerical and categorical data, providing a robust method for identifying anomalies such as outliers, extreme skewness, or lack of diversity. These insights are vital for understanding model behavior during testing and validation, as they can indicate potential areas of concern that may affect model performance. By offering a clear summary of the data, the test aids in the initial stages of data exploration and model development.

It should be noted that while this test provides a high-level overview of the data, it may not detect subtle correlations or complex patterns. It does not offer insights into the relationships between variables, which are crucial for understanding interactions within the dataset. Additionally, descriptive statistics alone cannot infer properties about future unseen data, making it necessary to use this test in conjunction with other statistical analyses to gain a comprehensive understanding of the model's data. High risk is indicated by skewed data or significant outliers, as well as a lack of diversity in categorical data, which can affect the model's predictive capabilities.

This test shows the results in two tables: one for numerical variables and another for categorical variables. The numerical table includes columns for count, mean, standard deviation, minimum, maximum, and various percentiles (25%, 50%, 75%, 90%, 95%). For example, the "CreditScore" variable has a mean of 646.50, a standard deviation of 98.88, and ranges from 350 to 850. The categorical table lists the count of observations, the number of unique values, the most common value, and its frequency percentage. For instance, the "Geography" variable has three unique values, with "France" being the most common at 46.78%. These tables provide a clear snapshot of the data's distribution, allowing for easy identification of key metrics and notable observations, such as the dominance of certain categories or the presence of outliers.

The test results reveal the following key insights:

Credit Score Distribution: The "CreditScore" variable shows a mean of 646.50 with a standard deviation of 98.88, indicating a relatively wide spread around the mean. The scores range from 350 to 850, with the median at 648, suggesting a slight right skew.
Balance Variability: The "Balance" variable has a mean of 81,049.48 and a high standard deviation of 61,372.21, indicating significant variability. The median balance is 102,038, with a minimum of 0 and a maximum of 250,898, highlighting the presence of accounts with no balance.
Geographical Dominance: In the categorical data, "France" is the most common geography, representing 46.78% of the dataset. This suggests a potential geographical bias that could influence model predictions.
Gender Proportion: The "Gender" variable is fairly balanced, with "Male" being the most common at 51.92%, indicating a slight male dominance in the dataset.

Based on these results, the dataset exhibits a diverse range of numerical values with some variables showing significant variability, such as "Balance". The categorical data reveals a potential geographical bias with a high concentration of observations from "France". The gender distribution is relatively balanced, which may help in reducing gender bias in model predictions. These insights suggest that while the dataset provides a broad representation of the population, certain variables may require further investigation to ensure they do not adversely affect the model's performance. Understanding these characteristics is crucial for developing robust models that can generalize well to new data.

Tables

Numerical Variables

Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
CreditScore	3232.0	646.4957	98.8847	350.0	578.0	648.0	715.0	776.0	814.0	850.0
Tenure	3232.0	4.9737	2.8570	0.0	3.0	5.0	7.0	9.0	9.0	10.0
Balance	3232.0	81049.4830	61372.2075	0.0	0.0	102038.0	128745.0	150212.0	164566.0	250898.0
NumOfProducts	3232.0	1.5046	0.6699	1.0	1.0	1.0	2.0	2.0	3.0	4.0
HasCrCard	3232.0	0.7002	0.4582	0.0	0.0	1.0	1.0	1.0	1.0	1.0
IsActiveMember	3232.0	0.4607	0.4985	0.0	0.0	0.0	1.0	1.0	1.0	1.0
EstimatedSalary	3232.0	100001.4120	57674.2043	12.0	50526.0	100141.0	149898.0	178998.0	189084.0	199992.0

Categorical Variables

Name	Count	Number of Unique Values	Top Value	Top Value Frequency	Top Value Frequency %
Geography	3232.0	3.0	France	1512.0	46.78
Gender	3232.0	2.0	Male	1678.0	51.92

▶ Test Result: Pearson Correlation Matrix (validmind.data_validation.PearsonCorrelationMatrix)

Pearson Correlation Matrix

The test operates by generating a correlation matrix for all numerical variables using the Pearson correlation formula. This formula measures the linear relationship between two variables, providing a coefficient that ranges from -1 to 1. A value of 1 indicates a perfect positive correlation, -1 a perfect negative correlation, and 0 no correlation. The test visualizes these relationships in a heat map, where the color intensity represents the magnitude and direction of the correlation. High correlations, typically above 0.7 in absolute terms, are highlighted to indicate potential redundancy.

The primary advantages of this test include its ability to detect and quantify linear relationships between variables, which aids in identifying redundant variables. This can simplify models and potentially improve performance by reducing complexity. The heatmap visualization offers an intuitive overview of correlations, making it accessible even to those not comfortable with numerical matrices. This visual representation helps in quickly identifying areas of concern or interest within the dataset.

It should be noted that this test is limited to detecting linear relationships, potentially missing non-linear dependencies that could be important for dimensionality reduction. It measures only the degree of linear relationship, not the strength of one variable's effect on another. The threshold of 0.7 for high correlation is arbitrary and might exclude valid dependencies with lower coefficients. Additionally, a large number of highly correlated variables can indicate redundancy, posing a risk of overfitting.

This test shows a heat map representing the Pearson correlation coefficients between pairs of numerical variables in the dataset. The axes of the heat map list the variables, and each cell shows the correlation coefficient between the variables at that intersection. The color scale ranges from -1 to 1, with darker colors indicating stronger correlations. Notable observations include the absence of any high correlations above 0.7, suggesting minimal redundancy. The highest correlation observed is between "Balance" and "Exited" at 0.16, which is relatively low. The heat map provides a clear visual representation of the correlations, with no white cells indicating high correlation, suggesting that the dataset does not have significant linear dependencies that could lead to redundancy.

The test results reveal the following key insights:

Low Overall Correlation: The dataset shows generally low correlation coefficients, with most values close to zero, indicating weak linear relationships between variables.
No High Correlations: There are no correlation coefficients exceeding the 0.7 threshold, suggesting minimal redundancy and a low risk of overfitting due to linear dependencies.
Specific Variable Relationships: The highest observed correlation is between "Balance" and "Exited" at 0.16, which is not significant enough to warrant concern for redundancy.

Based on these results, the dataset exhibits low linear dependencies among its numerical variables, indicating minimal redundancy. This suggests that the variables are likely contributing unique information to the model, reducing the risk of overfitting due to linear correlations. The absence of high correlations implies that dimensionality reduction through variable removal may not be necessary based on linear relationships alone. This analysis provides a clear understanding of the dataset's structure, supporting informed decisions about model complexity and feature selection.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:7d08

▶ Test Result: High Pearson Correlation (validmind.data_validation.HighPearsonCorrelation)

✅ High Pearson Correlation

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It measures the strength and direction of the linear relationship between two variables, with the correlation coefficient ranging from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The test sorts these correlations, removing duplicates and self-correlations, and evaluates them against a pre-set threshold of 0.3. If the absolute value of a correlation exceeds this threshold, it suggests a significant linear relationship. The test outputs the top n strongest correlations, providing a Pass or Fail status based on the threshold.

The primary advantages of this test include its ability to quickly and effectively identify linear relationships between feature pairs, which is crucial for detecting multicollinearity and feature redundancy. This transparency allows developers and risk management teams to address potential issues early in the model development process. The test's straightforward output, which includes the correlated variable pairs, their correlation coefficients, and Pass or Fail status, makes it particularly useful for scenarios where model interpretability and performance are critical.

It should be noted that the test is limited to identifying linear relationships and does not account for nonlinear dependencies. It is sensitive to outliers, which can significantly affect the correlation coefficient, potentially leading to misleading results. Additionally, the test only identifies redundancy within feature pairs and may not detect more complex relationships involving three or more variables. High correlation coefficients exceeding the threshold indicate a risk of multicollinearity, which can lead to model overfitting and reduced interpretability.

This test shows a table format output that lists feature pairs, their Pearson correlation coefficients, and a Pass or Fail status based on the threshold of 0.3. Each row represents a pair of features, with the "Columns" column indicating the feature pair, the "Coefficient" column showing the calculated Pearson correlation coefficient, and the "Pass/Fail" column indicating whether the correlation exceeds the threshold. The coefficients range from -0.1831 to 0.0331, all of which are below the threshold, resulting in a Pass status for all pairs. Notable observations include the highest correlation of -0.1831 between "IsActiveMember" and "Exited" and the lowest correlation of 0.0264 between "Tenure" and "HasCrCard."

The test results reveal the following key insights:

Low Correlation Across Features: All feature pairs exhibit low correlation coefficients, with none exceeding the threshold of 0.3, indicating minimal linear relationships.
Negative Correlations Predominate: The majority of the correlations are negative, with the strongest being -0.1831 between "IsActiveMember" and "Exited."
Positive Correlations Are Weak: The positive correlations observed, such as 0.0331 between "Tenure" and "EstimatedSalary," are weak and well below the threshold.

Based on these results, the dataset does not exhibit significant linear relationships between feature pairs, suggesting a low risk of multicollinearity affecting the model's performance. The predominance of weak negative correlations indicates that the features are largely independent, which is beneficial for model interpretability and robustness. The absence of strong correlations suggests that feature redundancy is not a concern, allowing for a more straightforward interpretation of each feature's contribution to the model's predictions.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(IsActiveMember, Exited)	-0.1831	Pass
(Balance, NumOfProducts)	-0.1819	Pass
(Balance, Exited)	0.1629	Pass
(Tenure, IsActiveMember)	-0.0547	Pass
(NumOfProducts, Exited)	-0.0485	Pass
(HasCrCard, IsActiveMember)	-0.0401	Pass
(Tenure, EstimatedSalary)	0.0331	Pass
(NumOfProducts, IsActiveMember)	0.0330	Pass
(CreditScore, EstimatedSalary)	-0.0291	Pass
(Tenure, HasCrCard)	0.0264	Pass

Run and log an individual test

Next, we'll use the previously initialized vm_balanced_raw_dataset (that still has a highly correlated Age column) as input to run an individual test, then log the result to the ValidMind Platform.

When running individual tests, you can use a custom result_id to tag the individual result with a unique identifier:

This result_id can be appended to test_id with a : separator.
The balanced_raw_dataset result identifier will correspond to the balanced_raw_dataset input, the dataset that still has the Age column.

result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation:balanced_raw_dataset",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)
result.log()

❌ High Pearson Correlation Balanced Raw Dataset

High Pearson Correlation: Balanced Raw Dataset is designed to identify highly correlated feature pairs in a dataset, which may suggest feature redundancy or multicollinearity. The primary purpose of this test is to measure the linear relationship between features, allowing developers and risk management teams to address potential impacts on a machine learning model's performance and interpretability.

The test operates by calculating pairwise Pearson correlations for all features in the dataset. It measures the strength and direction of the linear relationship between two variables, with the correlation coefficient ranging from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The test sorts these correlations, removing duplicates and self-correlations, and assigns a Pass or Fail status based on whether the absolute value of the correlation coefficient exceeds a pre-set threshold, which is 0.3 by default. The test also returns the top n strongest correlations, providing a clear view of potential multicollinearity issues.

The primary advantages of this test include its ability to quickly and simply identify relationships between feature pairs, making it particularly useful for early detection of multicollinearity issues that could disrupt model training. The test generates transparent outputs, displaying pairs of correlated variables along with their Pearson correlation coefficients and Pass or Fail status. This transparency aids in understanding the relationships within the dataset and allows for informed decision-making regarding feature selection and model design.

It should be noted that the test is limited to identifying linear relationships and may not capture nonlinear dependencies. It is sensitive to outliers, which can significantly affect the correlation coefficient, potentially leading to misleading results. Additionally, the test only identifies redundancy within feature pairs and may not detect more complex relationships involving three or more variables. High correlation coefficients exceeding the threshold indicate a risk of multicollinearity, which can lead to model overfitting and reduced interpretability.

This test shows the results in a tabular format, where each row represents a pair of features with their corresponding Pearson correlation coefficient and Pass or Fail status. The table includes columns for the feature pairs, the calculated correlation coefficient, and whether the correlation exceeds the threshold of 0.3. The coefficients range from -0.1819 to 0.3287, with only one pair, (Age, Exited), failing the test due to a coefficient of 0.3287, indicating a moderate positive linear relationship. The other pairs have coefficients below the threshold, suggesting weaker linear relationships. The table provides a clear overview of the linear dependencies between features, highlighting potential areas of concern for model development.

The test results reveal the following key insights:

Moderate Correlation Between Age and Exited: The feature pair (Age, Exited) shows a correlation coefficient of 0.3287, which exceeds the threshold, indicating a moderate positive linear relationship and potential multicollinearity risk.
Weak Correlations Among Other Features: All other feature pairs have correlation coefficients below the threshold, with values ranging from -0.1819 to 0.0377, suggesting weak linear relationships and minimal risk of multicollinearity.
Negative Correlations Observed: Several feature pairs, such as (IsActiveMember, Exited) and (Balance, NumOfProducts), exhibit negative correlation coefficients, indicating inverse relationships, though these are not strong enough to pose significant concerns.

Based on these results, the dataset exhibits a generally low level of linear correlation among most feature pairs, with the exception of the (Age, Exited) pair, which shows a moderate positive correlation. This suggests that while multicollinearity is not a widespread issue in this dataset, the relationship between Age and Exited may require further investigation to ensure it does not adversely affect model performance. The weak correlations among other features indicate that the dataset is relatively free from redundancy, supporting the interpretability and robustness of the model. These insights provide a foundation for making informed decisions about feature selection and model design, ensuring that the model remains both effective and interpretable.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3287	Fail
(IsActiveMember, Exited)	-0.1831	Pass
(Balance, NumOfProducts)	-0.1819	Pass
(Balance, Exited)	0.1629	Pass
(Tenure, IsActiveMember)	-0.0547	Pass
(NumOfProducts, Exited)	-0.0485	Pass
(HasCrCard, IsActiveMember)	-0.0401	Pass
(Age, Balance)	0.0388	Pass
(Age, IsActiveMember)	0.0377	Pass
(Age, NumOfProducts)	-0.0333	Pass

2025-12-31 22:15:18,722 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:balanced_raw_dataset does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for this particular test ID.

That's expected, as when we run individual tests the results logged need to be manually added to your documentation within the ValidMind Platform.

Add individual test results to model documentation

With the test results logged, let's head to the model we connected to at the beginning of this notebook and insert our test results into the documentation (Need more help?):

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Documentation under Documents.
Locate the Data Preparation section and click on 2.3. Correlations and Interactions to expand that section.
Hover under the Pearson Correlation Matrix content block until a horizontal dashed line with a + button appears, indicating that you can insert a new block.
Click + and then select Test-Driven Block under FROM LIBRARY:
- Click on VM Library under TEST-DRIVEN in the left sidebar.
- In the search bar, type in HighPearsonCorrelation.
- Select HighPearsonCorrelation:balanced_raw_dataset as the test.
A preview of the test gets shown:
Finally, click Insert 1 Test Result to Document to add the test result to the documentation.

Confirm that the individual results for the high correlation test has been correctly inserted into section 2.3. Correlations and Interactions of the documentation.
Finalize the documentation by editing the test result's description block to explain the changes you made to the raw data and the reasons behind them as shown in the screenshot below:

Model testing

So far, we've focused on the data assessment and pre-processing that usually occurs prior to any models being built. Now, let's instead assume we have already built a model and we want to incorporate some model results into our documentation.

Train simple logistic regression model

Using ValidMind tests, we'll train a simple logistic regression model on our dataset and evaluate its performance by using the LogisticRegression class from the sklearn.linear_model.

To start, let's grab the first few rows from the balanced_raw_no_age_df dataset with the highly correlated features removed we initialized earlier:

balanced_raw_no_age_df.head()

	CreditScore	Geography	Gender	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
6665	553	Germany	Female	7	128524.19	2	1	0	20682.46	0
624	654	France	Male	1	0.00	1	1	0	180345.44	0
4647	608	Spain	Female	9	102406.76	1	0	1	57600.66	0
7743	617	Germany	Female	10	167273.71	1	0	0	93439.75	1
771	722	France	Female	3	168197.66	1	1	0	140765.57	1

Before training the model, we need to encode the categorical features in the dataset:

Use the OneHotEncoder class from the sklearn.preprocessing module to encode the categorical features.
The categorical features in the dataset are Geography and Gender.

balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
6665	553	7	128524.19	2	1	0	20682.46	0	True	False	False
624	654	1	0.00	1	1	0	180345.44	0	False	False	True
4647	608	9	102406.76	1	0	1	57600.66	0	False	True	False
7743	617	10	167273.71	1	0	0	93439.75	1	True	False	False
771	722	3	168197.66	1	1	0	140765.57	1	False	False	False

We'll split our preprocessed dataset into training and testing, to help assess how well the model generalizes to unseen data:

We start by dividing our balanced_raw_no_age_df dataset into training and test subsets using train_test_split, with 80% of the data allocated to training (train_df) and 20% to testing (test_df).
From each subset, we separate the features (all columns except "Exited") into X_train and X_test, and the target column ("Exited") into y_train and y_test.

from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

Then using GridSearchCV, we'll find the best-performing hyperparameters or settings and save them:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

log_reg_params = {
    "l1_ratio": [0.0, 1.0],  # 0 = L2, 1 = L1
    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
}

grid_log_reg = GridSearchCV(
    LogisticRegression(
        solver="saga",
        penalty="elasticnet",   # required when using l1_ratio
        max_iter=5000,
    ),
    log_reg_params,
)

grid_log_reg.fit(X_train, y_train)

log_reg = grid_log_reg.best_estimator_

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

Initialize model evaluation objects

The last step for evaluating the model's performance is to initialize the ValidMind Dataset and Model objects in preparation for assigning model predictions to each dataset.

# Initialize the datasets into their own dataset objects
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

You'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our three models.

You simply initialize this model object with vm.init_model():

# Register the model
vm_model = vm.init_model(log_reg, input_id="log_reg_model_v1")

Assign predictions

Once the model has been registered you can assign model predictions to the training and testing datasets.

The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

vm_train_ds.assign_predictions(model=vm_model)
vm_test_ds.assign_predictions(model=vm_model)

2025-12-31 22:16:02,987 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2025-12-31 22:16:02,989 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2025-12-31 22:16:02,990 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2025-12-31 22:16:02,992 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2025-12-31 22:16:02,994 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2025-12-31 22:16:02,995 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2025-12-31 22:16:02,995 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2025-12-31 22:16:02,997 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Run the model evaluation tests

In this next example, we'll focus on running the tests within the Model Development section of the model documentation. Only tests associated with this section will be executed, and the corresponding results will be updated in the model documentation.

Note the additional config that is passed to run_documentation_tests() — this allows you to override inputs or params in certain tests.
In our case, we want to explicitly use the vm_train_ds for the validmind.model_validation.sklearn.ClassifierPerformance:in_sample test, since it's supposed to run on the training dataset and not the test dataset.

test_config = {
    "validmind.model_validation.sklearn.ClassifierPerformance:in_sample": {
        "inputs": {
            "dataset": vm_train_ds,
            "model": vm_model,
        },
    }
}
results = vm.run_documentation_tests(
    section=["model_development"],
    inputs={
        "dataset": vm_test_ds,  # Any test that requires a single dataset will use vm_test_ds
        "model": vm_model,
        "datasets": (
            vm_train_ds,
            vm_test_ds,
        ),  # Any test that requires multiple datasets will use vm_train_ds and vm_test_ds
    },
    config=test_config,
)

Test suite complete!

34/34 (100.0%)

Test Suite Results: Binary Classification V2

Check out the updated documentation on ValidMind.

Template for binary classification models.

▶ Model Development

Model Development

▶ Test Result: Model Metadata (validmind.model_validation.ModelMetadata)

Model Metadata

Model Metadata is designed to compare the metadata of different models, focusing on their architecture, framework, framework version, and programming language. This test aims to provide a comprehensive overview of the models' technical specifications, facilitating easier comparison and identification of potential integration or deployment challenges.

The test operates by utilizing the get_model_info function to extract metadata from each model. This metadata includes key attributes such as the modeling technique, framework, framework version, and programming language. The function then standardizes these attributes by renaming columns according to a predefined set of labels, ensuring consistency across different models. The standardized metadata is compiled into a summary table, which serves as a clear and concise representation of the models' technical details. This process allows for the identification of discrepancies or variations in the metadata, which could indicate potential issues in model documentation or management. The test is particularly useful for assessing compatibility and consistency across models, as significant differences in framework versions or programming languages might pose challenges in model integration and deployment.

The primary advantages of this test include its ability to provide a clear comparison of essential model metadata, which is crucial for understanding the technical landscape of the models being evaluated. By standardizing metadata labels, the test ensures that the information is easily interpretable and comparable, which is particularly beneficial in environments where multiple models are used. This standardization helps in identifying potential compatibility or consistency issues across models, which can be critical for successful model deployment and integration. Additionally, the test's focus on high-level metadata rather than detailed parameter information allows for a broad overview of the models' technical specifications, making it a valuable tool for initial assessments and comparisons.

It should be noted that the test has certain limitations and potential risks. It assumes that the get_model_info function returns all necessary metadata fields, which may not always be the case. The test relies heavily on the correctness and completeness of the metadata provided by each model, meaning that any inaccuracies or omissions in the metadata could lead to misleading conclusions. Furthermore, the test does not include detailed parameter information, focusing instead on high-level metadata. This focus may limit the test's ability to provide insights into the models' performance characteristics or specific technical nuances. Additionally, inconsistent or missing metadata across models can indicate potential issues in model documentation or management, which could pose risks if not addressed.

This test shows a summary table output format, which presents the metadata of the models in a structured manner. The table includes columns for the modeling technique, modeling framework, framework version, and programming language. Each row represents a different model, allowing for easy comparison across these key attributes. The table format is straightforward to read, with each column clearly labeled to indicate the type of metadata being presented. Key measurements include the specific framework version and programming language used, which are critical for assessing compatibility and integration potential. Notable observations from the table may include any discrepancies in framework versions or programming languages, which could highlight potential challenges in model deployment or integration. The range and scale of the values are determined by the specific metadata attributes, with framework versions typically following a numerical format and programming languages being represented as text.

The test results reveal the following key insights:

Consistent Use of Python: The metadata indicates that all models utilize Python as the programming language, suggesting a uniform development environment that may facilitate easier integration and maintenance.
Framework Version Uniformity: The models are built using the same framework version, 1.8.0, which implies consistency in the underlying technology stack and reduces the risk of compatibility issues.
Standardized Modeling Technique: The use of the SKlearnModel technique across models suggests a standardized approach to model development, which can enhance comparability and streamline the evaluation process.

Based on these results, the metadata comparison highlights a consistent use of Python and a uniform framework version across the models, suggesting a stable and compatible development environment. The standardized modeling technique further supports this consistency, indicating a cohesive approach to model development. These insights suggest that the models are likely to integrate well within the existing infrastructure, minimizing potential deployment challenges. The uniformity in framework version and programming language also reduces the risk of compatibility issues, enhancing the models' reliability and maintainability. Overall, the metadata comparison provides a clear and concise overview of the models' technical specifications, facilitating informed decision-making regarding their deployment and integration.

Tables

Modeling Technique	Modeling Framework	Framework Version	Programming Language
SKlearnModel	sklearn	1.8.0	Python

▶ Test Result: Dataset Split (validmind.data_validation.DatasetSplit)

Dataset Split

Dataset Split is designed to evaluate and visualize the distribution of data among training, testing, and validation datasets within a machine learning model. The primary purpose of this test is to ensure that the datasets are split appropriately, as an imbalanced split can affect the model's ability to learn effectively and generalize to new, unseen data.

The test operates by first calculating the total size of all available datasets in the model. It then determines the size of each individual dataset and calculates its proportion relative to the total size. This involves summing the sizes of the training, testing, and validation datasets to obtain a total dataset size. Each dataset's size is then divided by this total to compute its proportion, which is expressed as a percentage. The results are summarized in a table that displays the dataset names, their sizes, and their proportions of the total dataset size. This approach provides a clear quantitative measure of how the data is distributed across different datasets, with proportions typically ranging from 0% to 100%. A balanced distribution is generally preferred, with a larger proportion allocated to the training dataset to ensure sufficient learning, while maintaining adequate sizes for testing and validation to assess model performance and generalization.

The primary advantages of this test include its ability to provide a clear and understandable visualization of dataset split proportions, which can quickly highlight any potential imbalances. This is particularly useful for a wide range of task types, including classification, regression, and text-related tasks. The test is versatile and not tied to any specific data type, making it applicable to tabular data, time series data, or text data. By offering a straightforward representation of dataset distribution, it aids in identifying whether the data allocation might lead to issues such as overfitting or underfitting, thereby supporting better model development and evaluation.

It should be noted that the Dataset Split test has certain limitations. It does not provide insights into the quality or diversity of the data within each split, focusing solely on size and proportion. Additionally, the test does not offer recommendations or adjustments for imbalanced datasets, leaving the interpretation and action to the user. Furthermore, the test may lack compatibility with more complex modes of data splitting, such as stratified or time-based splits, which could limit its applicability in scenarios where such methods are necessary. High-risk signs include a very small training dataset, which may hinder the model's learning capability, or a very large training dataset with a small test dataset, which could lead to overfitting and poor generalization.

This test shows the results in a tabular format, presenting the dataset names, their sizes, and their proportions of the total dataset size. The table includes columns for the dataset name, the absolute size of each dataset, and the proportion of the total dataset size that each represents. The "Size" column is measured in the number of data points, while the "Proportion" column is expressed as a percentage of the total dataset size. The table provides a clear and concise view of how the data is distributed across the training and testing datasets. In this specific test result, the training dataset, labeled as "train_dataset_final," consists of 2,585 data points, making up 79.98% of the total dataset. The testing dataset, labeled as "test_dataset_final," contains 647 data points, accounting for 20.02% of the total. The total dataset size is 3,232 data points, with the proportions summing to 100%. This distribution indicates a substantial allocation to the training dataset, which is generally favorable for model learning, while the testing dataset is adequately sized to evaluate model performance.

The test results reveal the following key insights:

Training Dataset Dominance: The training dataset comprises 79.98% of the total data, indicating a strong emphasis on model learning. This substantial allocation is beneficial for training robust models but requires careful monitoring to avoid overfitting.
Adequate Testing Proportion: The testing dataset accounts for 20.02% of the total data, providing a sufficient sample size to evaluate the model's generalization capabilities. This balance helps ensure that the model's performance is assessed accurately on unseen data.

Based on these results, the dataset split demonstrates a well-balanced distribution between training and testing datasets, with a significant portion allocated to training, which is advantageous for model development. The testing dataset's size is adequate for evaluating the model's performance, suggesting that the current split is likely to support effective learning and generalization. This distribution aligns with best practices in machine learning, where a larger training dataset is preferred to enhance model accuracy, while maintaining a sufficient testing dataset to validate the model's predictive capabilities. The insights from this test provide a clear understanding of the dataset allocation, supporting informed decisions in model training and evaluation processes.

Tables

Dataset	Size	Proportion
train_dataset_final	2585	79.98%
test_dataset_final	647	20.02%
Total	3232	100%

▶ Test Result: Population Stability Index (validmind.model_validation.sklearn.PopulationStabilityIndex)

Population Stability Index

Population Stability Index is designed to assess the stability of a machine learning model's predictions by comparing the output distributions across different datasets. This test is crucial for identifying any significant shifts in model performance over time or changes in the characteristics of the population for which the model is making predictions.

The test operates by calculating the Population Stability Index (PSI) for each feature between two datasets, typically a training and a test dataset. The data from both datasets is sorted into bins or quantiles, with boundaries determined by the training data distribution. The proportion of data in each bin is calculated for both datasets. The PSI is then derived for each bin through a logarithmic transformation of the ratio of these proportions. The PSI values, along with the data proportions for each bin, are presented in a summary table, a grouped bar chart, and a scatter plot. PSI values typically range from 0 to 1, where values closer to 0 indicate stability, and higher values suggest potential shifts in data distribution.

The primary advantages of this test include its ability to provide a quantitative measure of model stability over time or across different samples. This makes it particularly useful for evaluating changes in model performance. The PSI allows for direct comparisons across different features based on their PSI values, facilitating the identification of specific areas where the model may be experiencing shifts. The straightforward calculation and interpretation of PSI, along with visual aids such as tables and charts, enhance its utility in model risk management by simplifying the comprehension of complex data shifts.

It should be noted that the PSI test does not account for interdependencies between features, which may result in similar shifts in distributions and PSI values. Additionally, the test does not inherently provide insights into the reasons behind distribution differences or changes in PSI values. It may not handle features with significant outliers effectively. The PSI test focuses on model predictions rather than underlying data distributions, which can lead to misinterpretations. Changes in PSI could be due to model drift, concept drift, or both, but distinguishing between these causes is challenging.

This test shows the results in both tabular and graphical formats. The table presents the bin number, initial and new dataset counts, their respective percentages, and the PSI for each bin. The plot visually represents the population ratio for each bin, comparing the initial and new datasets, with the PSI values overlaid. The table and plot together provide a comprehensive view of how the data is distributed across bins and the stability of these distributions. The PSI values are generally low, indicating stability, with the total PSI being 0.0217, suggesting minimal shifts between the datasets.

The test results reveal the following key insights:

Overall Stability: The total PSI value of 0.0217 indicates a stable distribution between the initial and new datasets, suggesting minimal shifts.
Bin-Specific Variations: Bin 5 shows the highest PSI value of 0.0094, indicating a noticeable shift in this segment, while other bins remain relatively stable.
Proportional Differences: The largest proportional difference is observed in Bin 5, where the new dataset has a higher percentage (26.12%) compared to the initial dataset (21.39%).
Consistent Bins: Bins 7 and 9 show no PSI value, indicating consistent distributions between the datasets in these segments.

Based on these results, the model demonstrates overall stability with minimal shifts in data distribution across the datasets. The PSI values suggest that the model's predictions remain consistent over time, with only minor variations in specific bins. The insights highlight the importance of monitoring Bin 5, where a noticeable shift is observed, to ensure continued model reliability. The consistent distributions in other bins reinforce the model's robustness, indicating that it is well-calibrated to handle the data it encounters. These observations provide a clear understanding of the model's behavior and its ability to maintain performance across different datasets.

Tables

Population Stability Index for train_dataset_final and test_dataset_final Datasets

Bin	Count Initial	Percent Initial (%)	Count New	Percent New (%)	PSI
0	257	9.9420	59	9.1190	0.0007
1	540	20.8897	121	18.7017	0.0024
2	82	3.1721	28	4.3277	0.0036
3	129	4.9903	26	4.0185	0.0021
4	411	15.8994	106	16.3833	0.0001
5	553	21.3926	169	26.1206	0.0094
6	431	16.6731	95	14.6832	0.0025
7	147	5.6867	36	5.5641	0.0000
8	31	1.1992	6	0.9274	0.0007
9	4	0.1547	1	0.1546	0.0000
Total	2585	100.0000	647	100.0000	0.0217

Figures

ValidMind Figure validmind.model_validation.sklearn.PopulationStabilityIndex:9012

▶ Test Result: Confusion Matrix (validmind.model_validation.sklearn.ConfusionMatrix)

Confusion Matrix

Confusion Matrix is designed to evaluate and visually represent the classification ML model's predictive performance using a Confusion Matrix heatmap. The primary purpose is to assess how well the model can correctly classify True Positives, True Negatives, False Positives, and False Negatives, which are fundamental aspects of model accuracy.

The test operates by comparing the predicted results from the classification model against the actual values. A confusion matrix is constructed using the unique labels from the actual values, employing scikit-learn's metrics. This matrix is then visually rendered using Plotly's create_annotated_heatmap function, providing a two-dimensional graphical representation of the model's performance. The matrix highlights the distribution of True Positives, True Negatives, False Positives, and False Negatives. The values in the matrix indicate the number of instances for each category, with higher values for True Positives and True Negatives generally indicating better model performance. Conversely, higher values for False Positives and False Negatives suggest areas where the model may be misclassifying data.

The primary advantages of this test include its ability to provide a simplified yet comprehensive visual snapshot of the classification model's predictive performance. It distinctly highlights True Positives, True Negatives, False Positives, and False Negatives, making it easier to identify potential areas for improvement. The matrix is particularly useful for multi-class classification problems, offering a straightforward view of complex model performances. Additionally, it aids in understanding the different types of errors the model could make, providing insights into Type-I and Type-II errors, which are crucial for refining model accuracy and reliability.

It should be noted that the test has limitations, particularly in cases of unbalanced classes, where it might misinterpret the accuracy of a model that predominantly predicts the majority class. The confusion matrix does not provide a single unified statistic to evaluate overall model performance, requiring separate calculations for metrics like precision, recall, and F1-score. It mainly serves as a descriptive tool without the capability for statistical hypothesis testing. There is also a risk of misinterpretation, as the matrix does not directly provide precision, recall, or F1-score data, necessitating additional computations for a complete performance evaluation.

This test shows a heatmap representation of the confusion matrix, which is divided into four quadrants representing True Positives, True Negatives, False Positives, and False Negatives. The x-axis represents the predicted labels, while the y-axis represents the true labels. Each cell in the matrix contains the count of instances for that category. The heatmap uses color intensity to indicate the magnitude of each value, with darker shades representing higher counts. Key measurements include 230 True Positives, 150 True Negatives, 183 False Positives, and 84 False Negatives. The matrix provides a clear visual indication of the model's performance, with notable observations including a relatively high number of False Positives, suggesting potential areas for model refinement.

The test results reveal the following key insights:

High True Positives: The model correctly identifies 230 instances as positive, indicating strong performance in this area.
Significant False Positives: There are 183 instances incorrectly classified as positive, highlighting a potential area for improvement.
Moderate True Negatives: The model correctly classifies 150 instances as negative, showing reasonable performance.
Lower False Negatives: With 84 instances misclassified as negative, the model shows some room for improvement in identifying positive cases.

Based on these results, the model demonstrates a strong ability to identify positive instances, as evidenced by the high number of True Positives. However, the significant number of False Positives suggests that the model may be overly aggressive in classifying instances as positive, which could lead to misclassification. The moderate number of True Negatives indicates a reasonable performance in identifying negative instances, but the presence of False Negatives suggests that some positive cases are being overlooked. These insights collectively suggest that while the model performs well in certain areas, there is room for refinement, particularly in reducing False Positives and False Negatives to enhance overall accuracy and reliability.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:d752

▶ Test Result: Classifier Performance In Sample (validmind.model_validation.sklearn.ClassifierPerformance:in_sample)

Classifier Performance In Sample

Classifier Performance: In Sample is designed to evaluate the performance of classification models by calculating key metrics such as precision, recall, F1-Score, accuracy, and ROC AUC scores. This test provides a comprehensive analysis of a model's ability to correctly classify instances, offering insights into both binary and multiclass classification scenarios.

The test operates by utilizing scikit-learn's classification_report to compute precision, recall, F1-Score, and accuracy. Precision measures the proportion of true positive results in all positive predictions, indicating the model's ability to avoid false positives. Recall, or sensitivity, assesses the proportion of true positives identified out of all actual positives, reflecting the model's capacity to capture all relevant instances. The F1-Score, a harmonic mean of precision and recall, balances these two metrics, providing a single measure of a model's accuracy. Accuracy represents the overall correctness of the model, calculated as the ratio of correctly predicted instances to the total instances. The ROC AUC score evaluates the model's ability to distinguish between classes, with values ranging from 0 to 1, where 1 indicates perfect discrimination and 0.5 suggests no discrimination.

The primary advantages of this test include its versatility in handling both binary and multiclass models, making it applicable across various classification tasks. By employing a range of performance metrics, the test offers a detailed view of model performance, highlighting strengths and weaknesses in different areas. The inclusion of ROC AUC is particularly beneficial for evaluating models on unbalanced datasets, as it provides a robust measure of a model's discriminatory power. This comprehensive approach ensures that the test can effectively assess models in diverse scenarios, providing valuable insights into their operational capabilities.

It should be noted that the test has limitations, such as assuming correctly identified labels for binary classification models, which may not always be the case in real-world applications. Additionally, the test is specifically designed for classification models and is not applicable to regression models, limiting its scope. The test's effectiveness can also be compromised if the test dataset does not adequately represent real-world scenarios, potentially leading to misleading performance evaluations. Furthermore, signs of high risk, such as low precision, recall, F1-Score, accuracy, and ROC AUC scores, indicate poor model performance and require careful interpretation to avoid erroneous conclusions.

This test shows the results in two tables: one for precision, recall, and F1-Score, and another for accuracy and ROC AUC. The first table presents these metrics for each class, as well as macro and weighted averages, allowing for a detailed examination of the model's performance across different categories. Precision, recall, and F1-Score values are provided for Class 0 and Class 1, with Class 0 showing a precision of 0.5885, recall of 0.4614, and F1-Score of 0.5173, while Class 1 has a precision of 0.5624, recall of 0.682, and F1-Score of 0.6165. The weighted and macro averages offer a summary of the model's overall performance. The second table displays the accuracy and ROC AUC scores, with accuracy at 0.5725 and ROC AUC at 0.5873, indicating moderate performance. These tables provide a comprehensive view of the model's classification capabilities, highlighting areas of strength and potential improvement.

The test results reveal the following key insights:

Class 0 Performance: Class 0 exhibits a precision of 0.5885, recall of 0.4614, and F1-Score of 0.5173, indicating a moderate ability to correctly identify positive instances but with a notable number of false negatives.
Class 1 Performance: Class 1 shows a precision of 0.5624, recall of 0.682, and F1-Score of 0.6165, suggesting a better balance between precision and recall compared to Class 0, with a higher recall indicating effective identification of positive instances.
Overall Model Performance: The weighted average precision, recall, and F1-Score are 0.5753, 0.5725, and 0.5672, respectively, reflecting the model's general performance across all classes.
Accuracy and ROC AUC: The model's accuracy is 0.5725, and the ROC AUC is 0.5873, both indicating moderate performance, with the ROC AUC suggesting limited discriminatory power between classes.

Based on these results, the model demonstrates moderate classification performance, with Class 1 showing relatively better precision and recall compared to Class 0. The overall metrics, including weighted averages and accuracy, suggest that the model performs adequately but may struggle with certain aspects of classification, particularly in distinguishing between classes as indicated by the ROC AUC score. These insights highlight the model's current capabilities and areas where further refinement may be necessary to enhance its performance in real-world applications.

Tables

Precision, Recall, and F1

Class	Precision	Recall	F1
0	0.5885	0.4614	0.5173
1	0.5624	0.6820	0.6165
Weighted Average	0.5753	0.5725	0.5672
Macro Average	0.5754	0.5717	0.5669

Accuracy and ROC AUC

Metric	Value
Accuracy	0.5725
ROC AUC	0.5873

▶ Test Result: Classifier Performance Out Of Sample (validmind.model_validation.sklearn.ClassifierPerformance:out_of_sample)

Classifier Performance Out Of Sample

Classifier Performance: Out of Sample is designed to evaluate the performance of classification models by calculating key metrics such as precision, recall, F1-Score, accuracy, and ROC AUC. This test provides a comprehensive analysis of a model's ability to correctly classify instances in both binary and multiclass scenarios, offering insights into its predictive power and reliability.

The test operates by utilizing scikit-learn's classification_report to compute precision, recall, F1-Score, and accuracy. Precision measures the proportion of true positive predictions among all positive predictions, indicating the model's ability to avoid false positives. Recall, or sensitivity, assesses the proportion of true positive predictions among all actual positives, reflecting the model's capacity to capture all relevant instances. The F1-Score, a harmonic mean of precision and recall, balances these two metrics, providing a single measure of a model's accuracy. Accuracy represents the overall correctness of the model's predictions, calculated as the ratio of correctly predicted instances to the total instances. The ROC AUC score evaluates the model's ability to distinguish between classes, with values ranging from 0 to 1, where 1 indicates perfect discrimination and 0.5 suggests no discrimination. These metrics collectively offer a detailed view of the model's performance across different dimensions.

The primary advantages of this test include its versatility in handling both binary and multiclass classification models, making it applicable across various domains and datasets. By employing a range of performance metrics, the test provides a holistic view of a model's strengths and weaknesses, allowing for a nuanced understanding of its predictive capabilities. The inclusion of ROC AUC is particularly beneficial for datasets with class imbalance, as it evaluates the model's ability to differentiate between classes regardless of their distribution. This comprehensive approach ensures that the test can effectively assess models in diverse scenarios, providing valuable insights into their operational effectiveness.

It should be noted that the test has certain limitations, including its reliance on correctly labeled data for accurate evaluation. It is specifically tailored for classification models and does not extend to regression models, limiting its applicability in contexts where continuous outcomes are of interest. Additionally, the test's insights may be constrained if the test dataset does not adequately represent real-world conditions, potentially leading to over- or underestimation of the model's performance. Signs of high risk, such as low precision, recall, F1-Score, accuracy, and ROC AUC, indicate poor model performance and may necessitate further investigation or model refinement.

This test shows the results in tabular format, with two tables presenting the key metrics. The first table, titled "Precision, Recall, and F1," displays these metrics for each class, including a weighted and macro average. Precision, recall, and F1-Score are provided for Class 0 and Class 1, with Class 0 showing a precision of 0.641, recall of 0.4505, and F1-Score of 0.5291, while Class 1 has a precision of 0.5569, recall of 0.7325, and F1-Score of 0.6327. The weighted and macro averages offer a summary across classes, with precision, recall, and F1-Score values around 0.58 to 0.60. The second table, "Accuracy and ROC AUC," presents an accuracy of 0.5873 and a ROC AUC of 0.5923, indicating moderate performance. These tables provide a clear view of the model's classification capabilities, with the metrics highlighting areas of strength and potential improvement.

The test results reveal the following key insights:

Class 0 Performance: Class 0 exhibits a precision of 0.641, indicating a moderate ability to correctly identify positive instances, but a lower recall of 0.4505 suggests difficulty in capturing all relevant instances, resulting in an F1-Score of 0.5291.
Class 1 Performance: Class 1 shows a precision of 0.5569 and a higher recall of 0.7325, leading to an F1-Score of 0.6327, reflecting better performance in identifying true positives compared to Class 0.
Overall Model Performance: The weighted average precision, recall, and F1-Score are approximately 0.60, indicating balanced performance across classes, with the macro average providing similar values, suggesting consistent model behavior.
Accuracy and Discrimination: The model's accuracy is 0.5873, and the ROC AUC is 0.5923, both of which are slightly above random chance, indicating moderate discriminatory power and overall performance.

Based on these results, the model demonstrates moderate classification capabilities, with Class 1 showing better recall and F1-Score compared to Class 0. The overall performance, as indicated by the weighted and macro averages, suggests balanced but not exceptional predictive power. The accuracy and ROC AUC scores, slightly above 0.5, reflect the model's ability to distinguish between classes, though there is room for improvement. These insights highlight the model's strengths in certain areas while also pointing to potential areas for enhancement, particularly in improving recall for Class 0 and overall accuracy.

Tables

Precision, Recall, and F1

Class	Precision	Recall	F1
0	0.6410	0.4505	0.5291
1	0.5569	0.7325	0.6327
Weighted Average	0.6002	0.5873	0.5794
Macro Average	0.5990	0.5915	0.5809

Accuracy and ROC AUC

Metric	Value
Accuracy	0.5873
ROC AUC	0.5923

▶ Test Result: Precision Recall Curve (validmind.model_validation.sklearn.PrecisionRecallCurve)

Precision Recall Curve

Precision Recall Curve is designed to evaluate the trade-off between precision and recall in binary classification models. Its primary purpose is to assess the model's ability to produce accurate results while capturing a majority of positive instances, providing insights into the model's performance in terms of precision and recall.

The test operates by extracting ground truth labels and prediction probabilities from the model's test dataset. It utilizes the precision_recall_curve method from the sklearn metrics module, which computes precision-recall pairs for each possible threshold. This involves calculating precision, the ratio of true positive predictions to the total positive predictions, and recall, the ratio of true positive predictions to the total actual positives. The resulting precision and recall scores are plotted against each other to form the Precision-Recall Curve. This curve is visually represented using Plotly's scatter plot, allowing for an intuitive understanding of the model's performance across different threshold levels. The values range from 0 to 1, where higher values indicate better performance.

The primary advantages of this test include its ability to effectively represent the balance between precision and recall, which is crucial in scenarios where both metrics are significant. The visual representation of the Precision-Recall Curve provides an intuitive understanding of how the model performs at various threshold levels, making it easier to identify the optimal balance between minimizing false positives and false negatives. This is particularly useful in applications where the cost of errors is high, such as medical diagnosis or fraud detection, where both precision and recall are critical.

It should be noted that this test is limited to binary classification models and is not applicable to multiclass classification models or Foundation models. Additionally, it may not fully represent the overall accuracy of the model if the costs of false positives and false negatives are significantly different, or if the dataset is heavily imbalanced. A lower area under the Precision-Recall Curve indicates high risk, as it suggests a high number of false positives and/or false negatives. If the curve is closer to the bottom left of the plot, it signifies a higher risk, indicating poor model performance.

This test shows a Precision-Recall Curve plotted with precision on the y-axis and recall on the x-axis. The curve provides a visual representation of the trade-off between precision and recall at various threshold levels. The plot ranges from 0 to 1 on both axes, with precision values starting high and gradually decreasing as recall increases. The curve's shape and position provide insights into the model's performance, with a curve closer to the top right indicating better performance. In this plot, the precision starts high but fluctuates significantly before stabilizing around 0.5 as recall increases. This suggests variability in the model's precision across different thresholds, with a tendency towards lower precision at higher recall levels.

The test results reveal the following key insights:

Initial High Precision: The curve starts with high precision values, indicating that the model initially makes accurate positive predictions.
Fluctuating Precision: There is significant fluctuation in precision as recall increases, suggesting variability in the model's performance across different thresholds.
Stabilization at Lower Precision: Precision stabilizes around 0.5 as recall approaches 1, indicating a trade-off where increasing recall results in lower precision.

Based on these results, the model demonstrates an initial capability to make accurate positive predictions, but this precision is not maintained as recall increases. The fluctuation in precision suggests that the model's performance is inconsistent across different thresholds, with a tendency towards lower precision at higher recall levels. This behavior indicates a trade-off between precision and recall, where achieving higher recall comes at the cost of reduced precision. The insights from the Precision-Recall Curve highlight the need to carefully consider the balance between precision and recall, especially in applications where both metrics are critical.

Figures

ValidMind Figure validmind.model_validation.sklearn.PrecisionRecallCurve:8674

▶ Test Result: ROC Curve (validmind.model_validation.sklearn.ROCCurve)

ROC Curve

ROC Curve is designed to evaluate the performance of binary classification models by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under Curve (AUC) score. This test aims to measure the model's ability to distinguish between two classes, providing insights into its discriminative power across various threshold levels.

The test operates by selecting the target model and datasets for binary classification, calculating predicted probabilities for the test set, and using these along with true outcomes to generate the ROC curve. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different threshold levels, illustrating the trade-off between sensitivity and specificity. The AUC score, ranging from 0 to 1, quantifies the overall performance, with values closer to 1 indicating better discrimination. A score of 0.5 suggests no discriminative ability, akin to random guessing. The test also includes a line representing randomness (AUC of 0.5) for comparison. Infinite values in thresholds are removed to ensure accuracy, and the results are saved for future reference.

The primary advantages of this test include its ability to provide a comprehensive visual representation of a model's performance across all possible classification thresholds, unlike metrics that evaluate performance at a single threshold. The AUC score remains consistent regardless of class distribution, making it a robust choice for imbalanced datasets. This consistency allows for a reliable assessment of the model's discriminative power, offering a single metric that encapsulates the entire ROC curve's information.

It should be noted that this test is limited to binary classification tasks, restricting its applicability to other model types. Models producing probabilities skewed towards 0 or 1 may not perform well under this test. Additionally, the ROC curve can sometimes indicate high performance even when most classifications are incorrect, as long as the model maintains a correct ranking order. This is known as the "Class Imbalance Problem," where the ROC curve may not fully reflect the model's practical utility in imbalanced scenarios.

This test shows a plot of the ROC curve for the model log_reg_model_v1 on the test_dataset_final, with the True Positive Rate on the y-axis and the False Positive Rate on the x-axis. The curve is compared against a dashed line representing random performance (AUC = 0.5). The AUC score of 0.59 is displayed, indicating the model's ability to distinguish between classes. The ROC curve's proximity to the line of randomness suggests limited discriminative power. The plot provides a visual assessment of the model's performance across different thresholds, with the AUC score serving as a summary metric.

The test results reveal the following key insights:

Limited Discriminative Power: The AUC score of 0.59 indicates that the model's ability to distinguish between the positive and negative classes is only slightly better than random guessing.
Proximity to Randomness: The ROC curve's closeness to the line of randomness suggests that the model struggles to achieve a high true positive rate without also increasing the false positive rate significantly.

Based on these results, the model log_reg_model_v1 demonstrates limited effectiveness in distinguishing between classes within the test_dataset_final. The AUC score of 0.59, while above random performance, highlights the need for potential improvements in model training or feature selection to enhance discriminative power. The ROC curve's proximity to the line of randomness underscores the challenge in achieving a balance between sensitivity and specificity, suggesting that further refinement may be necessary to improve classification accuracy.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:5e05

▶ Test Result: Training Test Degradation (validmind.model_validation.sklearn.TrainingTestDegradation)

✅ Training Test Degradation

Training Test Degradation is designed to assess the extent of performance degradation between a model's training and test datasets, ensuring that the model's ability to generalize to unseen data remains within acceptable limits. The primary purpose of this test is to evaluate the model's robustness and reliability by comparing key classification metrics such as precision, recall, and f1 score across both datasets. By setting a predefined threshold for acceptable degradation, the test helps in identifying potential overfitting or underfitting issues that could affect the model's real-world performance.

The test operates by calculating several classification metrics, including precision, recall, and f1 score, for both the training and test datasets. Precision measures the accuracy of positive predictions, recall assesses the ability to identify all relevant instances, and f1 score provides a balance between precision and recall. These metrics are derived from the model's predictions, where precision is the ratio of true positive predictions to the total predicted positives, recall is the ratio of true positives to the total actual positives, and f1 score is the harmonic mean of precision and recall. The degradation is computed as the percentage difference between the training and test scores relative to the training score. A degradation percentage exceeding the threshold of 10% indicates a potential issue, while values within this range suggest acceptable generalization.

The primary advantages of this test include its ability to provide a quantitative measure of the model's generalization capabilities, which is crucial for predicting its performance on unseen data. By evaluating multiple metrics, the test offers a comprehensive view of the model's performance across different dimensions, allowing for a more nuanced assessment. The flexibility to adjust the degradation threshold makes it adaptable to various scenarios, enabling users to tailor the test to specific requirements and risk tolerances. This adaptability is particularly beneficial in dynamic environments where model performance needs to be closely monitored and adjusted.

It should be noted that the test has certain limitations, including its reliance on the raw performance comparison between training and test datasets without considering the underlying data distribution. This can lead to misleading results if the datasets are not well-balanced or representative of the real-world scenario. Additionally, the test is designed specifically for classification tasks, limiting its applicability to other types of models. High degradation percentages, especially those exceeding the 10% threshold, are signs of potential overfitting or underfitting, indicating that the model may not perform well on new data.

This test shows a detailed table format presenting the results for each class and metric, including precision, recall, and f1 score. Each row of the table corresponds to a specific class and metric, displaying the scores for both the training and test datasets, the calculated degradation percentage, and the pass/fail status based on the predefined threshold. The precision for Class 0 shows a negative degradation of -8.931%, indicating an improvement in the test dataset, while recall and f1 score have positive degradations of 2.377% and -2.2898%, respectively, all passing the threshold. For Class 1, precision and recall show degradations of 0.9745% and -7.398%, respectively, with f1 score at -2.6417%, all within acceptable limits. The table provides a clear and concise overview of the model's performance across different metrics, highlighting areas where the model maintains or improves its performance on the test dataset.

The test results reveal the following key insights:

Improved Precision for Class 0: The precision for Class 0 shows a negative degradation of -8.931%, indicating that the model performs better on the test dataset compared to the training dataset.
Stable Recall for Class 0: The recall for Class 0 has a slight positive degradation of 2.377%, suggesting that the model's ability to identify relevant instances remains consistent across datasets.
Consistent F1 Score for Class 0: The f1 score for Class 0 shows a minor negative degradation of -2.2898%, reflecting a balanced performance between precision and recall.
Slight Degradation in Precision for Class 1: The precision for Class 1 has a small positive degradation of 0.9745%, indicating a slight decrease in accuracy for positive predictions on the test dataset.
Improved Recall for Class 1: The recall for Class 1 shows a negative degradation of -7.398%, demonstrating an enhanced ability to identify relevant instances in the test dataset.
Consistent F1 Score for Class 1: The f1 score for Class 1 exhibits a negative degradation of -2.6417%, maintaining a balanced performance between precision and recall.

Based on these results, the model demonstrates a strong ability to generalize from the training to the test dataset, with all metrics for both classes passing the degradation threshold. The negative degradation percentages for certain metrics, such as precision for Class 0 and recall for Class 1, indicate improved performance on the test dataset, suggesting that the model is not overfitting to the training data. The consistent f1 scores across both classes further reinforce the model's balanced performance, highlighting its robustness and reliability in handling unseen data. These insights collectively suggest that the model is well-suited for deployment in real-world scenarios, with a low risk of performance degradation when exposed to new data.

Tables

Class	Metric	train_dataset_final Score	test_dataset_final Score	Degradation (%)	Pass/Fail
0	Precision	0.5885	0.6410	-8.9310	Pass
0	Recall	0.4614	0.4505	2.3770	Pass
0	F1-Score	0.5173	0.5291	-2.2898	Pass
1	Precision	0.5624	0.5569	0.9745	Pass
1	Recall	0.6820	0.7325	-7.3980	Pass
1	F1-Score	0.6165	0.6327	-2.6417	Pass

▶ Test Result: Minimum Accuracy (validmind.model_validation.sklearn.MinimumAccuracy)

❌ Minimum Accuracy

Minimum Accuracy is designed to ensure that a model's prediction accuracy meets or exceeds a specified threshold, which is crucial for validating the model's performance in making correct predictions. This test is particularly important in both binary and multiclass classification tasks, where accurate predictions are essential for reliable model outputs.

The test operates by calculating the model's accuracy score using the accuracy_score method from sklearn, which compares the true labels (y_true) with the predicted labels (class_pred). The accuracy score is the ratio of correct predictions to the total number of predictions, providing a straightforward measure of the model's performance. This score is then compared to a predefined threshold, typically set at 0.7. If the model's accuracy score meets or exceeds this threshold, the test is marked as passed. The accuracy score ranges from 0 to 1, where a score closer to 1 indicates better performance. A score below the threshold suggests that the model may not be reliable for making accurate predictions.

The primary advantages of this test include its simplicity and effectiveness in providing a clear measure of overall model performance. It is particularly useful in scenarios where the dataset is balanced, as it offers a straightforward assessment of the model's ability to correctly classify instances across all classes. This test is versatile, applicable to both binary and multiclass classification tasks, making it a valuable tool for initial model evaluation. Its straightforward nature allows for quick identification of models that meet basic performance criteria, facilitating efficient model selection and deployment.

It should be noted that this test has limitations, particularly in datasets with imbalanced classes, where accuracy can be misleading. In such cases, the model may appear to perform well by predominantly predicting the majority class, thus inflating the accuracy score without truly reflecting the model's ability to handle minority classes. Additionally, this test does not account for other important metrics such as precision, recall, or the model's handling of false positives and false negatives. These limitations highlight the importance of using additional metrics to gain a comprehensive understanding of the model's performance, especially in imbalanced datasets.

This test shows a table format presenting the model's accuracy score, the threshold used, and the pass/fail status. The table includes a single row with columns labeled "Score," "Threshold," and "Pass/Fail." The "Score" column indicates the model's accuracy score, which is 0.5873 in this case. The "Threshold" column shows the predefined accuracy threshold of 0.7. The "Pass/Fail" column indicates the test result, which is "Fail" due to the score being below the threshold. This table provides a clear and concise summary of the test outcome, highlighting the model's inability to meet the required accuracy level.

The test results reveal the following key insights:

Model Fails to Meet Accuracy Threshold: The model's accuracy score of 0.5873 falls below the required threshold of 0.7, resulting in a fail status. This indicates that the model's predictions are not sufficiently accurate to meet the predefined performance criteria.

Based on these results, the model demonstrates a significant shortfall in meeting the minimum accuracy requirement, as evidenced by the accuracy score of 0.5873, which is below the threshold of 0.7. This suggests that the model may not be reliable for making accurate predictions, particularly in scenarios where high accuracy is critical. The failure to meet the threshold highlights the need for further investigation into the model's performance, potentially exploring alternative models or additional metrics to ensure robust and reliable predictions.

Tables

Score	Threshold	Pass/Fail
0.5873	0.7	Fail

▶ Test Result: Minimum F1 Score (validmind.model_validation.sklearn.MinimumF1Score)

✅ Minimum F1 Score

Minimum F1 Score is designed to ensure that the model's F1 score on the validation set meets a predefined minimum threshold, thereby confirming balanced performance between precision and recall. This test is particularly crucial in classification tasks where the distribution of positive and negative classes is skewed, as it provides a more comprehensive measure of a model's effectiveness than accuracy alone.

The test operates by calculating the F1 score using scikit-learn's metrics in Python. For binary classification problems, the F1 score is computed directly, while for multi-class problems, macro averaging is employed. The F1 score is a harmonic mean of precision and recall, which are derived from the true positive, false positive, and false negative rates. Precision measures the accuracy of positive predictions, while recall assesses the ability to identify all positive instances. The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0 indicates the worst performance. A score above the predefined threshold is considered satisfactory, indicating a balanced trade-off between precision and recall.

The primary advantages of this test include its ability to provide a balanced measure of a model's performance by accounting for both false positives and false negatives. This is particularly useful in scenarios with imbalanced class distributions, where accuracy can be misleading. The flexibility in setting the threshold value allows practitioners to tailor the minimum acceptable performance standards to specific requirements, ensuring that the model meets the desired level of effectiveness in identifying positive cases while minimizing false positives.

It should be noted that the F1 score assumes an equal cost for false positives and false negatives, which may not be true in some real-world scenarios. Additionally, the test may not be suitable for all types of models and machine learning tasks, as it focuses solely on the balance between precision and recall. Practitioners might need to rely on other metrics such as precision, recall, or the ROC-AUC score that align more closely with specific requirements. A low F1 score, particularly one below the established threshold, is a sign of high risk, indicating that the model may not be effectively balancing precision and recall.

This test shows the results in a tabular format, where each row represents a specific evaluation of the model's F1 score against the predefined threshold. The table includes columns for the actual F1 score, the threshold value, and a pass/fail indicator. The F1 score is presented as a decimal value, typically ranging from 0 to 1, with higher values indicating better performance. The threshold column specifies the minimum acceptable F1 score, and the pass/fail column indicates whether the model's performance meets or exceeds this threshold. In this case, the table shows an F1 score of 0.6327, a threshold of 0.5, and a pass result, indicating that the model's performance is satisfactory according to the predefined criteria.

The test results reveal the following key insights:

Model Meets Performance Threshold: The model achieves an F1 score of 0.6327, which surpasses the predefined threshold of 0.5, indicating balanced performance between precision and recall.
Satisfactory Precision-Recall Balance: The pass result suggests that the model effectively balances precision and recall, minimizing false positives while accurately identifying positive instances.

Based on these results, the model demonstrates a satisfactory level of performance by achieving an F1 score that exceeds the predefined threshold. This indicates that the model maintains a balanced trade-off between precision and recall, effectively identifying positive instances while minimizing false positives. The results suggest that the model is well-suited for classification tasks with imbalanced class distributions, where a balanced measure of performance is crucial. The pass result confirms that the model meets the desired performance standards, providing confidence in its ability to perform effectively in real-world scenarios.

Tables

Score	Threshold	Pass/Fail
0.6327	0.5	Pass

▶ Test Result: Minimum ROCAUC Score (validmind.model_validation.sklearn.MinimumROCAUCScore)

✅ Minimum ROCAUC Score

Minimum ROC AUC Score is designed to validate the model's performance by ensuring that the Receiver Operating Characteristic Area Under the Curve (ROC AUC) score on the validation dataset meets or exceeds a predefined threshold. This test is crucial for assessing the model's ability to distinguish between different classes, which is a fundamental requirement in both binary and multiclass classification tasks.

The test operates by calculating the multiclass ROC AUC score using the true target values and the model's predictions. It employs the LabelBinarizer to convert multi-class target variables into a binary format, which is necessary for computing the ROC AUC score. The ROC AUC score is a measure of the model's ability to correctly classify positive instances while minimizing false positives. It ranges from 0 to 1, where a score of 0.5 indicates no discriminative power, akin to random guessing, and a score of 1 indicates perfect classification. The test passes if the ROC AUC score exceeds the predefined threshold, which is typically set at 0.5. This threshold ensures that the model performs better than random chance. The results, including the ROC AUC score, the threshold, and the pass/fail status, are encapsulated in a ThresholdTestResult object.

The primary advantages of this test include its comprehensive assessment of the model's performance by considering both the true positive rate and false positive rate. The ROC AUC score is threshold-independent, meaning it evaluates the model's quality across various classification thresholds, making it a robust measure for different scenarios. Additionally, the test is versatile, applicable to both binary and multiclass classification problems, which enhances its utility in diverse modeling contexts. This flexibility allows practitioners to apply the test across a wide range of datasets and classification challenges, ensuring consistent evaluation criteria.

It should be noted that the test has limitations, particularly in scenarios with highly imbalanced class distributions. In such cases, the ROC AUC score might still appear favorable even if the model fails to predict the minority class effectively. Furthermore, the test does not provide insights into specific aspects of the model that may be causing poor performance if the ROC AUC score is unsatisfactory. The use of a macro average for multiclass ROC AUC scoring implies equal weightage to each class, which might not be appropriate if the classes are imbalanced, potentially skewing the interpretation of the model's performance.

This test shows the results in a tabular format, presenting key metrics such as the ROC AUC score, the threshold, and the pass/fail status. The table is straightforward to read, with each row representing a single test instance. The "Score" column indicates the calculated ROC AUC score, which in this case is 0.5923. The "Threshold" column shows the predefined minimum score required for the test to pass, set at 0.5. The "Pass/Fail" column indicates whether the test was successful, with a "Pass" status confirming that the model's performance exceeds the threshold. The ROC AUC score of 0.5923 suggests that the model has a moderate ability to distinguish between classes, performing better than random chance but leaving room for improvement.

The test results reveal the following key insights:

Model Performance Above Threshold: The ROC AUC score of 0.5923 surpasses the threshold of 0.5, indicating that the model's performance is better than random guessing and meets the minimum acceptable standard for classification tasks.
Moderate Discriminative Power: With a score of 0.5923, the model demonstrates moderate discriminative power, suggesting it can distinguish between classes to a reasonable extent but may require further refinement to enhance its classification capabilities.

Based on these results, the model demonstrates a satisfactory level of performance by surpassing the minimum ROC AUC threshold, indicating it can distinguish between classes better than random chance. However, the moderate score of 0.5923 suggests there is potential for improvement in the model's discriminative power. This performance level is adequate for initial validation, but further analysis and potential model adjustments may be necessary to enhance its classification accuracy and robustness, especially in scenarios with imbalanced class distributions. The insights gained from this test provide a foundational understanding of the model's current capabilities and areas for potential enhancement.

Tables

Score	Threshold	Pass/Fail
0.5923	0.5	Pass

▶ Test Result: Permutation Feature Importance (validmind.model_validation.sklearn.PermutationFeatureImportance)

Permutation Feature Importance

Permutation Feature Importance is designed to assess the significance of each feature in a model by evaluating the impact on model performance when feature values are randomly rearranged. The primary purpose of this test is to determine how much each feature contributes to the model's predictive power by observing the decrease in performance when the feature's values are permuted. This helps in identifying which features are most influential in the model's decision-making process.

The test operates by using the permutation_importance method from the sklearn.inspection module. This method involves shuffling the values of each feature one at a time and measuring the change in the model's performance, typically using a metric like accuracy or mean squared error. The idea is that if permuting a feature's values leads to a significant drop in performance, the feature is considered important. Conversely, if the performance remains largely unchanged, the feature is deemed less important. The results are often presented as a bar chart, where the length of each bar represents the importance of the corresponding feature. The values typically range from 0 to 1, where higher values indicate greater importance.

The primary advantages of this test include its ability to provide insights into the importance of different features, potentially revealing underlying data structures. It is particularly useful in identifying overfitting, as it can highlight if a model is overly reliant on specific features. Additionally, it is model-agnostic, meaning it can be applied to any classifier that provides a measure of prediction accuracy before and after feature permutation. This flexibility makes it a valuable tool for feature selection and model interpretation across various machine learning applications.

It should be noted that this test does not imply causality; it only indicates the amount of information a feature provides for the prediction task. It also does not account for interactions between features, which can lead to misleading importance scores if features are correlated. Furthermore, the test cannot interact with certain libraries like statsmodels, pytorch, and catboost, limiting its applicability in some contexts. High-risk signs include a model heavily relying on a feature with highly variable values, indicating potential instability, or a feature expected to be significant based on domain knowledge not influencing the model's predictions.

This test shows a bar chart representing the permutation importances of various features. The x-axis displays the importance score, while the y-axis lists the features. The chart indicates that the feature "Balance" has the highest importance, with a score significantly larger than the others, suggesting it plays a crucial role in the model's predictions. Other features like "EstimatedSalary" and "CreditScore" show much lower importance scores, indicating they contribute less to the model's performance. The chart provides a clear visual representation of how each feature impacts the model, with the length of the bars corresponding to their importance. The values range from 0 to approximately 0.09, with "Balance" being the most notable feature due to its high score.

The test results reveal the following key insights:

Balance Dominates Feature Importance: The feature "Balance" shows a significantly higher importance score compared to other features, indicating it is the most influential in the model's predictions.
Low Importance of EstimatedSalary and CreditScore: "EstimatedSalary" and "CreditScore" have much lower importance scores, suggesting they have minimal impact on the model's performance.
Negligible Impact of Categorical Features: Features like "Gender_Male" and "Geography_Spain" show negligible importance, indicating they do not significantly affect the model's predictions.

Based on these results, the model heavily relies on the "Balance" feature, which suggests that this feature is critical for the model's decision-making process. The low importance of "EstimatedSalary" and "CreditScore" indicates that these features do not contribute significantly to the model's performance, which may align or conflict with domain expectations. The negligible impact of categorical features suggests that the model does not consider these variables as important, which could be due to their lack of predictive power or potential redundancy with other features. Overall, the results provide a clear understanding of which features are driving the model's predictions and highlight areas where feature selection or engineering might be refined.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:cc2f

▶ Test Result: SHAP Global Importance (validmind.model_validation.sklearn.SHAPGlobalImportance)

SHAP Global Importance

SHAP Global Importance is designed to evaluate and visualize global feature importance using SHAP values for model explanation and risk identification. The primary purpose is to elucidate model outcomes by attributing them to contributing features, assigning a quantifiable global importance to each feature via their respective absolute Shapley values. This makes it suitable for tasks like classification, both binary and multiclass, and forms an essential part of model risk management.

The test operates by selecting a suitable explainer that aligns with the model's type. For tree-based models like XGBClassifier, RandomForestClassifier, and CatBoostClassifier, the TreeExplainer is used, whereas for linear models like LogisticRegression and LinearRegression, the LinearExplainer is employed. The explainer calculates the Shapley values, which are then visualized using two specific graphical representations: the Mean Importance Plot and the Summary Plot. The Mean Importance Plot portrays the significance of individual features based on their absolute Shapley values, calculating the average across all instances to highlight global importance. The Summary Plot combines feature importance with their effects, where each dot represents a Shapley value for a certain feature in a specific case, with a color gradient indicating the feature's value.

The primary advantages of this test include its ability to provide a detailed perspective on how different features shape the model's decision-making logic for each instance. SHAP does more than just illustrate global feature significance; it offers clear insights into model behavior, making it particularly useful for understanding complex models. By visualizing the contribution of each feature, stakeholders can gain a comprehensive understanding of the model's inner workings, which is crucial for tasks involving model risk management and explanation.

It should be noted that high-dimensional data can convolute interpretations, and associating importance with tangible real-world impact still involves a certain degree of subjectivity. Signs of high risk include overemphasis on certain features in SHAP importance plots, which might hint at model overfitting, and anomalies such as unexpected or illogical features showing high importance, suggesting incorrect or undesirable reasoning. A SHAP summary plot filled with high variability or scattered data points may also indicate a cause for concern, highlighting potential issues in model stability or feature interaction.

This test shows two key visualizations: the Mean Importance Plot and the Summary Plot. The Mean Importance Plot displays normalized feature importance, with the horizontal axis representing the normalized SHAP value as a percentage and the vertical axis listing the features. The Summary Plot shows the SHAP value's impact on model output, with the horizontal axis indicating the SHAP value and the vertical axis listing the features. A color gradient from blue to red represents the feature value from low to high. The plots reveal that "Balance" has the highest normalized feature importance, followed by "EstimatedSalary" and "CreditScore." The Summary Plot shows the distribution of SHAP values for each feature, with "Balance" having a significant impact on the model output, as indicated by the spread of SHAP values.

The test results reveal the following key insights:

Balance Dominates Feature Importance: The "Balance" feature shows the highest normalized SHAP value, indicating it is the most influential feature in the model's predictions.
Estimated Salary and Credit Score: These features also contribute significantly, though to a lesser extent than "Balance," suggesting they play important roles in the model's decision-making process.
Feature Value Impact: The Summary Plot illustrates that higher values of "Balance" tend to increase the model output, as shown by the concentration of red dots on the positive side of the SHAP value axis.

Based on these results, the model heavily relies on the "Balance" feature, which significantly influences its predictions. The importance of "EstimatedSalary" and "CreditScore" also highlights their roles in shaping the model's behavior. The visualizations provide a clear understanding of how these features contribute to the model's output, with "Balance" being a critical factor. This insight into feature importance and interaction is crucial for assessing model reliability and ensuring that the model's decisions align with expected reasoning.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:c195

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:30d5

▶ Test Result: Weakspots Diagnosis (validmind.model_validation.sklearn.WeakspotsDiagnosis)

❌ Weakspots Diagnosis

Weakspots Diagnosis is designed to identify and visualize weak spots in a machine learning model's performance across various sections of the feature space. The primary purpose is to evaluate the model's performance within specific regions, highlighting areas where it falls below set thresholds for metrics such as accuracy, precision, recall, and F1 scores.

The test operates by dividing the feature space into numerous bins, each representing a specific range of feature values. For each bin, the model's performance metrics are calculated on both the training and test datasets. Metrics like accuracy, precision, recall, and F1 score are used to assess the model's predictive capabilities. Accuracy measures the proportion of correct predictions, precision indicates the ratio of true positive predictions to the total predicted positives, recall reflects the ratio of true positive predictions to the actual positives, and F1 score is the harmonic mean of precision and recall. These metrics typically range from 0 to 1, where higher values indicate better performance. A weak spot is identified if any metric falls below a predetermined threshold on the test dataset, and these results are visualized using bar charts.

The primary advantages of this test include its ability to pinpoint specific regions of the feature space where the model's performance is suboptimal, allowing for targeted improvements. The graphical representation of performance metrics provides an intuitive understanding of how the model performs across different feature areas. Additionally, the test is flexible, allowing users to set different thresholds for various metrics based on application-specific requirements. This adaptability makes it particularly useful for models deployed in diverse environments where performance needs may vary.

It should be noted that the test's limitations include the potential oversimplification of model behavior due to the binning system, which depends on the chosen 'bins' parameter. The effectiveness of the test is also contingent on the selection of performance thresholds, which may not be universally applicable. Furthermore, the test is limited to numerical or categorical data types, excluding datasets with text columns. While the test highlights problematic regions, it does not provide direct suggestions for model improvement, requiring further analysis to address identified issues.

This test shows the results in both tabular and graphical formats. The tables present detailed metrics for each feature slice, including accuracy, precision, recall, and F1 scores for both training and test datasets. The bar charts visually depict these metrics, with separate plots for each feature and metric. The x-axis represents the feature slices, while the y-axis shows the metric values. A red dashed line indicates the threshold for each metric, helping to quickly identify bins where performance falls short. Notable observations include regions where test performance significantly lags behind training performance, suggesting potential overfitting or data distribution issues.

The test results reveal the following key insights:

Balance Feature Performance: The model shows varying performance across different balance ranges, with accuracy and F1 scores often below thresholds in lower balance ranges.
Credit Score Stability: Performance metrics for credit score slices indicate consistent issues with precision and recall, particularly in mid-range scores.
Estimated Salary Variability: The model's performance on estimated salary slices shows fluctuations, with some bins failing to meet accuracy and F1 score thresholds.
Geographical Influence: Geography-related features like Germany and Spain show distinct performance patterns, with some regions exceeding thresholds while others fall short.
Product and Tenure Effects: The number of products and tenure features exhibit mixed results, with certain bins showing strong performance and others indicating potential weaknesses.

Based on these results, the model demonstrates varied performance across different feature slices, with specific regions consistently underperforming. The insights suggest that the model may struggle with certain types of input data, particularly in lower balance ranges and mid-range credit scores. The geographical features reveal distinct patterns, indicating that location may influence model predictions. The variability in performance across product and tenure features suggests that these factors may also impact the model's effectiveness. Overall, the results highlight areas where the model's behavior could be further analyzed and potentially improved to enhance predictive accuracy and reliability.

Tables

Slice	Number of Records	Feature	Accuracy	Precision	Recall	F1	Dataset
(-250.898, 25089.809]	196	Balance	0.6429	0.0000	0.0000	0.0000	test_dataset_final
(25089.809, 50179.618]	5	Balance	0.8000	0.0000	0.0000	0.0000	test_dataset_final
(50179.618, 75269.427]	33	Balance	0.5455	0.4000	0.5000	0.4444	test_dataset_final
(75269.427, 100359.236]	77	Balance	0.4416	0.4127	0.8125	0.5474	test_dataset_final
(100359.236, 125449.045]	150	Balance	0.5800	0.5839	0.9886	0.7342	test_dataset_final
(125449.045, 150538.854]	124	Balance	0.6371	0.6371	1.0000	0.7783	test_dataset_final
(150538.854, 175628.663]	50	Balance	0.4600	0.4600	1.0000	0.6301	test_dataset_final
(175628.663, 200718.472]	9	Balance	0.6667	0.6667	1.0000	0.8000	test_dataset_final
(200718.472, 225808.281]	2	Balance	1.0000	1.0000	1.0000	1.0000	test_dataset_final
(225808.281, 250898.09]	1	Balance	1.0000	1.0000	1.0000	1.0000	test_dataset_final
(-250.898, 25089.809]	850	Balance	0.6094	0.0000	0.0000	0.0000	train_dataset_final
(25089.809, 50179.618]	21	Balance	0.3810	0.0000	0.0000	0.0000	train_dataset_final
(50179.618, 75269.427]	92	Balance	0.4891	0.5862	0.3269	0.4198	train_dataset_final
(75269.427, 100359.236]	312	Balance	0.4872	0.4774	0.7785	0.5918	train_dataset_final
(100359.236, 125449.045]	580	Balance	0.6103	0.6101	0.9972	0.7570	train_dataset_final
(125449.045, 150538.854]	475	Balance	0.5516	0.5516	1.0000	0.7110	train_dataset_final
(150538.854, 175628.663]	188	Balance	0.5319	0.5319	1.0000	0.6944	train_dataset_final
(175628.663, 200718.472]	51	Balance	0.5098	0.5098	1.0000	0.6753	train_dataset_final
(200718.472, 225808.281]	15	Balance	0.9333	0.9333	1.0000	0.9655	train_dataset_final
(225808.281, 250898.09]	1	Balance	1.0000	1.0000	1.0000	1.0000	train_dataset_final
(349.5, 400.0]	6	CreditScore	0.5000	1.0000	0.5000	0.6667	test_dataset_final
(400.0, 450.0]	13	CreditScore	0.6154	0.5455	1.0000	0.7059	test_dataset_final
(450.0, 500.0]	27	CreditScore	0.4815	0.4286	0.8182	0.5625	test_dataset_final
(500.0, 550.0]	71	CreditScore	0.5634	0.5333	0.7059	0.6076	test_dataset_final
(550.0, 600.0]	78	CreditScore	0.5513	0.4717	0.7812	0.5882	test_dataset_final
(600.0, 650.0]	127	CreditScore	0.5512	0.4756	0.7358	0.5778	test_dataset_final
(650.0, 700.0]	126	CreditScore	0.6111	0.6098	0.7463	0.6711	test_dataset_final
(700.0, 750.0]	88	CreditScore	0.6477	0.6304	0.6744	0.6517	test_dataset_final
(750.0, 800.0]	62	CreditScore	0.5968	0.6279	0.7500	0.6835	test_dataset_final
(800.0, 850.0]	49	CreditScore	0.6531	0.6667	0.6923	0.6792	test_dataset_final
(349.5, 400.0]	8	CreditScore	0.7500	1.0000	0.7500	0.8571	train_dataset_final
(400.0, 450.0]	51	CreditScore	0.5294	0.5588	0.6786	0.6129	train_dataset_final
(450.0, 500.0]	133	CreditScore	0.5113	0.4891	0.7143	0.5806	train_dataset_final
(500.0, 550.0]	268	CreditScore	0.6045	0.6059	0.7254	0.6603	train_dataset_final
(550.0, 600.0]	397	CreditScore	0.5819	0.5766	0.7010	0.6327	train_dataset_final
(600.0, 650.0]	476	CreditScore	0.5504	0.5469	0.6955	0.6123	train_dataset_final
(650.0, 700.0]	466	CreditScore	0.5429	0.5307	0.6391	0.5799	train_dataset_final
(700.0, 750.0]	401	CreditScore	0.6185	0.6035	0.6850	0.6417	train_dataset_final
(750.0, 800.0]	232	CreditScore	0.5733	0.5423	0.6937	0.6087	train_dataset_final
(800.0, 850.0]	153	CreditScore	0.5882	0.5676	0.5753	0.5714	train_dataset_final
(-188.401, 20009.67]	61	EstimatedSalary	0.4426	0.3830	0.7826	0.5143	test_dataset_final
(20009.67, 40007.76]	75	EstimatedSalary	0.5733	0.5536	0.8158	0.6596	test_dataset_final
(40007.76, 60005.85]	67	EstimatedSalary	0.5522	0.4762	0.7143	0.5714	test_dataset_final
(60005.85, 80003.94]	74	EstimatedSalary	0.5541	0.5490	0.7368	0.6292	test_dataset_final
(80003.94, 100002.03]	51	EstimatedSalary	0.6863	0.6207	0.7826	0.6923	test_dataset_final
(100002.03, 120000.12]	71	EstimatedSalary	0.5352	0.4889	0.6875	0.5714	test_dataset_final
(120000.12, 139998.21]	58	EstimatedSalary	0.6034	0.5862	0.6071	0.5965	test_dataset_final
(139998.21, 159996.3]	63	EstimatedSalary	0.6984	0.6486	0.8000	0.7164	test_dataset_final
(159996.3, 179994.39]	63	EstimatedSalary	0.6349	0.5789	0.7586	0.6567	test_dataset_final
(179994.39, 199992.48]	64	EstimatedSalary	0.6250	0.7692	0.6667	0.7143	test_dataset_final
(-188.401, 20009.67]	252	EstimatedSalary	0.6429	0.6731	0.7292	0.7000	train_dataset_final
(20009.67, 40007.76]	241	EstimatedSalary	0.5436	0.5287	0.6975	0.6014	train_dataset_final
(40007.76, 60005.85]	263	EstimatedSalary	0.5703	0.5287	0.7480	0.6195	train_dataset_final
(60005.85, 80003.94]	271	EstimatedSalary	0.5535	0.5460	0.6934	0.6109	train_dataset_final
(80003.94, 100002.03]	258	EstimatedSalary	0.5581	0.5519	0.6538	0.5986	train_dataset_final
(100002.03, 120000.12]	259	EstimatedSalary	0.5792	0.5669	0.6846	0.6202	train_dataset_final
(120000.12, 139998.21]	268	EstimatedSalary	0.5933	0.5808	0.7132	0.6403	train_dataset_final
(139998.21, 159996.3]	238	EstimatedSalary	0.5798	0.5705	0.7025	0.6296	train_dataset_final
(159996.3, 179994.39]	298	EstimatedSalary	0.5436	0.5429	0.6291	0.5828	train_dataset_final
(179994.39, 199992.48]	237	EstimatedSalary	0.5654	0.5345	0.5586	0.5463	train_dataset_final
(-0.001, 0.1]	297	Gender_Male	0.6128	0.6298	0.7037	0.6647	test_dataset_final
(0.9, 1.0]	350	Gender_Male	0.5657	0.5000	0.7632	0.6042	test_dataset_final
(-0.001, 0.1]	1257	Gender_Male	0.5863	0.6382	0.6730	0.6552	train_dataset_final
(0.9, 1.0]	1328	Gender_Male	0.5595	0.4894	0.6937	0.5739	train_dataset_final
(-0.001, 0.1]	445	Geography_Germany	0.5281	0.4189	0.5345	0.4697	test_dataset_final
(0.9, 1.0]	202	Geography_Germany	0.7178	0.7173	0.9786	0.8278	test_dataset_final
(-0.001, 0.1]	1794	Geography_Germany	0.5307	0.4675	0.4937	0.4802	train_dataset_final
(0.9, 1.0]	791	Geography_Germany	0.6675	0.6680	0.9708	0.7914	train_dataset_final
(-0.001, 0.1]	506	Geography_Spain	0.6008	0.5817	0.7838	0.6678	test_dataset_final
(0.9, 1.0]	141	Geography_Spain	0.5390	0.4219	0.4909	0.4538	test_dataset_final
(-0.001, 0.1]	1999	Geography_Spain	0.5858	0.5760	0.7342	0.6455	train_dataset_final
(0.9, 1.0]	586	Geography_Spain	0.5273	0.4963	0.4873	0.4917	train_dataset_final
(-0.001, 0.1]	195	HasCrCard	0.6256	0.6063	0.7700	0.6784	test_dataset_final
(0.9, 1.0]	452	HasCrCard	0.5708	0.5350	0.7150	0.6120	test_dataset_final
(-0.001, 0.1]	774	HasCrCard	0.5581	0.5544	0.6726	0.6078	train_dataset_final
(0.9, 1.0]	1811	HasCrCard	0.5787	0.5658	0.6861	0.6202	train_dataset_final
(-0.001, 0.1]	336	IsActiveMember	0.6339	0.6667	0.7449	0.7036	test_dataset_final
(0.9, 1.0]	311	IsActiveMember	0.5370	0.4330	0.7119	0.5385	test_dataset_final
(-0.001, 0.1]	1407	IsActiveMember	0.5842	0.6374	0.6707	0.6536	train_dataset_final
(0.9, 1.0]	1178	IsActiveMember	0.5586	0.4712	0.7015	0.5638	train_dataset_final
(0.997, 1.3]	393	NumOfProducts	0.5165	0.5652	0.7380	0.6402	test_dataset_final
(1.9, 2.2]	217	NumOfProducts	0.6959	0.3977	0.7292	0.5147	test_dataset_final
(2.8, 3.1]	32	NumOfProducts	0.6562	1.0000	0.6562	0.7925	test_dataset_final
(3.7, 4.0]	5	NumOfProducts	1.0000	1.0000	1.0000	1.0000	test_dataset_final
(0.997, 1.3]	1485	NumOfProducts	0.5091	0.5764	0.6895	0.6279	train_dataset_final
(1.9, 2.2]	904	NumOfProducts	0.6515	0.3808	0.6592	0.4828	train_dataset_final
(2.8, 3.1]	157	NumOfProducts	0.6815	1.0000	0.6622	0.7967	train_dataset_final
(3.7, 4.0]	39	NumOfProducts	0.7179	1.0000	0.7179	0.8358	train_dataset_final
(-0.01, 1.0]	90	Tenure	0.5889	0.6230	0.7308	0.6726	test_dataset_final
(1.0, 2.0]	67	Tenure	0.6269	0.5789	0.7097	0.6377	test_dataset_final
(2.0, 3.0]	73	Tenure	0.5616	0.5882	0.7317	0.6522	test_dataset_final
(3.0, 4.0]	68	Tenure	0.6471	0.5870	0.8438	0.6923	test_dataset_final
(4.0, 5.0]	70	Tenure	0.5429	0.4667	0.7241	0.5676	test_dataset_final
(5.0, 6.0]	62	Tenure	0.5806	0.5946	0.6667	0.6286	test_dataset_final
(6.0, 7.0]	72	Tenure	0.5000	0.3864	0.6538	0.4857	test_dataset_final
(7.0, 8.0]	73	Tenure	0.5479	0.4898	0.7500	0.5926	test_dataset_final
(8.0, 9.0]	47	Tenure	0.6809	0.6552	0.7917	0.7170	test_dataset_final
(9.0, 10.0]	25	Tenure	0.7200	0.7692	0.7143	0.7407	test_dataset_final
(-0.01, 1.0]	367	Tenure	0.5831	0.6000	0.7050	0.6483	train_dataset_final
(1.0, 2.0]	267	Tenure	0.4981	0.4788	0.6220	0.5411	train_dataset_final
(2.0, 3.0]	281	Tenure	0.5694	0.5353	0.6842	0.6007	train_dataset_final
(3.0, 4.0]	267	Tenure	0.5543	0.5658	0.6187	0.5911	train_dataset_final
(4.0, 5.0]	259	Tenure	0.5676	0.5569	0.7099	0.6242	train_dataset_final
(5.0, 6.0]	258	Tenure	0.5736	0.5385	0.6364	0.5833	train_dataset_final
(6.0, 7.0]	242	Tenure	0.6033	0.5540	0.6937	0.6160	train_dataset_final
(7.0, 8.0]	268	Tenure	0.6418	0.5964	0.7734	0.6735	train_dataset_final
(8.0, 9.0]	250	Tenure	0.5600	0.5904	0.7000	0.6405	train_dataset_final
(9.0, 10.0]	126	Tenure	0.5714	0.6184	0.6528	0.6351	train_dataset_final

Figures

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:bcc0

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:1c0b

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:8e1e

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:1dcc

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:0ca9

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:d476

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:82ee

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:3f9f

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:397e

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:2e20

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:4e7b

$ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:96c7$

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:3d67

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:11d2

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:eb2f

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:17c3

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:1685

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:07a0

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:2189

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:ff51

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:20e2

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:b3e3

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:4ac5

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:2b5c

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:362a

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:e14e

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:f4af

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:0a7a

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:698d

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:7f6d

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:25a4

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:bb33

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:6268

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:6f05

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:a25a

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:960a

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:ecd4

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:5532

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:89ec

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:7cad

▶ Test Result: Overfit Diagnosis (validmind.model_validation.sklearn.OverfitDiagnosis)

Overfit Diagnosis

Overfit Diagnosis is designed to assess potential overfitting in a model's predictions by identifying regions where performance between training and testing sets deviates significantly. The primary purpose of this test is to pinpoint specific regions or feature segments where the model may be overfitting, thereby allowing for targeted improvements in model generalization.

The test operates by comparing the model's performance on training versus test data, grouped by feature columns. It calculates the difference between the training and test performance for each group, using the AUC metric for classification models and the MSE metric for regression models. The test identifies regions where this difference exceeds a specified threshold, set to 0.04 by default. The AUC, or Area Under the Curve, measures the model's ability to distinguish between classes, with values ranging from 0 to 1, where higher values indicate better performance. The test involves calculating these metrics for each feature segment and visualizing regions where the performance gap exceeds the threshold, thus highlighting potential overfitting areas.

The primary advantages of this test include its ability to identify specific areas where overfitting occurs, providing a clear visualization of these segments. It supports multiple performance metrics, offering flexibility in application to both classification and regression models. This capability allows for a nuanced understanding of model behavior across different feature segments, aiding in debugging and model refinement. By visualizing overfitting segments, the test facilitates a more intuitive grasp of where the model's predictions may be overly tailored to the training data, thus guiding targeted interventions.

It should be noted that the test has limitations, including the default threshold, which may not be suitable for all use cases and might require tuning. Additionally, the test may not capture more subtle forms of overfitting that do not exceed the threshold. The assumption that the binning of features adequately represents the data segments can also be a limitation, as inappropriate binning might lead to misleading results. Furthermore, significant gaps between training and test performance metrics for specific feature segments are signs of high risk, indicating potential overfitting.

This test shows the results in both tabular and graphical formats. The tables present data on feature segments, including the number of training and test records, training and test AUC values, and the gap between them. The plots visualize the AUC gap for each feature segment, with a red line indicating the cut-off threshold of 0.04. The x-axis represents the feature segments, while the y-axis shows the AUC gap. Key metrics include the AUC values for training and test sets, and the gap, which highlights overfitting regions. Notable observations include segments where the gap exceeds the threshold, indicating potential overfitting.

The test results reveal the following key insights:

Significant Overfitting in NumOfProducts: The feature segment (2.8, 3.1] shows a substantial AUC gap of 0.8476, indicating severe overfitting.
High AUC Gap in Balance Segments: The segment (25089.809, 50179.618] has an AUC gap of 0.2885, suggesting overfitting in this balance range.
Moderate Overfitting in Tenure: Segments such as (6.0, 7.0] and (7.0, 8.0] show gaps of 0.0929 and 0.0432, respectively, indicating moderate overfitting.
Overfitting in EstimatedSalary: The segment (-188.401, 20009.67] has a gap of 0.1095, highlighting potential overfitting in this salary range.

Based on these results, the model exhibits varying degrees of overfitting across different feature segments. The most pronounced overfitting is observed in the NumOfProducts feature, with a significant gap indicating a need for model adjustment. Other features, such as Balance and EstimatedSalary, also show notable gaps, suggesting areas where the model's predictions are overly tailored to the training data. These insights provide a clear direction for refining the model to improve its generalization and reduce overfitting, particularly in the highlighted segments.

Tables

Overfit Diagnosis

Feature	Slice	Number of Training Records	Number of Test Records	Training AUC	Test AUC	Gap
CreditScore	(450.0, 500.0]	133	27	0.5238	0.4773	0.0465
Tenure	(-0.01, 1.0]	367	90	0.6033	0.5294	0.0739
Tenure	(6.0, 7.0]	242	72	0.6422	0.5493	0.0929
Tenure	(7.0, 8.0]	268	73	0.6545	0.6113	0.0432
Balance	(-250.898, 25089.809]	850	196	0.5329	0.4808	0.0520
Balance	(25089.809, 50179.618]	21	5	0.5385	0.2500	0.2885
Balance	(100359.236, 125449.045]	580	150	0.5337	0.3964	0.1373
Balance	(125449.045, 150538.854]	475	124	0.5245	0.4236	0.1009
Balance	(175628.663, 200718.472]	51	9	0.4938	0.4444	0.0494
Balance	(200718.472, 225808.281]	15	2	0.4286	0.0000	0.4286
NumOfProducts	(2.8, 3.1]	157	32	0.8476	0.0000	0.8476
EstimatedSalary	(-188.401, 20009.67]	252	61	0.6278	0.5183	0.1095
EstimatedSalary	(100002.03, 120000.12]	259	71	0.6157	0.5633	0.0524
Geography_Spain	(0.9, 1.0]	586	141	0.5515	0.5078	0.0437

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:60e6

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:1f8f

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:831d

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:5140

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:da9a

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:e8ae

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:0a24

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:41c4

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:cfd8

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:d996

▶ Test Result: Robustness Diagnosis (validmind.model_validation.sklearn.RobustnessDiagnosis)

✅ Robustness Diagnosis

Robustness Diagnosis is designed to evaluate the resilience of a machine learning model when exposed to noisy conditions. The primary purpose is to assess how well the model can maintain its performance when the input data is perturbed, simulating real-world scenarios where data may be imperfect or corrupted.

The test operates by introducing Gaussian noise to the numeric input features of the dataset at varying scales of standard deviation. This noise simulates potential real-world data imperfections. The model's performance is then evaluated using metrics such as the Area Under the Curve (AUC) for classification tasks. AUC measures the model's ability to distinguish between classes, with values ranging from 0 to 1, where 1 indicates perfect classification and 0.5 suggests no discriminative power. The test measures performance decay by comparing the AUC of the noisy data against the baseline, providing insights into how noise affects model accuracy.

The primary advantages of this test include its ability to provide insights into the model's robustness against noisy or corrupted data. By utilizing a variety of performance metrics suitable for both classification and regression tasks, it offers a comprehensive view of the model's resilience. The visualization of results helps in understanding the extent of performance degradation, making it easier to identify specific perturbation levels that significantly impact model performance.

It should be noted that the test has limitations, such as the assumption that Gaussian noise adequately represents all types of real-world data perturbations. The performance thresholds used to evaluate decay are somewhat arbitrary and may require tuning to align with specific application needs. Additionally, the test may not account for more complex or unstructured noise patterns that could affect model robustness, potentially leading to an incomplete assessment of the model's real-world performance.

This test shows the results in both tabular and graphical formats. The table presents data on perturbation size, dataset, row count, AUC, performance decay, and pass status. The graph plots AUC against perturbation size, illustrating how performance changes with increasing noise. The x-axis represents perturbation size, while the y-axis shows AUC values. Key measurements include AUC values for both training and test datasets, with performance decay indicating the change from baseline. Notable observations include variations in AUC across different perturbation sizes, with some datasets showing resilience while others exhibit more significant decay.

The test results reveal the following key insights:

Stable Performance at Low Noise Levels: At perturbation sizes of 0.1 and 0.2, both training and test datasets maintain AUC values close to the baseline, indicating robustness to minor noise.
Increased Variability at Higher Noise Levels: As perturbation size increases to 0.4 and 0.5, the test dataset shows more significant performance decay, with AUC dropping to 0.5805 at 0.4, highlighting sensitivity to higher noise levels.
Unexpected Improvement at 0.5 Perturbation: Interestingly, the test dataset AUC increases to 0.6046 at a perturbation size of 0.5, suggesting potential overfitting or noise-induced beneficial effects.

Based on these results, the model demonstrates a degree of robustness to low levels of noise, maintaining performance close to baseline. However, as noise levels increase, particularly beyond a perturbation size of 0.3, the model's performance becomes more variable, indicating potential vulnerabilities. The unexpected improvement at the highest noise level suggests complex interactions between noise and model behavior, warranting further investigation. These insights provide a nuanced understanding of the model's behavior under noisy conditions, highlighting areas for potential refinement to enhance robustness.

Tables

Perturbation Size	Dataset	Row Count	AUC	Performance Decay	Passed
Baseline (0.0)	train_dataset_final	2585	0.5873	0.0000	True
Baseline (0.0)	test_dataset_final	647	0.5923	0.0000	True
0.1	train_dataset_final	2585	0.5852	0.0021	True
0.1	test_dataset_final	647	0.5938	-0.0015	True
0.2	train_dataset_final	2585	0.5900	-0.0028	True
0.2	test_dataset_final	647	0.5954	-0.0031	True
0.3	train_dataset_final	2585	0.5851	0.0022	True
0.3	test_dataset_final	647	0.5875	0.0048	True
0.4	train_dataset_final	2585	0.5831	0.0042	True
0.4	test_dataset_final	647	0.5805	0.0118	True
0.5	train_dataset_final	2585	0.5888	-0.0015	True
0.5	test_dataset_final	647	0.6046	-0.0123	True

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:7ce7

In summary

In this second notebook, you learned how to:

Import a sample dataset
Identify which tests you might want to run with ValidMind
Initialize ValidMind datasets and model objects
Run individual tests
Utilize the output from tests you've run
Log test results from sets of or individual tests as evidence to the ValidMind Platform
Add supplementary individual test results to your documentation
Assign model predictions to your ValidMind model objects

Next steps

Integrate custom tests

Now that you're familiar with the basics of using the ValidMind Library to run and log tests to provide evidence for your model documentation, let's learn how to incorporate your own custom tests into ValidMind: 3 — Integrate custom tests