• About
  • Get Started
  • Guides
  • ValidMind Library
    • ValidMind Library
    • Supported Models
    • QuickStart Notebook

    • TESTING
    • Run Tests & Test Suites
    • Test Descriptions
    • Test Sandbox (BETA)

    • CODE SAMPLES
    • All Code Samples · LLM · NLP · Time Series · Etc.
    • Download Code Samples · notebooks.zip
    • Try it on JupyterHub

    • REFERENCE
    • ValidMind Library Python API
  • Support
  • Training
  • Releases
  • Documentation
    • About ​ValidMind
    • Get Started
    • Guides
    • Support
    • Releases

    • Python Library
    • ValidMind Library

    • ValidMind Academy
    • Training Courses
  • Log In
    • Public Internet
    • ValidMind Platform · US1
    • ValidMind Platform · CA1

    • Private Link
    • Virtual Private ValidMind (VPV)

    • Which login should I use?
  1. Test descriptions

EU AI Act Compliance — Read our original regulation brief on how the EU AI Act aims to balance innovation with safety and accountability, setting standards for responsible AI use

  • ValidMind Library
  • Supported models

  • QuickStart
  • Quickstart for model documentation
  • Install and initialize ValidMind Library
  • Store model credentials in .env files

  • Model Development
  • 1 — Set up ValidMind Library
  • 2 — Start model development process
  • 3 — Integrate custom tests
  • 4 — Finalize testing & documentation

  • Model Validation
  • 1 — Set up ValidMind Library for validation
  • 2 — Start model validation process
  • 3 — Developing a challenger model
  • 4 — Finalize validation & reporting

  • Model Testing
  • Run tests & test suites
    • Add context to LLM-generated test descriptions
    • Configure dataset features
    • Document multiple results for the same test
    • Explore test suites
    • Explore tests
    • Dataset Column Filters when Running Tests
    • Load dataset predictions
    • Log metrics over time
    • Run individual documentation sections
    • Run documentation tests with custom configurations
    • Run tests with multiple datasets
    • Intro to Unit Metrics
    • Understand and utilize RawData in ValidMind tests
    • Introduction to ValidMind Dataset and Model Objects
    • Run Tests
      • Run dataset based tests
      • Run comparison tests
  • Test descriptions
    • Data Validation
      • ACFandPACFPlot
      • ADF
      • AutoAR
      • AutoMA
      • AutoStationarity
      • BivariateScatterPlots
      • BoxPierce
      • ChiSquaredFeaturesTable
      • ClassImbalance
      • DatasetDescription
      • DatasetSplit
      • DescriptiveStatistics
      • DickeyFullerGLS
      • Duplicates
      • EngleGrangerCoint
      • FeatureTargetCorrelationPlot
      • HighCardinality
      • HighPearsonCorrelation
      • IQROutliersBarPlot
      • IQROutliersTable
      • IsolationForestOutliers
      • JarqueBera
      • KPSS
      • LaggedCorrelationHeatmap
      • LJungBox
      • MissingValues
      • MissingValuesBarPlot
      • MutualInformation
      • PearsonCorrelationMatrix
      • PhillipsPerronArch
      • ProtectedClassesCombination
      • ProtectedClassesDescription
      • ProtectedClassesDisparity
      • ProtectedClassesThresholdOptimizer
      • RollingStatsPlot
      • RunsTest
      • ScatterPlot
      • ScoreBandDefaultRates
      • SeasonalDecompose
      • ShapiroWilk
      • Skewness
      • SpreadPlot
      • TabularCategoricalBarPlots
      • TabularDateTimeHistograms
      • TabularDescriptionTables
      • TabularNumericalHistograms
      • TargetRateBarPlots
      • TimeSeriesDescription
      • TimeSeriesDescriptiveStatistics
      • TimeSeriesFrequency
      • TimeSeriesHistogram
      • TimeSeriesLinePlot
      • TimeSeriesMissingValues
      • TimeSeriesOutliers
      • TooManyZeroValues
      • UniqueRows
      • WOEBinPlots
      • WOEBinTable
      • ZivotAndrewsArch
      • Nlp
        • CommonWords
        • Hashtags
        • LanguageDetection
        • Mentions
        • PolarityAndSubjectivity
        • Punctuations
        • Sentiment
        • StopWords
        • TextDescription
        • Toxicity
    • Model Validation
      • BertScore
      • BleuScore
      • ClusterSizeDistribution
      • ContextualRecall
      • FeaturesAUC
      • MeteorScore
      • ModelMetadata
      • ModelPredictionResiduals
      • RegardScore
      • RegressionResidualsPlot
      • RougeScore
      • TimeSeriesPredictionsPlot
      • TimeSeriesPredictionWithCI
      • TimeSeriesR2SquareBySegments
      • TokenDisparity
      • ToxicityScore
      • Embeddings
        • ClusterDistribution
        • CosineSimilarityComparison
        • CosineSimilarityDistribution
        • CosineSimilarityHeatmap
        • DescriptiveAnalytics
        • EmbeddingsVisualization2D
        • EuclideanDistanceComparison
        • EuclideanDistanceHeatmap
        • PCAComponentsPairwisePlots
        • StabilityAnalysisKeyword
        • StabilityAnalysisRandomNoise
        • StabilityAnalysisSynonyms
        • StabilityAnalysisTranslation
        • TSNEComponentsPairwisePlots
      • Ragas
        • AnswerCorrectness
        • AspectCritic
        • ContextEntityRecall
        • ContextPrecision
        • ContextPrecisionWithoutReference
        • ContextRecall
        • Faithfulness
        • NoiseSensitivity
        • ResponseRelevancy
        • SemanticSimilarity
      • Sklearn
        • AdjustedMutualInformation
        • AdjustedRandIndex
        • CalibrationCurve
        • ClassifierPerformance
        • ClassifierThresholdOptimization
        • ClusterCosineSimilarity
        • ClusterPerformanceMetrics
        • CompletenessScore
        • ConfusionMatrix
        • FeatureImportance
        • FowlkesMallowsScore
        • HomogeneityScore
        • HyperParametersTuning
        • KMeansClustersOptimization
        • MinimumAccuracy
        • MinimumF1Score
        • MinimumROCAUCScore
        • ModelParameters
        • ModelsPerformanceComparison
        • OverfitDiagnosis
        • PermutationFeatureImportance
        • PopulationStabilityIndex
        • PrecisionRecallCurve
        • RegressionErrors
        • RegressionErrorsComparison
        • RegressionPerformance
        • RegressionR2Square
        • RegressionR2SquareComparison
        • RobustnessDiagnosis
        • ROCCurve
        • ScoreProbabilityAlignment
        • SHAPGlobalImportance
        • SilhouettePlot
        • TrainingTestDegradation
        • VMeasure
        • WeakspotsDiagnosis
      • Statsmodels
        • AutoARIMA
        • CumulativePredictionProbabilities
        • DurbinWatsonTest
        • GINITable
        • KolmogorovSmirnov
        • Lilliefors
        • PredictionProbabilitiesHistogram
        • RegressionCoeffs
        • RegressionFeatureSignificance
        • RegressionModelForecastPlot
        • RegressionModelForecastPlotLevels
        • RegressionModelSensitivityPlot
        • RegressionModelSummary
        • RegressionPermutationFeatureImportance
        • ScorecardHistogram
    • Ongoing Monitoring
      • CalibrationCurveDrift
      • ClassDiscriminationDrift
      • ClassificationAccuracyDrift
      • ClassImbalanceDrift
      • ConfusionMatrixDrift
      • CumulativePredictionProbabilitiesDrift
      • FeatureDrift
      • PredictionAcrossEachFeature
      • PredictionCorrelation
      • PredictionProbabilitiesHistogramDrift
      • PredictionQuantilesAcrossFeatures
      • ROCCurveDrift
      • ScoreBandsDrift
      • ScorecardHistogramDrift
      • TargetPredictionDistributionPlot
    • Prompt Validation
      • Bias
      • Clarity
      • Conciseness
      • Delimitation
      • NegativeInstruction
      • Robustness
      • Specificity
  • Test sandbox beta

  • Notebooks
  • Code samples
    • Capital Markets
      • Quickstart for knockout option pricing model documentation
      • Quickstart for Heston option pricing model using QuantLib
    • Credit Risk
      • Document an application scorecard model
      • Document an application scorecard model
      • Document an application scorecard model
      • Document a credit risk model
      • Document an application scorecard model
    • Custom Tests
      • Implement custom tests
      • Integrate external test providers
    • Model Validation
      • Validate an application scorecard model
    • Nlp and Llm
      • Sentiment analysis of financial data using a large language model (LLM)
      • Summarization of financial data using a large language model (LLM)
      • Sentiment analysis of financial data using Hugging Face NLP models
      • Summarization of financial data using Hugging Face NLP models
      • Automate news summarization using LLMs
      • Prompt validation for large language models (LLMs)
      • RAG Model Benchmarking Demo
      • RAG Model Documentation Demo
    • Ongoing Monitoring
      • Ongoing Monitoring for Application Scorecard
      • Quickstart for ongoing monitoring of models with ValidMind
    • Regression
      • Document a California Housing Price Prediction regression model
    • Time Series
      • Document a time series forecasting model
      • Document a time series forecasting model

  • Reference
  • ValidMind Library Python API

Test descriptions

Published

May 12, 2025

Tests that are available as part of the ValidMind Library, grouped by type of validation or monitoring test.

Try the test sandbox beta

Explore our interactive sandbox to see what tests are available in the ValidMind Library.

  • Data validation
  • Model validation
  • Prompt validation
  • Ongoing monitoring
ACFandPACFPlot
Analyzes time series data using Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots to reveal trends and correlations.
ADF
Assesses the stationarity of a time series dataset using the Augmented Dickey-Fuller (ADF) test.
AutoAR
Automatically identifies the optimal Autoregressive (AR) order for a time series using BIC and AIC criteria.
AutoMA
Automatically selects the optimal Moving Average (MA) order for each variable in a time series dataset based on minimal BIC and AIC values.
AutoStationarity
Automates Augmented Dickey-Fuller test to assess stationarity across multiple time series in a DataFrame.
BivariateScatterPlots
Generates bivariate scatterplots to visually inspect relationships between pairs of numerical predictor variables in machine learning classification tasks.
BoxPierce
Detects autocorrelation in time-series data through the Box-Pierce test to validate model performance.
ChiSquaredFeaturesTable
Assesses the statistical association between categorical features and a target variable using the Chi-Squared test.
ClassImbalance
Evaluates and quantifies class distribution imbalance in a dataset used by a machine learning model.
DatasetDescription
Provides comprehensive analysis and statistical summaries of each column in a machine learning model's dataset.
DatasetSplit
Evaluates and visualizes the distribution proportions among training, testing, and validation datasets of an ML model.
DescriptiveStatistics
Performs a detailed descriptive statistical analysis of both numerical and categorical data within a model's dataset.
DickeyFullerGLS
Assesses stationarity in time series data using the Dickey-Fuller GLS test to determine the order of integration.
Duplicates
Tests dataset for duplicate entries, ensuring model reliability via data quality verification.
EngleGrangerCoint
Assesses the degree of co-movement between pairs of time series data using the Engle-Granger cointegration test.
FeatureTargetCorrelationPlot
Visualizes the correlation between input features and the model's target output in a color-coded horizontal bar plot.
HighCardinality
Assesses the number of unique values in categorical columns to detect high cardinality and potential overfitting.
HighPearsonCorrelation
Identifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity.
IQROutliersBarPlot
Visualizes outlier distribution across percentiles in numerical data using the Interquartile Range (IQR) method.
IQROutliersTable
Determines and summarizes outliers in numerical features using the Interquartile Range method.
IsolationForestOutliers
Detects outliers in a dataset using the Isolation Forest algorithm and visualizes results through scatter plots.
JarqueBera
Assesses normality of dataset features in an ML model using the Jarque-Bera test.
KPSS
Assesses the stationarity of time-series data in a machine learning model using the KPSS unit root test.
LJungBox
Assesses autocorrelations in dataset features by performing a Ljung-Box test on each feature.
LaggedCorrelationHeatmap
Assesses and visualizes correlation between target variable and lagged independent variables in a time-series dataset.
MissingValues
Evaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold.
MissingValuesBarPlot
Assesses the percentage and distribution of missing values in the dataset via a bar plot, with emphasis on identifying high-risk columns based on a user-defined threshold.
MutualInformation
Calculates mutual information scores between features and target variable to evaluate feature relevance.
PearsonCorrelationMatrix
Evaluates linear dependency between numerical variables in a dataset via a Pearson Correlation coefficient heat map.
PhillipsPerronArch
Assesses the stationarity of time series data in each feature of the ML model using the Phillips-Perron test.
ProtectedClassesCombination
Visualizes combinations of protected classes and their corresponding error metric differences.
ProtectedClassesDescription
Visualizes the distribution of protected classes in the dataset relative to the target variable and provides descriptive statistics.
ProtectedClassesDisparity
Investigates disparities in model performance across different protected class segments.
ProtectedClassesThresholdOptimizer
Obtains a classifier by applying group-specific thresholds to the provided estimator.
RollingStatsPlot
Evaluates the stationarity of time series data by plotting its rolling mean and standard deviation over a specified window.
RunsTest
Executes Runs Test on ML model to detect non-random patterns in output data sequence.
ScatterPlot
Assesses visual relationships, patterns, and outliers among features in a dataset through scatter plot matrices.
ScoreBandDefaultRates
Analyzes default rates and population distribution across credit score bands.
SeasonalDecompose
Assesses patterns and seasonality in a time series dataset by decomposing its features into foundational components.
ShapiroWilk
Evaluates feature-wise normality of training data using the Shapiro-Wilk test.
Skewness
Evaluates the skewness of numerical data in a dataset to check against a defined threshold, aiming to ensure data quality and optimize model performance.
SpreadPlot
Assesses potential correlations between pairs of time series variables through visualization to enhance understanding of their relationships.
TabularCategoricalBarPlots
Generates and visualizes bar plots for each category in categorical features to evaluate the dataset's composition.
TabularDateTimeHistograms
Generates histograms to provide graphical insight into the distribution of time intervals in a model's datetime data.
TabularDescriptionTables
Summarizes key descriptive statistics for numerical, categorical, and datetime variables in a dataset.
TabularNumericalHistograms
Generates histograms for each numerical feature in a dataset to provide visual insights into data distribution and detect potential issues.
TargetRateBarPlots
Generates bar plots visualizing the default rates of categorical features for a classification machine learning model.
TimeSeriesDescription
Generates a detailed analysis for the provided time series dataset, summarizing key statistics to identify trends, patterns, and data quality issues.
TimeSeriesDescriptiveStatistics
Evaluates the descriptive statistics of a time series dataset to identify trends, patterns, and data quality issues.
TimeSeriesFrequency
Evaluates consistency of time series data frequency and generates a frequency plot.
TimeSeriesHistogram
Visualizes distribution of time-series data using histograms and Kernel Density Estimation (KDE) lines.
TimeSeriesLinePlot
Generates and analyses time-series data through line plots revealing trends, patterns, anomalies over time.
TimeSeriesMissingValues
Validates time-series data quality by confirming the count of missing values is below a certain threshold.
TimeSeriesOutliers
Identifies and visualizes outliers in time-series data using the z-score method.
TooManyZeroValues
Identifies numerical columns in a dataset that contain an excessive number of zero values, defined by a threshold percentage.
UniqueRows
Verifies the diversity of the dataset by ensuring that the count of unique rows exceeds a prescribed threshold.
WOEBinPlots
Generates visualizations of Weight of Evidence (WoE) and Information Value (IV) for understanding predictive power of categorical variables in a data set.
WOEBinTable
Assesses the Weight of Evidence (WoE) and Information Value (IV) of each feature to evaluate its predictive power in a binary classification model.
ZivotAndrewsArch
Evaluates the order of integration and stationarity of time series data using the Zivot-Andrews unit root test.
No matching items
BertScore
Assesses the quality of machine-generated text using BERTScore metrics and visualizes results through histograms and bar charts, alongside compiling a comprehensive table of descriptive statistics.
BleuScore
Evaluates the quality of machine-generated text using BLEU metrics and visualizes the results through histograms and bar charts, alongside compiling a comprehensive table of descriptive statistics for BLEU scores.
ClusterSizeDistribution
Assesses the performance of clustering models by comparing the distribution of cluster sizes in model predictions with the actual data.
ContextualRecall
Evaluates a Natural Language Generation model's ability to generate contextually relevant and factually correct text, visualizing the results through histograms and bar charts, alongside compiling a comprehensive table of descriptive statistics for…
FeaturesAUC
Evaluates the discriminatory power of each individual feature within a binary classification model by calculating the Area Under the Curve (AUC) for each feature separately.
MeteorScore
Assesses the quality of machine-generated translations by comparing them to human-produced references using the METEOR score, which evaluates precision, recall, and word order.
ModelMetadata
Compare metadata of different models and generate a summary table with the results.
ModelPredictionResiduals
Assesses normality and behavior of residuals in regression models through visualization and statistical tests.
RegardScore
Assesses the sentiment and potential biases in text generated by NLP models by computing and visualizing regard scores.
RegressionResidualsPlot
Evaluates regression model performance using residual distribution and actual vs. predicted plots.
RougeScore
Assesses the quality of machine-generated text using ROUGE metrics and visualizes the results to provide comprehensive performance insights.
TimeSeriesPredictionWithCI
Assesses predictive accuracy and uncertainty in time series models, highlighting breaches beyond confidence intervals.
TimeSeriesPredictionsPlot
Plot actual vs predicted values for time series data and generate a visual comparison for the model.
TimeSeriesR2SquareBySegments
Evaluates the R-Squared values of regression models over specified time segments in time series data to assess segment-wise model performance.
TokenDisparity
Evaluates the token disparity between reference and generated texts, visualizing the results through histograms and bar charts, alongside compiling a comprehensive table of descriptive statistics for token counts.
ToxicityScore
Assesses the toxicity levels of texts generated by NLP models to identify and mitigate harmful or offensive content.
No matching items
Bias
Assesses potential bias in a Large Language Model by analyzing the distribution and order of exemplars in the prompt.
Clarity
Evaluates and scores the clarity of prompts in a Large Language Model based on specified guidelines.
Conciseness
Analyzes and grades the conciseness of prompts provided to a Large Language Model.
Delimitation
Evaluates the proper use of delimiters in prompts provided to Large Language Models.
NegativeInstruction
Evaluates and grades the use of affirmative, proactive language over negative instructions in LLM prompts.
Robustness
Assesses the robustness of prompts provided to a Large Language Model under varying conditions and contexts. This test specifically measures the model's ability to generate correct classifications with the given prompt even when the inputs are edge…
Specificity
Evaluates and scores the specificity of prompts provided to a Large Language Model (LLM), based on clarity, detail, and relevance.
No matching items
CalibrationCurveDrift
Evaluates changes in probability calibration between reference and monitoring datasets.
ClassDiscriminationDrift
Compares classification discrimination metrics between reference and monitoring datasets.
ClassImbalanceDrift
Evaluates drift in class distribution between reference and monitoring datasets.
ClassificationAccuracyDrift
Compares classification accuracy metrics between reference and monitoring datasets.
ConfusionMatrixDrift
Compares confusion matrix metrics between reference and monitoring datasets.
CumulativePredictionProbabilitiesDrift
Compares cumulative prediction probability distributions between reference and monitoring datasets.
FeatureDrift
Evaluates changes in feature distribution over time to identify potential model drift.
PredictionAcrossEachFeature
Assesses differences in model predictions across individual features between reference and monitoring datasets through visual analysis.
PredictionCorrelation
Assesses correlation changes between model predictions from reference and monitoring datasets to detect potential target drift.
PredictionProbabilitiesHistogramDrift
Compares prediction probability distributions between reference and monitoring datasets.
PredictionQuantilesAcrossFeatures
Assesses differences in model prediction distributions across individual features between reference and monitoring datasets through quantile analysis.
ROCCurveDrift
Compares ROC curves between reference and monitoring datasets.
ScoreBandsDrift
Analyzes drift in population distribution and default rates across score bands.
ScorecardHistogramDrift
Compares score distributions between reference and monitoring datasets for each class.
TargetPredictionDistributionPlot
Assesses differences in prediction distributions between a reference dataset and a monitoring dataset to identify potential data drift.
No matching items
Run comparison tests
ACFandPACFPlot

© Copyright 2025 ValidMind Inc. All Rights Reserved.

  • Edit this page
  • Report an issue
Cookie Preferences
  • validmind.com

  • Privacy Policy

  • Terms of Use