• About
  • Get Started
  • Guides
  • ValidMind Library
    • ValidMind Library
    • Supported Models
    • QuickStart Notebook

    • TESTING
    • Run Tests & Test Suites
    • Test Descriptions
    • Test Sandbox (BETA)

    • CODE SAMPLES
    • All Code Samples · LLM · NLP · Time Series · Etc.
    • Download Code Samples · notebooks.zip
    • Try it on JupyterHub

    • REFERENCE
    • ValidMind Library Python API
  • Support
  • Training
  • Releases
  • Documentation
    • About ​ValidMind
    • Get Started
    • Guides
    • Support
    • Releases

    • Python Library
    • ValidMind Library

    • ValidMind Academy
    • Training Courses
  • Log In
    • Public Internet
    • ValidMind Platform · US1
    • ValidMind Platform · CA1

    • Private Link
    • Virtual Private ValidMind (VPV)

    • Which login should I use?
  1. tests
  2. data_validation
  3. TextDescription

EU AI Act Compliance — Read our original regulation brief on how the EU AI Act aims to balance innovation with safety and accountability, setting standards for responsible AI use

  • ValidMind Library

  • Python API
  • 2.8.13
  • init
  • init_dataset
  • init_model
  • init_r_model
  • get_test_suite
  • log_metric
  • preview_template
  • print_env
  • reload
  • run_documentation_tests
  • run_test_suite
  • tags
  • tasks
  • test
  • log_text
  • RawData
    • RawData
    • inspect
    • serialize

  • Submodules
  • __version__
  • datasets
    • classification
      • customer_churn
      • taiwan_credit
    • credit_risk
      • lending_club
      • lending_club_bias
    • nlp
      • cnn_dailymail
      • twitter_covid_19
    • regression
      • fred
      • lending_club
  • errors
  • test_suites
    • classifier
    • cluster
    • embeddings
    • llm
    • nlp
    • parameters_optimization
    • regression
    • statsmodels_timeseries
    • summarization
    • tabular_datasets
    • text_data
    • time_series
  • tests
    • data_validation
      • ACFandPACFPlot
      • ADF
      • AutoAR
      • AutoMA
      • AutoStationarity
      • BivariateScatterPlots
      • BoxPierce
      • ChiSquaredFeaturesTable
      • ClassImbalance
      • CommonWords
      • DatasetDescription
      • DatasetSplit
      • DescriptiveStatistics
      • DickeyFullerGLS
      • Duplicates
      • EngleGrangerCoint
      • FeatureTargetCorrelationPlot
      • Hashtags
      • HighCardinality
      • HighPearsonCorrelation
      • IQROutliersBarPlot
      • IQROutliersTable
      • IsolationForestOutliers
      • JarqueBera
      • KPSS
      • LJungBox
      • LaggedCorrelationHeatmap
      • LanguageDetection
      • Mentions
      • MissingValues
      • MissingValuesBarPlot
      • MutualInformation
      • PearsonCorrelationMatrix
      • PhillipsPerronArch
      • PolarityAndSubjectivity
      • ProtectedClassesCombination
      • ProtectedClassesDescription
      • ProtectedClassesDisparity
      • ProtectedClassesThresholdOptimizer
      • Punctuations
      • RollingStatsPlot
      • RunsTest
      • ScatterPlot
      • ScoreBandDefaultRates
      • SeasonalDecompose
      • Sentiment
      • ShapiroWilk
      • Skewness
      • SpreadPlot
      • StopWords
      • TabularCategoricalBarPlots
      • TabularDateTimeHistograms
      • TabularDescriptionTables
      • TabularNumericalHistograms
      • TargetRateBarPlots
      • TextDescription
      • TimeSeriesDescription
      • TimeSeriesDescriptiveStatistics
      • TimeSeriesFrequency
      • TimeSeriesHistogram
      • TimeSeriesLinePlot
      • TimeSeriesMissingValues
      • TimeSeriesOutliers
      • TooManyZeroValues
      • Toxicity
      • UniqueRows
      • WOEBinPlots
      • WOEBinTable
      • ZivotAndrewsArch
      • nlp
    • model_validation
      • AdjustedMutualInformation
      • AdjustedRandIndex
      • AutoARIMA
      • BertScore
      • BleuScore
      • CalibrationCurve
      • ClassifierPerformance
      • ClassifierThresholdOptimization
      • ClusterCosineSimilarity
      • ClusterPerformanceMetrics
      • ClusterSizeDistribution
      • CompletenessScore
      • ConfusionMatrix
      • ContextualRecall
      • CumulativePredictionProbabilities
      • DurbinWatsonTest
      • FeatureImportance
      • FeaturesAUC
      • FowlkesMallowsScore
      • GINITable
      • HomogeneityScore
      • HyperParametersTuning
      • KMeansClustersOptimization
      • KolmogorovSmirnov
      • Lilliefors
      • MeteorScore
      • MinimumAccuracy
      • MinimumF1Score
      • MinimumROCAUCScore
      • ModelMetadata
      • ModelParameters
      • ModelPredictionResiduals
      • ModelsPerformanceComparison
      • OverfitDiagnosis
      • PermutationFeatureImportance
      • PopulationStabilityIndex
      • PrecisionRecallCurve
      • PredictionProbabilitiesHistogram
      • ROCCurve
      • RegardScore
      • RegressionCoeffs
      • RegressionErrors
      • RegressionErrorsComparison
      • RegressionFeatureSignificance
      • RegressionModelForecastPlot
      • RegressionModelForecastPlotLevels
      • RegressionModelSensitivityPlot
      • RegressionModelSummary
      • RegressionPerformance
      • RegressionPermutationFeatureImportance
      • RegressionR2Square
      • RegressionR2SquareComparison
      • RegressionResidualsPlot
      • RobustnessDiagnosis
      • RougeScore
      • SHAPGlobalImportance
      • ScoreProbabilityAlignment
      • ScorecardHistogram
      • SilhouettePlot
      • TimeSeriesPredictionWithCI
      • TimeSeriesPredictionsPlot
      • TimeSeriesR2SquareBySegments
      • TokenDisparity
      • ToxicityScore
      • TrainingTestDegradation
      • VMeasure
      • WeakspotsDiagnosis
      • sklearn
      • statsmodels
      • statsutils
    • prompt_validation
      • Bias
      • Clarity
      • Conciseness
      • Delimitation
      • NegativeInstruction
      • Robustness
      • Specificity
      • ai_powered_test
  • unit_metrics
  • vm_models

On this page

  • create_metrics_df
  • TextDescription
    • Purpose
    • Test Mechanism
    • Signs of High Risk
    • Strengths
    • Limitations
  • Edit this page
  • Report an issue
  1. tests
  2. data_validation
  3. TextDescription

validmind.TextDescription

create_metrics_df

defcreate_metrics_df(df,text_column,unwanted_tokens,lang):

TextDescription

@tags('nlp', 'text_data', 'visualization')

@tasks('text_classification', 'text_summarization')

defTextDescription(dataset:validmind.vm_models.VMDataset,unwanted_tokens:set={'s', "s'", 'mr', 'ms', 'mrs', 'dr', "'s", ' ', "''", 'dollar', 'us', '``'},lang:str='english'):

Conducts comprehensive textual analysis on a dataset using NLTK to evaluate various parameters and generate visualizations.

Purpose

The TextDescription test aims to conduct a thorough textual analysis of a dataset using the NLTK (Natural Language Toolkit) library. It evaluates various metrics such as total words, total sentences, average sentence length, total paragraphs, total unique words, most common words, total punctuations, and lexical diversity. The goal is to understand the nature of the text and anticipate challenges machine learning models might face in text processing, language understanding, or summarization tasks.

Test Mechanism

The test works by:

  • Parsing the dataset and tokenizing the text into words, sentences, and paragraphs using NLTK.
  • Removing stopwords and unwanted tokens.
  • Calculating parameters like total words, total sentences, average sentence length, total paragraphs, total unique words, total punctuations, and lexical diversity.
  • Generating scatter plots to visualize correlations between various metrics (e.g., Total Words vs Total Sentences).

Signs of High Risk

  • Anomalies or increased complexity in lexical diversity.
  • Longer sentences and paragraphs.
  • High uniqueness of words.
  • Large number of unwanted tokens.
  • Missing or erroneous visualizations.

Strengths

  • Essential for pre-processing text data in machine learning models.
  • Provides a comprehensive breakdown of text data, aiding in understanding its complexity.
  • Generates visualizations to help comprehend text structure and complexity.

Limitations

  • Highly dependent on the NLTK library, limiting the test to supported languages.
  • Limited customization for removing undesirable tokens and stop words.
  • Does not consider semantic or grammatical complexities.
  • Assumes well-structured documents, which may result in inaccuracies with poorly formatted text.
TargetRateBarPlots
TimeSeriesDescription

© Copyright 2025 ValidMind Inc. All Rights Reserved.

  • Edit this page
  • Report an issue
Cookie Preferences
  • validmind.com

  • Privacy Policy

  • Terms of Use