Distinct Ratio#

Check: distinct-ratio-check

Purpose: Validates that the ratio of distinct (non-null) values in a specified column exceeds a minimum threshold. A row set fails the check if the actual ratio of distinct values is lower than the configured min_ratio.

Python Configuration#

from sparkdq.checks import DistinctRatioCheckConfig
from sparkdq.core import Severity

DistinctRatioCheckConfig(
    check_id="passenger-count-uniqueness",
    column="passenger_count",
    min_ratio=0.8,
    severity=Severity.CRITICAL
)

Declarative Configuration#

- check: distinct-ratio-check
  check-id: passenger-count-uniqueness
  column: passenger_count
  min-ratio: 0.8
  severity: critical

Typical Use Cases#

  • ✅ Ensure that a column has a sufficiently high number of distinct (non-null) values.

  • ✅ Detect columns that may have too much repetition or lack of variability.

  • ✅ Identify potential issues such as constants, default-filled fields, or data entry errors.

  • ✅ Enforce entropy or uniqueness expectations for features used in ML models or analytics.