Unique Ratio#
Check: unique-ratio-check
Purpose: Checks whether the ratio of unique (non-null) values in a specified column meets or exceeds a configured threshold. A row does not directly fail; instead, the dataset is considered invalid if the proportion of unique values is too low.
Note
If the configured
min-ratio
is not met, the check fails.Null values are excluded from the uniqueness calculation.
The total number of rows is used as the denominator (including nulls).
Python Configuration#
from sparkdq.checks import UniqueRatioCheckConfig
from sparkdq.core import Severity
UniqueRatioCheckConfig(
check_id="vendor-id-uniqueness",
column="VendorID",
min_ratio=0.7,
severity=Severity.CRITICAL
)
Declarative Configuration#
- check: unique-ratio-check
check-id: vendor-id-uniqueness
column: VendorID
min-ratio: 0.7
severity: critical
Typical Use Cases#
✅ Ensure that a column intended to be mostly unique (e.g., IDs, hashes) behaves as expected.
✅ Detect issues where only a few values are repeated frequently, reducing feature usefulness.
✅ Prevent downstream errors due to low-entropy or non-discriminative values.
✅ Support feature quality checks in ML preprocessing pipelines.