Count Max#
Check: row-count-max-check
Purpose: Verifies that the input DataFrame does not exceed a specified maximum number of rows. Helps to detect unexpected data growth or duplication early in the pipeline.
Python Configuration#
from sparkdq.checks import RowCountMaxCheckConfig
from sparkdq.core import Severity
RowCountMaxCheckConfig(
check_id="prevent_oversize_batch",
max_count=100000,
severity=Severity.ERROR
)
Declarative Configuration#
- check: row-count-max-check
check-id: prevent_oversize_batch
max-count: 100000
severity: error
Typical Use Cases#
✅ Detect abnormal data growth that may indicate duplicates or incorrect joins.
✅ Prevent downstream systems (e.g., reports or dashboards) from processing overly large datasets.
✅ Catch unintentional full loads when only incremental data was expected.