Unique Rows#

Check: unique-rows-check

Purpose: Ensures that all rows in the dataset are unique, either across all columns or a specified subset. This check helps detect unintended data duplication and enforces row-level uniqueness constraints.

Note

If no subset is provided, the check considers all columns to determine uniqueness.

Python Configuration#

from sparkdq.checks import UniqueRowsCheckConfig
from sparkdq.core import Severity

UniqueRowsCheckConfig(
    check_id="no_duplicate_rows",
    subset_columns=["trip_id", "pickup_time"],
    severity=Severity.CRITICAL
)

Declarative Configuration#

- check: unique-rows-check
  check-id: no_duplicate_rows
  subset-columns:
    - trip_id
    - pickup_time
  severity: critical

Typical Use Cases#

  • ✅ Enforce uniqueness on primary key–like columns (e.g., trip_id, user_id)

  • ✅ Detect duplicated records caused by faulty joins, reprocessing, or ingestion errors

  • ✅ Ensure referential integrity before merging datasets or writing to transactional stores

  • ✅ Validate correctness of deduplication logic in preprocessing pipelines