Foreign Key Check#

Check: foreign-key-check

Purpose: Validates that all values in a specified column exist in a reference dataset’s column. This check ensures referential integrity between two datasets — typically used to verify that foreign keys are resolvable.

This check operates at the aggregate level. It computes the number and ratio of missing foreign key references and returns a single validation result for the entire dataset.

The check fails if any value in the source column is not found in the referenced column of the injected reference dataset.

Python Configuration#

from sparkdq.checks import ForeignKeyCheckConfig
from sparkdq.core import Severity

ForeignKeyCheckConfig(
    check_id="valid_customer_id",
    column="customer_id",
    reference_dataset="customers",
    reference_column="id",
    severity=Severity.CRITICAL
)

Declarative Configuration#

- check: foreign-key-check
  check-id: valid_customer_id
  column: customer_id
  reference-dataset: customers
  reference-column: id
  severity: critical

Typical Use Cases#

  • ✅ Ensure that every order.customer_id exists in the customers table.

  • ✅ Validate that foreign keys in fact tables (e.g., sales.product_id) refer to known dimension entries.

  • ✅ Guarantee relational integrity between joined datasets in data lakes or data warehouses.