Defining Checks¶

SparkDQ supports two types of checks: row-level checks that validate each record individually, and aggregate checks that evaluate the dataset as a whole. Both can be defined directly in Python or loaded declaratively from YAML or JSON.

Every check accepts an optional severity parameter: Severity.CRITICAL (default) marks failing rows as invalid, Severity.WARNING records the violation but keeps the rows in pass_df().

Python-native Configuration¶

For dynamic or code-driven use cases (e.g. notebooks, CI pipelines), you can define checks directly in Python using type-safe config classes. The CheckSet supports both the classic and the fluent API style.

Fluent API (recommended)Classic API

from sparkdq.checks import NullCheckConfig, RowCountBetweenCheckConfig
from sparkdq.core import Severity
from sparkdq.management import CheckSet

check_set = (
    CheckSet()
    .add_check(
        NullCheckConfig(
            check_id="my-null-check",
            columns=["email"]
        )
    )
    .add_check(
        RowCountBetweenCheckConfig(
            check_id="my-count-check",
            min_count=100,
            max_count=5000
        )
    )
)

from sparkdq.checks import NullCheckConfig, RowCountBetweenCheckConfig
from sparkdq.core import Severity
from sparkdq.management import CheckSet

check_set = CheckSet()
check_set.add_check(
    NullCheckConfig(
        check_id="my-null-check",
        columns=["email"]
    )
)
check_set.add_check(
    RowCountBetweenCheckConfig(
        check_id="my-count-check",
        min_count=100,
        max_count=5000
    )
)

Declarative Configuration¶

If you use a metadata-driven or config-as-code approach, SparkDQ also supports declarative check definitions via dictionaries — for example loaded from YAML or JSON files.

# dq_checks.yaml
- check: null-check
  check-id: my-null-check
  columns:
    - email
  severity: warning

- check: row-count-between-check
  check-id: my-count-check
  min-count: 100
  max-count: 5000

To load the configuration into SparkDQ, use the following code:

Note

SparkDQ does not install pyyaml or any other config parser. You are responsible for loading your config into a Python dictionary — SparkDQ only takes it from there.

import yaml
from sparkdq.management import CheckSet

with open("dq_checks.yaml") as f:
    config = yaml.safe_load(f)

check_set = CheckSet()
check_set.add_checks_from_dicts(config)