Validating DataFrames¶

Once your checks are defined and grouped into a CheckSet, pass them to the BatchDQEngine together with your Spark DataFrame. The engine evaluates all rules in a single pass and returns a structured BatchValidationResult.

from sparkdq.checks import NullCheckConfig
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

check_set = CheckSet()
check_set.add_check(NullCheckConfig(check_id="my-null-check", columns=["email"]))

result = BatchDQEngine(check_set).run_batch(df)

result.pass_df().show()
result.fail_df().select("_dq_errors").show(truncate=False)
print(result.summary())

What Happens Under the Hood¶

When run_batch() is called, the engine:

evaluates all row-level checks and marks failing rows individually
evaluates all aggregate checks against the full DataFrame
annotates every row with _dq_passed, _dq_errors, and _dq_validation_ts
returns a result object ready to be queried or routed

Working with the Result¶

Method	Description
`pass_df()`	Records that passed all critical checks
`fail_df()`	Records that failed at least one critical check
`warn_df()`	Passing records that triggered one or more warning-level violations
`summary()`	Structured summary with record counts, pass rate, and timestamp

Result Columns¶

SparkDQ automatically adds the following columns to your DataFrame:

Column	Description
`_dq_passed`	Boolean flag — `true` if the row passed all critical checks
`_dq_errors`	Array of structured errors per failed check (check-id, name, severity)
`_dq_validation_ts`	Timestamp of the validation run