Validating DataFrames¶
Once your checks are defined and grouped into a CheckSet, pass them to the BatchDQEngine together with your Spark DataFrame. The engine evaluates all rules in a single pass and returns a structured BatchValidationResult.
from sparkdq.checks import NullCheckConfig
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet
check_set = CheckSet()
check_set.add_check(NullCheckConfig(check_id="my-null-check", columns=["email"]))
result = BatchDQEngine(check_set).run_batch(df)
result.pass_df().show()
result.fail_df().select("_dq_errors").show(truncate=False)
print(result.summary())
What Happens Under the Hood¶
When run_batch() is called, the engine:
- evaluates all row-level checks and marks failing rows individually
- evaluates all aggregate checks against the full DataFrame
- annotates every row with
_dq_passed,_dq_errors, and_dq_validation_ts - returns a result object ready to be queried or routed
Working with the Result¶
| Method | Description |
|---|---|
pass_df() |
Records that passed all critical checks |
fail_df() |
Records that failed at least one critical check |
warn_df() |
Passing records that triggered one or more warning-level violations |
summary() |
Structured summary with record counts, pass rate, and timestamp |
Result Columns¶
SparkDQ automatically adds the following columns to your DataFrame:
| Column | Description |
|---|---|
_dq_passed |
Boolean flag — true if the row passed all critical checks |
_dq_errors |
Array of structured errors per failed check (check-id, name, severity) |
_dq_validation_ts |
Timestamp of the validation run |