sparkdq.engine#

class BatchDQEngine(check_set: CheckSet | None = None, fail_levels: List[Severity] = [Severity.CRITICAL])[source]

Bases: BaseDQEngine

Engine for executing data quality checks on Spark DataFrames in batch mode.

This engine applies both row-level and aggregate-level checks using the BatchCheckRunner, and annotates the DataFrame with error metadata.

run_batch(df: DataFrame, reference_datasets: dict[str, DataFrame] | None = None) → BatchValidationResult[source]

Run all registered checks against the given DataFrame.

This method applies both row-level and aggregate-level checks and returns a validation result containing the annotated DataFrame and the aggregated check results.

Parameters:

df (DataFrame) – The input Spark DataFrame to validate.
reference_datasets (ReferenceDatasetDict, optional) – A dictionary of named reference DataFrames used by integrity checks. Required for checks that compare values against external datasets (e.g., foreign key validation). Each key should match the reference_dataset name expected by the check.

Returns:

Object containing the validated DataFrame, aggregate check results, and the original input schema.

Return type:

BatchValidationResult

class BatchValidationResult(df: ~pyspark.sql.dataframe.DataFrame, aggregate_results: ~typing.List[~sparkdq.core.check_results.AggregateCheckResult], input_columns: ~typing.List[str], timestamp: ~datetime.datetime = <factory>)[source]

Bases: object

Encapsulates the results of a batch data quality validation run.

Includes: - The validated Spark DataFrame, annotated with validation metadata. - A list of results from aggregate-level checks. - The original input column names, used to restore the pre-validation structure.

This class provides convenience methods to access only the passing or failing rows, making it easier to route or analyze validated data downstream.

df

The DataFrame after validation, including:

_dq_passed (bool): Row-level pass/fail status.
_dq_errors (array): Structured errors from failed checks.
_dq_aggregate_errors (array, optional): Errors from failed aggregates.

Type:: DataFrame

aggregate_results

Results of all aggregate checks.

Type:: List[AggregateCheckResult]

input_columns

Names of the original input columns.

Type:: List[str]

timestamp

Timestamp of when the validation result was created.

Type:: datetime

fail_df() → DataFrame[source]

Return only the rows that failed one or more critical checks.

This includes validation metadata such as _dq_errors, _dq_passed, and, if present, _dq_aggregate_errors. Additionally, a _dq_validation_ts column is added for downstream auditing or tracking.

Returns:: DataFrame containing invalid rows and relevant error metadata.
Return type:: DataFrame

pass_df() → DataFrame[source]

Return only the rows that passed all critical checks.

This method filters for rows where _dq_passed is true and restores the original column structure from the input DataFrame.

Returns:: DataFrame containing only valid rows, with original schema.
Return type:: DataFrame

summary() → ValidationSummary[source]

Create a summary of the validation results.

This includes total record count, pass/fail statistics, and number of rows with warning-level errors.

Returns:: Structured summary of the validation outcome.
Return type:: ValidationSummary

warn_df() → DataFrame[source]

Returns rows that passed all critical checks but contain warning-level violations.

These are rows where _dq_passed is True, but the _dq_errors array contains at least one entry with severity == WARNING.

Returns:: Filtered DataFrame of rows with warnings.
Return type:: DataFrame